[情報] Genome Model Applied to Software
Genome Model Applied to Software
By Danny O'Brien
02:00 AM Oct. 04, 2004 PT
What does uncovering the secret language of DNA have in common with
reverse-engineering Microsoft software?
Quite a lot, according to Marshall Beddoe, a security analyst who is
turning to algorithms used in bioinformatics research to understand the
arcane mysteries of closed, proprietary software.
Beddoe explained his methodology in a presentation at ToorCon, last week's
San Diego security and hacker conference, showing how biologists' work can
make the often tedious task of reverse-engineering network software a
little simpler. In his presentation, Beddoe noted that over the last 30
years, biologists have developed an impressive battery of algorithms to
spot commonalities between samples of DNA.
It suddenly became very obvious to me I could use the algorithms used in
genomics in protocol analysis," said Beddoe. "So obvious, that I've been
having real problems explaining to other engineers how it works."
Beddoe calls his approach "Protocol Informatics."
For years, reverse engineers have struggled to understand the arcane
mysteries of proprietary software from the only clues they had: the way
the code communicates over open networks. Groups like the open-source
Samba project, whose code lets Unix systems interoperate with Windows file
servers, have spent hundreds of volunteer hours laboriously probing and
analyzing data packets emitted by Microsoft's software.
Reverse-engineering these protocols generally consists of looking through
dozens of separate computer conversations, scouring patterns that repeat
themselves and working out what those sequences might mean.
The trick, according to Beddoe, is to make the right connections between
the terminology of genetics and protocol analysis. Much of bioinformatics
is devoted to finding DNA sequences separated by long gaps of unknown data,
then a continuation of a known sequence. Since much of DNA is filled with
repeating, seemingly irrelevant noise, eliminating these gaps is a common
problem in genomics.
The same is true in protocol reverse-engineering. To researchers like
Beddoe, network conversations are full of "junk" -- usually the actual
data being sent -- which interferes with the analysis of the occasional
command sequence that controls what to do with that junk. Beddoe dug up
some of the oldest algorithms in the bioinformatics' armory, and used them
to eliminate junk data among patterns of repeated commands.
Geneticists have also spent many years analyzing the rate of mutation
between different DNA samples. Given two pieces of DNA, biologists have
devised complex algorithms to discover whether they're descended from the
same ancestors. The method works by comparing the genetic differences with
the known mutation rates of certain DNA components.
Beddoe applied the same principles to his mutating network conversations.
He notes, for example, that ASCII text is much more likely to "mutate"
into other text than it is to mutate into something else. By feeding in
probabilities about text instead of DNA nucleotides, Beddoe discovered that
he could more easily spot related fields in network exchanges.
The genetics algorithms told him that some chunks of data were close
relations; in fact, they were bits of the network protocol that were
performing similar actions.
Geneticists have also had a head start in visualizing unimaginable heaps of
data. Beddoe took the equations used by geneticists to display a species'
family tree and created a family tree of his analyzed protocols. The
result: a phylogenetic tree of Microsoft's SMB protocol, clumping
interesting fragments together for further investigation.
Beddoe isn't the only one in the computer security world casting an envious
eye over the bioinformatics sector's research. Dan Kaminsky, senior
security consultant for Avaya, said he has been investigating using genomic
pattern analysis for identifying and clustering "mutant" machines on a
corporate network: PCs whose variation from the company's standard
installation might make them vulnerable to compromise.
Kaminsky thinks this is only the beginning for the spread of bioinformatics
ideas into other fields.
"Generating an ordered, hierarchal breakdown of interrelationships from
huge piles of information is a problem that crops up everywhere. I'm not
surprised to see bioinformatics solutions finally being applied to the rest
of our poorly understood, oversized networks."
On the biology end, Terry Gaasterland, associate professor of computational
genomics at Rockefeller University, agrees that there's a wide field of
uses for the algorithms her discipline has developed -- and tricks to be
learned by biologists from other fields, too.
"The problem of decoding the language of networks and the problem of
finding signals in DNA are really two related instances of machine
learning problems. We're almost bound to discover universal principles of
information communication by investigating both," she said.
For the time being, though, Beddoe and others have one more decoding
problem to battle: understanding the jargon of another field's
documentation.
Justin Mason, the creator of SpamAssassin, is investigating bioinformatics
approaches to spam identification. He said that to outsiders, the genomics
world can seem more closed than the world of network engineers.
"A lot of the interesting research takes place in expensive journals and
seminars that we can't really get hold of. It's a bit of a difference from
the free exchange you get between coders online," he said.
Beddoe himself deduced much of the algorithms he used from downloading
PowerPoint slides from biologists' websites.
Gaasterland disagreed with Mason's assessment and said many bioinformatic
papers become freely available six months after publication. She added that
the publication of Beddoe's work might provide him with more assistance
from the bioinformatics community.
That'll come as a relief to Beddoe, who until now assumed that biologists
wouldn't pay much heed to his project.
<Wired News> http://www.wired.com/news/infostructure/0,1377,65191,00.html
--
※ 發信站: 批踢踢實業坊(ptt.cc)
◆ From: 61.222.173.26
討論串 (同標題文章)
CSSE 近期熱門文章
PTT數位生活區 即時熱門文章