Information Distance
Ming Li
Canada Research Chair in Bioinformatics
University of Waterloo
A story about chain letters
Charles H. Bennett collected 33 copies 1980

1997.
Like a virus, they have infected billions of people.
Like a gene, they are about 2000 characters and
mutate;
Traditional phylogeny methods fail:
Can’t do multiple alignment due to translocations
No models of evolution
They are not alone: programs, music scores, genomes
...
A sample
letter:
A very pale letter reveals evolutionary path:
((copy)*mutate)*
Information Distance
Bennett, Gacs, Li, Vitanyi, Zurek, STOC’93
Li et al: Bioinformatics, 17:2(2001), 149

154, Li et al, IEEE Trans. Info. Theory, 2004
In classical Newton world, we use length to
measure distance: 10 miles, 2 km
In the modern information world, what measure do
we use to measure the distances between
•
Two documents?
•
Two genomes?
•
Two computer virus?
•
Two junk emails?
•
Two (possibly plagiarized) programs?
•
Two pictures?
•
Two internet homepages?
•
Same two objects may be measured at different
granulation levels
A general theory must satisfy:
Application independent
Information granulation independent
Dominate
all
other theories
Useful in practice.
Outline
A theory of information distance
Applications: a paradigm of parameter

free
data mining
Part I: A Theory of Information
Distance
The classical approaches do not work
For all the distances we know: Euclidean distance,
Hamming distance, edit distance, none is proper. For
example, they do not reflect our intuition on:
But from where shall we start?
We will start from first principles of physics and make no
more assumptions. We wish to derive a general theory of
information distance.
Austria
Byelorussia
Kolmogorov complexity
K(x)=
length of shortest description of
x
K(xy)=
length of shortest description of
x
given
y
.
K(x)

K(xy)
is information
y
knows about
x
Theorem (Mutual Information).
K(x)

K(xy) = K(y)

K(yx)
Kolmogorov complexity
Thermodynamics of Computing
Physical Law: 1kT is needed to irreversibly
process 1 bit
(Von Neumann, Landauer)
Reversible computation is free.
Heat Dissipation
Input
Output
Compute
A
B
A AND B
A AND B
B AND NOT A
A AND NOT B
A
billiard
ball
computer
Input
Output
0
1
1
0
0
1
1
1
0
0
0
1
1
1
Ultimate thermodynamics cost of erasing
x
:
•
“Reversibly compress”
x
to
x*
•
Then erase x*. Cost ~
K(x
) bits.
•
The longer you compute, the less heat dissipation.
Cost of computing
x
from
y
, define:
E(x,y) = min { p : U(x,p) = y, U(y,p)=x }.
Fundamental Theorem:
E(x,y) = max{ K(xy), K(yx) }
Bennett, Gacs, Li, Vitanyi, Zurek STOC’93
Normalized Information distance:
max(K(xy),Kyx))
d(x,y) =

max{K(x),K(y)}
First proposed in Li et al:
Bioinformatics,
17:2(2001), 149

154, in slightly different form.
In this form in: Li et al,
IEEE Trans Info. Theory
, 2004
Theorem
.
d(x,y)
is a nontrivial distance. It is
symmetric, satisfies triangle inequality, etc.
Open Question
.
We wish to show
d(x,y)
is
universal
: if
x
and
y
are “close” in any sense, then
they are “close” under
d(x,y).
That is, for any
reasonable computable distance
D
, there exists
constant
c
, for all
x,y
,
d(x,y)
≤
D(x,y) + c
d(x,y) Properties
For any computable D, for all x,y:
d(x,y)
≤ D(x,y)+ c
Proof Ideas: Naively, by density assumption {y: y=n and D(x,y)
≤ d } ≤ 2
dn
, we have
K(xy), K(yx)
≤ nD(x,y).
So
max{K(xy), K(yx) nD(x,y)
d(x,y) =

≤

(1)
max{K(x),K(y) max{K(x),K(y)
Then we are stuck. This will work only if K(x) or K(y)=n. To solve this, we first prove,
Lemma: There exist shortest programs x* for x, and y* for y, such that: K(xy) ≤ K(x*y*).
Now,
max{K(xy), K(yx)} max{K(x*y*),K(y*x*)} max{x*,y*}D(x,y)
d(x,y) =

≤

≤

≤ D(x,y) +c
max{ K(x),K(y) } max{K(x),K(y) } max{x*,y*}
Part II: A paradigm of parameter

free data mining
Keogh, Lonardi, Ratanamahatana,
KDD 2004
Perils of parameter

laden data mining
algorithms:
Incorrect settings miss true patterns
Too much tuning leads to over fitting

excellent performance on one dataset, fails
badly on new but similar datasets.
Parameters impose presumptions on data
Parameter

free data mining
Our theory provides a paradigm of parameter

free data mining, as d(x,y) is universal.
Works at all granularity levels of information
No assumptions on data.
But d(x,y) is not computable. Does this theory
work at all? We have decided to do extensive
and real life experiments.
Application 1: Reconstructing History of
Chain Letters
For each pair of chain letters (
x,
y
) we
computed
d(x,y)
by GenCompress, hence
a distance matrix.
Using standard phylogeny program to
construct their evolutionary history based
on the d(x,y) distance matrix.
The resulting tree is a perfect phylogeny:
distinct features are all grouped together.
Bennett, M. Li and B. Ma, Chain letters and evolutionary histories.
Scientific American
, 288:6(June 2003) (feature article), 76

81.
A typical chain letter input file:
with love all things are possible
this paper has been sent to you for good luck. the original is in new
england. it has been around the world nine times. the luck has been sent to
you. you will receive good luck within four days of receiving this letter.
provided, in turn, you send it on. this is no joke. you will receive good
luck in the mail. send no money. send copies to people you think need good
luck. do not send money as faith has no price. do not keep this letter. It
must leave your hands within 96 hours. an r.a.f. (royal air force) officer
received $470,000. joe elliot received $40,000 and lost them because he
broke the chain. while in the philippines, george welch lost his wife 51
days after he received the letter. however before her death he received
$7,755,000. please, send twenty copies and see what happens in four days.
the chain comes from venezuela and was written by saul anthony de grou, a
missionary from south america. since this letter must tour the world, you
must make twenty copies and send them to friends and associates. after a
few days you will get a surprise. this is true even if you are not
superstitious. do note the following: constantine dias received the chain
in 1953. he asked his secretary to make twenty copies and send them. a few
days later, he won a lottery of two million dollars. carlo daddit, an office
employee, received the letter and forgot it had to leave his hands within
96 hours. he lost his job. later, after finding the letter again, he mailed
twenty copies; a few days later he got a better job. dalan fairchild received
the letter, and not believing, threw the letter away, nine days later he died.
in 1987, the letter was received by a young woman in california, it was very
faded and barely readable. she promised herself she would retype the letter
and send it on, but she put it aside to do it later. she was plagued with
various problems including expensive car repairs, the letter did not leave
her hands in 96 hours. she finally typed the letter as promised and got a
new car. remember, send no money. do not ignore this. it works.
st. jude
Phylogeny of 33 Chain Letters
Confirmed by VanArsdale’s study, answers an open question
Application 2: Evolution of Species
Li et al: Bioinformatics, 17:2(2001),
Traditional methods: for a single gene
•
Max. likelihood:
multiple alignment
,
assumes
statistical evolutionary models, computes the most likely
tree.
•
Max. parsimony:
multiple alignment, then finds the
best tree, minimizing cost.
•
Distance

based methods:
multiple alignment
,
NJ;
Quartet methods, Fitch

Margoliash method.
Problem: different gene trees, manual
alignment, horizontally transferred genes, do
not handle genome level events.
Whole Genome Phylogeny
Many complete genomes sequenced (400 eukaryote projects).
No evolutionary models
Multiple alignment not possible
Single

gene trees often give conflicting results.
Snel, Bork, Huynen: compare gene contents. Boore, Brown:
gene order. Sankoff, Pevzner, Kececioglu:
reversal/translocation.
All above are either too simplistic or NP

hard and need
approximation anyways.
Our method using shared information is robust.
Uses all the information in the genome.
No need of evolutionary model
–
universal.
No need of alignment
Special cases: gene contents, gene order, reversal/translocation
Eutherian Orders:
It has been a disputed issue which of the two
groups of placental mammals are closer: Primates,
Ferungulates, Rodents.
In mtDNA, 6 proteins say primates closer to
ferungulates; 6 proteins say primates closer to
rodents.
Hasegawa’s group concatenated 12 mtDNA
proteins from:
rat, house mouse, grey seal, harbor seal, cat, white
rhino, horse, finback whale, blue whale, cow, gibbon, gorilla, human,
chimpanzee, pygmy chimpanzee, orangutan, sumatran orangutan, with
opossum, wallaroo, platypus as out group, 1998
, using max
likelihood method in MOLPHY.
Who is our closer relative?
Eutherian Orders ...
We use complete mtDNA genome of exactly the
same species.
We computed
d(x,y)
for each pair of species, and
used Neighbor Joining in MOLPHY package (and
our own Hypercleaning software).
We constructed exactly the same tree. Confirming
Primates and Ferungulates are closer than
Rodents.
Evolutionary Tree of Mammals:
Li et al: Bioinformatics, 17:2(2001)
Applications 3: “Uncheatable”
Plagiarism Test
X. Chen, B. Francia, M. Li, B. Mckinnon, A. Seker. IEEE Trans. Information Theory, 50:7(July
2004), 1545

1550.
The shared information measure also works for
checking student program assignments. We have
implemented the system SID.
Our system would take input on the web, strip user
comments, unify variables, we openly advertise
our methods (unlike other programs) that we
check shared information between each pair. It is
uncheatable because it is universal.
Available at http://genome.math.uwaterloo.ca/SID
Application 4:
A language tree
created using
UN’s The
Universal
Declaration
Of Human Rights,
by three Italian
physicists, in
Phy. Rev. Lett.,
& New Scientist
Application 5: Classifying Music
By Rudi Cilibrasi, Paul Vitanyi, Ronald de Wolf,
reported in
New Scientist, April 2003.
They took 12 Jazz, 12 classical, 12 rock music
scores. Classified well.
Potential application in identifying authorship.
The technique's elegance lies in the fact that it is
tone deaf. Rather than looking for features such as
common rhythms or harmonies, says Vitanyi, "it
simply compresses the files obliviously."
Parameter

Free Data Mining: Keogh,
Lonardi, Ratanamahatana,
KDD’04
Time series clustering
•
Compared against 51 different parameter

laden
measures from SIGKDD, SIGMOD, ICDM,
ICDE, SSDB, VLDB, PKDD, PAKDD, the
simple parameter

free shared information
method outperformed all

including HMM,
dynamic time warping, etc.
Approximating Normalized Information
distance for non

literal objects
(R. Cilibrasi, P. Vitanyi)
Internet distribution:
Internet page count for “x”
g(x) =

# pages indexed
Theorem.
–
log
m
(x) = K(x) + O(1), where
m
(x) is the universal
distribution. (The shorter the more likely.)
If we assume the internet distribution roughly follows
m
(x), then we
can approximate the normalized information distance by replacing
K(x) by
–
log
m
(x).
Shannon

Fano Code
Consider n symbols 1,2, …, N, with decreasing
probabilities: p
1
≥ p
2
≥, … ≥ p
n
. Let P
r
=∑
i=1..r
p
i
.
The binary code E(r) for r is obtained by
truncating the binary expansion of P
r
at length
E(r) such that

log p
r
≤ E(r) <

log p
r
+1
Highly probably symbols are mapped to shorter
codes, and
2

E(r)
≤ p
r
< 2

E(r)+1
Near optimal: Let H =

∑
r
p
r
logp
r

the average
number of bits needed to encode 1…N. Then we
have

∑
r
p
r
logp
r
≤ H < ∑
r
(

log p
r
+1)p
r
= 1

∑
r
p
r
logp
r
Examples
“horse”: #hits = 46,700,000
“rider”: #hits = 12,200,000
“horse” “rider”: #hits = 2,630,000
#pages indexed: 8,058,044,651
d”(horse,rider) = 0.453
Theoretically+empirically: scale

invariant
Cilibrasi

Vitanyi classified numbers vs colors, 17
th
century
dutch painters, prime numbers, electrical terms, religious
terms, translation English

>Spanish.
New ways of doing expert systems, wordnet, AI,
translation, all sorts of stuff.
Query

Answer System
Y. Hao, X. Zhang, X. Zhu, M. Li
Adding conditions to normalized information distance,
we built a Query

Answer system.
Example: “Who invented the light bulb?” Our system
computes
d”(who, light bulb  invent)
Result:
Candidates
Distance d’’
tomas edison 0.4801
light bulb 0.6087
latimer 0.7283
joseph swan 0.7750
Other applications
C. Ane and M.J. Sanderson: Phylogenetic
reconstruction
K. Emanuel, S. Ravela, E. Vivant, C. Risi:
Hurricane risk assessment
Protein sequence classification
Fetal heart rate detection
Ortholog detection
Authorship, topic, domain identification
Worms and network traffic analysis
Summary
A robust method that works when there is no clear
data model: English text, music, genome.
A quick, primitive, and dirty way that (almost)
always works, when other methods don’t.
A solid theory behind.
When a domain is well

understood, it is usually
better to combine with domain

specific methods,
perhaps with parameter, then.
Open Questions & Research Issues
Other applications: Authorship inference, Internet
plagiarism detection.
Better compression algorithms
–
entropy
estimation.
Conjecture: There problems where you cannot
solve/approximate, whileas simple algorithms
usually work in “practice”
–
but this fact is not
provable.
Provably computable approximation?
•
An obvious example is Shannon information
Collaborators & Credits:
Chain letters: C. Bennett, B. Ma
GenCompress: X. Chen, S. Kwong
DNACompress: X. Chen, B. Ma, J. Tromp
Tree programs: Jiang, Kearney, Zhang
Biological experiments: J. Badger
Plagiarism, SID: X. Chen, B. McKinnon, A. Seker
Literature comparison: B. Ma, P. Vitanyi, X.
Chen, X. Li
Σχόλια 0
Συνδεθείτε για να κοινοποιήσετε σχόλιο