# Ming Li Talk about Bioinformatics - PAMI - University of Waterloo

Βιοτεχνολογία

2 Οκτ 2013 (πριν από 4 χρόνια και 8 μήνες)

100 εμφανίσεις

Information Distance

Ming Li

University of Waterloo

Charles H. Bennett collected 33 copies 1980
--
1997.

Like a virus, they have infected billions of people.

Like a gene, they are about 2000 characters and
mutate;

Can’t do multiple alignment due to translocations

No models of evolution

They are not alone: programs, music scores, genomes
...

A sample

letter:

A very pale letter reveals evolutionary path:

((copy)*mutate)*

Information Distance

Bennett, Gacs, Li, Vitanyi, Zurek, STOC’93

Li et al: Bioinformatics, 17:2(2001), 149
-
154, Li et al, IEEE Trans. Info. Theory, 2004

In classical Newton world, we use length to
measure distance: 10 miles, 2 km

In the modern information world, what measure do
we use to measure the distances between

Two documents?

Two genomes?

Two computer virus?

Two junk emails?

Two (possibly plagiarized) programs?

Two pictures?

Two internet homepages?

Same two objects may be measured at different
granulation levels

A general theory must satisfy:

Application independent

Information granulation independent

Dominate
all

other theories

Useful in practice.

Outline

A theory of information distance

-
free
data mining

Part I: A Theory of Information
Distance

The classical approaches do not work

For all the distances we know: Euclidean distance,
Hamming distance, edit distance, none is proper. For
example, they do not reflect our intuition on:

But from where shall we start?

We will start from first principles of physics and make no
more assumptions. We wish to derive a general theory of
information distance.

Austria

Byelorussia

Kolmogorov complexity

K(x)=

length of shortest description of
x

K(x|y)=
length of shortest description of
x

given
y
.

K(x)
-
K(x|y)

is information
y

x

Theorem (Mutual Information).

K(x)
-
K(x|y) = K(y)
-
K(y|x)

Kolmogorov complexity

Thermodynamics of Computing

Physical Law: 1kT is needed to irreversibly
process 1 bit
(Von Neumann, Landauer)

Reversible computation is free.

Heat Dissipation

Input

Output

Compute

A

B

A AND B

A AND B

B AND NOT A

A AND NOT B

A

billiard

ball

computer

Input

Output

0

1

1

0

0

1

1

1

0

0

0

1

1

1

Ultimate thermodynamics cost of erasing
x
:

“Reversibly compress”
x

to
x*

Then erase x*. Cost ~
K(x
) bits.

The longer you compute, the less heat dissipation.

Cost of computing
x

from
y
, define:

E(x,y) = min { |p| : U(x,p) = y, U(y,p)=x }.

Fundamental Theorem:

E(x,y) = max{ K(x|y), K(y|x) }

Bennett, Gacs, Li, Vitanyi, Zurek STOC’93

Normalized Information distance:

max(K(x|y),Ky|x))

d(x,y) =
------------------------

max{K(x),K(y)}

First proposed in Li et al:
Bioinformatics,

17:2(2001), 149
-
154, in slightly different form.

In this form in: Li et al,
IEEE Trans Info. Theory
, 2004

Theorem
.

d(x,y)

is a nontrivial distance. It is
symmetric, satisfies triangle inequality, etc.

Open Question
.
We wish to show

d(x,y)

is
universal
: if
x

and
y

are “close” in any sense, then
they are “close” under
d(x,y).

That is, for any
reasonable computable distance
D
, there exists
constant
c
, for all
x,y
,

d(x,y)

D(x,y) + c

d(x,y) Properties

For any computable D, for all x,y:

d(x,y)
≤ D(x,y)+ c

Proof Ideas: Naively, by density assumption |{y: |y|=n and D(x,y)
≤ d }| ≤ 2
dn
, we have

K(x|y), K(y|x)
≤ nD(x,y).

So

max{K(x|y), K(y|x) nD(x,y)

d(x,y) =
------------------------

--------------------

(1)

max{K(x),K(y) max{K(x),K(y)

Then we are stuck. This will work only if K(x) or K(y)=n. To solve this, we first prove,

Lemma: There exist shortest programs x* for x, and y* for y, such that: K(x|y) ≤ K(x*|y*).

Now,

max{K(x|y), K(y|x)} max{K(x*|y*),K(y*|x*)} max{|x*|,|y*|}D(x,y)

d(x,y) =
------------------------

-----------------------------

---------------------------

≤ D(x,y) +c

max{ K(x),K(y) } max{K(x),K(y) } max{|x*|,|y*|}

Part II: A paradigm of parameter
-
free data mining

Keogh, Lonardi, Ratanamahatana,
KDD 2004

Perils of parameter
-
algorithms:

Incorrect settings miss true patterns

Too much tuning leads to over fitting
---

excellent performance on one dataset, fails
badly on new but similar datasets.

Parameters impose presumptions on data

Parameter
-
free data mining

Our theory provides a paradigm of parameter
-
free data mining, as d(x,y) is universal.

Works at all granularity levels of information

No assumptions on data.

But d(x,y) is not computable. Does this theory
work at all? We have decided to do extensive
and real life experiments.

Application 1: Reconstructing History of
Chain Letters

For each pair of chain letters (
x,

y
) we
computed
d(x,y)

by GenCompress, hence
a distance matrix.

Using standard phylogeny program to
construct their evolutionary history based
on the d(x,y) distance matrix.

The resulting tree is a perfect phylogeny:
distinct features are all grouped together.

Bennett, M. Li and B. Ma, Chain letters and evolutionary histories.

Scientific American
, 288:6(June 2003) (feature article), 76
-
81.

A typical chain letter input file:

with love all things are possible

this paper has been sent to you for good luck. the original is in new

england. it has been around the world nine times. the luck has been sent to

you. you will receive good luck within four days of receiving this letter.

provided, in turn, you send it on. this is no joke. you will receive good

luck in the mail. send no money. send copies to people you think need good

luck. do not send money as faith has no price. do not keep this letter. It

must leave your hands within 96 hours. an r.a.f. (royal air force) officer

broke the chain. while in the philippines, george welch lost his wife 51

days after he received the letter. however before her death he received

\$7,755,000. please, send twenty copies and see what happens in four days.

the chain comes from venezuela and was written by saul anthony de grou, a

missionary from south america. since this letter must tour the world, you

must make twenty copies and send them to friends and associates. after a

few days you will get a surprise. this is true even if you are not

superstitious. do note the following: constantine dias received the chain

in 1953. he asked his secretary to make twenty copies and send them. a few

days later, he won a lottery of two million dollars. carlo daddit, an office

employee, received the letter and forgot it had to leave his hands within

96 hours. he lost his job. later, after finding the letter again, he mailed

twenty copies; a few days later he got a better job. dalan fairchild received

the letter, and not believing, threw the letter away, nine days later he died.

in 1987, the letter was received by a young woman in california, it was very

faded and barely readable. she promised herself she would retype the letter

and send it on, but she put it aside to do it later. she was plagued with

various problems including expensive car repairs, the letter did not leave

her hands in 96 hours. she finally typed the letter as promised and got a

new car. remember, send no money. do not ignore this. it works.

st. jude

Phylogeny of 33 Chain Letters

Confirmed by VanArsdale’s study, answers an open question

Application 2: Evolution of Species

Li et al: Bioinformatics, 17:2(2001),

Traditional methods: for a single gene

Max. likelihood:
multiple alignment
,
assumes
statistical evolutionary models, computes the most likely
tree.

Max. parsimony:
multiple alignment, then finds the
best tree, minimizing cost.

Distance
-
based methods:
multiple alignment
,
NJ;
Quartet methods, Fitch
-
Margoliash method.

Problem: different gene trees, manual
alignment, horizontally transferred genes, do
not handle genome level events.

Whole Genome Phylogeny

Many complete genomes sequenced (400 eukaryote projects).

No evolutionary models

Multiple alignment not possible

Single
-
gene trees often give conflicting results.

Snel, Bork, Huynen: compare gene contents. Boore, Brown:
gene order. Sankoff, Pevzner, Kececioglu:
reversal/translocation.

All above are either too simplistic or NP
-
hard and need
approximation anyways.

Our method using shared information is robust.

Uses all the information in the genome.

No need of evolutionary model

universal.

No need of alignment

Special cases: gene contents, gene order, reversal/translocation

Eutherian Orders:

It has been a disputed issue which of the two
groups of placental mammals are closer: Primates,
Ferungulates, Rodents.

In mtDNA, 6 proteins say primates closer to
ferungulates; 6 proteins say primates closer to
rodents.

Hasegawa’s group concatenated 12 mtDNA
proteins from:
rat, house mouse, grey seal, harbor seal, cat, white
rhino, horse, finback whale, blue whale, cow, gibbon, gorilla, human,
chimpanzee, pygmy chimpanzee, orangutan, sumatran orangutan, with
opossum, wallaroo, platypus as out group, 1998
, using max
likelihood method in MOLPHY.

Who is our closer relative?

Eutherian Orders ...

We use complete mtDNA genome of exactly the
same species.

We computed
d(x,y)

for each pair of species, and
used Neighbor Joining in MOLPHY package (and
our own Hypercleaning software).

We constructed exactly the same tree. Confirming
Primates and Ferungulates are closer than
Rodents.

Evolutionary Tree of Mammals:

Li et al: Bioinformatics, 17:2(2001)

Applications 3: “Uncheatable”
Plagiarism Test

X. Chen, B. Francia, M. Li, B. Mckinnon, A. Seker. IEEE Trans. Information Theory, 50:7(July
2004), 1545
-
1550.

The shared information measure also works for
checking student program assignments. We have
implemented the system SID.

Our system would take input on the web, strip user
our methods (unlike other programs) that we
check shared information between each pair. It is
uncheatable because it is universal.

Available at http://genome.math.uwaterloo.ca/SID

Application 4:

A language tree

created using

UN’s The

Universal

Declaration

Of Human Rights,

by three Italian

physicists, in

Phy. Rev. Lett.,

& New Scientist

Application 5: Classifying Music

By Rudi Cilibrasi, Paul Vitanyi, Ronald de Wolf,
reported in
New Scientist, April 2003.

They took 12 Jazz, 12 classical, 12 rock music
scores. Classified well.

Potential application in identifying authorship.

The technique's elegance lies in the fact that it is
tone deaf. Rather than looking for features such as
common rhythms or harmonies, says Vitanyi, "it
simply compresses the files obliviously."

Parameter
-
Free Data Mining: Keogh,
Lonardi, Ratanamahatana,
KDD’04

Time series clustering

Compared against 51 different parameter
-
measures from SIGKDD, SIGMOD, ICDM,
ICDE, SSDB, VLDB, PKDD, PAKDD, the
simple parameter
-
free shared information
method outperformed all
---

including HMM,
dynamic time warping, etc.

Approximating Normalized Information
distance for non
-
literal objects
(R. Cilibrasi, P. Vitanyi)

Internet distribution:

Internet page count for “x”

g(x) =
----------------------------------

# pages indexed

Theorem.

log
m
(x) = K(x) + O(1), where
m
(x) is the universal
distribution. (The shorter the more likely.)

If we assume the internet distribution roughly follows
m
(x), then we
can approximate the normalized information distance by replacing
K(x) by

log
m
(x).

Shannon
-
Fano Code

Consider n symbols 1,2, …, N, with decreasing
probabilities: p
1
≥ p
2
≥, … ≥ p
n
. Let P
r
=∑
i=1..r
p
i
.
The binary code E(r) for r is obtained by
truncating the binary expansion of P
r

at length
|E(r)| such that

-
log p
r
≤ |E(r)| <
-
log p
r
+1

Highly probably symbols are mapped to shorter
codes, and

2
-
|E(r)|
≤ p
r

< 2
-
|E(r)|+1

Near optimal: Let H =
-

r
p
r
logp
r

---

the average
number of bits needed to encode 1…N. Then we
have

-

r
p
r
logp
r
≤ H < ∑
r

(
-
log p
r
+1)p
r

= 1
-

r
p
r
logp
r

Examples

“horse”: #hits = 46,700,000

“rider”: #hits = 12,200,000

“horse” “rider”: #hits = 2,630,000

#pages indexed: 8,058,044,651

d”(horse,rider) = 0.453

Theoretically+empirically: scale
-
invariant

Cilibrasi
-
Vitanyi classified numbers vs colors, 17
th

century
dutch painters, prime numbers, electrical terms, religious
terms, translation English
-
>Spanish.

New ways of doing expert systems, wordnet, AI,
translation, all sorts of stuff.

Query
-

Y. Hao, X. Zhang, X. Zhu, M. Li

Adding conditions to normalized information distance,
we built a Query
-

Example: “Who invented the light bulb?” Our system
computes

d”(who, light bulb | invent)

Result:
Candidates

Distance d’’

tomas edison 0.4801

light bulb 0.6087

latimer 0.7283

joseph swan 0.7750

Other applications

C. Ane and M.J. Sanderson: Phylogenetic
reconstruction

K. Emanuel, S. Ravela, E. Vivant, C. Risi:
Hurricane risk assessment

Protein sequence classification

Fetal heart rate detection

Ortholog detection

Authorship, topic, domain identification

Worms and network traffic analysis

Summary

A robust method that works when there is no clear
data model: English text, music, genome.

A quick, primitive, and dirty way that (almost)
always works, when other methods don’t.

A solid theory behind.

When a domain is well
-
understood, it is usually
better to combine with domain
-
specific methods,
perhaps with parameter, then.

Open Questions & Research Issues

Other applications: Authorship inference, Internet
plagiarism detection.

Better compression algorithms

entropy
estimation.

Conjecture: There problems where you cannot
solve/approximate, whileas simple algorithms
usually work in “practice”

but this fact is not
provable.

Provably computable approximation?

An obvious example is Shannon information

Collaborators & Credits:

Chain letters: C. Bennett, B. Ma

GenCompress: X. Chen, S. Kwong

DNACompress: X. Chen, B. Ma, J. Tromp

Tree programs: Jiang, Kearney, Zhang