Ming Li Talk about Bioinformatics - PAMI - University of Waterloo

powerfultennesseeBiotechnology

Oct 2, 2013 (3 years and 11 months ago)

87 views

Information Distance


Ming Li

Canada Research Chair in Bioinformatics

University of Waterloo

A story about chain letters


Charles H. Bennett collected 33 copies 1980
--
1997.



Like a virus, they have infected billions of people.


Like a gene, they are about 2000 characters and
mutate;


Traditional phylogeny methods fail:


Can’t do multiple alignment due to translocations


No models of evolution


They are not alone: programs, music scores, genomes
...

A sample

letter:

A very pale letter reveals evolutionary path:


((copy)*mutate)*

Information Distance

Bennett, Gacs, Li, Vitanyi, Zurek, STOC’93

Li et al: Bioinformatics, 17:2(2001), 149
-
154, Li et al, IEEE Trans. Info. Theory, 2004


In classical Newton world, we use length to
measure distance: 10 miles, 2 km


In the modern information world, what measure do
we use to measure the distances between


Two documents?


Two genomes?


Two computer virus?


Two junk emails?


Two (possibly plagiarized) programs?


Two pictures?


Two internet homepages?


Same two objects may be measured at different
granulation levels

A general theory must satisfy:


Application independent


Information granulation independent


Dominate
all

other theories


Useful in practice.

Outline


A theory of information distance



Applications: a paradigm of parameter
-
free
data mining

Part I: A Theory of Information
Distance



The classical approaches do not work


For all the distances we know: Euclidean distance,
Hamming distance, edit distance, none is proper. For
example, they do not reflect our intuition on:







But from where shall we start?


We will start from first principles of physics and make no
more assumptions. We wish to derive a general theory of
information distance.

Austria

Byelorussia

Kolmogorov complexity


K(x)=

length of shortest description of
x



K(x|y)=
length of shortest description of
x

given
y
.


K(x)
-
K(x|y)

is information
y

knows about
x


Theorem (Mutual Information).


K(x)
-
K(x|y) = K(y)
-
K(y|x)


Kolmogorov complexity

Thermodynamics of Computing


Physical Law: 1kT is needed to irreversibly
process 1 bit
(Von Neumann, Landauer)


Reversible computation is free.


Heat Dissipation

Input

Output

Compute

A

B

A AND B

A AND B

B AND NOT A

A AND NOT B

A

billiard

ball

computer

Input

Output

0

1

1

0

0

1

1

1

0

0

0

1

1

1


Ultimate thermodynamics cost of erasing
x
:


“Reversibly compress”
x

to
x*


Then erase x*. Cost ~
K(x
) bits.


The longer you compute, the less heat dissipation.


Cost of computing
x

from
y
, define:


E(x,y) = min { |p| : U(x,p) = y, U(y,p)=x }.


Fundamental Theorem:


E(x,y) = max{ K(x|y), K(y|x) }



Bennett, Gacs, Li, Vitanyi, Zurek STOC’93


Normalized Information distance:


max(K(x|y),Ky|x))


d(x,y) =
------------------------


max{K(x),K(y)}

First proposed in Li et al:
Bioinformatics,

17:2(2001), 149
-
154, in slightly different form.

In this form in: Li et al,
IEEE Trans Info. Theory
, 2004


Theorem
.

d(x,y)

is a nontrivial distance. It is
symmetric, satisfies triangle inequality, etc.



Open Question
.
We wish to show

d(x,y)

is
universal
: if
x

and
y

are “close” in any sense, then
they are “close” under
d(x,y).

That is, for any
reasonable computable distance
D
, there exists
constant
c
, for all
x,y
,


d(x,y)


D(x,y) + c


d(x,y) Properties

For any computable D, for all x,y:


d(x,y)
≤ D(x,y)+ c

Proof Ideas: Naively, by density assumption |{y: |y|=n and D(x,y)
≤ d }| ≤ 2
dn
, we have


K(x|y), K(y|x)
≤ nD(x,y).


So


max{K(x|y), K(y|x) nD(x,y)


d(x,y) =
------------------------


--------------------

(1)


max{K(x),K(y) max{K(x),K(y)


Then we are stuck. This will work only if K(x) or K(y)=n. To solve this, we first prove,

Lemma: There exist shortest programs x* for x, and y* for y, such that: K(x|y) ≤ K(x*|y*).

Now,



max{K(x|y), K(y|x)} max{K(x*|y*),K(y*|x*)} max{|x*|,|y*|}D(x,y)


d(x,y) =
------------------------


-----------------------------


---------------------------

≤ D(x,y) +c


max{ K(x),K(y) } max{K(x),K(y) } max{|x*|,|y*|}


Part II: A paradigm of parameter
-
free data mining

Keogh, Lonardi, Ratanamahatana,
KDD 2004

Perils of parameter
-
laden data mining
algorithms:


Incorrect settings miss true patterns


Too much tuning leads to over fitting
---

excellent performance on one dataset, fails
badly on new but similar datasets.


Parameters impose presumptions on data

Parameter
-
free data mining


Our theory provides a paradigm of parameter
-
free data mining, as d(x,y) is universal.


Works at all granularity levels of information


No assumptions on data.


But d(x,y) is not computable. Does this theory
work at all? We have decided to do extensive
and real life experiments.

Application 1: Reconstructing History of
Chain Letters


For each pair of chain letters (
x,

y
) we
computed
d(x,y)

by GenCompress, hence
a distance matrix.


Using standard phylogeny program to
construct their evolutionary history based
on the d(x,y) distance matrix.


The resulting tree is a perfect phylogeny:
distinct features are all grouped together.


Bennett, M. Li and B. Ma, Chain letters and evolutionary histories.

Scientific American
, 288:6(June 2003) (feature article), 76
-
81.


A typical chain letter input file:

with love all things are possible

this paper has been sent to you for good luck. the original is in new

england. it has been around the world nine times. the luck has been sent to

you. you will receive good luck within four days of receiving this letter.

provided, in turn, you send it on. this is no joke. you will receive good

luck in the mail. send no money. send copies to people you think need good

luck. do not send money as faith has no price. do not keep this letter. It

must leave your hands within 96 hours. an r.a.f. (royal air force) officer

received $470,000. joe elliot received $40,000 and lost them because he

broke the chain. while in the philippines, george welch lost his wife 51

days after he received the letter. however before her death he received

$7,755,000. please, send twenty copies and see what happens in four days.

the chain comes from venezuela and was written by saul anthony de grou, a

missionary from south america. since this letter must tour the world, you

must make twenty copies and send them to friends and associates. after a

few days you will get a surprise. this is true even if you are not

superstitious. do note the following: constantine dias received the chain

in 1953. he asked his secretary to make twenty copies and send them. a few

days later, he won a lottery of two million dollars. carlo daddit, an office

employee, received the letter and forgot it had to leave his hands within

96 hours. he lost his job. later, after finding the letter again, he mailed

twenty copies; a few days later he got a better job. dalan fairchild received

the letter, and not believing, threw the letter away, nine days later he died.

in 1987, the letter was received by a young woman in california, it was very

faded and barely readable. she promised herself she would retype the letter

and send it on, but she put it aside to do it later. she was plagued with

various problems including expensive car repairs, the letter did not leave

her hands in 96 hours. she finally typed the letter as promised and got a

new car. remember, send no money. do not ignore this. it works.

st. jude

Phylogeny of 33 Chain Letters

Confirmed by VanArsdale’s study, answers an open question


Application 2: Evolution of Species


Li et al: Bioinformatics, 17:2(2001),


Traditional methods: for a single gene


Max. likelihood:
multiple alignment
,
assumes
statistical evolutionary models, computes the most likely
tree.


Max. parsimony:
multiple alignment, then finds the
best tree, minimizing cost.



Distance
-
based methods:
multiple alignment
,
NJ;
Quartet methods, Fitch
-
Margoliash method.


Problem: different gene trees, manual
alignment, horizontally transferred genes, do
not handle genome level events.

Whole Genome Phylogeny


Many complete genomes sequenced (400 eukaryote projects).


No evolutionary models


Multiple alignment not possible


Single
-
gene trees often give conflicting results.


Snel, Bork, Huynen: compare gene contents. Boore, Brown:
gene order. Sankoff, Pevzner, Kececioglu:
reversal/translocation.


All above are either too simplistic or NP
-
hard and need
approximation anyways.


Our method using shared information is robust.


Uses all the information in the genome.


No need of evolutionary model


universal.


No need of alignment


Special cases: gene contents, gene order, reversal/translocation

Eutherian Orders:


It has been a disputed issue which of the two
groups of placental mammals are closer: Primates,
Ferungulates, Rodents.


In mtDNA, 6 proteins say primates closer to
ferungulates; 6 proteins say primates closer to
rodents.


Hasegawa’s group concatenated 12 mtDNA
proteins from:
rat, house mouse, grey seal, harbor seal, cat, white
rhino, horse, finback whale, blue whale, cow, gibbon, gorilla, human,
chimpanzee, pygmy chimpanzee, orangutan, sumatran orangutan, with
opossum, wallaroo, platypus as out group, 1998
, using max
likelihood method in MOLPHY.

Who is our closer relative?

Eutherian Orders ...


We use complete mtDNA genome of exactly the
same species.


We computed
d(x,y)

for each pair of species, and
used Neighbor Joining in MOLPHY package (and
our own Hypercleaning software).


We constructed exactly the same tree. Confirming
Primates and Ferungulates are closer than
Rodents.

Evolutionary Tree of Mammals:

Li et al: Bioinformatics, 17:2(2001)

Applications 3: “Uncheatable”
Plagiarism Test

X. Chen, B. Francia, M. Li, B. Mckinnon, A. Seker. IEEE Trans. Information Theory, 50:7(July
2004), 1545
-
1550.


The shared information measure also works for
checking student program assignments. We have
implemented the system SID.


Our system would take input on the web, strip user
comments, unify variables, we openly advertise
our methods (unlike other programs) that we
check shared information between each pair. It is
uncheatable because it is universal.


Available at http://genome.math.uwaterloo.ca/SID


Application 4:

A language tree

created using

UN’s The

Universal

Declaration

Of Human Rights,

by three Italian

physicists, in

Phy. Rev. Lett.,

& New Scientist

Application 5: Classifying Music


By Rudi Cilibrasi, Paul Vitanyi, Ronald de Wolf,
reported in
New Scientist, April 2003.


They took 12 Jazz, 12 classical, 12 rock music
scores. Classified well.


Potential application in identifying authorship.


The technique's elegance lies in the fact that it is
tone deaf. Rather than looking for features such as
common rhythms or harmonies, says Vitanyi, "it
simply compresses the files obliviously."

Parameter
-
Free Data Mining: Keogh,
Lonardi, Ratanamahatana,
KDD’04



Time series clustering


Compared against 51 different parameter
-
laden
measures from SIGKDD, SIGMOD, ICDM,
ICDE, SSDB, VLDB, PKDD, PAKDD, the
simple parameter
-
free shared information
method outperformed all
---

including HMM,
dynamic time warping, etc.

Approximating Normalized Information
distance for non
-
literal objects
(R. Cilibrasi, P. Vitanyi)

Internet distribution:


Internet page count for “x”


g(x) =
----------------------------------



# pages indexed

Theorem.

log
m
(x) = K(x) + O(1), where
m
(x) is the universal
distribution. (The shorter the more likely.)


If we assume the internet distribution roughly follows
m
(x), then we
can approximate the normalized information distance by replacing
K(x) by

log
m
(x).


Shannon
-
Fano Code


Consider n symbols 1,2, …, N, with decreasing
probabilities: p
1
≥ p
2
≥, … ≥ p
n
. Let P
r
=∑
i=1..r
p
i
.
The binary code E(r) for r is obtained by
truncating the binary expansion of P
r

at length
|E(r)| such that


-
log p
r
≤ |E(r)| <
-
log p
r
+1


Highly probably symbols are mapped to shorter
codes, and


2
-
|E(r)|
≤ p
r

< 2
-
|E(r)|+1


Near optimal: Let H =
-

r
p
r
logp
r

---

the average
number of bits needed to encode 1…N. Then we
have


-


r
p
r
logp
r
≤ H < ∑
r

(
-
log p
r
+1)p
r

= 1
-


r
p
r
logp
r

Examples


“horse”: #hits = 46,700,000


“rider”: #hits = 12,200,000


“horse” “rider”: #hits = 2,630,000


#pages indexed: 8,058,044,651


d”(horse,rider) = 0.453


Theoretically+empirically: scale
-
invariant


Cilibrasi
-
Vitanyi classified numbers vs colors, 17
th

century
dutch painters, prime numbers, electrical terms, religious
terms, translation English
-
>Spanish.


New ways of doing expert systems, wordnet, AI,
translation, all sorts of stuff.

Query
-
Answer System


Y. Hao, X. Zhang, X. Zhu, M. Li


Adding conditions to normalized information distance,
we built a Query
-
Answer system.


Example: “Who invented the light bulb?” Our system
computes



d”(who, light bulb | invent)



Result:
Candidates

Distance d’’



tomas edison 0.4801



light bulb 0.6087



latimer 0.7283



joseph swan 0.7750

Other applications


C. Ane and M.J. Sanderson: Phylogenetic
reconstruction


K. Emanuel, S. Ravela, E. Vivant, C. Risi:
Hurricane risk assessment


Protein sequence classification


Fetal heart rate detection


Ortholog detection


Authorship, topic, domain identification


Worms and network traffic analysis

Summary


A robust method that works when there is no clear
data model: English text, music, genome.


A quick, primitive, and dirty way that (almost)
always works, when other methods don’t.


A solid theory behind.


When a domain is well
-
understood, it is usually
better to combine with domain
-
specific methods,
perhaps with parameter, then.

Open Questions & Research Issues


Other applications: Authorship inference, Internet
plagiarism detection.


Better compression algorithms


entropy
estimation.


Conjecture: There problems where you cannot
solve/approximate, whileas simple algorithms
usually work in “practice”


but this fact is not
provable.


Provably computable approximation?


An obvious example is Shannon information

Collaborators & Credits:


Chain letters: C. Bennett, B. Ma


GenCompress: X. Chen, S. Kwong


DNACompress: X. Chen, B. Ma, J. Tromp


Tree programs: Jiang, Kearney, Zhang


Biological experiments: J. Badger


Plagiarism, SID: X. Chen, B. McKinnon, A. Seker


Literature comparison: B. Ma, P. Vitanyi, X.
Chen, X. Li