Ran Libeskind

Hadas
Department of Computer Science
Harvey Mudd College
Joint work with:
Mike Charleston (Univ. of Sydney)
Chris
Conow
(USC)
Ben Cousins (Clemson)
Daniel Fielder (HMC)
John Peebles (HMC)
Tselil
Schramm (HMC)
Anak
Yodpinyanee
(HMC)
Integrated CS/Bio Course
Send e

mail to:
ran@cs.hmc.edu
Overview
•
A 75

minute “research lecture” to first

year
students in our CS/Bio intro course
•
Show first

year students that what they’ve
learned is relevant to current research
•
Showcase research done with senior students
•
What have they have done so far?
–
Biology: Genes, alignment,
phylogenetic
trees,
RNA folding
–
CS: Programming, recursion, “
memoization
”
Specifically…
•
Pairwise
global alignment and RNA folding
–
Why you should care
–
Designed and implemented recursive solutions
–
Why are they slow?
–
How do we make them faster?
–
“
Memoization
” idea
–
Wow, that’s fast! (but no actual analysis yet)
–
Designed and implemented “
memoized
” versions
–
Used their implementations to investigate questions
Around 10 lines of Python code!
Specifically…
•
Phylogenetic
trees
–
Why you should care
–
Implemented simple algorithm (e.g. UPGMA)
–
Used their implementation to answer questions…
–
Existence and relative merits of other algorithms
(mention maximum likelihood… but it’s slow!)
A 75

minute lecture in 30 minutes
(or less)
Cophylogenetics
“ I can understand how a flower and a bee might slowly
become, either simultaneously or one after the other,
modified and adapted in the most perfect manner to
each other, by the continued preservation of
individuals presenting mutual and slightly
favourable
deviations of structure.”
Charles Darwin,
The Origin of Species
Actual 75

minute lecture starts here! (Also a chapter in new B4B)
Obligate Mutualism of
Figs and Fig Wasps
From
Cophylogeny of the Ficus Microcosm
, A. Jackson, 2004
ovi postor
The
Cophylogeny
Problem
From
Hafner
MS and Nadler SA,
Phylogenetic
trees support the
coevolution
of
parasites and their hosts.
Nature
1988,
332
:258

259
Indigobirds
and Finches
www.indigobirds.com
•
High level of host specificity (e.g. mouth markings)
The Question…
Given a host tree, parasite tree, and tip mapping, what is the most plausible
mapping between the trees and is it suggestive of
coevolution
?
This seems to
be a “hard” problem!
Measuring the “Hardness” of
Computational Problems
There are three kinds of problems…
1.
Easy
2.
Hard
3.
Impossible!
“Easy” Problems
Sorting a list of n numbers:
[42, 3, 17, 26, … , 100]
Multiplying two
n
x
n
matrices:
3 5 2 7
1 6 8 9
2 4 6 10
9 3 2 12
(
)
1 5 5 4
5 12 8 6
7 6 1 5
9 23 5 8
(
)
=
(
)
n
n
n
n
n
Global Alignment is “easy”!
•
Reminder of 2
n
running time of alignment
•
Informally motivate n
2
running time of
memoized
version
Snowplows of Northern Minnesota
Burrsburg
Frostbite City
Shiversville
Tundratown
Freezeapolis
“Hard” Problems
“Hard” Problems
Snowplows of Northern Minnesota
Burrsburg
Frostbite City
Shiversville
Tundratown
Freezeapolis
Brute

force? Greed?
n
2
versus 2
n
The Ran

O

Matic performs 10
9
operations/sec
n
2
2
n
n = 10
n = 30
n = 50
n = 70
100
< 1 sec
900
< 1 sec
2500
< 1 sec
1024
< 1 sec
10
9
1 sec
4900
< 1 sec
n
2
versus 2
n
The Ran

O

Matic performs 10
9
operations/sec
n
2
2
n
n = 10
n = 30
n = 50
n = 70
100
< 1 sec
900
< 1 sec
2500
< 1 sec
1024
< 1 sec
10
9
1 sec
10
15
13 days
4900
< 1 sec
n
2
versus 2
n
The Ran

O

Matic performs 10
9
operations/sec
n
2
2
n
n = 10
n = 30
n = 50
n = 70
100
< 1 sec
900
< 1 sec
2500
< 1 sec
1024
< 1 sec
10
9
1 sec
10
15
13 days
4900
< 1 sec
10
21
37 trillion
years
n
2
versus 2
n
The Ran

O

Matic performs 10
9
operations/sec
n
2
2
n
n = 10
n = 30
n = 50
n = 70
100
< 1 sec
900
< 1 sec
2500
< 1 sec
1024
< 1 sec
10
9
1 sec
10
15
13 days
4900
< 1 sec
10
21
37 trillion
years
Computers double in speed every
2 years. Let’s just wait 10 years!
37 trillion years

>
n
2
versus 2
n
The Ran

O

Matic performs 10
9
operations/sec
n
2
2
n
n = 10
n = 30
n = 50
n = 70
100
< 1 sec
900
< 1 sec
2500
< 1 sec
1024
< 1 sec
10
9
1 sec
10
15
13 days
4900
< 1 sec
10
21
37 trillion
years
Computers double in speed every
2 years. Let’s just wait 10 years!
37 trillion years

>
37 billion years!
Snowplows and Travelling
Salesperson Revisited!
Travelling
Salesperson
Problem
Snowplow Problem
Protein Folding
NP

complete
problems
Tens of thousands
of other known problems go in this
cloud!!
Phylogenetic
trees
by maximum
likelihood
Multiple sequence
alignment
“I can’t find an efficient algorithm. I guess I’m
too dumb.”
Cartoon courtesy of “Computers and Intractability: A Guide to the Theory of NP

completeness” by M. Garey and D. Johnson
Cartoon courtesy of “Computers and Intractability: A Guide to the Theory of NP

completeness” by M. Garey and D. Johnson
“I can’t find an efficient algorithm because no such
algorithm is possible!”
Cartoon courtesy of “Computers and Intractability: A Guide to the Theory of NP

completeness” by M. Garey and D. Johnson
“I can’t find an efficient algorithm, but neither
can all these famous people.”
$1 million
Vinay Deolalikar
Coping with NP

completeness…
•
Brute force
•
Ad hoc Heuristics
•
Meta heuristics
•
Approximation algorithms
Obligate Mutualism of
Figs and Fig Wasps
From
Cophylogeny of the Ficus Microcosm
, A. Jackson, 2004
ovi postor
The
Cophylogeny
Problem…
Host tree
a
b
c
Parasite tree
d
e
The
Cophylogeny
Problem
Host tree
Tips associations
a
b
c
Parasite tree
d
e
Possible Solutions
a
b
c
d
e
a
b
c
d
e
Input
Event Cost Model
cospeciation
a
b
c
d
e
cospeciation
cospeciation
a
b
c
d
e
Event Cost Model
duplication
a
b
c
d
e
duplication
a
b
c
d
e
Event Cost Model
host

switch
a
b
c
d
e
host

switch
a
b
c
d
e
Event Cost Model
loss
a
b
c
d
e
loss
loss
loss
loss
a
b
c
d
e
Event Cost Model
a
b
c
d
e
cospeciation
loss
loss
duplication
host

switch
loss
loss
cospeciation
a
b
c
d
e
Cost = duplication
+
cospeciation
+ 3 *
loss
Cost = cospeciation
+
host

switch
+
loss
Some typical costs
a
b
c
d
e
a
b
c
Cost = 8
Cost
= 5
cospeciation
loss
loss
duplication
host

switch
loss
loss
cospeciation
+ 0
+ 2
+ 2
+ 2
+ 3
+ 2
+ 2
+
0
e
d
This problem is hard!
•
How hard?
NP

complete!
(Joint work with Charleston,
Ovadia
,
Conow
, Fielder)
•
The host

switches are the culprits
e
f
g
h
Existing Methods
TreeMap
Tarzan/CoRe

PA
Technique
Brute force
Ignore timing incompatibilities
Solution
Optimal
Can be BETTER than optimal!
Running Time
Exponential
Polynomial,
Very fast
Tree Builder
No
Yes
Solution Viewer
Yes
Yes
A
Metaheuristic
Approach
•
Fix a timing
•
We can solve the problem optimally for a
given timing using
t
= 0
t
= 1
t
= 2
t
= 3
t
= 4
Dynamic Programming
s
t
r
u
v
w
x
y
a
Compute
Cost[a,su,2]
a
c
b
parasite
t
= 0
t
= 1
t
= 2
t
= 3
t
= 4
s
t
r
u
v
w
x
y
a
Compute
Cost[a,su,2]
b
c
Cost[b,tw,3]
Cost[c,y,4]
a
c
b
parasite
Dynamic Programming
t
= 0
t
= 1
t
= 2
t
= 3
t
= 4
Dynamic Programming
a
b
c
s
t
r
u
v
w
x
y
Cost[b,tw,3]
loss
host

switch
loss
Cost[c,y,4]
a
c
b
parasite
Compute
Cost[a,su,2]
t
= 0
t
= 1
t
= 2
t
= 3
t
= 4
Dynamic Programming
a
b
c
s
t
r
u
v
w
x
y
Cost[b,tw,3]
loss
host

switch
loss
Cost[c,y,4]
Candidate for
Cost[a,su,2]:
Cost[b, tw, 3] + Cost[c, uy, 4] + 2 *
loss
+
host

switch
Dynamic Programming
Running Time
•
O(
n
3
) cells to fill in
•
O(
n
2
) positions for first child
•
O(
n
2
) positions for second child
•
O(
n
) to count #losses from each child, but this is precomputable
O(
n
3
x (
n
2
x
n
2
)) =
O(
n
7
)
total
Dynamic Programming
Running Time
•
O(
n
3
) cells to fill in
•
O(
n
2
) positions for first child
•
O(
n
2
) positions for second child
•
O(
n
) to count #losses from each child, but this is
precomputable
O(
n
3
x
(
n
2
x
n
2
)) =
O(
n
7
)
total
Can be improved to O(
n
3
)
Genetic Algorithm
Existing Software
TreeMap
Tarzan/CoRe

PA
Jane 2
Technique
Brute force
DP, Ignore timing
incompatibilities
Genetic algorithm
DP
Solution
Optimal
Can be BETTER than
optimal!
Sometimes suboptimal
Running Time
Exponential
Polynomial,
Very fast
Polynomial,
a lot faster!
Can control running time
Tree Builder
No
Yes
No, but Jane 2
can read
CoRe

PA’s trees
Solution Viewer
Yes
Yes
Yes
Also Interactive
The Fig/Wasp Challenge
Results
The Fig/Wasp Dataset…
Randomly
Generated Problem
Instances
Original
Problem Instance
Paper recently completed…
30 Coauthors
18 Institutes
10 Countries
Results
Results
Demo
Future Work…
•
One parasite, many hosts (“failure to diverge”)
•
Reticulate phylogenies
•
Multifurcations
•
Suggestions?
Questions/Comments
Comments 0
Log in to post a comment