Ran Libeskind-Hadas Department of Computer Science Harvey Mudd College

cathamAI and Robotics

Oct 23, 2013 (3 years and 8 months ago)

90 views

Ran Libeskind
-
Hadas

Department of Computer Science

Harvey Mudd College


Joint work with:


Mike Charleston (Univ. of Sydney)


Chris
Conow

(USC)


Ben Cousins (Clemson)


Daniel Fielder (HMC)


John Peebles (HMC)


Tselil

Schramm (HMC)


Anak

Yodpinyanee

(HMC)


Integrated CS/Bio Course


Send e
-
mail to:
ran@cs.hmc.edu



Overview


A 75
-
minute “research lecture” to first
-
year
students in our CS/Bio intro course


Show first
-
year students that what they’ve
learned is relevant to current research


Showcase research done with senior students


What have they have done so far?


Biology: Genes, alignment,
phylogenetic

trees,
RNA folding


CS: Programming, recursion, “
memoization



Specifically…


Pairwise

global alignment and RNA folding


Why you should care


Designed and implemented recursive solutions


Why are they slow?


How do we make them faster?



Memoization
” idea


Wow, that’s fast! (but no actual analysis yet)


Designed and implemented “
memoized
” versions


Used their implementations to investigate questions


Around 10 lines of Python code!

Specifically…


Phylogenetic

trees


Why you should care


Implemented simple algorithm (e.g. UPGMA)


Used their implementation to answer questions…


Existence and relative merits of other algorithms
(mention maximum likelihood… but it’s slow!)

A 75
-
minute lecture in 30 minutes
(or less)

Cophylogenetics

“ I can understand how a flower and a bee might slowly
become, either simultaneously or one after the other,
modified and adapted in the most perfect manner to
each other, by the continued preservation of
individuals presenting mutual and slightly
favourable

deviations of structure.”


Charles Darwin,
The Origin of Species



Actual 75
-
minute lecture starts here! (Also a chapter in new B4B)


Obligate Mutualism of

Figs and Fig Wasps

From
Cophylogeny of the Ficus Microcosm
, A. Jackson, 2004

ovi postor

The
Cophylogeny

Problem

From
Hafner

MS and Nadler SA,
Phylogenetic

trees support the
coevolution

of

parasites and their hosts.

Nature

1988,
332
:258
-
259

Indigobirds

and Finches

www.indigobirds.com



High level of host specificity (e.g. mouth markings)


The Question…

Given a host tree, parasite tree, and tip mapping, what is the most plausible

mapping between the trees and is it suggestive of
coevolution
?

This seems to

be a “hard” problem!

Measuring the “Hardness” of
Computational Problems

There are three kinds of problems…


1.
Easy


2.
Hard


3.
Impossible!

“Easy” Problems

Sorting a list of n numbers:

[42, 3, 17, 26, … , 100]

Multiplying two
n

x

n

matrices:

3 5 2 7

1 6 8 9

2 4 6 10

9 3 2 12

(

)

1 5 5 4

5 12 8 6

7 6 1 5

9 23 5 8

(

)

=

(

)

n

n

n

n

n

Global Alignment is “easy”!




Reminder of 2
n

running time of alignment




Informally motivate n
2
running time of
memoized

version



Snowplows of Northern Minnesota

Burrsburg

Frostbite City

Shiversville

Tundratown

Freezeapolis

“Hard” Problems

“Hard” Problems

Snowplows of Northern Minnesota

Burrsburg

Frostbite City

Shiversville

Tundratown

Freezeapolis

Brute
-
force? Greed?

n
2

versus 2
n

The Ran
-
O
-
Matic performs 10
9

operations/sec

n
2



2
n


n = 10



n = 30



n = 50



n = 70


100

< 1 sec


900

< 1 sec



2500

< 1 sec



1024

< 1 sec



10
9

1 sec



4900

< 1 sec



n
2

versus 2
n

The Ran
-
O
-
Matic performs 10
9

operations/sec

n
2



2
n


n = 10



n = 30



n = 50



n = 70


100

< 1 sec


900

< 1 sec



2500

< 1 sec



1024

< 1 sec



10
9

1 sec



10
15

13 days



4900

< 1 sec



n
2

versus 2
n

The Ran
-
O
-
Matic performs 10
9

operations/sec

n
2



2
n


n = 10



n = 30



n = 50



n = 70


100

< 1 sec


900

< 1 sec



2500

< 1 sec



1024

< 1 sec



10
9

1 sec



10
15

13 days



4900

< 1 sec



10
21

37 trillion
years




n
2

versus 2
n

The Ran
-
O
-
Matic performs 10
9

operations/sec

n
2



2
n


n = 10



n = 30



n = 50



n = 70


100

< 1 sec


900

< 1 sec



2500

< 1 sec



1024

< 1 sec



10
9

1 sec



10
15

13 days



4900

< 1 sec



10
21

37 trillion
years




Computers double in speed every
2 years. Let’s just wait 10 years!

37 trillion years
-
>

n
2

versus 2
n

The Ran
-
O
-
Matic performs 10
9

operations/sec

n
2



2
n


n = 10



n = 30



n = 50



n = 70


100

< 1 sec


900

< 1 sec



2500

< 1 sec



1024

< 1 sec



10
9

1 sec



10
15

13 days



4900

< 1 sec



10
21

37 trillion
years




Computers double in speed every
2 years. Let’s just wait 10 years!

37 trillion years
-
>

37 billion years!

Snowplows and Travelling
Salesperson Revisited!

Travelling
Salesperson
Problem

Snowplow Problem

Protein Folding

NP
-
complete

problems

Tens of thousands

of other known problems go in this
cloud!!

Phylogenetic

trees
by maximum
likelihood

Multiple sequence
alignment

“I can’t find an efficient algorithm. I guess I’m
too dumb.”

Cartoon courtesy of “Computers and Intractability: A Guide to the Theory of NP
-
completeness” by M. Garey and D. Johnson

Cartoon courtesy of “Computers and Intractability: A Guide to the Theory of NP
-
completeness” by M. Garey and D. Johnson

“I can’t find an efficient algorithm because no such

algorithm is possible!”

Cartoon courtesy of “Computers and Intractability: A Guide to the Theory of NP
-
completeness” by M. Garey and D. Johnson

“I can’t find an efficient algorithm, but neither

can all these famous people.”



$1 million

Vinay Deolalikar

Coping with NP
-
completeness…



Brute force



Ad hoc Heuristics



Meta heuristics



Approximation algorithms


Obligate Mutualism of

Figs and Fig Wasps

From
Cophylogeny of the Ficus Microcosm
, A. Jackson, 2004

ovi postor

The
Cophylogeny

Problem…

Host tree

a

b

c

Parasite tree

d

e

The
Cophylogeny

Problem

Host tree

Tips associations

a

b

c

Parasite tree

d

e

Possible Solutions

a

b

c

d

e

a

b

c

d

e

Input

Event Cost Model

cospeciation

a

b

c

d

e

cospeciation

cospeciation

a

b

c

d

e

Event Cost Model

duplication

a

b

c

d

e

duplication

a

b

c

d

e

Event Cost Model

host
-
switch

a

b

c

d

e

host
-
switch

a

b

c

d

e

Event Cost Model

loss

a

b

c

d

e

loss

loss

loss

loss

a

b

c

d

e

Event Cost Model

a

b

c

d

e

cospeciation

loss

loss

duplication

host
-
switch

loss

loss

cospeciation

a

b

c

d

e

Cost = duplication



+

cospeciation
+ 3 *
loss

Cost = cospeciation


+

host
-
switch
+
loss

Some typical costs

a

b

c

d

e

a

b

c

Cost = 8

Cost
= 5

cospeciation

loss

loss

duplication

host
-
switch

loss

loss

cospeciation

+ 0

+ 2

+ 2

+ 2

+ 3

+ 2

+ 2

+
0

e

d

This problem is hard!


How hard?
NP
-
complete!
(Joint work with Charleston,
Ovadia
,
Conow
, Fielder)


The host
-
switches are the culprits

e

f

g

h

Existing Methods



TreeMap

Tarzan/CoRe
-
PA

Technique

Brute force

Ignore timing incompatibilities

Solution

Optimal

Can be BETTER than optimal!

Running Time

Exponential

Polynomial,

Very fast

Tree Builder

No

Yes

Solution Viewer

Yes

Yes

A
Metaheuristic

Approach


Fix a timing


We can solve the problem optimally for a
given timing using
t

= 0

t

= 1

t

= 2

t

= 3

t

= 4

Dynamic Programming

s

t

r

u

v

w

x

y

a

Compute
Cost[a,su,2]

a

c

b

parasite

t

= 0

t

= 1

t

= 2

t

= 3

t

= 4

s

t

r

u

v

w

x

y

a

Compute
Cost[a,su,2]

b

c

Cost[b,tw,3]

Cost[c,y,4]

a

c

b

parasite

Dynamic Programming

t

= 0

t

= 1

t

= 2

t

= 3

t

= 4

Dynamic Programming

a

b

c

s

t

r

u

v

w

x

y

Cost[b,tw,3]

loss

host
-
switch

loss

Cost[c,y,4]

a

c

b

parasite

Compute
Cost[a,su,2]

t

= 0

t

= 1

t

= 2

t

= 3

t

= 4

Dynamic Programming

a

b

c

s

t

r

u

v

w

x

y

Cost[b,tw,3]

loss

host
-
switch

loss

Cost[c,y,4]

Candidate for
Cost[a,su,2]:


Cost[b, tw, 3] + Cost[c, uy, 4] + 2 *
loss
+
host
-
switch


Dynamic Programming

Running Time


O(
n
3
) cells to fill in


O(
n
2
) positions for first child


O(
n
2
) positions for second child


O(
n
) to count #losses from each child, but this is precomputable


O(
n
3

x (
n
2

x
n
2
)) =
O(
n
7
)

total

Dynamic Programming

Running Time


O(
n
3
) cells to fill in


O(
n
2
) positions for first child


O(
n
2
) positions for second child


O(
n
) to count #losses from each child, but this is
precomputable


O(
n
3

x

(
n
2

x

n
2
)) =
O(
n
7
)

total


Can be improved to O(
n
3
)

Genetic Algorithm

Existing Software

TreeMap

Tarzan/CoRe
-
PA

Jane 2

Technique

Brute force

DP, Ignore timing
incompatibilities

Genetic algorithm

DP

Solution

Optimal

Can be BETTER than
optimal!

Sometimes suboptimal

Running Time

Exponential

Polynomial,

Very fast

Polynomial,
a lot faster!

Can control running time

Tree Builder

No

Yes

No, but Jane 2
can read

CoRe
-
PA’s trees

Solution Viewer

Yes

Yes

Yes

Also Interactive

The Fig/Wasp Challenge


Results

The Fig/Wasp Dataset…

Randomly
Generated Problem
Instances

Original

Problem Instance

Paper recently completed…

30 Coauthors

18 Institutes

10 Countries

Results

Results


Demo

Future Work…


One parasite, many hosts (“failure to diverge”)


Reticulate phylogenies


Multifurcations


Suggestions?


Questions/Comments