Bioinformatics challenges of new sequencing ... - cdn.oreilly.com

powerfultennesseeΒιοτεχνολογία

2 Οκτ 2013 (πριν από 4 χρόνια και 1 μήνα)

73 εμφανίσεις

“How Perl Saved the Human Genome Project”

DATE:

Early February, 1996

LOCATION
: Cambridge, England, in the conference room of the
largest DNA sequencing center in Europe.

OCCASION:

A high level meeting between the computer scientists of
this center and the largest DNA sequencing center in the United
States.

THE PROBLEM:

Although the two centers use almost identical
laboratory techniques, almost identical databases, and almost
identical data analysis tools, they still can't interchange data or
meaningfully compare results.

THE SOLUTION:

Perl.

Lincoln Stein, TPJ
Vol 1 #2 Summer 1996


“How Perl Saved the Human Genome Project”


Perl solved issues of:



a rapidly
-
changing situation


text
-
manipulation to convert between data formats


building pipelines to glue data analysis programs together

10 years on

Obligatory tenuous coding analogy

The genome is the source of a program to build and run a human

Obligatory tenuous coding analogy

But: the author is not available for comment

Obligatory tenuous coding analogy

It’s 3GB in size

Obligatory tenuous coding analogy

Due to constant forking, there are about 7 billion different versions

Obligatory tenuous coding analogy

It’s full of copy
-
and
-
paste and cruft

Obligatory tenuous coding analogy

And it’s completely undocumented

Obligatory tenuous coding analogy

Q: How do you debug it?

Obligatory tenuous coding analogy

A: Diff a working copy and a broken copy

Same as it ever was


We still have the same problems as in 1996



a rapidly
-
changing situation


text
-
manipulation to convert between data formats


building pipelines to glue data analysis programs together

A rapidly changing situation

MR Stratton
et al.
Nature

458
, 719
-
724 (2009)

Many data formats


“a sea of incompatible data
formats”


“[for each new piece of software]
you could always count on it to
sport its own idiosyncratic user interface and data format.


Lincoln Stein, TPJ
Vol 1 #2 Summer 1996




Building pipelines

Initial data QC

Data QC

Submission to

public archives

Sample reception

Library prep

Sequence ordering

Sequencing

Tracking

Genotype check

Library QC

Recalibration

Mapping to reference

Merging libraries

To collaborators

SNP calling

Structural variants

Filtering

Build release BAM files

Collaborator data

Visualization

Downstream analysis

In conclusion



Although it's not perfect, Perl fills the needs of the genome centers
remarkably well, and is usually the first tool we turn to when we
have a problem to solve.”