“How Perl Saved the Human Genome Project”
DATE:
Early February, 1996
LOCATION
: Cambridge, England, in the conference room of the
largest DNA sequencing center in Europe.
OCCASION:
A high level meeting between the computer scientists of
this center and the largest DNA sequencing center in the United
States.
THE PROBLEM:
Although the two centers use almost identical
laboratory techniques, almost identical databases, and almost
identical data analysis tools, they still can't interchange data or
meaningfully compare results.
THE SOLUTION:
Perl.
Lincoln Stein, TPJ
Vol 1 #2 Summer 1996
“How Perl Saved the Human Genome Project”
Perl solved issues of:
a rapidly
-
changing situation
text
-
manipulation to convert between data formats
building pipelines to glue data analysis programs together
10 years on
Obligatory tenuous coding analogy
The genome is the source of a program to build and run a human
Obligatory tenuous coding analogy
But: the author is not available for comment
Obligatory tenuous coding analogy
It’s 3GB in size
Obligatory tenuous coding analogy
Due to constant forking, there are about 7 billion different versions
Obligatory tenuous coding analogy
It’s full of copy
-
and
-
paste and cruft
Obligatory tenuous coding analogy
And it’s completely undocumented
Obligatory tenuous coding analogy
Q: How do you debug it?
Obligatory tenuous coding analogy
A: Diff a working copy and a broken copy
Same as it ever was
We still have the same problems as in 1996
a rapidly
-
changing situation
text
-
manipulation to convert between data formats
building pipelines to glue data analysis programs together
A rapidly changing situation
MR Stratton
et al.
Nature
458
, 719
-
724 (2009)
Many data formats
“a sea of incompatible data
formats”
“[for each new piece of software]
you could always count on it to
sport its own idiosyncratic user interface and data format.
Lincoln Stein, TPJ
Vol 1 #2 Summer 1996
Building pipelines
Initial data QC
Data QC
Submission to
public archives
Sample reception
Library prep
Sequence ordering
Sequencing
Tracking
Genotype check
Library QC
Recalibration
Mapping to reference
Merging libraries
To collaborators
SNP calling
Structural variants
Filtering
Build release BAM files
Collaborator data
Visualization
Downstream analysis
In conclusion
“
Although it's not perfect, Perl fills the needs of the genome centers
remarkably well, and is usually the first tool we turn to when we
have a problem to solve.”
Enter the password to open this PDF file:
File name:
-
File size:
-
Title:
-
Author:
-
Subject:
-
Keywords:
-
Creation Date:
-
Modification Date:
-
Creator:
-
PDF Producer:
-
PDF Version:
-
Page Count:
-
Preparing document for printing…
0%
Σχόλια 0
Συνδεθείτε για να κοινοποιήσετε σχόλιο