Complete Genomics

hostitchAI and Robotics

Oct 23, 2013 (3 years and 8 months ago)

77 views

Complete Genomics

Complete Genomics


It’s a business model: sequence only human
genomes on a commercial scale at lowest
cost in the industry


first mover.


It’s a technology platform, too.

Complete Genomics sold!!


CG sold to BGI this past week for $117 M!


Not surprising no answer on speaker request


Will operate as independent company at
least for now.

History

Complete Genomics was established in June 2005 by Dr. Clifford Reid, Dr. Radoje (Rade) Drmanac, and
Mr. John Curson, and began operations in March 2006.

Our founders shared a vision to provide academic and biopharmaceutical researchers with whole human
genomic data and analysis at an unprecedented quality, cost and scale without requiring researchers to
invest in in
-
house sequencing instruments, high
-
performance computing resources and specialized
personnel. Complete Genomics makes human genome sequencing, data analysis and management
accessible for all scientific and medical researchers to conduct large
-
scale genomic research studies. It’s
through these studies that researchers understand the genetic basis of disease and gain insight into how we
might prevent, diagnose and treat genetic diseases.

Cliff Reid, chairman, president and chief executive officer of Complete Genomics, brought with him more
than 25 years of experience in startup and growth companies managing the commercialization of
innovative, data management technologies. Rade Drmanac, Complete Genomics’ chief scientific officer, a
dedicated and talented scientist, is one of the founders of the field of genomics, and an early participant
and a grant recipient of the Human Genome Project. Rade was also one of the pioneers of a massively
parallel sequencing technique called sequencing by hybridization. In it's early stages of development, John
Curson provided the financial expertise to ensure the company was funded for growth and innovation.

One of the difficult challenges facing the genomics industry is improving our understanding of how genes
contribute to diseases that have a complex pattern of inheritance. For many diseases, multiple genes each
make a subtle contribution to a person’s predisposition or susceptibility to a disease or response to a drug
treatment protocol. We believe that unraveling this complex network will be critical to understanding
human health and disease.

History

Knowing that whole human genome sequencing needed to be done on a large number of samples in order to
provide researchers with meaningful perspectives into human diseases, our insight was that large
-
scale
genomic studies required a radically new business and scientific approach.

Large
-
scale disease studies require accurate and affordable sequencing at a high throughput, and a completely
new set of data management and analytics capabilities. The challenge was not just technological but required
the ability to bring the technology to market effectively.

Complete Genomics human
-
focused technology

Complete Genomics decided to exclusively focus on human DNA.

Creating a technology that was designed to
do human sequencing, ensured that it would be done with the highest standards of accuracy and efficiency.
Because we have optimized our technology platform and our operations for the unique requirements of high
-
throughput whole human genome sequencing, we are able to achieve accuracy levels of 99.9997% at a total
cost significantly less than the total cost of purchasing and using commercial DNA sequencing instruments.

Complete Genomics set out to make the process simple

CG decided to combine proprietary human genome sequencing technology with advanced informatics and
data management software to provide an innovative, end
-
to
-
end, outsourced service to customers. Researchers
receive highly accurate genomic data, assembled and annotated, ready for biological interpretation for their
large
-
scale disease projects. Our solution provides researchers with whole human genomic data and analysis at
an unprecedented quality, cost and scale without requiring them to invest in in
-
house sequencing instruments,
high
-
performance computing resources and specialized personnel.

Today, Complete Genomics' genome sequencing center, which began commercial operations in May 2010,
combines a high
-
throughput sample preparation facility, a collection of proprietary high
-
throughput
sequencing instruments, and a large
-
scale data center. Customers ship their samples via common carrier
services. We then sequence and analyze these samples and provide customers with highly accurate genomic
data, assembled and annotated, enabling them to focus exclusively on their single highest priority, biological
interpretation and discovery.



Integrated Services

DNA Isolation, Fragmentation, and Size Capture

Cells are lysed and DNA is extracted from the cell lysate. High
-
molecular
-
weight DNA, often
several megabase pairs long, is sonicated to break the DNA double
-
strands at random
intervals. Bioinformatic mapping of the sequencing reads is most efficient when the sample
DNA contains a narrow length range. Therefore, selecting the ideal fragment lengths of the
DNA for sequencing the fragments are size separated by polyacrylamide gel electrophoresis
(PAGE). DNA of suitable size range is purified by gel extraction, resulting in DNA with
lengths within a narrow range (typically 400


500 base pairs).

Attaching adapter sequences

Adapter DNA sequences must be attached to the unknown
DNA so that DNA with known sequences flank the unknown
DNA. In the first round of adapter ligation, a right and left
adapter (Ad1) is attached to the right and left flanks of the
fragmented DNA, and the DNA is PCR amplified. The right
and left Ad1 are modified to create complementary single
strand ends that bind to each other and form circular DNA. A
restriction enzyme is added, which cleaves the DNA 13 bp to
the right of the right adapter. This results in linear double
-
stranded DNA. Right and left adapter sequences (Ad2) are
ligated onto the ends of the linear DNA and the product is
PCR amplified. The Ad2 sequences are modified to allow
them to bind each other and form circular DNA. The
restriction enzyme is used again to cleave the circular DNA
13 bp to the left of Ad1. The result is a linear DNA fragment.
Right and left adapter sequences (Ad3) are ligated to the
right and left flank of the linear DNA and the product is PCR
amplified. The adapters are modified so that they bind to
each other and form circular DNA. The type III restriction
enzyme EcoP15 is added, which cleaves the DNA 26 bp to
the left of Ad3 and 26 bp to the right of Ad2. This step
removes a large segment of DNA and linearizes the DNA
again. Right and left adapters (Ad4) are ligated to the DNA,
the product is PCR amplified, and the Ad4 sequences are
modified so that they bind each other. The result is the
completed circular DNA template.

Rolling circle replication

Once a circular DNA template,
containing sample DNA that is
ligated to four unique adapter
sequences has been generated, the
full sequence is amplified into a
long string of DNA. This is
accomplished by rolling circle
replication with the Phi 29 DNA
polymerase which binds and
replicates the DNA template. The
newly synthesized strand is
released from the circular
template, resulting in a long single
-
stranded DNA comprising several
head
-
to
-
tail copies of the circular
template. The four adapter
sequences contain palindromic
sequences which hybridize and
cause the single strand to fold onto
itself, resulting in a tight ball of
DNA approximately 300
nanometers (nm) across. This
allows the nanoballs to remain
separated from each other and
reduces any tangling between
different single stranded DNA
lengths.

DNA nanoball microarray

To obtain DNA sequence, the DNA nanoballs are attached to a microarray flow cell . The flow cell
is a 25 mm by 75 mm silicon wafer coated with silicon dioxide, titanium, hexamethyldisilazane
(HMDS), and a photoresistive material. The DNA nanoballs are added to the flow cell and
selectively bind to the aminosilane in a highly ordered pattern, allowing a very high density of
DNA nanoballs to be sequenced.

Substrate

Flow cell array density

DNA Nanoball sequencing flow cell (top) has high density
of sequencing reads with most positions occupied compared
to other next generation sequencing platforms (bottom)

Unchained sequencing by ligation

The order of the DNA bases between the adapter
sequences is determined after being arrayed onto a flow
cell. First, oligonucleotide anchor DNA that is
complementary to either the right or left end of one of the
adapters is added to the flow cell. Next, T4 DNA ligase is
added to a pool of four 10
-
mer DNA sequences that have
degenerate nucleotides in all but one position (for
example position 1 next to the anchor, figure) and are
added to the flow cell. The interrogative position in the
DNA probe contains an "A" nucleotide with a red
fluorophore attached, a "C" with a yellow fluorophore
attached, a "G" with a green fluorophore attached or a
"T" with a blue fluorophore attached. Only the probe that
has a complementary nucleotide in the interrogative
position will bind. The T4 DNA ligase attaches the probe
to the anchor, the non
-
binding probes are washed away,
and the fluorescence is detected. The probe/anchor is
removed from the DNA nanoball and another anchor is
added. A new pool of probes is added with a different
interrogative position. The correct probe hybridizes, is
ligated, rinsed and the fluorescence is read and recorded.
This process is repeated with all ten interrogation
positions next to an anchor sequence. Once all ten
positions are recorded, an anchor is added that binds to a
different adapter and the process is repeated to identify
the ten nucleotides next to that adapter.

Simpler View

CG Instrument

Complete Genomics’ sequencing instrument consists of three loosely coupled standardized
subsystems:

• DNA nanoarrays, packaged into flow slides

• Standard liquid
-
handling robot. Each instrument can run 2 to 16 slides in parallel

while

one slide is imaging, the remaining slides are in various stages of preparation for imaging.

• High
-
speed imager
-

a four
-
color fluorescence microscope.

This modular design enables Complete Genomics to adjust components easily as specifications or

performance criteria change and to rapidly reconfigure hardware as technologies advance. Each of
its components may be independently upgraded as suppliers release new, improved versions.

Flow Slides

Complete Genomics has developed a powerful flowslide platform to minimize reagent
use and simplify fluorescence imaging. Micro
-
channels formed on top of the patterned
substrates enable efficient reagent delivery and eliminate dead volume while
simultaneously satisfying the optical requirements for high
-
resolution imaging. Process
capacity (DNA spots measured per cycle) may be increased by adding more flow slides
to the liquid handling deck. This addition ensures that increases in imager speed are
matched by increases in process capacity.

Imaging and Genome Assembly

After each DNA probe/ligation step, the flow cell is
imaged to determine which nucleotide base bound to the
DNA nanoball. The fluorophore is excited with an arc
lamp that radiates specific wavelengths of light towards
the flow cell. The wavelength of the fluorescence of
each DNA nanoball is captured on a high resolution
CCD camera. The image is then processed to remove
background noise and assess the intensity of each point.
The color of each DNA nanoball corresponds to a base
at the interrogative position. The order of 70 nucleotides
per DNA nanoball is determined using 80 images (8
rounds of 10 interrogative positions per round).

In generating the circular template, a large segment of the original 400


500 base pair fragment was replaced
with the adapter Ad4. The 70 bp that are sequenced are therefore the first 35 bp of the original 400


500 bp
fragment, and the last 35 bp of the 400


500 bp fragment. Therefore, the sequence is identified for two 35 bp
reads of DNA separated by about 330


430 bp. These 35 bp reads are compared, using bioinformatics, to a
reference genome and assigned to a genetic locus. Massive parallel genome mapping is accomplished through
high coverage of the reference nucleotide positions, and the complete genome of the DNA sample is
assembled. Single base
-
pair mismatches between the sequenced reads and the reference sequence are used to
identify possible single nucleotide polymorphism (SNP). In addition, mapping of one 35
-
bp portion of the
mate pair may identify DNA inserts and deletions (indels) via bioinformatics algorithms that detect possible
discrepancies between the mate pairs.

Complete Genomics Analysis Tools

Complete Genomics Analysis Tools (CGA™ Tools) are a set of open source
software tools for downstream analysis of sequencing data produced by Complete
Genomics. These tools allow multi
-
genome comparisons to be performed, or can
convert our native file formats to SAM or BAM formats for easier use of other open
source tools. Customers use these tools to conduct various analyses including
family
-
based analysis or case
-
control analysis.

Analysis functionality includes:


Genome comparison tools


compare different types of variant calls between
genomes


Format conversion tools


export data into other standard formats for analysis


Filtering and annotation tools


manipulate variant files


Reference tools


help build a copy of the reference genome for other CGA
Tools analyses

CGA Tools are available as pre
-
compiled binary distributions for 64
-
bit Linux and
Mac OS X platforms.


In addition, the CGA Tools C++ APIs can be used on these
platforms by installing the source code.

Download CGA Tools and find additional documentation
here
.


Advantages


Use of very high
-
density arrays. The array design permits one DNA nanoball to attach to each
pit that is part of an ordered array, and therefore a higher concentration of DNA can be added.
This allows a high percentage of the pits to be occupied by a DNA nanoball , thus maximizing
the number of reads per flow cell compared to other sequencing arrays , where molecules of
DNA are added to a flow cell in a random orientation .


Sequencing reactions are non
-
progressive; after each reading of the probe, the probe and
anchor are removed and a new anchor and probe set are added. Therefore, if a probe did not
bind in the previous reaction, this has no effect on the next probe ligation,

thus eliminating a
major source of reading error that may occur in other next generation sequencing platforms.

It
also reduces the use of expensive probes, since DNA nanoball sequencing does not necessitate
the probe ligation reaction to be run to completion.


Other advantages of DNA nanoball sequencing include the use of high
-
fidelity Phi 29 DNA
polymerase


to ensure accurate amplification of the circular template, several hundred copies
of the circular template compacted into a small area resulting in an intense signal, and
attachment of the fluorophore to the probe at a long distance from the ligation point results in
improved ligation.

Disadvantages


The main disadvantage of DNA nanoball sequencing is the short read
length of the DNA sequences obtained with this method.


Short reads,
especially for DNA high in DNA repeats, may map to two or more
regions of the reference genome.


A second disadvantage of this method is that multiple rounds of PCR
have to be used. This can introduce PCR bias, and possibly amplify
contaminants in the template construction phase.

CG Whole Human Genome Sequencing Service


Complete Genomics provides barcoded, 96
-
well plates enabling safe, convenient
shipping of your genomic DNA samples to our sequencing facility. Complete Genomics
recommends 10 micrograms of genomic DNA (minimum of 7.5 micrograms required)
and performs inbound sample quality control (QC) including confirmation of gender and
sample matching, library construction, sequencing, assembly, analysis, and data QC for
each genome sample. Complete data sets are then uploaded to Amazon Web Services
for direct delivery on hard disk drives.


Approximately three to four months after sample acceptance, customers receive
assembled and annotated sequence data on a hard disk drive. The data consist of a
minimum of 40X average coverage or approximately 120 Gigabases (Gb) per sample,
with over 90% of the genome fully called (both alleles called). For the High Coverage
Product, double the data is delivered, resulting in a minimum of 80X average coverage
or approximately 240 Gb per sample. In addition to data delivery on a hard drive,
Complete Genomics stores the entire data set for 30 days as part of the standard service.


Also provided are analytical tools enabling researchers to rapidly conduct further
analysis and compare genomic data, resulting in an expedited and standardized approach
to their genomic studies. Immediate access to this information provides rapid
identification of genes and pathways modified in human disease with minimal use of in
-
house computational resources.


As of March 2012, CG’s genome center has the capacity to sequence more than 600
genomes per month with accuracy levels of 99.9998%.

CG Service Deliverables


Complete Genomics Cancer Sequencing Service


Whole
genome sequencing of
cancer
-
normal
pairs and
trios using 3.5 micrograms of
DNA minimum, 5 micrograms recommended as of Aug. 2012.


High
coverage for all samples (≥ 80x guaranteed)


Algorithms in our Analysis Pipeline are tailored to handle the complexities of cancer
samples, which are typically heterogeneous,
aneuploid


and include possible stromal
contamination.


Specialized
bioinformatics to accommodate heterogeneity and copy number
aberrations.
All germline and somatic variants are called, facilitating a much greater understanding
of the genetic basis of cancer.


Comprehensive
, genome
-
wide reporting of SNPs, indels, CNVs, structural variations,
and
mobile element
insertions


Optimized
paired analysis identifying somatic variations


Informative
scores and annotations to help prioritize identified variations


The highest
exome call rate (median of >98% delivered in the first half of 2012
)


With
the Complete Genomics Cancer Sequencing Service, cancer pairs and trios are
confirmed to match prior
to sequencing
, to ensure efficient processing of the correct
samples. Advanced algorithms increase the power to detect allele variants in complex
genomes where tumor heterogeneity, widespread aneuploidy, and normal tissue
contamination may be present. Reports highlight somatic small variants, CNVs, and
structural variations, which are summarized in VCF files to simplify downstream analysis.

Long Fragment Read (LFR)
Technology


In July, CG publuished a paper in
Nature

in collaboration with George Church
and others describing a low
-
cost DNA sequencing and haplotyping process,
long fragment read (LFR) technology, which is similar to sequencing long
single DNA molecules without cloning or separation of metaphase
chromosomes. In this study, ten LFR libraries were made using only
~100 picograms of human DNA per sample. Up to 97% of the heterozygous
single nucleotide variants were assembled into long haplotype contigs.
Removal of false positive single nucleotide variants not phased by multiple
LFR haplotypes resulted in a final genome error rate of 1 in 10 megabases.
Cost
-
effective and accurate genome sequencing and haplotyping from 10

20
human cells, as demonstrated here, will enable comprehensive genetic studies
and diverse clinical applications.


Nature

487, 190

195 (12 July 2012)


This advance shows that it should be possible to achieve clinical quality and
scale in personal genome sequencing of microbiopsies and circulating cancer
cells.