Department of Biomedical Informatics - Bioinformatics

hordeprobableBiotechnology

Oct 4, 2013 (4 years and 10 days ago)

126 views

Department of

Biomedical Informatics

Next Generation Sequencing Data
Processing for
Epigenetics

Research

Kun Huang

Department of Biomedical Informatics

OSUCCC Biomedical Informatics Shared Resource

The Ohio State University Medical Center


IEEE BIBM 2010 Tutorial


Department of

Biomedical Informatics

2

Outline


Overview of NGS


Sequence Alignment


Peak Finding


Data
Management/Computing/Visualization


Summary


Department of

Biomedical Informatics

The

sequencing

of

the

equivalent

of

an

entire

human

genome

for

$
1
,
000

has

been

announced

as

a

goal

for

the

genetics

community,

and

new

technologies

suggest

that

reaching

this

goal

is

a

matter

of

when
,

rather

than

if
.



what

would

you

do

if

this

sequencing

capacity

were

available

immediately?


http://www.nature.com/ng/qoty/index.html

NGS
-

paradigm shifting

Department of

Biomedical Informatics

4

Instruments


Illumina
/
Solexa

Genome Analyzer/GAII/HiSeq2000


Short sequences


36
-
150nt


Total throughput


40~50Gb (GAII), 200Gb (HiSeq2000)


Life Technology (ABI)/
SOLiD

2/3/4/HQ


Short sequences


50
-
75nt


Total throughput
-

~50Gb (
SOLiD

4), 300Gb (HQ)


Roche/454 GS FLX Titanium/GS Junior


Long reads


average 400nt


Total throughput


400
-
600Mb (FLX), 35Mb (Junior)


Department of

Biomedical Informatics

5

Applications


RNA
-
seq


Transcriptome


Splice variants


Gene fusion


Allelic expression imbalance


Functional SNPs


Small/non
-
coding RNA


ChIP
-
seq


Protein
-
DNA binding


Epigenetics


DNA
-
methylation


Bi
-
sulfite sequencing


MIRA
-
seq


DNA
-
resequencing


Structural variations


SNP,
CNV,
InDel
,
Transposon
, etc.


De Novo sequencing


Translation


Department of

Biomedical Informatics

6

Considerations of Choosing
NGS


Cost $$$


$$


$


Sequencing depth


Directionality


E.g., microRNA


Pair
-
end/mate
-
pair


Experimental protocol


E.g., RNA
-
seq protocol


poly A fishing, ribominus, DNS


Currently NGS is high output, but not really high
-
throughput. It not only takes longer time to run the
instrument, it also takes
much longer time
to analyze.

Department of

Biomedical Informatics

Sequencing

Sequence Files

Quality Scores

Department of

Biomedical Informatics

8

Sequence Files


~20
-
40 million
sequences per lane

Department of

Biomedical Informatics

9

Typical Alignment Results

Department of

Biomedical Informatics

10

NGS analysis workflow

Primary analysis

Secondary

analysis

Tertiary analysis




Image acquisition and bead

processing


Quality metrics


Color calls


䉡獥Bc慬ls



卥煵敮捥 慬i杮浥nt


卥煵敮捥 stats


Consensus calling


Create QC files


T慧 c潵oti湧

䅰灬icati潮
-
獰散楦ic

慮慬y獩s



-
獥煵敮捩湧


De novo sequencing


ChIP
-
seq


坨潬攠tr慮獣sipt潭o


䑎䄠浥th祬yti潮




Sequencing Core

Bioinformatics Core/Research Team

Department of

Biomedical Informatics

11

Computational Challenges


Large data size


High precision


Experimental protocol and bias


Sequence alignment (all but
de novo
sequencing)


Peak / enrichment finding (ChIP
-
seq, RNA
-
seq
)


Variation detection (RNA
-
seq
, re
-
sequencing)


Assembly


Visualization (all)


Data management and computing (all)

Department of

Biomedical Informatics

12

Outline


Overview of NGS


Sequence Alignment


Peak Finding


Data Management and Computing


Summary


Department of

Biomedical Informatics

Alignment strategy


BLAST (Basic Local Alignment Search Tool)


BLAT (BLAST
-
Like Alignment Tool)

Sanford et al. Genome Research, 2009

Department of

Biomedical Informatics

14

Mapping Short Reads to
Reference Genome


Mapping Considerations


Need to allow mismatches and gaps


SNP locations


Sequencing errors


Reading errors


Use of quality scores


Use of SNP knowledge


Indexing and hashing


Genome


Reads

Department of

Biomedical Informatics

15

Tradeoffs


Limiting the number of allowed mismatches


Ignoring insertions and deletions or limiting their number
and length


Ignoring base quality score information


Limiting the number of reported matching locations


Imposing constraints on read length


Ignoring information about errors particular to each
sequencing technology


Department of

Biomedical Informatics

16

SOAP (Li
et al
, 2008)


Both reads and reference genome are converted to numeric data
type using 2
-
bits
-
per
-
base coding


Load reference genome into memory


For human genome, 14GB RAM is required


300(gapped) to 1200(ungapped) times faster than BLAST


2 mismatches or 1
-
3bp continuous gap


Errors accumulate during the sequencing process


Much higher number of sequencing errors at the 3’
-
end
(sometimes make the reads unalignable to the reference genome)


Iteratively trim several basepairs at the 3’
-
end and redo the
alignment


Improve sensitivity



Department of

Biomedical Informatics

17

Mapping Procedure


Two steps:


Data or genome transform


Hashing table


Borrows
-
Wheeler transform


Mapping


Table lookup or index search

Department of

Biomedical Informatics

18

Mapping Methods


ELAND and ELAND Extended(Cox, unpublished)


“Efficient Large
-
Scale Alignment of Nucleotide Databases” (Solexa Ltd.)


SeqMap (Jiang, 2008)



Mapping massive amount of oligonucleotides to the genome”


RMAP (Smith, 2008)


“Using quality scores and longer reads improves accuracy of Solexa read
mapping”


MAQ (Li, 2008)


“Mapping short DNA sequencing reads and calling variants using
mapping quality scores”


Bowtie (Langmead et al 2009, Genome Biol, 10:R25)



Department of

Biomedical Informatics

19

ELAND (Cox, unpublished)


Commercial sequence mapping program comes with
Illumina
/
Solexa

sequencer



Allow at most 2 mismatches


Map sequences up to 32
nt

in length


All sequences have to be same length



ELAND Extended
allows up to six mismatches but only
two mismatches in the first 32
nt


Department of

Biomedical Informatics

20

RMAP (Smith
et al
, 2008)


Improve mapping accuracy


Possible sequencing errors at 3’
-
ends of longer reads


Base
-
call quality scores



Use of base
-
call quality scores


Quality cutoff


High quality positions are checked for mismatches


Low quality positions always induce a match


Quality control step eliminates reads with too many low quality
positions



Allow any number of mismatches

Department of

Biomedical Informatics

21

Mapping for Bi
-
sulfite
Seq


RMap
, MAQ


BSMAP





BS Seeker

Xi and Li,
BMC
Bioinformatics
2009, 10:232

Chen,
Cokus

and
Pellegrini
,
BMC Bioinformatics 2010, 11:203

Department of

Biomedical Informatics

TopHat

Trapnell et al. Bioinformatics 2009

Department of

Biomedical Informatics

After TopHat


You got this:







But you want this:

Cufflinks

Department of

Biomedical Informatics


Assigning each reads to
its potential isoform by
maximizing a function
that assigns a
likelihood to all
possible sets of relative
abundances of the
different isoforms.



Open source software

Trapnell et al. Nat. Biot 2010

Cufflinks

Department of

Biomedical Informatics

25

Outline


Overview of NGS


Sequence Alignment


Peak Finding


Data Management and Computing


Summary


Department of

Biomedical Informatics

26

PeakFinder


PeakFinder


Fejes et al, Bioinformatics, 2008


Monte Carlo based false discovery ratio

Department of

Biomedical Informatics

Analysis


Kharchenko

et al, Nature Biotechnology 26:12:1351
-
1359


QuEST

Department of

Biomedical Informatics

The BELT algorithm

Frietze

and
Lan

et al JBC 2010 ,285:1393
-
1403.

Department of

Biomedical Informatics

29

Peak Finding and Motif Analysis


Determining the exact binding sites from short reads
generated from ChIP
-
Seq experiments


SISSRs (Site Identification from Short Sequence Reads) (Jothi
2008)


MACS (Model
-
based Analysis of ChIP
-
Seq) (Zhang
et al
, 2008)


Department of

Biomedical Informatics

30

Peaks Are Not Everything


Department of

Biomedical Informatics

31

Multiscale

Approach


Many proteins are prevalent on the genome


E.g., histones, RNA Pol II


Multiscale approach


E.g., Zang et al, Bioinformatics 2009

Department of

Biomedical Informatics

32

Histone

Coding


Barski

et al,
Cell 129:823
-
837,
2007.

Department of

Biomedical Informatics

33

Histone

Binding Patterns

Heintzman

et al,
Nature Genetics 39:311
-
318, 2007.

Department of

Biomedical Informatics

34


Department of

Biomedical Informatics

35


TS
S

TS
S





TS
S

TS
S

















TS
S

TS
S

TS
S

TS
S

Department of

Biomedical Informatics

36


Asymptotic distribution of the test statistics (correlation)
is approximated using simulation and 95% is used as the
threshold to determine significant correlation.


Scan genomic regions (Pol II)

Scan genomic regions (H3K4me2)

Department of

Biomedical Informatics

37


Total

p
aired H3K4me2
and
Pol

II peaks

Within 10kb of
RefSeq

genes

Within 10kb

of
microRNAs

Others

40110

23443

161

16863

Examples of paired H3K4me2 (top) and
Pol

II (down)
peaks

Department of

Biomedical Informatics

38

Data Integration Using Bayesian
Network

Inferring causal relationships among different
histone

modifications and gene expression,
Yu et al, Genome Research, 2008

Department of

Biomedical Informatics

39

Pipelines and tools


Illumina

-

GERALD and CASAVA


ELAND is part of GERALD pipeline


TOPHAT will be included in CASAVA in the coming version


Life Technologies (ABI)


BioScope


Cistrome


Galaxy


R/
Bioconductor

packages


Commercial ones

Department of

Biomedical Informatics

40

Outline


Overview of NGS


Sequence Alignment


Peak Finding


Data Management and Computing


Summary


Department of

Biomedical Informatics

41

Mapping Example


Hash table
construction
using sliding
window



Table lookup
to find
matches for
each read


Department of

Biomedical Informatics

42

Modeling Run Time Costs

c
g

: Time to hash a single
genome subsequence

G
: Size of genome

c
r

: Time to process a single
read if no collision

c
c

: Time to resolve a collision

R
: Number of reads

N
: Number of computation
nodes

Department of

Biomedical Informatics

43

Partition Reads Only

(PRO)


Partition reads into
N equal parts.


Useful when R is
large and G is
small.


Memory
requirement does
not scale


Department of

Biomedical Informatics

44

Partition Genome Only

(PGO)


Partition genome
into N equal parts


Useful when G is
large and R is small.


Memory
requirement scales
perfectly

Department of

Biomedical Informatics

45

Partition Reads and Genome (PRG)


A generalization
of PRO and PGO


Nodes are
arranged in
N=N
R
xN
G
mesh


Useful unless
G>>R or G<<R


Memory scales
worse than PGO,
but better than
PRO


Department of

Biomedical Informatics

46

Suffix Based Assignment (SBA)

c
gs

: Time to compare a
genome sequence
against suffixes

c
rs

: Time to compare a read
against suffixes



Under perfect balance G
and R are partitioned
equally


Limited scalability due
to
c
gs

and
c
rs

terms


Useful for medium
values of N


Memory requirement
scales well



Department of

Biomedical Informatics

47

Varying Number of Nodes


G: 800M, R: 130M,
N: (4, 16, 64)


Up to 22x speedup:
From a day to an hour!

Department of

Biomedical Informatics

48

Experiment Using Amazon EC2


Cloud

computing



Amazon

Elastic

Computing

Cloud

(EC
2
)


Low
-
cost

:

pay

per

use


Easy

to

maintain

and

set

up


Mapping

7
.
8

million

short

reads

to

the

human

genome

in

less

than

0
.
5

hour

for

less

than

$
4

Department of

Biomedical Informatics

49


Automation pipeline: from Sequencer to
Supercomputer


Python scripts based system


Data
portal:
QUEST


Metadata based query


Data access management and sharing


Integration with
GenePattern

and Integrated Genome
Browser


Integration

Illumina

sequencers

Local HP
Buffer Server

QUEST Server
and

Data Storage
Servers

OSC Glenn Cluster

(> 9000 CPUs). GERALD,
CASAVA, BLAST, RMap,
etc.

Visualization server
(UCSC Genome
Browser, IGV)

BMI Cluster (576
CPU) . Velvet,
TOPHAT,
RMap
,
GMap
,
PolyBayes
,
Genome Studio

Amazon EC2
cloud computing
system

Data Management and Sharing

Department of

Biomedical Informatics

50

Visualization


Whole Genome
View

Department of

Biomedical Informatics

51

Visualization


Inter
-
chromosome Relationship

Department of

Biomedical Informatics

52

Pol II Distribution with Histone
Markers

Landscapes of H3K4me2 (left) and Pol II (right) binding profiles

Department of

Biomedical Informatics

53

Outline


Overview of NGS


Sequence Alignment


Peak Finding


Data Management and Computing


Summary


Department of

Biomedical Informatics

54

Challenges


Large data size


Illumina

GAII to
HiSeq

(40Gb to 200
Gb

per run)


Network bandwidth problem


File transferring protocol


Accurate metadata is important


Challenges in sharing data analysis algorithms, software
and workflows


Public software may not support the
file format and
may only be
useful for routine/general analysis


Linking existing software with the domain specific components is
NOT easy


GenePattern

provides a possible solution for organizing the
workflow for codes from heterogeneous sources


QUEST
integrates
GenePattern

Department of

Biomedical Informatics

55


Helicose


Projection from Pacific Science: in five years, 200
Gbases/hour


What about “next generation
NGS”?

http://www.rithme.eu

Carlson, www.synthesis.cc

Department of

Biomedical Informatics


NCI CCSB center


Dr. Tim Huang (OSU)


Dr. Jeffrey
Parvin

(OSU)


Dr.
Shili

Lin (OSU
)


Dr.
Umit

Catalyurek


Dr.
Hatice

Gulcin

Ozer

(OSU)


Terry
Camerlengo

(OSU)





Dr
.
Cenny

Taslim

(OSU)


Dr
. Victor Jin (OSU)


Dr.
Yunlong

Liu (IUPUI
)


Dr.
Umit

Catalyurek

(OSU)


Dr.
Durek

Bozdag

(OSU)






Acknowledgement

Thank You!

Position available

We

have

postdoctoral

researcher

openings

for

analyzing

large

biomedical

data

including

NGS,

microarray,

and

large

images
.

Interested

candidates

please

e
-
mail
:

kun
.
huang@osumc
.
edu