pptx - The Chinese University of Hong Kong

tanktherapistΒιοτεχνολογία

23 Οκτ 2013 (πριν από 3 χρόνια και 9 μήνες)

119 εμφανίσεις

Lecture 2. Topics in Next
-
Generation Sequencing

The Chinese University of Hong Kong

CSCI5050 Bioinformatics and Computational Biology

Lecture outline

1.
Sequencing and next
-
generation sequencing

2.
Standard data processing

a.
Data preprocessing

b.
Sequence alignment

c.
Sequence assembly

3.
Applications

4.
Specific data processing and analysis

Last update: 10
-
Sep
-
2013

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip
-
cse
-
cuhk | Fall 2013

2

SEQUENCING AND NEXT
-
GENERATION SEQUENCING

Part 1

The sequencing problem


Input: a sample containing some DNA


Output: the exact content of the DNA (i.e., the
strings of
ACGT
s)


Remarks:


The DNA usually comes from multiple cells. If the
DNA in different cells are different, the sequencing
result will be an average of them.


If the amount of DNA is small, may need to make
more copies by an experimental procedure called
“amplification”. Could affect results if quantity is
important.

Last update: 10
-
Sep
-
2013

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip
-
cse
-
cuhk | Fall 2013

4

Sequencing experiments: Basic ideas


Use one strand as template, grow the other
strand


Different ways to detect which nucleotide is
added. For example,


Give a different color for each type of nucleotide


Supply only one type of nucleotide at a time, and see
if some signals (e.g., light) can be detected


Stop whenever a certain nucleotide is added. Then
deduce the nucleotide by DNA lengths (Sanger
sequencing)


Can only handle up to a certain length of DNA


Need to break down a DNA into small fragments if it is
too long

Last update: 10
-
Sep
-
2013

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip
-
cse
-
cuhk | Fall 2013

5

Sanger sequencing


Low
-
throughput, but accurate and can handle up to 1000bp


Still standard for small
-
scale laboratory use

Last update: 10
-
Sep
-
2013

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip
-
cse
-
cuhk | Fall 2013

6

Image credit: the
-
scientist.com

Components:


DNA to be sequenced


Primer


Free nucleotides that allow
further extension (dNTP):


N=
A
,
C
,
G

or
T
, all four
types are present


Free nucleotides that terminate
extension (ddNTP):


N=
A
,
C
,
G

or
T
, only one
type is present


DNA polymerase

See these videos for animations:
http://www.youtube.com/watch?v
=oYpllbI0qF8

http://www.youtube.com/watch?v
=6ldtdWjDwes

Next
-
generation sequencing


“Next
-
generation” refers to high
-
throughput
methods, as compared to low
-
throughput
Sanger
-
like sequencing


Also called “deep sequencing” or “massively parallel
sequencing”


“Third
-
generation” is already underway


Motivated by the large size of the human genome
(i.e., the set of all chromosomes, about 3 billion
base pairs) and the high sequencing cost


Key idea: parallelization

Last update: 10
-
Sep
-
2013

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip
-
cse
-
cuhk | Fall 2013

7

Some next
-
generation sequencing methods

Last update
: 11
-
Sep
-
2013

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip
-
cse
-
cuhk | Fall 2013

8

Image credit: Metzker,
Nature Reviews Genetics

11:31
-
46, (2010)

Naming convention:


Roche: Company


454: Sequencing method


GS FLX Titnium: Machine
type


See these videos for details:


Pyrosequencing:
http://www.youtube.com/w
atch?v=nFfgWGFe0aA


Solexa:
http://www.youtube.com/w
atch?v=77r5p8IBwJk


SOLiD:
http://www.youtube.com/w
atch?v=nlvyF8bFDwM


Note: numbers are from a
2010 paper

More updated comparison tables

Last update: 10
-
Sep
-
2013

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip
-
cse
-
cuhk | Fall 2013

9

Image credit: Liu et al.,
Journal of Biomedicine and Biotechnology

2012:251364, (2012)

Some next
-
generation sequencing methods


Platform for
library/template
preparation


Droplet


Solid
-
phase


(single molecule, no
amplification)


Immobilization


Primer


Template


Polymerase

Last update: 10
-
Sep
-
2013

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip
-
cse
-
cuhk | Fall 2013

10

Image credit: Metzker,
Nature Reviews Genetics

11:31
-
46, (2010)

Some next
-
generation sequencing methods


Chemistry for
identification of
nucleotide:


Reversible dye
-
terminators:
terminating base


fluorescence


removal of
terminating group


Pyrosequencing:
pyrophosphate,
which fuels a reaction
to give out visible
light

Last update: 10
-
Sep
-
2013

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip
-
cse
-
cuhk | Fall 2013

11

Image credit: Metzker,
Nature Reviews Genetics

11:31
-
46, (2010)

Sequencing a long DNA


Cut down the long DNA into shorter ones, by
either


Restriction enzymes that recognize specific
sequences


Mechanical shearing


Acoustic waves


Sequence one or both ends of the fragments


Determine the original DNA from the
sequenced fragments

Last update: 10
-
Sep
-
2013

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip
-
cse
-
cuhk | Fall 2013

12

Shotgun sequencing


Difficult to keep track of the order of fragments


Shotgun: random fragmentation


See
http://www.youtube.com/watch?v=vg7Y5EeZsjk

Last update: 10
-
Sep
-
2013

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip
-
cse
-
cuhk | Fall 2013

13

Whole genome shotgun

Hierarchical approach: slightly easier
to get back the original sequence

Image credit: Jennifier et al.,
Biological Procedures Online

11(1):52
-
78, (2009)

Major milestones in sequencing

Last update: 10
-
Sep
-
2013

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip
-
cse
-
cuhk | Fall 2013

14

Genome

Type

Size

Completed
year

Time
needed

Cost
(USD)

Bacteriophage MS2

Virus

(RNA)

3,569nt

1976

?

?

Bacteriophage

X174

Virus (DNA)

5,368bp

1977

?

?

Haemophilus influenzae

Bacteria

1.8Mb

1995

?

?

Saccharomyces cerevisiae

Fungus (yeast)

12.1Mb

1996

?

?

Caenorhabditis elegans

Nematode

(worm)

100Mb

1998

?

?

Arabidopsis thaliana

Plant

157Mb

2000

?

?

Homo sapiens

Mammal (human)

3.2Gb

2003

15 years

3B

Craig Venter

Mammal (human)

2.8Gb

2007

5 years

100M

James Watson

Mammal (human)

6Gb
(diploid)

2008

4 months

1.5M

YanHuang 1 (Chinese)

Mammal (human)

~3Gb

2008

2 months

0.5M

Neanderthal

Mammal

3.2Gb

2010

4 years

6.4M

Anyone

Mammal (human)

~3Gb

2011

1 week

10K

Cost and number of human genomes sequenced


Estimates only

Last update: 10
-
Sep
-
2013

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip
-
cse
-
cuhk | Fall 2013

15

Image source: http://www.existencegenetics.com/fullgenome.php

STANDARD DATA PROCESSING

Part 2

From images to formatted data

Last update: 10
-
Sep
-
2013

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip
-
cse
-
cuhk | Fall 2013

17

Image credit: Geospiza

Sequencing reads produced


Sequences:


Single
-
end sequencing: one
sequencing read (i.e., a short
string) per fragment


Paired
-
end sequencing: two
sequencing reads per
fragment


Quality score:


How reliable each sequenced
base is


While sequencing is quite
reliable, errors do occur

Last update: 10
-
Sep
-
2013

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip
-
cse
-
cuhk | Fall 2013

18

Read
length

Insert size

Fragment length

Note: Some define insert size as
the same as fragment length

mate pair

Quality score


High
-
throughput sequencing has a relatively high
error rate


Based on the sequencing signal, the sequencing
machine can estimate an error probability
p

for
each base call


A corresponding quality score can be defined.


One commonly use quality score is Phred Quality
q
:

q

=
-
10 log
10

p


q

can take value from 0 (
p

= 1) to infinity (
p

= 0)


Higher
q



Better base quality


Practically, a Phred score of 30 (p=0.001) or more indicates
good quality


There are other types of quality scores

Last update: 10
-
Sep
-
2013

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip
-
cse
-
cuhk | Fall 2013

19

The FASTQ file format


See
http://genome.ucsc.edu/FAQ/FAQformat.html

for a list of commonly used file formats in genomics


FASTQ: read sequences and quality scores (see
http://maq.sourceforge.net/fastq.shtml
)


This is the “raw data” bioinformaticians deal with


We
seldom need to work on the raw images directly


Another famous file format for genomics is the FASTA
format mainly for sequences. FASTQ is like FASTA + quality

Last update: 10
-
Sep
-
2013

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip
-
cse
-
cuhk | Fall 2013

20

FASTQ


Each sequence occupies four lines:

@SEQ_ID

GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT

+

!''*((((***+))%%%++)(%%%%).1***
-
+*''))**55CCF>>>>>>CCCCCCC65


Line 1: @, followed by sequence ID and descriptions


Line 2: the sequence


Line 3: +, optionally followed by sequence ID and
descriptions


Line 4: quality scores (mapped to ASCII characters)


Standard: <score> = <ASCII number of character>


33;


E.g., ‘!’ has an ASCII number of 33, which means the first base has
a quality score of 33


33 = 0, i.e., very bad (
p
=1 for Phred score)


Illumina has a different standard (with different versions)

Last update: 10
-
Sep
-
2013

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip
-
cse
-
cuhk | Fall 2013

21

Example source: Wikipedia

Data preprocessing


Read level:


Removing adapter sequences


Removing poly
-
A tails (for RNA data)


Filtering of reads of low quality


Trimming of reads with low quality ends


Global level:


Checking fraction of reads that pass quality thresholds


Comparing distribution of read lengths with
expectation


Checking other application
-
specific statistics (e.g.,
distribution of nucleotides)

Last update: 10
-
Sep
-
2013

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip
-
cse
-
cuhk | Fall 2013

22

Quality checking reports


Base quality


Usually lower at first and last bases even for good cases

Last update: 10
-
Sep
-
2013

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip
-
cse
-
cuhk | Fall 2013

23

A good case

A bad case

Image credit: FastQC

Quality checking reports


Duplication (reads with exactly the same sequences)


Indication of amplification bias or insufficient starting
materials

Last update: 10
-
Sep
-
2013

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip
-
cse
-
cuhk | Fall 2013

24

A good case

A bad case

Image credit: FastQC

Meaning of “sequenced genome”


How to get the actual sequence (i.e., string) of
nucleotides from a set of sequencing reads?


Sequence assembly: assemble the original sequences
using the short reads


Also called “de novo assembly”


Sequence alignment/mapping: using a reference
sequence, find out the position of each read and
identify differences between the current DNA
sequence and the reference


This kind of studies is called “re
-
sequencing”


Only a small number of human genomes have
been assembled. Most only mapped to reference
genomes

Last update: 10
-
Sep
-
2013

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip
-
cse
-
cuhk | Fall 2013

25

Sequence assembly


Main idea: If two reads overlap substantially,
there is a high chance that they come from
adjacent positions in the original sequence


Example:


Original sequence:
ACCGGGTCTACGTTCCAT


Read 1:
ACCGGGT


Read 2:
CGGGTCT


Alignment:

ACCGGGT__

__CGGGTCT


Partial assembly:

ACCGGGTCT

Last update: 10
-
Sep
-
2013

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip
-
cse
-
cuhk | Fall 2013

26

Now, suppose Read 3 is
GTTCCAT

It can align with read 1 as follows:

ACCGGGT_____

_____GTTCCAT

Which results in a
wrong

partial assembly:

ACCGGGTTCCAT

Problem: overlap too short. Easy to get false
hit from other positions

Sequence assembly


Main challenges:


Find reads with substantial overlaps efficiently


Determine the size of overlap necessary to eliminate
false hits


Handling repeats (consider the following example:)


Original sequence:
A
CCTCCTCCTCCT
G


Suppose we get all length
-
4 reads:

ACCT
,
CCTC
,
CTCC
,
TCCT
,
CCTG


They can also be produced from this sequence:
ACCTCCTG


Need to check read count (number of copies for each type of
reads) to determine number of CCT occurrences


Demonstrating why read length and quantitative precision matter


Handling sequencing errors

Last update: 10
-
Sep
-
2013

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip
-
cse
-
cuhk | Fall 2013

27

Sequence assembly


In order to have overlaps between reads, a base needs to be
covered by more than one read


Suppose


The whole DNA has length
N


Each read has length
n


There are
m

reads


What is the probability that a base is not covered, if the reads were
independently and uniformly sampled?


Ignoring boundary effects (i.e., some reads are at the ends of the
DNA),


m
=1: (
N
-
n
) /
N


In general: [(
N
-
n
) /
N
]
m


Can use similar calculations to estimate the number of reads
needed to provide good coverage of all bases


The average number of times each base is covered is called the “read
depth”: 30x, 60x, etc.

Last update: 10
-
Sep
-
2013

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip
-
cse
-
cuhk | Fall 2013

28

One formulation


de Bruijn graphs


Treat each read (suppose all are of length
k
) as a node


Add a directed edge (i.e., an arrow) from a node to another if the last
k
-
1
bases of the former are exactly the same as the first
k
-
1 bases of the latter


In the ideal case, the goal is to start from a node and traverse all edges

Last update: 10
-
Sep
-
2013

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip
-
cse
-
cuhk | Fall 2013

29

Image credit: Compeau et al.,
Nature Biotechnology

29(11):987
-
991, (2011)

Sequence alignment/mapping


Sequence assembly is difficult for long sequences


If the sequenced DNA is similar to an assembled one, it
would be much easier to simply find out the location of
each read in the reference, and identify the differences


Major challenges:


Finding out the locations of many reads in the reference
efficiently


Handling mismatches


Distinguishing between genetic variants and sequencing
errors


Handling insertions, deletions, duplications, and other
types of genomic variations


More details about sequence assembly and alignment
are covered by the projects

Last update: 10
-
Sep
-
2013

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip
-
cse
-
cuhk | Fall 2013

30

The SAM and BAM formats


SAM: a text
-
based file format for storing sequence
alignments


BAM: a compressed binary version of SAM

Last update: 10
-
Sep
-
2013

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip
-
cse
-
cuhk | Fall 2013

31

SAM


Conceptual alignment:

Coor 12345678901234 5678901234567890123456789012345

ref AGCATGTTAGATAA**GATAGCTGTGCTAGTAGGCAGTCAGCGCCAT

+r001/1 TTAGATAAAGGATA*CTG

+r002 aaaAGATAA*GGATA

+r003 gcctaAGCTAA

+r004 ATAGCT..............TCAGC

-
r003 ttagctTAGGC

-
r001/2 CAGCGCCAT


SAM format:

@HD VN:1.3 SO:coordinate

@SQ SN:ref LN:45

r001 163 ref 7 30 8M2I4M1D3M = 37 39 TTAGATAAAGGATACTG *

r002 0 ref 9 30 3S6M1P1I4M * 0 0 AAAAGATAAGGATA *

r003 0 ref 9 30 5H6M * 0 0 AGCTAA * NM:i:1

r004 0 ref 16 30 6M14N5M * 0 0 ATAGCTTCAGC *

r003 16 ref 29 30 6H5M * 0 0 TAGGC * NM:i:0

r001 83 ref 37 30 9M = 7
-
39 CAGCGCCAT *


CIGAR string: M=alignment match; S=substitution (mismatch); I=insertion; D=deletion, etc.

Last update: 10
-
Sep
-
2013

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip
-
cse
-
cuhk | Fall 2013

32

Example source: http://samtools.sourceforge.net/SAM1.pdf

APPLICATIONS

Part 3

Applications


While next
-
generation sequencing was first used
for decoding DNA sequences, now it is used for
many other applications


Gene expression (RNA
-
seq, CAGE, ...)


Protein binding to DNA (ChIP
-
seq, ChIP
-
exo, ...)


DNA methylation (BS
-
seq, MeDIP
-
seq, MethylCap
-
seq,
RRBS
-
seq, ...)


Histone modifications (ChIP
-
seq)


Open chromatin (DNase
-
seq, FAIRE
-
seq, ...)


DNA long
-
range interactions (ChIA
-
PET, Hi
-
C, TCC, ...)


RNA
-
protein interactions (CLIP
-
Seq, HITS
-
CLIP, PAR
-
CLIP, RIP
-
seq, ...)


...

Last update: 10
-
Sep
-
2013

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip
-
cse
-
cuhk | Fall 2013

34

RNA
-
seq


Can selectively
select some RNA
based on:


Presence or
absence of poly
-
A tail


Length

Last update: 10
-
Sep
-
2013

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip
-
cse
-
cuhk | Fall 2013

35

Image credit: Wang et al.,
Nature Reviews Genetics

10(1):57
-
63, (2009)

ChIP
-
seq


Ch
romatin
i
mmunopreci
p
itation
followed by
sequencing


Use antibody to “pull
down” target DNA,
such as DNA bound
by a certain protein

Last update: 10
-
Sep
-
2013

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip
-
cse
-
cuhk | Fall 2013

36

Image credit: Mardis,
Nature Methods

4:613
-
614, (2007)

Bisulfite sequencing


To find out cytosines
methylated at the carbon
-
5
position


Usually occurring at CpG, CpHpG
and CpHpH nucleotide patterns


Bisulfite sequencing: Use
bisulfite treatment to turn
unmethylated cytosines into
uracils (which are sequenced as
thymines)


Determining methylated
locations: Mapping sequencing
reads to both original and C

T
transformed references

Last update: 10
-
Sep
-
2013

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip
-
cse
-
cuhk | Fall 2013

37

Image source: http://www.hgu.mrc.ac.uk/img/researchers_img/meehan/DNA_Methylation_in_Vertebrates_a.jpg, Wikipedia

SPECIFIC PROCESSING AND
ANALYSIS

Part 4

Some more specific tasks


General:


Comparing with controls


Handling replicates


DNA sequencing:


Detecting single nucleotide variants/polymorphisms (SNVs/SNPs)


Detecting small insertions/deletions (indels)


Detecting duplications


Detecting other large
-
scale rearrangements


RNA
-
seq:


Measuring expression levels [discussion paper]


Determining differentially expressed genes [discussion paper]


Determining isoforms [project]


Detecting gene fusion events [project]


ChIP
-
seq:


Detecting signal peaks [project]

Last update: 10
-
Sep
-
2013

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip
-
cse
-
cuhk | Fall 2013

39

Comparing with controls


Observed data contain both desired signals and
unwanted stuff


For example:


In DNA sequencing, the sample may contain
contaminated DNA (normal cells contaminating
cancer sample)


In ChIP
-
seq, some regions with no protein binding
may also be pulled down


By comparing with a control, we can pinpoint the
real signals


Cancer: Normal cells (tumor
-
adjacent/ normal tissue/
blood)


Chip
-
seq: Same DNA, but without ChIP

Last update: 10
-
Sep
-
2013

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip
-
cse
-
cuhk | Fall 2013

40

Handling replicates


Random noise can be filtered by having replicates


Biological replicates (different samples)


Technical replicates (different experiments)


Main idea:

If a signal is consistently observed in
multiple experiments, it is more likely to be real


Usual steps:


Computing probability of consistency


Filtering inconsistent signals


Combining unfiltered data from replicates

Last update: 10
-
Sep
-
2013

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip
-
cse
-
cuhk | Fall 2013

41

Comparing signal ranks


Suppose we have two ChIP
-
seq datasets. For each dataset, we
have ranked each region by the signal strength (number of
reads)


If the real signals have high and consistent ranks, while the
noise has low and random ranks, we would get a curve like
this:

Last update: 10
-
Sep
-
2013

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip
-
cse
-
cuhk | Fall 2013

42

Fraction of regions within the
top t ranks in both datasets

Image credit: Li et al., arXiv:1110.4705

Detecting SNVs


Single nucleotide variant: at one base, the observed data is
different from the reference


Called SNP at the population scale, some definitions require at least
1% frequency


SNVs can be detected by allowing mismatches when mapping reads
to reference


Would take more time if more mismatches are allowed


If the reference is
ACCG
, how many kinds of reads can be mapped if
we allow


No mismatches?


1 mismatch?


2 mismatches?


To distinguish between SNVs and sequencing errors, usually a
minimum number of reads that support the SNV is required


For diploid organisms, also need to distinguish between


Homozygous (maternal and paternal copies are the same at the base)


Heterozygous (maternal and paternal copies are different at the base)

Last update: 10
-
Sep
-
2013

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip
-
cse
-
cuhk | Fall 2013

43

Detecting small indels


Small indels (that span one or a few bases) can be difficult to detect


If the reference is
ACCGGTA
, how many kinds of reads can be mapped if
we allow


1 mismatch?


1 insertion of size 1?


Hard to determine whether a SNV near the end of a read is actually an
indel


Good to perform local realignment by combining information from
multiple reads


Example:

Reference:

CGACCGT

Read 1:


ACC
A
GT

(more likely to be one insertion than two SNVs)

Read 2:

CGACC
A

(not sure whether it is insertion or SNV by itself)




(more likely to be an insertion after considering read 1)

Last update: 10
-
Sep
-
2013

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip
-
cse
-
cuhk | Fall 2013

44

Duplications


Genomes contain many kinds of repeats that
differ by


Size of repeating unit


Number of copies


Locations of the copies


How similar are the copies


Mechanisms by which the repeats were produced

Last update: 10
-
Sep
-
2013

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip
-
cse
-
cuhk | Fall 2013

45

Some proposed definitions

Term

Definition

Structural variant

A genomic alteration

(e.g., a CNV, and inversion) that involves segments of DNA >1kb

Copy number variant (CNV)

A duplication or

deletion event involving >1kb of DNA

Duplicon

A duplicated genomic segment >1kb in length with >90% similarity between copies

Indel

Variation from insertion or deletion event involving <1kb of DNA

Intermediate
-
sized structural
variant (ISV)

A structural variant that is
-
8kb to 40kb in size. This can refer to a CNV or a balanced
structural rearrangement (e.g., an inversion)

Low copy repeat (LCR)

Similar to segmental duplication

Multisite variant (MSV)

Complex polymorphic variation that is neither

a PSV nor a SNP

Paralogous sequence variant (PSV)

Sequence difference between duplicated copies (paralogs)

Segmental duplication

Duplicated region ranging from 1kb upward with a sequence identity of >90%


Interchromosomal

Duplications distributed among nonhomologous chromosomes


Intrachromosomal

Duplications restricted to a single chromosome

Single nucleotide polymorphism
(SNP)

Base substitution involving only a single nucleotide; ~10 million are thought to be
present

in the human genome at >1%, leading to an average of one SNP difference
per 1250 bases between randomly chosen individuals

Last update: 10
-
Sep
-
2013

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip
-
cse
-
cuhk | Fall 2013

46

Table source: Freeman et al.,
Genome Research

16(8):949
-
961, (2006)

Detecting duplications


As mentioned, short repeats cause problems
in sequence assembly


Need long reads


For duplications that are long,


Determining number of copies: based on read
counts


Determining boundaries (“breakpoints”): could be
helped by paired
-
end reads

Last update: 10
-
Sep
-
2013

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip
-
cse
-
cuhk | Fall 2013

47

Other types of large
-
scale rearrangements


Detection requires special handling:


Computing distance of a mate pair
(two reads from the two ends of a
fragment) calculated from their
mapped positions, comparing with
expected distance


Checking the mapped orientation of a
mate pair


Looking for “split reads”


reads with
different parts coming from different
regions in the reference

Last update: 10
-
Sep
-
2013

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip
-
cse
-
cuhk | Fall 2013

48

Image source: Wikipedia

Measuring expression levels


How to compute an expression level from a
distribution of read counts?


Calculate the average


Based on a statistical model


Normalization: If expression levels of different genes
or the same gene in different datasets are to be
compared


Longer genes are expected to get more reads


For a dataset with more reads, each gene gets more reads
on average


RPKM: Reads per Kilobase per Million reads

Last update: 10
-
Sep
-
2013

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip
-
cse
-
cuhk | Fall 2013

49

Image credit: Wang et al.,
Nature Reviews Genetics

10(1):57
-
63, (2009)

Differential expression


Suppose we have measured gene expression in
two samples (e.g., tumor vs. normal), how to
identify the list of genes with differential
expression?


May not want to consider genes with very low
expression (where random fluctuation has a large
influence)


Consider genes with a statistically significant
difference (compare 1 vs. 2 and 100 vs. 200)


Consider genes with a large fold change (large
numbers can easily get statistical significance: 10000
vs. 11000)

Last update: 10
-
Sep
-
2013

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip
-
cse
-
cuhk | Fall 2013

50

Isoforms


Alternative
isoforms: same
gene producing
multiple types of
transcript

Last update: 10
-
Sep
-
2013

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip
-
cse
-
cuhk | Fall 2013

51

Image credit: Wang et al.,
Nature

456(7221):470
-
476, (2008)

Reconstructing isoforms


Paired
-
end reads and split reads could help
connect neighboring exons


Still not easy to determine isoforms


If there are reads that connect exon 1 and exon 2,
and reads that connect exon 2 and exon 3, do we
have a transcript with all 3 exons?


Also need to consider read counts


E1:50, E2:60, E3:10


In general, need to make some assumptions,
construct a statistical model, and do
predictions

Last update: 10
-
Sep
-
2013

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip
-
cse
-
cuhk | Fall 2013

52

Signal peaks


Protein binding sites are
usually short (~10bp) but
the DNA pulled down can
be much longer


With a large number of
reads from random
positions around the
binding site, a distribution
will be formed

Last update: 10
-
Sep
-
2013

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip
-
cse
-
cuhk | Fall 2013

53

Image credit: Rozowsky et al.,
Nature Biotechnology
27(1):66
-
75, (2009)

Calling signal peaks


Things to consider:


Signals in control (e.g., due to open chromatin)


Height of peaks


Fluctuations


Local bias


Read distribution in the two strands

Last update: 10
-
Sep
-
2013

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip
-
cse
-
cuhk | Fall 2013

54

Image credit: Kharchenko et al.,
Nature Biotechnology

26(12):1351
-
1359, (2008)

Some other common file formats


WIG: For storing signals at base resolution


BED: For storing intervals


GFF, GTF: For storing annotations (will be
introduced later)

Last update: 10
-
Sep
-
2013

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip
-
cse
-
cuhk | Fall 2013

55

Wiggle Track Format (WIG)


Text
-
based format for storing
signals for individual bases


Fixed step:


fixedStep chrom=chr1 start=14051
step=100

18.6

2.4

44.7


Variable step:


variableStep chrom=chr1 span=3

143001 12.5


There is also a binary format called
BigWig with more efficient data
access

Last update: 10
-
Sep
-
2013

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip
-
cse
-
cuhk | Fall 2013

56

Chromosome

Position

Value

1

14051

18.6

1

14151

2.4

1

14251

44.7

Chromosome

Position

Value

1

143001

12.5

1

143002

12.5

1

143003

12.5

BED format


Text
-
based, tab
-
delimited format for storing
signals for intervals


3 required fields: chrom, chromStart, chromEnd


9 optional fields: name, score, strand, thickStart,
thickEnd, itemRgb, blockCount, blockSizes, blockStarts
(last ones for visualization)


Example:

chr22 1000 5000 cloneA 960 + 1000 5000 0 2 567,488, 0,3512

chr22 2000 6000 cloneB 900
-

2000 6000 0 2 433,399, 0,3601


There is also a binary format called BigBed with
more efficient data access


Many variations, such as the commonly
-
used
bedGraph format with only 4 fields: chrom,
chromStart, chromEnd, dataValue

Last update: 10
-
Sep
-
2013

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip
-
cse
-
cuhk | Fall 2013

57

Example source: UCSC Genome Browser

Formatting traps


When you work with genomic data files, be
careful of the following:


Whether genomic locations start with 0 or start with 1


Whether the first position of a region is included


Whether the last position of a region is included


For example, for the bedGraph format:


First position is counted as 0.


First specified position is included.


Last specified position in not included.


Therefore, “chr1

2

4” means the third and
fourth positions of chromosome 1.

Last update: 10
-
Sep
-
2013

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip
-
cse
-
cuhk | Fall 2013

58

Summary


Characteristics of next
-
generation sequencing:


High
-
throughput


Relatively low cost


Short reads


Standard data processing


Data preprocessing


Sequence alignment


Sequence assembly


Applications (X
-
seq)


Specific data processing and analysis


Tasks for processing and analyzing DNA sequencing,
RNA
-
seq and ChIP
-
seq data

Last update: 10
-
Sep
-
2013

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip
-
cse
-
cuhk | Fall 2013

59