NGS cancer genomics data processing and analysis

agreementkittensSemiconductor

Nov 2, 2013 (3 years and 9 months ago)

189 views

NGS cancer genomics data processing and
analysis

Somak Roy, MD


Clinical fellow

Division of Urologic Surgical Pathology

University of Pittsburgh Medical Center

Outline

http://www.genome.gov/sequencingcosts/

Background

Epigenetic
profiling

NGS

Gene
fusion
detection

Mutation
profiling

Copy
number
variations

Structural
variants

Application in Cancer Genomics

NGS

in clinical domain

Sequence the sample DNA to obtain a string of characters
(ATGC)

Compare the obtained sequence to the reference
sequence (expected normal)

Any deviation from the reference (single or multiple
base(s)) is a variant.

Theme of DNA Sequencing

Evolution of Sequencing

Robison. Nat
Biotechnol

2011;29:805
-
7

Rothberg et al. Nature.
2011;475:348
-
52

Semiconductor Sequencing

Arch
Pathol

Lab Med. 2012;136:000

000;
doi
: 10.5858/arpa.2012
-
0107
-
RA

Optics
-
based Sequencing

NGS data processing elements

Signal processing

Raw signal

Normalization

Base calling

Quality control

FASTQ / unaligned BAM

Signal Processing


Non
-
optical

CCGCTAGCTATATTATATCGGGGCCCTAGATAGCTAGATATAGAGGGCTCTAGAGATCGATAGCTAGAG

CTAGCTCG
CCGGGGCCCTAGAGTATATTATAGGCTCTAGAGATCGATAGCTGATAGCTAGATATAAGAG

ATATAAGCGCGGCTCGATCGGTCTAGAGAGGCCCTAGAGTATATTA
CTAGCT
TAAGCTGATAGCTAGAG

CCGCTAGCTATATTATATCGGGGCCCTAGATAGCTAGATATAGAGGGCTCTAGAGATCGATAGCTAGAG

CTAGCTCG
CCGGGGCCCTAGAGTATATTATAGGCTCTAGAGATCGATAGCTGATAGCTAGATATAAGAG

ATATAAGCGCGGCTCGATCGGTCTAGAGAGGCCCTAGAGTATATTA
CTAGCT
TAAGCTGATAGCTAGAG

CCGCTAGCTATATTATATCGGGGCCCTAGATAGCTAGATATAGAGGGCTCTAGAGATCGATAGCTAGAG

CTAGCTCG
CCGGGGCCCTAGAGTATATTATAGGCTCTAGAGATCGATAGCTGATAGCTAGATATAAGAG

ATATAAGCGCGGCTCGATCGGTCTAGAGAGGCCCTAGAGTATATTA
CTAGCT
TAAGCTGATAGCTAGAG

CCGCTAGCTATATTATATCGGGGCCCTAGATAGCTAGATATAGAGGGCTCTAGAGATCGATAGCTAGAG

CTAGCTCG
CCGGGGCCCTAGAGTATATTATAGGCTCTAGAGATCGATAGCTGATAGCTAGATATAAGAG

ATATAAGCGCGGCTCGATCGGTCTAGAGAGGCCCTAGAGTATATTA
CTAGCT
TAAGCTGATAGCTAGAG

Signal Processing
-

Optical

CCGCTAGCTATATTATATCGGGGCCCTAGATAGCTAGATATAGAGGGCTCTAGAGATCGATAGCTAGAG

CTAGCTCG
CCGGGGCCCTAGAGTATATTATAGGCTCTAGAGATCGATAGCTGATAGCTAGATATAAGAG

ATATAAGCGCGGCTCGATCGGTCTAGAGAGGCCCTAGAGTATATTA
CTAGCT
TAAGCTGATAGCTAGAG

CCGCTAGCTATATTATATCGGGGCCCTAGATAGCTAGATATAGAGGGCTCTAGAGATCGATAGCTAGAG

CTAGCTCG
CCGGGGCCCTAGAGTATATTATAGGCTCTAGAGATCGATAGCTGATAGCTAGATATAAGAG

ATATAAGCGCGGCTCGATCGGTCTAGAGAGGCCCTAGAGTATATTA
CTAGCT
TAAGCTGATAGCTAGAG

CCGCTAGCTATATTATATCGGGGCCCTAGATAGCTAGATATAGAGGGCTCTAGAGATCGATAGCTAGAG

CTAGCTCG
CCGGGGCCCTAGAGTATATTATAGGCTCTAGAGATCGATAGCTGATAGCTAGATATAAGAG

ATATAAGCGCGGCTCGATCGGTCTAGAGAGGCCCTAGAGTATATTA
CTAGCT
TAAGCTGATAGCTAGAG

CCGCTAGCTATATTATATCGGGGCCCTAGATAGCTAGATATAGAGGGCTCTAGAGATCGATAGCTAGAG

CTAGCTCG
CCGGGGCCCTAGAGTATATTATAGGCTCTAGAGATCGATAGCTGATAGCTAGATATAAGAG

ATATAAGCGCGGCTCGATCGGTCTAGAGAGGCCCTAGAGTATATTA
CTAGCT
TAAGCTGATAGCTAGAG

Signal Processing
-

Homopolymer

Take a Peak into FASTQ !

Header: Sequence ID, additional info

Sequence

Optional header

Quality score

Phred

Score /
Phred
-
like score

Q

=
-
10*log
10
p

Per Base
Call score

Take a Peak into FASTQ !

Q

=
-
10*log
10
p

30
=
-
10*log
10
(10
-
3
)

20
=
-
10*log
10
(10
-
2
)

What are these characters ?

ASCII format

67
=
-
10*log
10
(p=?)

Read

Pile
-
up

G
G
G
Depth of
Coverage

5x

3x

Var

(G) frequency =3/5 (60%)

Variant
frequency

Mapping, Assembly & Variant Identification

Mapping / Alignment

Next
-
Generation DNA Sequencing Informatics. Ed. Brown SM. Cold Spring Harbor Laboratory Press. 2013

Mapping / Alignment

Pabinger

et al. Briefings in Bioinformatics. Jan 2013

Mapping / Alignment
-

QC

Next
-
Generation DNA Sequencing Informatics. Ed. Brown SM. Cold Spring Harbor Laboratory Press. 2013

Variant identification

Variant identification

Pabinger

et al. Briefings in Bioinformatics. Jan 2013

Variant identification
-

QC

Next
-
Generation DNA Sequencing Informatics. Ed. Brown SM. Cold Spring Harbor Laboratory Press. 2013

Annotation

1522648G>A

44512584G>C

8124526T>A

2544856_2544860
AATGC

..

55124785GA>CC

Public / custom databases

Nomenclature

Biological implication (Gene, transcript and protein level)

Genotype
-
phenotype correlations

Prognostic implication

Predictive implication

Report level

Definition / explanation

1

Missense
SNVs
, insertions and deletions, occurring in the coding, untranslated and splice site
regions, which are of
known

clinical significance.

2

Missense
SNV
, insertions and deletions, occurring in the coding, untranslated and splice site
regions, which are of
uncertain

clinical significance.

3

Variant(s) not classifiable into
levels
1, 2, 4, and 5 automatically by program. Pathologist is required
to review it and reclassify appropriately based on available evidence.

4

Synonymous SNV, SNV(s) with established evidence of benign biological and/or clinical outcome
and intronic variants (except splice
site and deep
intronic

with known significance
).

5

Sequence variants which have been proved to be platform specific sequencing errors based on
repeated experimental
data,
sanger

sequence confirmation
and thorough manual review of
read
pile
-
ups
(
eg
. Using
IGV

viewer).

Variant Classification

Visualization

Result Reporting, Management and Sharing

Targeted
sequencing

Whole
exome

sequencing

Whole genome
sequencing