OMICS and Pathway Integration for Knowledge Discovery - cceHUB

tastelesscowcreekΒιοτεχνολογία

4 Οκτ 2013 (πριν από 3 χρόνια και 9 μήνες)

74 εμφανίσεις

Integrative Colorectal
Cancer
Omics Data Mining and
Knowledge
Discovery

Jake Y. Chen, Ph.D.

IUPUI

Indiana Center for Systems Biology & Personalized Medicine

http://bio.informatics.iupui.edu

Polyp and Colorectal Cancer


Polyp vs. Colorectal Cancer


Benign
tumors of the large
intestine.


Does
not invade nearby tissue or spread to other
parts of the
body.


If
not removed from the large intestine,
may
become
malignant (cancerous) over time
.


Most of the cancers of the large intestine are
believed to have developed from
Polyp.

Photo Courtesy of National
Cancer Institute


Colon Cancer vs. Rectal Cancer


Share many commonalities, including molecular mechanisms.


Tend to be treated differently.

Colorectal Cancer Molecular Pathways

A
. Walther,
et al. (2009)
Nature
Reviews Cancer
,
9(7) pp
.
489
-
99

Omics
/Clinical Data Source

Proteomics/Metabolomics/
Lipdomics
/Clinical Data

Diet

H=70

PP=54

CR=29

N=153

Oxidative
Stress

H=50

PP=32

CR=12

N=94

LC
-
MS
Proteomics

H=80

PR=72

CR=40

N=192

Vitamin D

H=83

PP=81

CR=31

N=195

GC/GC MS
Metabolomics

H=83

PP=84

CR=30

N=197

Lipdomics

H=47

PP=35

CR=15

N=97

NMR
Metabolomics

H=53

PP=35

CR=15

N=103

Scientific Questions to Answer


Data Analysis


Which Omics
data
has the
best prediction
power?


Which
features in
Omics
data
are important?


Data Mining


Does integration of Omics
data
improve
the
prediction?


Which combination
of
Omics
data has the best prediction power?


Knowledge Discovery


Why
those features in
Omics
data have the best prediction power?

Roadmap


Knowledge Discovery of Proteomics
Data


Knowledge Discovery of Metabolomics Data


Integrative Data
Mining

Proteomics Data Description


Group:
Bindley Biosciences Center at Purdue
University


Instruments:
Agilent's chip cube coupled the XCT
PLUS ESI ion trap


Data format at CCE
webportal
:
mzXML


Number of Samples:
Normal: 80; PolyP:72;
Colorectal: 40


LC
-
MS Proteomics Data Processing

LC/MS
data
“heat map”

Total Ion Chromatogram
(
TIC)
summarized from enhanced
heat
map

Methods Adapted from

N
.
Jeffries (2005)
Bioinformatics
, vol. 21, (no. 14), pp.
3066.

S.A
.
Kazmi
,
et al., (2006)
Metabolomics
, vol. 2, (no. 2), pp.
75
-
83

Image Enhanced LC/MS
data
“heat map”

LC
-
MS Major Protein Identification

~25
-
28 characteristic proteins /sample identified


Identify Most Informative TIC R.T. “Grid”

Apply the R.T. Grid to Original Spectra

Use Mascot to Search for Protein ID at

R.T. Grid Regions

No

Scan

RT

Uniprot_ID

Score

Expect

Evidence

1

119

139.48

ADAD2_HUMAN

38

3.3

0

2

229

265.87

NNMT_HUMAN

43

1.1

2

3

372

429.15

ZSA5D_HUMAN

42

1.2

0

4

656

749.8

BRAF_HUMAN

40

2.2

479

5

1162

1276.6

RGS7_HUMAN

47

0.39

1

6

1310

1407.2

TTC9C_HUMAN

35

6.3

0

7

1669

1713.9

CP042_HUMAN

38

3.1

0

8

1866

1879.1

HXD11_HUMAN

34

8.4

0

9

1987

1980.3

ING4_HUMAN

38

3.1

2

10

2114

2086

ZN423_HUMAN

33

10

0

11

2353

2285.7

CL065_HUMAN

37

3.9

0

12

2539

2441.3

CA5BL_HUMAN

47

0.4

1

13

2722

2594.7

NPDC1_HUMAN

38

3.6

0

14

2874

2722.2

DJC27_HUMAN

37

3.8

0

15

3001

2828.5

BORG4_HUMAN

40

2.2

1

16

3165

2965.1

KC1G1_HUMAN

27

43

0

17

3440

3196.1

TPPC5_HUMAN

40

2

0

18

3656

3377.6

UB2D3_HUMAN

43

0.99

1

19

3997

3665.5

TM208_HUMAN

34

8.1

0

20

4257

3885.4

ZBED3_HUMAN

29

23

0

Proteomics Result Interpretation

Proteins
Identified
from Colon
Cancer and
Health
Group

Uniprot_ID

Frequency
in Colon
(10)

Frequency
in Health
(10)

Evidence in
PubMed

BRAF_HUMAN

3

0

508

DMP46_HUMAN

3

0

0

NNMT_HUMAN

3

1

4

MRP_HUMAN

1

3

0

STK33_HUMAN

0

3

0

Uniprot_ID

Gene

Protein Name

Evidence in
PubMed

BRAF1_HUMAN

BRAF

Serine/threonine
-
protein kinase B
-
raf

508

P53_HUMAN

TP53

Cellular tumor antigen p53

443

CD44_HUMAN

CD44

CD44 antigen

411

MDM2_HUMAN

MDM2

E3 ubiquitin
-
protein ligase Mdm2

131

BCR_HUMAN

BCR

Breakpoint cluster region protein

59

LCK_HUMAN

LCK

Tyrosine
-
protein kinase
Lck

29

Q7RTZ3_HUMAN

LCK

Tyrosine
-
protein kinase
Lck

29

CAV1_HUMAN

CAV1

Caveolin
-
1

21

PNPH_HUMAN

PNP

Purine nucleoside
phosphorylase

13

CBL_HUMAN

CBL

E3 ubiquitin
-
protein ligase CBL

11

RAF1_HUMAN

RAF1

RAF proto
-
oncogene
serine/threonine
-
protein kinase

10

CD38_HUMAN

CD38

ADP
-
ribosyl

cyclase

1

8

NNMT_HUMAN

NNMT

Nicotinamide N
-
methyltransferase

4

IRAK1_HUMAN

IRAK1

Interleukin
-
1 receptor
-
associated
kinase 1

3

DMPK_HUMAN

DMPK

Myotonin
-
protein kinase

2

ITA5_HUMAN

ITGA5

Integrin alpha
-
5

1

ITB1_HUMAN

ITGB1

Integrin beta
-
1

1

ZAP70_HUMAN

ZAP70

Tyrosine
-
protein kinase ZAP
-
70

1

Proteins
Interacted with H
igh
-
Frequency
Proteins
from Colon Cancer Group

Proteomics Result Interpretation

A Network Biology Context

Protein
N
etwork Constructed from the Top 3 Differential Proteins

Green
-
circled
proteins are frequently (>=0.3) detected in the
colon patient blood
samples by using
LC/MS.
Node: Protein with evidence from PubMed
by searching ("GENE_SYMBOL" AND ("colon" OR "colorectal")
AND ("cancer" OR "carcinoma")), Edge
: Protein interaction with
confidence score
from HAPPI
1.31 (
4&5
-
Star)

Proteomics Result Interpretation

A
Biological Pathway Context

BRAF
(
Serine/threonine
-
protein kinase
B
-
raf
) plays major roles in
Colorectal Cancer Pathway (KEGG data)


NNMT (
Nicotinamide

N
-
methyltransferase
) is involved in Biological
Oxidations/Phase II Conjugation/Methylation (from
Reactome
)


Proteomics Result Interpretation

A Biological Pathway Context for NNMT

Roadmap


Knowledge Discovery of Proteomics Data


Knowledge Discovery of Metabolomics
Data


NMR Data


GCxGC

MS Data


Integrative Data Mining


Metabolomics Data Description

Group:
Daniel
Raftery

Laboratory at Purdue University


1.
NMR Data


Instruments:
Bruker

Avance

500MHz, NMR


Data format at CCE
webportal
:
Excel spreadsheet


Number of Samples:
Normal: 53; PolyP:35; Colorectal: 15

2.
GCxGC MS Data


Instruments:
LECO Pegasus 4D
GCxGC
-
TOF


Data format at CCE
webportal
:
Excel spreadsheet


Number of Samples:
Normal: 83; Polyp: 84; Colorectal:30





NMR Data Analysis Workflow

Extract peaks’
ppm

Search Against

Human Metabolome
Database (2.5) to
identify metabolites


Report only significant metabolites

Sample_ID

1

2

Top1

Delta
-
Hexanolactone

Delta
-
Hexanolactone

Top2

Hypotaurine

Hypotaurine

Top3

2,3
-
Diphosphoglyceric
acid

Diethanolamine

Top4

Diethanolamine

3,7
-
Dimethyluric acid

Top5

3
-
Phosphoglyceric acid

Methyl isobutyl ketone

Top6

3,7
-
Dimethyluric acid

1,3,7
-
Trimethyluric acid

Top7

1,3,7
-
Trimethyluric acid

Cysteine
-
S
-
sulfate

Top8

L
-
Allothreonine

L
-
Allothreonine

Top9

Top10

Signal Processing

NMR Peak Metabolite Identification

using Human Metabolomics Database

1) Input the peak lists

2) Get the metabolites; leave
out those with fewer

than 2
matches

Significant Metabolites Identified
from NRM Metabolomics Data

Group

Metabolites

Polyp
vs

Health

D
-
Arabitol,D
-
Pantethine
(2/35
vs

0/53)

Colorectal
vs

Polyp

None

Colorectal
vs

Health

D
-
Arabitol

(2/15
vs

0/53)

Population Frequency =

Marker

metabolites?

Shared

metabolites

D
-
Arabitol

Identified from NMR Results

Involved in Pentose and
Glucuronate

Interconversions

Pathways

Roadmap


Knowledge Discovery of Proteomics Data


Knowledge Discovery of Metabolomics
Data


NMR Data


GCxGC

MS Data


Integrative Data Mining


Results from GCxGC MS
Data I

Metabolite identification is more straightforward


Polyp
vs

Healthy

Colorectal

vs

Polyp

Colorectal
vs

Healthy

Metabolites

Metabolites

Metabolites

Methanesulfinic

acid,
trimethylsilyl

ester

Acetic acid, (
methoxyimino
)
-
,
trimethylsilyl

ester

Butanoic

acid, 2
-
[(
trimethylsilyl
)oxy]
-
,
trimethylsilyl

ester

Propanoic

acid, 2
-
(
methoxyimino
)
-
,
trimethylsilyl

ester

Pentanoic

acid, 2
-
(
methoxyimino
)
-
3
-
methyl
-
,
trimethylsilyl

ester

L
-
Valine
, N
-
(
trimethylsilyl
)
-
,
trimethylsilyl

ester

Hexanedioic

acid,
bis
(2
-
ethylhexyl) ester

Methanesulfinic

acid,
trimethylsilyl

ester

Cholesterol
trimethylsilyl

ether

Mefloquine

Pentanedioic

acid, 2
-
(
methoxyimino
)
-
,
bis
(
trimethylsilyl
) ester

Hexanoic

acid,
trimethylsilyl

ester

Cyclohexane
, 1,3,5
-
trimethyl
-
2
-
octadecyl
-

L
-
Valine
, N
-
(
trimethylsilyl
)
-
,
trimethylsilyl

ester

Pentanoic

acid, 2
-
(
methoxyimino
)
-
3
-
methyl
-
,
trimethylsilyl

ester

Tetradecanoic

acid,
trimethylsilyl

ester

Butanoic

acid, 2
-
[(
trimethylsilyl
)oxy]
-
,
trimethylsilyl

ester

Hexanoic

acid, 2
-
(
methoxyimino
)
-
,
trimethylsilyl

ester

psi,psi
.
-
Carotene, 3,3',4,4'
-
tetradehydro
-
1,1',2,2'
-
tetrahydro
-
1,1'
-
dimethoxy
-
2,2'
-
dioxo
-

Cyclohexane
, 1,3,5
-
trimethyl
-
2
-
octadecyl
-

3,6
-
Dioxa
-
2,7
-
disilaoctane, 2,2,4,7,7
-
pentamethyl
-

Silanol
,
trimethyl
-
, pyrophosphate (4:1)

Butanoic

acid, 2
-
(
methoxyimino
)
-
3
-
methyl
-
,
trimethylsilyl

ester

Trimethylsilyl

ether of glycerol

L
-
Asparagine
, N,N2
-
bis(
trimethylsilyl
)
-
,
trimethylsilyl

ester

Ethylbis
(
trimethylsilyl
)amine

Cyclotrisiloxane
, 2,4,6
-
trimethyl
-
2,4,6
-
triphenyl
-

Benzene, (1
-
hexadecylheptadecyl)
-

Pentanedioic

acid, 2
-
(
methoxyimino
)
-
,
bis
(
trimethylsilyl
) ester

Results from GCxGC MS Data

II

A. Polyp
vs

Healthy

B. Polyp
vs

Colorectal

C. Colorectal
vs

Healthy

Comparative Results (Intensity vs. Population)

Marker

Metabolite
Panel Clustering of three groups

Intensity
based
Heat map

Population Frequency based
Heat map

Metabolites identified from
GCxGC

MS Results

Involved in Fatty
Acid Biosynthesis Pathways

Roadmap


Knowledge Discovery of Proteomics Data


Knowledge Discovery of Metabolomics Data


Integrative Data Mining


Data Set Description


Diet, Lipidomics, Oxidative and VD


# of features and the total # of subjects varies





Three classes are balanced to the least common
denominator


Healthy vs. Polyp


Healthy vs. Colorectal


Polyp vs. Colorectal

Diet

Lipid

Oxidative

VD

Total Subjects

150

97

94

195

Total Features

38

49

3

2

Predictive Modeling Methods


Data Preprocessing


Filtering outliers (three standard deviations away from mean)


Data Normalization
(transforming to the 0
-
1 range)


Binned
categorical data using
Quantile

binning
method


Missing Value Treatment


Replaced with the mean value of the attribute in group


Support vector machines (SVM) Classifier Kernel


Radial Basis Function (RBF) kernel are used


Feature Selection Methods


Approach #1: Two
sample unpaired T
-
tests at 5% significance
level.


Approach #2: SVM
Attribute Evaluator with Ranker Algorithm.


Features from T
-
tests are filtered using
p
-
values


K
-
fold Cross
-
validation

Classification

Model

Clean
Dataset

Raw
Dataset

Hypothesis

Hypothesis

Hypothesis

Dietary Attributes
a
s Predictors

Polyp vs. Healthy

Colorectal vs. Healthy

2.38E
-
02

4.21E
-
01

4.11E
-
02

1.21E
-
01

2.53E
-
02

9.57E
-
01

3.71E
-
02

5.60E
-
02

SVM Predictor Accuracy = 64%

SVM Predictor Accuracy = 65%

P
-
value

P
-
value

Ice cream

Rice

Tea

Shellfish

Salad

Tomato

Egg

Milk

Lipidomics T
-
Tests Results




Significant Features Selected from T Test with their
corresponding p value

Features

Polyp
vs. Healthy

Polyp
vs. Colorectal

Colorectal vs. Healthy

16:0/18:1 PE

1.76E
-
02





24:1 Cer

6.90E
-
03





LPE 18:1





<1.00E
-
04

LPE 20:0

1.50E
-
03

2.00E
-
04



An
-
16:0 LPA





3.23E
-
02

An
-
18:1 LPA



3.38E
-
02

1.33E
-
02

AA



1.13E
-
02



18:2 LPA



1.13E
-
02

4.50E
-
03

20:4 LPA





2.40E
-
02

22:6 FA



4.28E
-
02

3.24E
-
02

LPE 16:0



3.08E
-
02

3.40E
-
03

LPE 18:0



3.90E
-
03

1.00E
-
04

LPE 18:1



2.18E
-
02



Integrating
l
ipidomics with clinical features

Performance comparisons

Accuracy

(without
pre
-
selection)

Accuracy

(with t
-
test

pre
-
selection)

Accuracy

(automatic

selection)

Polyp vs.

Healthy

0.54

0.71

0.78

Colorectal
vs. Healthy
*

0.57

0.63

0.73

Polyp vs.
Colorectal
*

0.70

0.90

0.87

*
Since the number of subjects was less than15, 3 fold cross
-
validation accuracy was reported.

Accuracy

Polyp vs.

Healthy

0.55

Colorectal
vs. Healthy
*

0.60

Polyp vs.
Colorectal
*

0.60

Without Clinical
Features

With Clinical Features

Messages


Individual Omics data set has variable
predictive performance


Need thorough statistical filtering + biological
knowledge integration to battle inherent high
-
level of data noise


Integration of different Omics data with
clinical data can improve predictive
performance


31

Acknowledgment

We

thank

all

the

members

in

our

team
.