From Informatics to Bioinformatics: The Knowledge Discovery ...

abalonestrawBiotechnology

Oct 2, 2013 (3 years and 9 months ago)

135 views

Copyright

2003 limsoon wong

From Informatics to Bioinformatics:

The Knowledge Discovery
Perspective

Limsoon Wong

Institute for Infocomm Research

Singapore

Copyright

2003 limsoon wong

Plan


Overview of recent knowledge discovery
successes in bioinformatics


Risk assignment of childhood ALL
patients to optimize risk
-
benefit ratio of
therapy


Recognition of translation intiation sites
from DNA sequences


Copyright

2003 limsoon wong

overview of recent knowledge
discovery successes in
bioinformatics

Copyright

2003 limsoon wong

Jonathan’s rules

: Blue or Circle

Jessica’s rules


: All the rest

What is Datamining?

Whose block

is this?

Jonathan’s blocks

Jessica’s blocks

Copyright

2003 limsoon wong

What is Datamining?

Question: Can you explain how?

Copyright

2003 limsoon wong

What is Bioinformatics?

Copyright

2003 limsoon wong

Bioinformatics brings benefits

To the patient:

Better drug, better treatment


To the pharma:

Save time, save cost, make more $


To the scientist:

Better science

Copyright

2003 limsoon wong

To figure these out,

we bet on...

“solution” =


Data Mgmt + Knowledge Discovery



Data Mgmt =


Integration + Transformation + Cleansing



Knowledge Discovery =


Statistics + Algorithms + Databases

Copyright

2003 limsoon wong

Integration

Technology

(Kleisli)

Cleansing &
Warehousing
(FIMM)

MHC
-
Peptide

Binding

(PREDICT)

Protein Interactions

Extraction (PIES)

Gene Expression

& Medical Record

Datamining (PCL)

Gene Feature

Recognition (Dragon)

Venom

Informatics

1994

1998

1996

2000

2002

8 years of

bioinformatics

R&D in

Singapore

ISS

KRDL

LIT/I
2
R

GeneticXchange

Molecular

Connections

Biobase

History

Copyright

2003 limsoon wong

Predict Epitopes,

Find Vaccine Targets


Vaccines are often the
only solution for viral
diseases


Finding & developing
effective vaccine targets
(epitopes) is slow and
expensive process

Copyright

2003 limsoon wong

Recognize Functional Sites,

Help Scientists


Effective recognition of
initiation, control, and
termination of biological
processes is crucial to
speeding up and focusing
scientific experiments



Data mining of bio seqs to
find rules for recognizing &
understanding functional
sites

Dragon’s 10x


reduction of

TSS recognition

false positives

Copyright

2003 limsoon wong

Diagnose Leukaemia,

Benefit Children


Childhood leukaemia is a
heterogeneous disease


Treatment is based on subtype


3 different tests and 4 different
experts are needed for
diagnosis



Curable in USA,



fatal in Indonesia

Copyright

2003 limsoon wong

Understand Proteins,

Fight Diseases


Understanding function and role
of protein needs organised info
on interaction pathways


Such info are often reported in
scientific paper but are seldom
found in structured databases



Knowledge extraction


system to process free text



extract protein names



extract interactions

Copyright

2003 limsoon wong

risk assignment of

childhood ALL patients to optimize
risk
-
benefit ratio of therapy

Copyright

2003 limsoon wong

Childhood ALL

Heterogeneous Disease


Major subtypes are


T
-
ALL


E2A
-
PBX1


TEL
-
AML1


MLL genome rearrangements


Hyperdiploid>50


BCR
-
ABL

Copyright

2003 limsoon wong

Childhood ALL

Treatment Failure


Overly intensive treatment leads to


Development of secondary cancers


Reduction of IQ


Insufficiently intensive treatment leads to


Relapse

Copyright

2003 limsoon wong

Childhood ALL

Risk
-
Stratified Therapy


Different subtypes respond differently to
the same treatment intensity






Match patient to optimum treatment
intensity for his subtype & prognosis

BCR
-
ABL,

MLL

TEL
-
AML1,

Hyperdiploid>50

T
-
ALL

E2A
-
PBX1

Generally good
-
risk,

lower intensity

Generally high
-
risk,

higher intensity

Copyright

2003 limsoon wong

Childhood ALL

Risk Assignment


The major subtypes look similar





Conventional diagnosis requires


Immunophenotyping


Cytogenetics


Molecular diagnostics

Copyright

2003 limsoon wong

Mission


Conventional risk assignment procedure
requires difficult expensive tests and
collective judgement of multiple
specialists


Generally available only in major
advanced hospitals


Can we have a single
-
test easy
-
to
-
use
platform instead?

Copyright

2003 limsoon wong

Single
-
Test Platform of

Microarray & Machine Learning

Copyright

2003 limsoon wong

Overall Strategy

Diagnosis

of subtype

Subtype
-


dependent

prognosis

Risk
-

stratified

treatment

intensity


For each subtype,
select genes to
develop classification
model for diagnosing
that subtype


For each subtype,
select genes to
develop prediction
model for prognosis
of that subtype

Copyright

2003 limsoon wong

Childhood ALL


Subtype Diagnosis by PCL


Gene expression data collection


Gene selection by

2


Classifier training by
emerging pattern


Classifier tuning (optional for some
machine learning methods)


Apply classifier for diagnosis of future
cases by
PCL

Copyright

2003 limsoon wong

Childhood ALL Subtype Diagnosis

Our Workflow

A tree
-
structured

diagnostic

workflow was

recommended by

our doctor
collaborator

Copyright

2003 limsoon wong

Childhood ALL Subtype Diagnosis

Training and Testing Sets

Copyright

2003 limsoon wong

Childhood ALL Subtype Diagnosis

Signal Selection Basic Idea


Choose a signal w/ low intra
-
class distance


Choose a signal w/ high inter
-
class distance

Copyright

2003 limsoon wong

Childhood ALL Subtype Diagnosis

Signal Selection by

2

Copyright

2003 limsoon wong

Childhood ALL Subtype Diagnosis

Emerging Patterns


An emerging pattern is a set of conditions


usually involving several features


that most members of a class satisfy


but none or few of the other class satisfy


A jumping emerging pattern is an emerging
pattern that


some members of a class satisfy


but no members of the other class satisfy


We use only jumping emerging patterns

Copyright

2003 limsoon wong

Childhood ALL Subtype Diagnosis

PCL: Prediction by Collective Likelihood

Copyright

2003 limsoon wong

Childhood ALL Subtype Diagnosis

Accuracy of PCL
(vs. other classifiers)

The classifiers are all applied to the 20 genes selected

by

2 at each level of the tree

Copyright

2003 limsoon wong

Multidimensional Scaling Plot


Subtype Diagnosis

Copyright

2003 limsoon wong

Multidimensional Scaling Plot

Subtype
-
Dependent Prognosis


Similar computational
analysis was carried
out to predict relapse
and/or secondary
AML in a subtype
-
specific manner


>97% accuracy
achieved

Copyright

2003 limsoon wong

Childhood ALL

Is there a new subtype?


Hierarchical
clustering of gene
expression
profiles reveals a
novel subtype of
childhood ALL

Copyright

2003 limsoon wong

Childhood ALL

Cure Rates in ASEAN Countries


Conventional risk
assignment
procedure requires
difficult expensive
tests and collective
judgement of multiple
specialists


Not available in less
advanced ASEAN
countries

Copyright

2003 limsoon wong

Childhood ALL

Treatment Cost


Treatment for childhood ALL over 2 yrs


Intermediate intensity: US$60k


Low intensity: US$36k


High intensity: US$72k


Treatment for relapse: US$150k


Cost for side
-
effects: Unquantified

Copyright

2003 limsoon wong

Childhood ALL in ASEAN Counties

Current Situation
(2000 new cases/yr)


Intermediate intensity
conventionally applied
in less advanced
ASEAN countries


Over intensive for 50%
of patients, thus more
side effects


Under intensive for
10% of patients, thus
more relapse


5
-
20% cure rates



US$120m
(US$60k *
2000)

for intermediate
intensity treatment


US$30m
(US$150k *
2000 * 10%)

for relapse
treatment


Total US$150m/yr
plus un
-
quantified
costs for dealing with
side effects

Copyright

2003 limsoon wong

Childhood ALL in ASEAN Counties

Using Our Platform
(2000 new cases/yr)


Low intensity applied
to 50% of patients


Intermediate intensity
to 40% of patients


High intensity to 10%
of patients


Reduced side effects


Reduced relapse


75
-
80% cure rates


US$36m
(US$36k * 2000
* 50%)

for low intensity


US$48m
(US$60k * 2000
* 40%)

for intermediate
intensity


US$14.4m
(US$72k *
2000 * 10%)

for high
intensity


Total US$98.4m/yr


Save US$51.6m/yr

Copyright

2003 limsoon wong

Acknowledgements

Copyright

2003 limsoon wong

recognition of translation intiation
sites from DNA sequences

Copyright

2003 limsoon wong

Translation Initiation Site

Copyright

2003 limsoon wong

A Sample mRNA


299 HSU27655.1 CAT U27655 Homo sapiens

CGTGTGTGCAGCAGCCTGCAGCTGCCCCAAGCC
ATG
GCTGAACACTGACTCCCAGCTGTG 80

CCCAGGGCTTCAAAGACTTCTCAGCTTCGAGC
ATG
GCTTTTGGCTGTCAGGGCAGCTGTA 160

GGAGGCAG
ATG
AGAAGAGGGAG
ATG
GCCTTGGAGGAAGGGAAGGGGCCTGGTGCCGAGGA 240

CCTCTCCTGGCCAGGAGCTTCCTCCAGGACAAGACCTTCCACCCAACAAGGACTCCCCT

............................................................ 80

................................iEEEEEEEEEEEEEEEEEEEEEEEEEEE 160

EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE 240

EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE


What makes the second ATG the translation

initiation site?

Copyright

2003 limsoon wong

Translation Initiation Site Recognition:

Steps of a General Approach


Training data gathering


Signal generation


k
-
grams, colour, texture, domain know
-
how, ...


Signal selection


Entropy,

2, CFS, t
-
test, domain know
-
how...


Signal integration


SVM, ANN, PCL, CART, C4.5, kNN, ...


Copyright

2003 limsoon wong

Translation Initiation Site Recognition:

Training & Testing Data


Vertebrate dataset of Pedersen & Nielsen
[ISMB’97]


3312 sequences


13503 ATG sites


3312 (24.5%) are TIS


10191 (75.5%) are non
-
TIS


Use for 3
-
fold x
-
validation expts

Copyright

2003 limsoon wong

Translation Initiation Site Recognition:

Signal Generation


K
-
grams (ie., k consecutive letters)


K = 1, 2, 3, 4, 5, …


Window size vs. fixed position


Up
-
stream, downstream vs. any where in window


In
-
frame vs. any frame

Copyright

2003 limsoon wong

Signal Generation:


An Example


299 HSU27655.1 CAT U27655 Homo sapiens

CGTGTGTGCAGC
AGCCTGCAGCTGCCCCAAGCCATGGCTGAACACTGACTCCCAGCTGTG

80

CCCAGGGCTTCAAAGACTTCTCAGCTTCGAGC
ATG
GCTTTTGGCTGTCAGGGCAGCTGTA

160

GGAGGCAGATGAGAAGAGGGAGATGGCCTTGGAGGAAGGGAAGGGGCCTGGTGCC
GAGGA 240

CCTCTCCTGGCCAGGAGCTTCCTCCAGGACAAGACCTTCCACCCAACAAGGACTCCCCT


Window =

100 bases


In
-
frame, downstream


GCT = 1, TTT = 1, ATG = 1…


Any
-
frame, downstream


GCT = 3, TTT = 2, ATG = 2…


In
-
frame, upstream


GCT = 2, TTT = 0, ATG = 0, ...

Copyright

2003 limsoon wong

Signal Generation:

Too Many Signals


For each value of k, there are

4
k
* 3 * 2 k
-
grams



If we use k = 1, 2, 3, 4, 5, we have

4 + 24 + 96 + 384 + 1536 + 6144 = 8188

features!



This is too many for most machine
learning algorithms

Copyright

2003 limsoon wong

Translation Initiation Site Recognition:

Signal Selection
(eg.,



Copyright

2003 limsoon wong

Translation Initiation Site Recognition:

Signal Selection
(eg.,
CFS)


Instead of scoring individual signals,
how about scoring a group of signals as
a whole?


CFS


Correlation
-
based Feature Selection


A good group contains signals that are
highly correlated with the class, and yet
uncorrelated with each other

Copyright

2003 limsoon wong

Signal Selection:

Sample k
-
grams Selected


Position

3


in
-
frame upstream ATG


in
-
frame downstream


TAA, TAG, TGA
,


CTG, GAC, GAG, and GCC

Kozak consensus

Leaky scanning

Stop codon

Codon bias

Copyright

2003 limsoon wong

Translation Initiation Site Recognition:

Signal Integration


kNN

Given a test sample, find the k training
samples that are most similar to it. Let the
majority class win.


SVM

Given a group of training samples from two
classes, determine a separating plane that
maximises the margin of error.


Naïve Bayes, ANN, C4.5, PCL, ...

Copyright

2003 limsoon wong

Translation Initiation Site Recognition:

Results
(on Pedersen & Nielsen’s mRNA)

Copyright

2003 limsoon wong

Translation Initiation Site Recognition:

mRNA

灲潴敩n

F

L

I

M

V

S

P

T

A

Y

H

Q

N

K

D

E

C

W

R

G

A

T

E

L

R

S

stop

How about using k
-
grams

from the translation?

Copyright

2003 limsoon wong

Signal Generation:

Amino
-
Acid Features

Copyright

2003 limsoon wong

Signal Generation:

Amino
-
Acid Features

Copyright

2003 limsoon wong

Signal Selection:

Amino Acid K
-
grams Discovered

Copyright

2003 limsoon wong

Translation Initiation Site Recognition:

Results
(based on amino acid features)

Performance based on amino
-
acid features:

is better than performance based on DNA seq. features:

Copyright

2003 limsoon wong

Acknowledgements


Huiqing Liu


Jinyan Li


Roland Yap


Zeng Fanfan


A.G. Pedersen


H. Nielsen

Copyright

2003 limsoon wong

To give this lecture to SMA students.

Date: 28 Oct 2003

Time: 10
-
11.30am

Venue: Video Conference Room, S15
-
04
-
30