Gene Feature Recognition

clumpfrustratedBiotechnology

Oct 2, 2013 (3 years and 10 months ago)

81 views

NUS
-
KI Course on Bioinformatics, Nov 2005

For written notes on this lecture, please read Chapters 4 and 7 of
The Practical Bioinformatician

Gene Feature Recognition

Limsoon Wong

NUS
-
KI Course on Bioinformatics, Nov 2005

Recognition of Splice Sites

A simple example to start the day



NUS
-
KI Course on Bioinformatics, Nov 2005

Copyright 2005 © Limsoon Wong

Splice Sites

Acceptor

Donor

NUS
-
KI Course on Bioinformatics, Nov 2005

Copyright 2005 © Limsoon Wong

Acceptor Site (Human Genome)


If we align all known acceptor sites (with their
splice junction site aligned), we have the
following nucleotide distribution







Acceptor site: CAG | TAG | coding region


Image credit: Xu

NUS
-
KI Course on Bioinformatics, Nov 2005

Copyright 2005 © Limsoon Wong

Donor Site (Human Genome)


If we align all known donor sites (with their splice
junction site aligned), we have the following
nucleotide distribution







Donor site: coding region | GT

Image credit: Xu

NUS
-
KI Course on Bioinformatics, Nov 2005

Copyright 2005 © Limsoon Wong

What Positions Have

“High” Information Content?


For a weight matrix, information content of each
column is calculated as






X

筁ⱃⱇ,命

Prob(X)*log (Prob(X)/0.25)



When a column has evenly distributed
nucleotides, its information content is lowest


Only need to look at positions having high
information content

NUS
-
KI Course on Bioinformatics, Nov 2005

Copyright 2005 © Limsoon Wong

Information Content Around

Donor Sites in Human Genome







Information content



column

3 =


.34*log (.34/.25)


.363*log
(.363/.25)


.183* log (.183/.25)


.114* log
(.114/.25) = 0.04



column

1 =


.092*log (.92/.25)


.03*log
(.033/.25)


.803* log (.803/.25)


.073* log
(.73/.25) = 0.30

Image credit: Xu

NUS
-
KI Course on Bioinformatics, Nov 2005

Copyright 2005 © Limsoon Wong

Weight Matrix Model for Splice Sites


Weight matrix model


build a weight matrix for donor, acceptor,
translation start site, respectively


use positions of high information content


Image credit: Xu

NUS
-
KI Course on Bioinformatics, Nov 2005

Copyright 2005 © Limsoon Wong

Splice Site Prediction: A Procedure


Add up freq of corr letter in corr positions:






Make prediction on splice site based on some
threshold

AAG
GT
AAGT
: .34 + .60 + .80 +1.0 + 1.0
+ .52 + .71 + .81 + .46 = 6.24

TGT
GT
CTCA
: .11 + .12 + .03 +1.0 + 1.0
+ .02 + .07 + .05 + .16 = 2.56

Image credit: Xu

NUS
-
KI Course on Bioinformatics, Nov 2005

Recognition of

Translation Initiation Sites

An introduction to the World’s simplest
TIS recognition system

A simple approach to accuracy and
understandability


NUS
-
KI Course on Bioinformatics, Nov 2005

Copyright 2005 © Limsoon Wong

Translation Initiation Site

NUS
-
KI Course on Bioinformatics, Nov 2005

Copyright 2005 © Limsoon Wong

A Sample cDNA


What makes the second ATG the TIS?


299 HSU27655.1 CAT U27655 Homo sapiens

CGTGTGTGCAGCAGCCTGCAGCTGCCCCAAGCC
ATG
GCTGAACACTGACTCCCAGCTGTG 80

CCCAGGGCTTCAAAGACTTCTCAGCTTCGAGC
ATG
GCTTTTGGCTGTCAGGGCAGCTGTA 160

GGAGGCAG
ATG
AGAAGAGGGAG
ATG
GCCTTGGAGGAAGGGAAGGGGCCTGGTGCCGAGGA 240

CCTCTCCTGGCCAGGAGCTTCCTCCAGGACAAGACCTTCCACCCAACAAGGACTCCCCT

............................................................ 80

................................iEEEEEEEEEEEEEEEEEEEEEEEEEEE 160

EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE 240

EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE


NUS
-
KI Course on Bioinformatics, Nov 2005

Copyright 2005 © Limsoon Wong

Approach


Training data gathering


Signal generation


k
-
grams, distance, domain know
-
how, ...


Signal selection


Entropy,

2, CFS, t
-
test, domain know
-
how...


Signal integration


SVM, ANN, PCL, CART, C4.5, kNN, ...

NUS
-
KI Course on Bioinformatics, Nov 2005

Copyright 2005 © Limsoon Wong

Training & Testing Data


Vertebrate dataset of Pedersen & Nielsen
[ISMB’97]


3312 sequences


13503 ATG sites


3312 (24.5%) are TIS


10191 (75.5%) are non
-
TIS


Use for 3
-
fold x
-
validation expts

NUS
-
KI Course on Bioinformatics, Nov 2005

Copyright 2005 © Limsoon Wong

Signal Generation


K
-
grams (ie., k consecutive letters)


K = 1, 2, 3, 4, 5, …


Window size vs. fixed position


Up
-
stream, downstream vs. any where in window


In
-
frame vs. any frame

NUS
-
KI Course on Bioinformatics, Nov 2005

Copyright 2005 © Limsoon Wong

Signal Generation: An Example


Window =

㄰1⁢慳as


In
-
frame, downstream


GCT = 1, TTT = 1, ATG = 1…


Any
-
frame, downstream


GCT = 3, TTT = 2, ATG = 2…


In
-
frame, upstream


GCT = 2, TTT = 0, ATG = 0, ...


299 HSU27655.1 CAT U27655 Homo sapiens

CGTGTGTGCAGC
AGCCTGCAGCTGCCCCAAGCCATGGCTGAACACTGACTCCCAGCTGTG

80

CCCAGGGCTTCAAAGACTTCTCAGCTTCGAGC
ATG
GCTTTTGGCTGTCAGGGCAGCTGTA

160

GGAGGCAGATGAGAAGAGGGAGATGGCCTTGGAGGAAGGGAAGGGGCCTGGTGCC
GAGGA 240

CCTCTCCTGGCCAGGAGCTTCCTCCAGGACAAGACCTTCCACCCAACAAGGACTCCCCT

NUS
-
KI Course on Bioinformatics, Nov 2005

Copyright 2005 © Limsoon Wong

An Example File
Resulting From Feature
Generation

NUS
-
KI Course on Bioinformatics, Nov 2005

Copyright 2005 © Limsoon Wong

Too Many Signals


For each value of k, there are 4
k

* 3 * 2 k
-
grams



If we use k = 1, 2, 3, 4, 5, we have 24 + 96 + 384 +
1536 + 6144 = 8184 features!



This is too many for most machine learning
algorithms

NUS
-
KI Course on Bioinformatics, Nov 2005

Copyright 2005 © Limsoon Wong

Signal Selection (Basic Idea)


Choose a signal w/ low intra
-
class distance


Choose a signal w/ high inter
-
class distance

NUS
-
KI Course on Bioinformatics, Nov 2005

Copyright 2005 © Limsoon Wong

Signal Selection (eg.,
t
-
statistics)

NUS
-
KI Course on Bioinformatics, Nov 2005

Copyright 2005 © Limsoon Wong

Signal Selection (eg.,

2)

NUS
-
KI Course on Bioinformatics, Nov 2005

Copyright 2005 © Limsoon Wong

Signal Selection (eg.,
CFS)


Instead of scoring individual signals, how about
scoring a group of signals as a whole?


CFS


Correlation
-
based Feature Selection


A good group contains signals that are highly
correlated with the class, and yet uncorrelated
with each other

NUS
-
KI Course on Bioinformatics, Nov 2005

Copyright 2005 © Limsoon Wong

Sample k
-
grams Selected by CFS


Position

3


in
-
frame upstream ATG


in
-
frame downstream


TAA, TAG, TGA,


CTG, GAC, GAG, and GCC

Kozak consensus

Leaky scanning

Stop codon

Codon bias?

NUS
-
KI Course on Bioinformatics, Nov 2005

Copyright 2005 © Limsoon Wong

Signal Integration


kNN


Given a test sample, find the k training samples
that are most similar to it. Let the majority class
win


SVM


Given a group of training samples from two
classes, determine a separating plane that
maximises the margin of error


Naïve Bayes, ANN, C4.5, ...

NUS
-
KI Course on Bioinformatics, Nov 2005

Copyright 2005 © Limsoon Wong

Neighborhood

5 of class

3 of class

=

Illustration of kNN (k=8)

Image credit: Zaki

Typical “distance” measure =

NUS
-
KI Course on Bioinformatics, Nov 2005

Copyright 2005 © Limsoon Wong

Using WEKA for
TIS Prediction

NUS
-
KI Course on Bioinformatics, Nov 2005

Copyright 2005 © Limsoon Wong

Results (3
-
fold x
-
validation)

* Using top 20

2
-
selected features from amino
-
acid features

NUS
-
KI Course on Bioinformatics, Nov 2005

Copyright 2005 © Limsoon Wong

Validation Results (on Chr X and Chr 21)


Using top 100 features selected by entropy and
trained on Pedersen & Nielsen’s

ATGpr

Our

method

NUS
-
KI Course on Bioinformatics, Nov 2005

Copyright 2005 © Limsoon Wong

Technique Comparisons


Pedersen&Nielsen
[ISMB’97]


85% accuracy


Neural network


No explicit features


Zien
[Bioinformatics’00]


88% accuracy


SVM+kernel engineering


No explicit features


Hatzigeorgiou
[Bioinformatics’02]


94% accuracy (with
scanning rule)


Multiple neural networks


No explicit features


Our approach


89% accuracy (94% with
scanning rule)


Explicit feature
generation


Explicit feature selection


Use any machine
learning method w/o any
form of complicated
tuning

NUS
-
KI Course on Bioinformatics, Nov 2005

Recognition of

Transcription Start Sites

An introduction to the World’s best TSS
recognition system

A heavy tuning approach

NUS
-
KI Course on Bioinformatics, Nov 2005

Copyright 2005 © Limsoon Wong

Transcription Start Site

NUS
-
KI Course on Bioinformatics, Nov 2005

Copyright 2005 © Limsoon Wong

Structure of Dragon Promoter Finder

-
200 to +50

window size

Model selected based

on desired sensitivity

NUS
-
KI Course on Bioinformatics, Nov 2005

Copyright 2005 © Limsoon Wong

Each model has two submodels

based on GC content

GC
-
rich submodel

GC
-
poor submodel

(C+G) =

#C + #G

Window Size

NUS
-
KI Course on Bioinformatics, Nov 2005

Copyright 2005 © Limsoon Wong

Data Analysis Within Submodel

K
-
gram (k = 5) positional weight matrix

s
p

s
e

s
i

NUS
-
KI Course on Bioinformatics, Nov 2005

Copyright 2005 © Limsoon Wong

Promoter, Exon, Intron Sensors


These sensors are positional weight matrices of
k
-
grams, k = 5 (aka pentamers)


They are calculated as s below using promoter,
exon, intron data respectively

Pentamer at i
th

position in input


j
th

pentamer at

i
th

position in

training window

Frequency of jth

pentamer at ith position

in training window

Window size

s

NUS
-
KI Course on Bioinformatics, Nov 2005

Copyright 2005 © Limsoon Wong

Data Preprocessing & ANN

Tuning parameters

tanh(x) =

e
x

-

e
-
x

e
x

+

e
-
x

s
IE

s
I

s
E

tanh(net)

Simple feedforward ANN

trained by the Bayesian

regularisation method

w
i

net =


s
i
* w
i

Tuned

threshold

NUS
-
KI Course on Bioinformatics, Nov 2005

Copyright 2005 © Limsoon Wong

Accuracy Comparisons

without C+G submodels

with C+G submodels

NUS
-
KI Course on Bioinformatics, Nov 2005

Notes

NUS
-
KI Course on Bioinformatics, Nov 2005

Copyright 2005 © Limsoon Wong

References (TIS Recognition)


A. G. Pedersen, H. Nielsen, “Neural network
prediction of translation initiation sites in
eukaryotes”,
ISMB

5:226
--
233, 1997


H.Liu, L. Wong, “Data Mining Tools for Biological
Sequences”,
Journal of Bioinformatics and
Computational Biology
, 1(1):139
--
168, 2003


A. Zien et al., “Engineering support vector
machine kernels that recognize translation
initiation sites”,
Bioinformatics

16:799
--
807, 2000


A. G. Hatzigeorgiou, “Translation initiation start
prediction in human cDNAs with high accuracy”,
Bioinformatics

18:343
--
350, 2002

NUS
-
KI Course on Bioinformatics, Nov 2005

Copyright 2005 © Limsoon Wong

References (TSS Recognition)


V. B. Bajic et al., “Computer model for recognition
of functional transcription start sites in RNA
polymerase II promoters of vertebrates”,
J. Mol.
Graph. & Mod.

21:323
--
332, 2003


J. W. Fickett, A. G. Hatzigeorgiou, “Eukaryotic
promoter recognition”,
Gen. Res.

7:861
--
878, 1997


A. G. Pedersen et al., “The biology of eukaryotic
promoter prediction
---
a review”,
Computer &
Chemistry

23:191
--
207, 1999


M. Scherf et al., “Highly specific localisation of
promoter regions in large genome sequences by
PromoterInspector”,
JMB

297:599
--
606, 2000