What is bioinformatics? - Center for Biological Sequence analysis

dasypygalstockingsΒιοτεχνολογία

2 Οκτ 2013 (πριν από 4 χρόνια και 12 μέρες)

82 εμφανίσεις

Introduction to Pattern Recognition

Prediction in Bioinformatics


What do we want to predict?


Features from sequence


Data mining


How can we predict?


Homology / Alignment


Pattern Recognition / Statistical Methods / Machine Learning


What is prediction?


Generalization / Overfitting


Preventing overfitting: Homology reduction


How do we measure prediction?


Performance measures


Threshold selection


Henrik Nielsen

Center for Biological Sequence Analysis

Technical University of Denmark



Sequence
→ structure → function

Prediction from DNA sequence


Protein
-
coding genes


transcription factor binding sites


transcription start/stop


translation start/stop


splicing: donor/acceptor sites



Non
-
coding RNA


tRNAs


rRNAs


miRNAs


General features


Structure (curvature/bending)


Binding (histones
etc.
)


Folding / structure


Post
-
Translational Modifications


Attachment:

phosphorylation glycosylation lipid attachment


Cleavage:
signal peptides, propeptides, transit peptides


Sorting:
secretion, import into various organelles, insertion into
membranes


Interactions


Function


Enzyme activity


Transport


Receptors


Structural components


etc...

Prediction from amino acid sequence

Protein sorting in eukaryotes


Proteins belong in different organelles of the cell


and some even
have their function
outside

the cell


Günter Blobel

was in 1999 awarded
The Nobel Prize in Physiology
or Medicine for the discovery that "
proteins have intrinsic signals that
govern their transport and localization in the cell
"

Data: UniProt annotation of protein sorting

Annotations relevant for protein sorting are found in:


the
CC

(comments) lines


cross
-
references (
DR

lines) to
GO

(Gene Ontology)



the
FT

(feature table) lines


ID INS_HUMAN Reviewed;
110
AA.

AC P
01308
;

...

DE Insulin precursor [Contains: Insulin B chain; Insulin A chain].

GN Name=INS;

...

CC
-
!
-

SUBCELLULAR LOCATION: Secreted.

...

DR GO; GO:
0005576
; C:extracellular region; IC:UniProtKB.

...

FT SIGNAL
1 24


3
types of
non
-
experimental qualifiers
in the
CC

and
FT

lines:


Potential
:
Predicted by sequence analysis methods


Probable
:

Inconclusive experimental evidence


By similarity
:

Predicted by alignment to proteins with known
location

Problems in database parsing

Extreme example:


A4_HUMAN, Alzheimer disease amyloid protein

CC
-
!
-

SUBCELLULAR LOCATION:
Membrane
;
Single
-
pass type I membrane

CC
protein
. Note=
Cell surface

protein that rapidly becomes

CC internalized via clathrin
-
coated pits. During maturation, the

CC immature APP (N
-
glycosylated in the
endoplasmic reticulum
) moves

CC to the
Golgi complex

where complete maturation occurs (O
-

CC glycosylated and sulfated). After alpha
-
secretase cleavage,

CC soluble APP is released into the
extracellular space

and the C
-

CC terminal is internalized to
endosomes

and
lysosomes
. Some APP

CC accumulates in
secretory transport vesicles

leaving the late Golgi

CC compartment and returns to the cell surface. Gamma
-
CTF(59) peptide

CC is located to both the
cytoplasm

and
nuclei

of neurons. It can be

CC translocated to the
nucleus

through association with Fe65. Beta
-

CC APP42 associates with FRPL1 at the
cell surface

and the complex is

CC then rapidly internalized. APP sorts to the basolateral surface in

CC epithelial cells. During neuronal differentiation, the Thr
-
743

CC phosphorylated form is located mainly in growth cones, moderately

CC in neurites and sparingly in the cell body. Casein kinase

CC phosphorylation can occur either at the
cell surface

or within a

CC
post
-
Golgi compartment
.

...

DR GO; GO:0009986;
C:cell surface
; IDA:UniProtKB.

DR GO; GO:0005576;
C:extracellular region
; TAS:ProtInc.

DR GO; GO:0005887;
C:integral to plasma membrane
; TAS:ProtInc.

Prediction methods


Homology / Alignment


Simple pattern recognition



Example:


PROSITE entry
PS00014, ER_TARGET
:

Endoplasmic reticulum targeting sequence.

Pattern:
[KRHQSA]
-
[DENQ]
-
E
-
L>


Statistical methods


Weight matrices: calculate amino acid
probabilities


Other examples:

Regression, variance analysis, clustering


Machine learning


Like statistical methods, but parameters are estimated by
iterative
training

rather than direct calculation


Examples:

Neural Networks (
NN
), Hidden Markov Models
(
HMM
), Support Vector Machines (
SVM
)


Prediction of subcellular localisation from sequence


Homology:
threshold


30%
-
70% identity


Sorting signals (‘‘zip codes’’)


N
-
terminal:

secretory (ER) signal peptides, mitochondrial
& chloroplast transit peptides.


C
-
terminal:

peroxisomal targeting signal 1, ER
-
retention
signal.


internal:

Nuclear localisation signals, nuclear export
signals.


Global properties


amino acid composition, aa pair composition


composition in limited regions


predicted structure


physico
-
chemical parameters


Combined approaches

Signal
-
based prediction


Signal peptides


von Heijne 1983, 1986 [WM]


SignalP

(Nielsen
et al.

1997, 1998; Bendtsen
et al.

2004) [NN,
HMM]


Mitochondrial & chloroplast transit peptides


Mitoprot

(Claros & Vincens 1996) [linear discriminant using
physico
-
chemical parameters
]


ChloroP, TargetP*

(
Emanuelsson
et al.

1999, 2000) [NN]


iPSORT*

(Bannai
et al.
2002) [decision tree using physico
-
chemical parameters]


Protein Prowler*

(Hawkins & Bodén 2006) [NN]

*
= includes also signal peptides


Nuclear localisation signals


PredictNLS

(Cokol
et al.

2000) [regex]


NucPred

(Heddad
et al.

2004) [regex, GA]

Composition
-
based prediction


Nakashima and Nishikawa 1994 [
2 categories
; odds
-
ratio statistics]


ProtLock

(Cedano
et al.

1997) [
5 categories
; Mahalanobis distance]


Chou and Elrod 1998 [
12 categories
; covariant discriminant]


NNPSL

(Reinhardt and Hubbard 1998) [
4 categories
; NN]


SubLoc

(Hua and Sun 2001)
[
4 categories
; SVM]


PLOC

(Park and Kanehisa 2003) [
12 categories
; SVM]


LOCtree

(Nair & Rost 2005) [
6 categories
; SVM incl. regions,
structure and profiles]


BaCelLo

(Pierleoni
et al.

2006) [
5 categories
; SVM incl. regions and
profiles]


Pro:



does not require knowledge of signals


works even if N
-
terminus is wrong

Con:



cannot identify isoform differences

A simple statistical method: Linear regression

Observations (
training data
):

a
set of
x

values (
input
) and
y

values
(
output
).

Model:

y

=
a
x

+
b
(2

parameters
,
which are estimated from the
training data)

Prediction:
Use the model to
calculate a
y

value for a
new

x

value


Note:

the model does not fit the observations exactly. Can we do
better than this?

Overfitting

y =
a
x +
b

2
parameter model

Good description, poor fit

y =
a
x
6
+
b
x
5
+
c
x
4
+
d
x
3
+
e
x
2
+
f
x+
g

7 parameter model

Poor description, good fit

Note:

It is
not interesting

that a model can fit its observations (training
data) exactly.

To function as a prediction method, a model must be able to
generalize
,
i.e. produce sensible output on
new

data.

A classification problem

How complex a
model should we
choose? This
depends on:


The real
complexity of the
problem


The size of the
training data set


The amount of
noise in the data
set

How to estimate parameters for prediction?

Model selection

Linear Regression

Quadratic
Regression

Join
-
the
-
dots

The test set method

The test set method

The test set method

The test set method

The test set method

Cross Validation

Cross Validation

Cross Validation

Cross Validation

Cross Validation

Cross Validation

Cross Validation

Which kind of Cross Validation?

Note:

Leave
-
one
-
out

is also known as
jack
-
knife

Problem: sequences are related


If the sequences in
the test set are
closely related

to
those in the training
set, we can
not

measure true
generalization
performance


ALAKAAAAM

ALAKAAAAN

ALAKAAAAR

ALAKAAAAT

ALAKAAAAV

GMNERPILT

GILGFVFTM

TLNAWVKVV

KLNEPVLLL

AVVPFIVSV

Solution: Homology reduction


Calculate all pairwise similarities in
the data set


Define a threshold for being
”neighbours”‏(too‏closely‏related)


Calculate numbers of neighbours for
each example, and remove the
example with most neighbours


Repeat until there are no examples
with neighbours left


Alternative: Homology partitioning


keep all examples, but cluster them
so that no neighbours end up in the
same fold


Should be combined with weighting

The Hobohm algorithm

Defining a threshold for homology reduction

The Sander/Schneider curve:

For protein structure prediction, 70% identical classification of
secondary structure means prediction by alignment is possible

This corresponds to

25% identical amino acids in a local alignment
> 80 positions

First approach:
two sequences
are too closely
related, if the
prediction
problem can be
solved by
alignment

Defining a threshold for homology reduction

The Pedersen / Nielsen / Wernersson curve:

Use the extreme value distribution to define
the BLAST score at which the similarity is
stronger than random

Second
approach: two
sequences are
too closely
related, if their
homology is
statistically
significant