Seq_annotationx

hordeprobableΒιοτεχνολογία

4 Οκτ 2013 (πριν από 4 χρόνια και 9 μέρες)

78 εμφανίσεις

Automatic

methods

for
functional

annotation

of
sequences

Petri Törönen

What
,
Why
,
How
???


Functional

annotation

of
sequence

(
seq
.)


Definition of
description

line


Mapping

seq
. to
functional

categories


Simple

solutions

are

error
-
sensitive


Review

some

available

tools

in the
exercises

Old,
simple

way


Do

a
Sequence

Search

(SS),
like

BLAST,
with

your

sequence


Find

the
best

match


Transfer

all

the info
from

the
best

match

to
your

sequence


Everything

done
?
Finished
?

Problems


First

hit

is
unknown

seq
.


First

hit

is
misannotated

seq
.


a
n
increasing

problem
!!


No
significant

matches

found


Strong
,
but

only

local

matches

=>
impurities

in

search


Inpurities

in
query

seq
.


Why manual analysis is hard?


Large

size

of
gene

lists

(SS
result

list
)


False

positives

among

observed

results





Each

gene

can

have

multiple

functions



-
the

important

common
theme

among

the

genes

can

go

easily

unnoticed
.


Requires

detailed

knowledge

of
genes


varying

representations

for
same

function

in
description

lines


Objectivity

Why manual analysis is hard?

Gene Ontology (GO)


A controlled vocabulary of gene product roles in
cells and the role associations


The roles can be applied to all organisms


Three main hierarchies: biological process,
cellular component and molecular function
include currently about 19,000 classes (=roles)





-
usually

only

a
small

portion

of
these

classes

is in

use

with

one

organism

(
example
:
chloroplasts

related


functions

are

important

only

within

plants
)


www.geneontology.org

Structure of GO


GO graph:


Hierarchical

structure

of
linked

nodes



-
each

node

presents

one

class

that

is

part

of
its

parental

class


Direct
Acylic

Graph (DAG)




-
a
tree
-
structure

where

branches

can


also

merge

when

going

from

parental


nodes

to
child

nodes
.



Genes

can

be

linked

to
many

classes

in
the GO
structure

Starting node

root of hierarchical

structure

More

detailed

classes

Less

detailed

classes

How GO helps


GO
presents

a
terminology

for
presentation

of
known

information

of the
gene


GO
classifies

genes

according

to
their

known/predicted

functions


Classes

represent

varying

detail



Classifications

can

be

used

to
find

over
-
represented

functions

in the
results



How

GO
helps


Look
over
-
represented

GO
classes

from

the
gene

list

Sampling w/o replacements answers to:


How many ways there are to select 8 balls so
that two of them are white and rest are black
from the whole data?

we

would

like

to
ask
:


what

is the
probability

of
observing

the
number

of
class

members

like

we

have

in the
cluster

by

random
?


Solution

from

the
statistics

is the
sampling

without

replacement

Methods

that

predict

protein

function


Methods

that

summarize

the SS
result

list


Methods

that

use

profile

searches


Methods

that

use

sequence

features


Methods

based

on
sequence

patterns


Methods

based

on
sequence

phylogeny


SS
list

summarization


Consensus

analysis

of SS
list


Do

the SS


Look
repetitively

occuring

descriptions

/GO
classes


Over
-
representation

of GO
classes

(BLAST2GO)


Tools

performing

this
:


Our

method

PANNZER (Koskinen et al.
u
npubl
.)


BLAST2GO (
http://www.blast2go.org/start_blast2go
)


ConFunc


Profile

search

methods


Use

profile

searches

instead

of SS


Some

positions

are

more

conserved

in the
seq
.


PFAM
http://pfam.sanger.ac.uk/


ConFunc

http://www.sbg.bio.ic.ac.uk/~confunc/

ConFunc

in
detail


BLAST
search

with

query

seq
.


Obtain

a
result

list


Seq:s

in
result

list

are

clustered

to
seq:s

with

similar

function

(
same

GO
classes
)


Each

cluster

is
used

as a
seed

for a
profile

search


Test

how

well

the
query

seq

matches

to
each

profile


Use

link
:
http://www.sbg.bio.ic.ac.uk/confunc/indextemp.cgi

Sequence

feature
methods


Look for
sequence

features


Features
:
Secondary

structure
,
protein

domains


Compare

sequences

by

looking

which

features

they

have

in common


Methods

that

do

this
: FACT


http://www.cibiv.at/FACT/


Limited
search

possibilities

with

FACT

Sequence

pattern

methods


Pattern

=>
frequently

observed

short

motif

from

seq
. DB


InterProScan


BioDictionary

from

IBM
Computational

Biology

(http://cbcsrv.watson.ibm.com/Tpa.html)


Extraction

of
most

of the
patterns

from

swissprot


Linking

of
each

pattern

to
keywords
,
seen

in the
seq:s

where

pattern

was


Query

seq
. is
linked

to
keywords

via
patterns

it

has


Phylogeny

based

methods


Shortly
:
Include

the
species

tree

to the
annotation

of the
sequences
.


Evolutionary

distance

is
taken

into
account


Compara

from

ENSEMBL


http://www.ebi.ac.uk/GOA/compara_go_annotations.html

Tip

for
testing

the
tools


For
testing

with

purely

random

sequence


http://www.bioinformatics.org/sms2/random_protein.html


For
testing

partially

random

sequence


http://www.bioinformatics.org/sms2/mutate_protein.html