Application of Unstructured Learning in Computational Biology

powerfultennesseeΒιοτεχνολογία

2 Οκτ 2013 (πριν από 3 χρόνια και 11 μήνες)

68 εμφανίσεις


Bioinformatics


The application of computer science to
biological data



Tony C Smith

Department of Computer Science

University of Waikato

tcs@cs.waikato.ac.nz

Bioinformatics








Tony C Smith

The
essence

is prediction …





My dog is very littl
_

?



We know that letters do not occur in English at random
(e.g. ‘t’ is more common than ‘x’)



We know that context changes the probability of a letter
(e.g. ‘x’ is more common than ‘t’ after the sequence “I eat
Weet
-
Bi_”)



Predicting symbols is fundamental to a wide range of
important applications (e.g. encryption, compression)

Bioinformatics








Tony C Smith

Prediction in bioinformatics

Predicting the location of genes in DNA

Predicting gene roles in an organism

Predicting errors in a genetic transcription

Predicting the function of proteins

Predicting diseases from molecular samples

Anything that involves “making a judgment”; a
yes/no decision about whether some sample
datum ‘does’ or ‘does not’ have some property.

Bioinformatics








Tony C Smith

Representation






W e e t


B i x



01010111
01100101
01100101
01110100
00101101




… to the computer, everything is binary!

Bioinformatics








Tony C Smith

01010111
01100101
01100101
01110100
00101101


01
01
10
11
00
10
01
11
11
10
11
01
00
11
01
00
00
10
11
01


A

A
C

G
T

C
A

T
T

C
G

A
T
G
A

T
T
C
G

A



Just as we can teach a computer to predict
things about a sequence of letters in English
prose, we can also teach it to predict things
about a other sequences

like a genetic
sequence


Bioinformatics








Tony C Smith

A genetic prediction problem


ttgcaatcggcgctacgcttcaaaatttattatattcccggc
gcggctacgttcatcccagcagcagcgattttaaaattaa
cgcatcagactctcgtcgcgttcgtcgcctttattcacgcta
atggacgacatcttttactacgacggcgcctacgcatcg
cagcatacgacgcccagcatagtattttagaggcgagg
acatcatcatatcgcagctacagcgcatcagacgcata
cgacgacgactacgacgacactaacgacgatgttgcg
cacccacaccagttatatagagacgaactcgcatcagc

Bioinformatics








Tony C Smith

A genetic prediction problem


ttgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcg
cctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgc
agctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagctgc
aatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgccttt
attcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagct
acagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcg
gcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattca
cgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacag
cgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgct
acgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgcta
atggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcat
cagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgctacgct
tcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatgga
cgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcaga
cgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgctacgcttcaa
aatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgac
atcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcat
acgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgctacgcttcaaaatttat
tatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatctttt
actacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacga
cgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgctacgcttcaaaatttattatatt
cccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactac
gacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacg
actacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgttgcgcacccacaccagttatatagagacgaactc

Bioinformatics








Tony C Smith

A genetic prediction problem


ttgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaa
tgg
acgacatcttttactacgacggcgcctacgcatcg
cagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccag
tta
tatagagacgaactcgcatcagct
gcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatg
gac
gacatcttttactacgacggcgcctacgcatcgc
agcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagt
tat
atagagacgaactcgcatcagtgc
aatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatgga
cga
catcttttactacgacggcgcctacgcatcgcag
catacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagtta
tat
agagacgaactcgcatcagtgcaat
cggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacga
cat
cttttactacgacggcgcctacgcatcgcagcat
acgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatat
aga
gacgaactcgcatcagtgcaatcg
gcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgaca
tct
tttactacgacggcgcctacgcatcgcagcatac
gacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatag
aga
cgaactcgcatcagtgcaatcggc
gctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatc
ttt
tactacgacggcgcctacgcatcgcagcatacga
cgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagag
acg
aactcgcatcagtgcaatcggcgct
acgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttt
tac
tacgacggcgcctacgcatcgcagcatacgacg
cccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagac
gaa
ctcgcatcagtgcaatcggcgctac
gcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatctttta
cta
cgacggcgcctacgcatcgcagcatacgacgcc
cagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacga
act
cgcatcagtgttgcgcacccacacc
agttatatagagacgaactcttgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttc
gtc
gcctttattcacgctaatggacgacatcttttacta
cgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacga
tgt
tgcgcacccacaccagttatatag
agacgaactcgcatcagctgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtc
gcc
tttattcacgctaatggacgacatcttttactac
gacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgat
gtt
gcgcacccacaccagttatataga
gacgaactcgcatcagtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgc
ctt
tattcacgctaatggacgacatcttttactacga
cggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgt
tgc
gcacccacaccagttatatagaga
cgaactcgcatcagtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcct
tta
ttcacgctaatggacgacatcttttactacgacg
gcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttg
cgc
acccacaccagttatatagagacg
aactcgcatcagtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgccttt
att
cacgctaatggacgacatcttttactacgacggc
gcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcg
cac
ccacaccagttatatagagacgaa
ctcgcatcagtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttat
tca
cgctaatggacgacatcttttactacgacggcgc
ctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgca
ccc
acaccagttatatagagacgaact
cgcatcagtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattc
acg
ctaatggacgacatcttttactacgacggcgcct
acgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacc
cac
accagttatatagagacgaactcg
catcagtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcac
gct
aatggacgacatcttttactacgacggcgcctac
gcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcaccca
cac
cagttatatagagacgaactcgca
tcagtgttgcgcacccacaccagttatatagagacgaactc

Bioinformatics








Tony C Smith

A genetic prediction problem





A gene encodes a protein




It is a blueprint that provides biochemical
instructions on how to construct a sequence of
amino acids so as to make a working protein
that will perform some function in the organism

Bioinformatics








Tony C Smith

A genetic prediction problem




encoding region


untranslated region

transcription

factor

RNA

RNA

RNA

RNA

RNA

Bioinformatics








Tony C Smith

A genetic prediction problem




untranslated region

Bioinformatics








Tony C Smith

A genetic prediction problem




untranslated region

ttgcaatcggcgctacgcttcaaaatttattatattcccggc

Bioinformatics








Tony C Smith

A genetic prediction problem



ttgcaatcggcgctacgcttcaaaatttattatattcccggc

What transcription factors bind to this gene?


Where is the transcription factor binding site?

Bioinformatics








Tony C Smith

A genetic prediction problem



ttgcaatcggcgctacgcttcaaaatttattatattcccggc

Clues:

A binding site is often a short general pattern

E.g. CCGATNATCGG


Bioinformatics








Tony C Smith

A genetic prediction problem



ttgcaatcggcgctacgcttcaaaatttattatattcccggc

Clues:

The patterns are often reverse complements


E.g.

CCGATNATCGG


GGCTANTAGCC

Bioinformatics








Tony C Smith

A genetic prediction problem



ttgcaatcggcgctacgcttcaaaatttattatattcccggc

Clues:

Where there is one binding site, often there is
another nearby.

Bioinformatics








Tony C Smith

A genetic prediction problem



All of these properties are the kinds of things for
which computer science has developed
algorithms and data structures to identify quickly
and efficiently, and therefore it is exactly the
kind of problem computer scientists should be
able to solve.

Bioinformatics








Tony C Smith

proteomics

Three consecutive nucleotides in the coding region

form a ‘codon’ … i.e. encode an amino acid.


A string of amino acids makes a protein.

3 nucleotides, 4 possibilities each:



4
3

= 64 possible codons



But there are only 20 amino acids!


Bioinformatics








Tony C Smith

proteomics

Glycine:

GGA, GGC, GGG, GGT

Tyrosine:

TAT, TAC

Methionine:

ATG

There is quite a bit of redundancy in codons.

Bioinformatics








Tony C Smith

Amide

group

Carboxyl

group

R group

Amino Acid

Bioinformatics








Tony C Smith

Amino Acid

glycine

tyrosine

Bioinformatics








Tony C Smith

Bioinformatics








Tony C Smith

Bioinformatics








Tony C Smith

Bioinformatics








Tony C Smith

Artificial Intelligence

Computers do things
only human brains
can otherwise do

expert

expert

Bioinformatics








Tony C Smith

Artificial Intelligence

Computers do things
only human brains
can otherwise do

expert

system

expert

Bioinformatics








Tony C Smith

Artificial Intelligence

Computers do things
only human brains
can otherwise do

learning

system

expert

system

Bioinformatics








Tony C Smith

Machine learning

creating computer programs that get better with experience

learn how to make expert judgments

discover previously hidden, potentially useful information (data
mining)

What is machine learning?

How does it work?

user provides learning system with examples of concept to be
learned

induction algorithm infers a characteristic model of the examples

model is used to predict whether or not future novel instances are
also examples


and it does this very consistently, and very, very
quickly!

Bioinformatics








Tony C Smith

Biotechnology

Biologists know proteins, computer
scientists know machine learning

Together, they can find out a lot of hidden
information about genes and proteins

Biotechnology is a multi
-
billion dollar
industry

Biotechnology is one of the best funded
areas of scientific research

Bioinformatics








Tony C Smith

The University of Waikato

Waikato University is the centre of the
universe for machine learning

The Machine Learning Group is a large,
globally active, well
-
funded research group

The WEKA workbench of ML tools is used
around the world

Professors at Waikato University literally
wrote the book on sequence modeling


Bioinformatics








Tony C Smith

The University of Waikato


If you’re seriously interested in machine
learning, in getting involved in
bioinformatics research, or indeed any
other area along the leading edge of
computer science, then university is the
only place to be, and


Waikato wants You!