bioinformatics talk (Y10 v2) - Department of Computer Science ...

clumpfrustratedΒιοτεχνολογία

2 Οκτ 2013 (πριν από 4 χρόνια και 12 μέρες)

76 εμφανίσεις


Bioinformatics


The Prediction of Life



Tony C Smith

Department of Computer Science

University of Waikato

tcs@cs.waikato.ac.nz

Bioinformatics








Tony C Smith

Bioinformatics


Computation with biological data


Data:

genes, proteins, microarrays, mass
spectra, written documents, populations of
organisms …


Goal:

knowledge discovery


Bioinformatics








Tony C Smith

The
essence

is prediction …






My dog is very littl
_

?



We know that letters do not occur in English at random;
not all letters are equally common (e.g. ‘e’ is more
common than ‘x’)



We know that context changes the probability of a letter
(e.g. what’s the most likely letter after the sequence
“I
eat Weet
-
Bi_”
)



Prediction is important in many applications (e.g.

encryption, compression, communication, graphics,
simulation … and bioinformatics!)

Bioinformatics








Tony C Smith

Prediction in bioinformatics

Predicting the location of genes in DNA

Predicting the function of proteins

Predicting diseases from molecular samples

Predicting population dynamics

Anything that involves “making a judgment”;
typically expressible as a yes/no decision about
some sample datum

Bioinformatics








Tony C Smith

Representation






W e e t


B i x



01010111
01100101
01100101
01110100
00101101




… to the computer, everything is binary!

Bioinformatics








Tony C Smith

01010111
01100101
01100101
01110100
00101101


01
01
10
11
00
10
01
11
11
10
11
01
00
11
01
00
00
10
11
01


A

A
C

G
T

C
A

T
T

C
G

A
T
G
A

T
T
C
G

A



Just as we can teach a computer to predict
things about a sequence of letters in English
prose, we can also teach it to predict things
about a other sequences

like a genetic
sequence


Bioinformatics








Tony C Smith

A genetic prediction problem


ttgcaatcggcgctacgcttcaaaatttattatattcccggc
gcggctacgttcatcccagcagcagcgattttaaaattaa
cgcatcagactctcgtcgcgttcgtcgcctttattcacgcta
atggacgacatcttttactacgacggcgcctacgcatcg
cagcatacgacgcccagcatagtattttagaggcgagg
acatcatcatatcgcagctacagcgcatcagacgcata
cgacgacgactacgacgacactaacgacgatgttgcg
cacccacaccagttatatagagacgaactcgcatcagc

Bioinformatics








Tony C Smith

A genetic prediction problem


ttgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcg
cctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgc
agctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagctgc
aatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgccttt
attcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagct
acagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcg
gcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattca
cgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacag
cgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgct
acgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgcta
atggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcat
cagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgctacgct
tcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatgga
cgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcaga
cgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgctacgcttcaa
aatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgac
atcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcat
acgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgctacgcttcaaaatttat
tatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatctttt
actacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacga
cgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgctacgcttcaaaatttattatatt
cccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactac
gacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacg
actacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgttgcgcacccacaccagttatatagagacgaactc

Bioinformatics








Tony C Smith

A genetic prediction problem


ttgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaa
tgg
acgacatcttttactacgacggcgcctacgcatcg
cagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccag
tta
tatagagacgaactcgcatcagct
gcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatg
gac
gacatcttttactacgacggcgcctacgcatcgc
agcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagt
tat
atagagacgaactcgcatcagtgc
aatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatgga
cga
catcttttactacgacggcgcctacgcatcgcag
catacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagtta
tat
agagacgaactcgcatcagtgcaat
cggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacga
cat
cttttactacgacggcgcctacgcatcgcagcat
acgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatat
aga
gacgaactcgcatcagtgcaatcg
gcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgaca
tct
tttactacgacggcgcctacgcatcgcagcatac
gacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatag
aga
cgaactcgcatcagtgcaatcggc
gctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatc
ttt
tactacgacggcgcctacgcatcgcagcatacga
cgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagag
acg
aactcgcatcagtgcaatcggcgct
acgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttt
tac
tacgacggcgcctacgcatcgcagcatacgacg
cccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagac
gaa
ctcgcatcagtgcaatcggcgctac
gcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatctttta
cta
cgacggcgcctacgcatcgcagcatacgacgcc
cagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacga
act
cgcatcagtgttgcgcacccacacc
agttatatagagacgaactcttgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttc
gtc
gcctttattcacgctaatggacgacatcttttacta
cgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacga
tgt
tgcgcacccacaccagttatatag
agacgaactcgcatcagctgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtc
gcc
tttattcacgctaatggacgacatcttttactac
gacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgat
gtt
gcgcacccacaccagttatataga
gacgaactcgcatcagtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgc
ctt
tattcacgctaatggacgacatcttttactacga
cggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgt
tgc
gcacccacaccagttatatagaga
cgaactcgcatcagtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcct
tta
ttcacgctaatggacgacatcttttactacgacg
gcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttg
cgc
acccacaccagttatatagagacg
aactcgcatcagtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgccttt
att
cacgctaatggacgacatcttttactacgacggc
gcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcg
cac
ccacaccagttatatagagacgaa
ctcgcatcagtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttat
tca
cgctaatggacgacatcttttactacgacggcgc
ctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgca
ccc
acaccagttatatagagacgaact
cgcatcagtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattc
acg
ctaatggacgacatcttttactacgacggcgcct
acgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacc
cac
accagttatatagagacgaactcg
catcagtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcac
gct
aatggacgacatcttttactacgacggcgcctac
gcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcaccca
cac
cagttatatagagacgaactcgca
tcagtgttgcgcacccacaccagttatatagagacgaactc

Bioinformatics








Tony C Smith

A genetic prediction problem





A gene encodes a protein




It is a blueprint that provides biochemical
instructions on how to construct a sequence of
amino acids so as to make a working protein
that will perform some function in the organism

Bioinformatics








Tony C Smith

A genetic prediction problem




encoding region


untranslated region

transcription

factor

RNA

RNA

RNA

RNA

RNA

Bioinformatics








Tony C Smith

A genetic prediction problem




untranslated region

Bioinformatics








Tony C Smith

A genetic prediction problem




untranslated region

ttgcaatcggcgctacgcttcaaaatttattatattcccggc

Bioinformatics








Tony C Smith

A genetic prediction problem



ttgcaatcggcgctacgcttcaaaatttattatattcccggc

What transcription factors bind to this gene?


Where is the transcription factor binding site?

Bioinformatics








Tony C Smith

A genetic prediction problem



ttgcaatcggcgctacgcttcaaaatttattatattcccggc

Clues:

A binding site is often a short general pattern

E.g. CCGATNATCGG


Bioinformatics








Tony C Smith

A genetic prediction problem



ttgcaatcggcgctacgcttcaaaatttattatattcccggc

Clues:

The patterns are often reverse complements


E.g.

CCGATNATCGG


GGCTANTAGCC

Bioinformatics








Tony C Smith

A genetic prediction problem



ttgcaatcggcgctacgcttcaaaatttattatattcccggc

Clues:

Where there is one binding site, often there is
another nearby.

Bioinformatics








Tony C Smith

A genetic prediction problem



All of these properties are the kinds of things for
which computer science has developed
algorithms and data structures to identify quickly
and efficiently, and therefore it is exactly the
kind of problem computer scientists should be
able to solve.

Bioinformatics








Tony C Smith

proteomics

Three consecutive nucleotides in the coding region

form a ‘codon’ … i.e. encode an amino acid.


A string of amino acids makes a protein.

3 nucleotides, 4 possibilities for each, so



4
3

= 64 possible codons



But there are only 20 amino acids!


Bioinformatics








Tony C Smith

proteomics

Glycine:

GGA, GGC, GGG, GGT

Tyrosine:

TAT, TAC

Methionine:

ATG

There is quite a bit of redundancy in codons.

Bioinformatics








Tony C Smith

Amide

group

Carboxyl

group

R group

Amino Acid

Bioinformatics








Tony C Smith

Amino Acid

glycine

tyrosine

Bioinformatics








Tony C Smith

Primary structure:

MSALVSTTPSLLAGVRNVDB …..

Bioinformatics








Tony C Smith

Tertiary Structure

Bioinformatics








Tony C Smith

Secondary Structure

Bioinformatics








Tony C Smith

Signal peptide

A relatively short sequence of amino
residues at the N
-
terminus of the nascent
protein




typically 15
-
50 residues




MAGPRPSPWARLLLAALISVSLSGTLA
RCKKAPVSKKCETCVGQAALTGL …

Cleaved off as protein passes through
membrane
(operates like a pass key)

Knowing signal peptide helps determine
protein function in the organism

Bioinformatics








Tony C Smith

How do we do it?


see any patterns?

ttgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaa
tgg
acgacatcttttactacgacggcgcctacgcatcg
cagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccag
tta
tatagagacgaactcgcatcagct
gcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaatttcgcctttattcacgctaatggacgacatcttttactacgacggcgc
cta
cgcatcgcagcatacgacgcccacgcccagc
atagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactc
gca
tcagtgcaatcggcgctacgcttcaa
aatttattatagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttata
tag
agacgaactcgcatcagtgcaatcggc
gctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatc
ttt
tactacgacggcgcctacgcatcgcagcatacga
cgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagag
acg
aactcgcatcagtgcaatcggcgct
acgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtaacgcatcagactctcgtcgcgttcgcgcgttcgtcgcctt
tat
tcacgctaatggacgacatcttttactacgacggc
gcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcg
cac
ccacaccagttatatagagacgaa
ctcgcatcagtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttat
tca
cgctaatggacgacatcttttactacgacggcgc
ctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgctac
gct
tcaaaatttattatattcccggcggcaa
tcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacg
aca
tcttttactacgacggcgcctacgcatcgcagca
tacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttata
tag
agacgaactcgcatcagtgcaatcg
gcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgaca
tct
tttactacgacggcgcctacgcatcgcagcatac
gacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatag
aga
cgaactcgcatcagtgcaatcggc
gctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatc
ttt
tactacgacggcgcctacgcatcgcagcatacga
cgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagag
acg
aactcgcatcagtgttgcgcaccca
caccagttatatagagacgaactcttgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgc
gtt
cgtcgcctttatttattatattcccggcgcggcta
cgttcatcccagcattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcg
cat
cagacgcatacgacgacgactacgacg
acactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagctgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagc
gat
tttaaaattaacgcatcagactctcgtcgc
gttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcg
cat
cagacgcatacgacgacgactacgacga
cactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcaggacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcga
gga
catcatcatatcgcagctacagcgcat
cagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagatgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcc
cag
cagcagcgattttaaaattaacgcatca
gactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatc
gca
gctactcatatcgcagctacagcgcatcaga
cgcatacgacgacgaagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgctacgctt
caa
aatttattatattcccggcgcggct
acgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacga
cgc
ccagcatagtattttagaggcgaggacatca
tcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgctacgctt
caa
aatttattatattcccggcgcggctac
gttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacg
ccc
agcatagtattttagaggcgaggacatcatc
atatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgctacgcttca
aaa
gcagcgattttaaaattaacgcatc
agactctcgtcgcgttcgtcgcctttattcacgctaatggacgacgaactcgcatcagtgcaatcggccggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcg
cgt
tcgtcgcctttattcacgctaatggacgacatc
ttttactacgacggcgcctacgcatcgcagcatacgattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgc
taa
tggacgacatcttttactacgacggcgcctac
gcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcaccca
cac
cagttatatagagacgaactcgca
tcagtgttgcgcacccacaccagttatatagagacgaactcttagaggcgaggacatcatcatatcgcagctacagcgcatcagttagaggcgaggacatcatcatatcgcagctacagcgcatc
agt
tagaggcgaggacatcatcatatcgc

Bioinformatics








Tony C Smith

Local biases in
residues around
the cleavage site

Sequence
regularities can be
exploited by
statistical and
pattern
-
based
models

Bioinformatics








Tony C Smith

Proteomic prediction

Language:



letters combine to form words



• words combine to form phrases



• phrases combine to form sentences



• sentences combine to form sentences (and



ultimately Harry Potter books)


Proteins:

• amino acids combine to form peptides



• peptides combine to form
secondary motifs



(e.g.
α
-
helixes and
β
-
sheets)





motifs combine to make proteins



• proteins combine to make toenails (and



ultimately people)

Bioinformatics








Tony C Smith

Approach

Problem is stated as two
-
class:


an amino acid is either the first residue of
the mature protein or it is not


Each residue is described by a single
document, which includes as many
electrochemical, structural or contextual
facts as are available (desirable)

Bioinformatics








Tony C Smith

Properties of amino acids

Bioinformatics








Tony C Smith

Residue as a document

E.g.


Cysteine

Cys


C



aliphatic [
yes
], aromatic [
no
], hydrophobic [
yes
],
charge [
-
], polarized [
yes
]
,

small [
no
], number of
nitrogen atoms [
1
], contains sulphur [
yes
], has a
carbon ring [
no
], ionized [
yes
], valence [
2
], cbeta
[
no
], covalent [
yes
], h
-
bond [
yes
],


etc. (whatever else experimenter wants to include)


Bioinformatics








Tony C Smith

Sample document



PRNUM:1. AANUM:21.


AMINO[
-
8]:L. ALIPH[
-
8]:
-
. AROMA[
-
8]:
-
. CBETA[
-
8]:
-
. CHARG[
-
8]:
-
. COVAL[
-
8]:
-
. HBOND[
-
8]:
-
. HPHOB[
-
8]:+. IONIZ[
-
8]:
-
. NITRO[
-
8]:
1.
POLAR[
-
8]:
-
. POSNG[
-
8]:0. SMALL[
-
8]:
-
. SULPH[
-
8]:
-
. TEENY[
-
8]:
-
. CRING[
-
8]:
-
. VALEN[
-
8]:2. AMINO[
-
7]:L. ALIPH[
-
7]:
-
. AROMA[
-
7]:
-
.
CBETA[
-
7]:
-
. CHARG[
-
7]:
-
. COVAL[
-
7]:
-
. HBOND[
-
7]:
-
. HPHOB[
-
7]:+. IONIZ[
-
7]:
-
. NITRO[
-
7]:1. POLAR[
-
7]:
-
. POSNG[
-
7]:0. SMALL[
-
7]:
-
.
SULPH[
-
7]:
-
. TEENY[
-
7]:
-
. CRING[
-
7]:
-
. VALEN[
-
7]:2. AMINO[
-
6]:F. ALIPH[
-
6]:+. AROMA[
-
6]:+. CBETA[
-
6]:
-
. CHARG[
-
6]:
-
. COVAL[
-
6]:
-
.
HBOND[
-
6]:
-
. HPHOB[
-
6]:+. IONIZ[
-
6]:
-
. NITRO[
-
6]:1. POLAR[
-
6]:
-
. POSNG[
-
6]:0. SMALL[
-
6]:
-
. SULPH[
-
6]:
-
. TEENY[
-
6]:
-
. CRING[
-
6]:+
.
VALEN[
-
6]:2. AMINO[
-
5]:A. ALIPH[
-
5]:
-
. AROMA[
-
5]:
-
. CBETA[
-
5]:
-
. CHARG[
-
5]:
-
. COVAL[
-
5]:
-
. HBOND[
-
5]:
-
. HPHOB[
-
5]:
-
. IONIZ[
-
5]:
-
.
NITRO[
-
5]:1. POLAR[
-
5]:
-
. POSNG[
-
5]:0. SMALL[
-
5]:+. SULPH[
-
5]:
-
. TEENY[
-
5]:+. CRING[
-
5]:
-
. VALEN[
-
5]:2. AMINO[
-
4]:T. ALIPH[
-
4]:+
.
AROMA[
-
4]:
-
. CBETA[
-
4]:+. CHARG[
-
4]:
-
. COVAL[
-
4]:
-
. HBOND[
-
4]:+. HPHOB[
-
4]:
-
. IONIZ[
-
4]:
-
. NITRO[
-
4]:1. POLAR[
-
4]:+. POSNG[
-
4]:0. SMALL[
-
4]:+. SULPH[
-
4]:
-
. TEENY[
-
4]:
-
. CRING[
-
4]:
-
. VALEN[
-
4]:2. AMINO[
-
3]:C. ALIPH[
-
3]:+. AROMA[
-
3]:
-
. CBETA[
-
3]:
-
. CHARG
[
-
3]:
-
. COVAL[
-
3]:+. HBOND[
-
3]:+. HPHOB[
-
3]:+. IONIZ[
-
3]:+. NITRO[
-
3]:1. POLAR[
-
3]:+. POSNG[
-
3]:
-
. SMALL[
-
3]:
-
. SULPH[
-
3]:+.
TEENY[
-
3]:
-
. CRING[
-
3]:
-
. VALEN[
-
3]:2. AMINO[
-
2]:I. ALIPH[
-
2]:
-
. AROMA[
-
2]:
-
. CBETA[
-
2]:+. CHARG[
-
2]:
-
. COVAL[
-
2]:
-
. HBOND[
-
2]:
-
.
HPHOB[
-
2]:+. IONIZ[
-
2]:
-
. NITRO[
-
2]:1. POLAR[
-
2]:
-
. POSNG[
-
2]:0. SMALL[
-
2]:
-
. SULPH[
-
2]:
-
. TEENY[
-
2]:
-
. CRING[
-
2]:
-
. VALEN[
-
2]:2
.
AMINO[
-
1]:A. ALIPH[
-
1]:
-
. AROMA[
-
1]:
-
. CBETA[
-
1]:
-
. CHARG[
-
1]:
-
. COVAL[
-
1]:
-
. HBOND[
-
1]:
-
. HPHOB[
-
1]:
-
. IONIZ[
-
1]:
-
. NITRO[
-
1]:1
.
POLAR[
-
1]:
-
. POSNG[
-
1]:0. SMALL[
-
1]:+. SULPH[
-
1]:
-
. TEENY[
-
1]:+. CRING[
-
1]:
-
. VALEN[
-
1]:2.
AMINO[0]:R. ALIPH[0]:+. AROMA[0]:
-
.
CBETA[0]:
-
. CHARG[0]:+. COVAL[0]:
-
. HBOND[0]:+. HPHOB[0]:
-
. IONIZ[0]:+. NITRO[0]:4. POLAR[0]:+. POSNG[0]:+. SMALL[0]:
-
.
SULPH[0]:
-
. TEENY[0]:
-
. CRING[0]:
-
. VALEN[0]:3.

AMINO[1]:H. ALIPH[1]:+. AROMA[1]:+. CBETA[1]:
-
. CHARG[1]:+. COVAL[1]:
-
.
HBOND[1]:+. HPHOB[1]:
-
. IONIZ[1]:+. NITRO[1]:3. POLAR[1]:+. POSNG[1]:+. SMALL[1]:
-
. SULPH[1]:
-
. TEENY[1]:
-
. CRING[1]:+.
VALEN[1]:3. AMINO[2]:Q. ALIPH[2]:+. AROMA[2]:
-
. CBETA[2]:
-
. CHARG[2]:
-
. COVAL[2]:
-
. HBOND[2]:+. HPHOB[2]:
-
. IONIZ[2]:
-
.
NITRO[2]:2. POLAR[2]:+. POSNG[2]:0. SMALL[2]:
-
. SULPH[2]:
-
. TEENY[2]:
-
. CRING[2]:
-
. VALEN[2]:2. AMINO[3]:Q. ALIPH[3]:+.
AROMA[3]:
-
. CBETA[3]:
-
. CHARG[3]:
-
. COVAL[3]:
-
. HBOND[3]:+. HPHOB[3]:
-
. IONIZ[3]:
-
. NITRO[3]:2. POLAR[3]:+. POSNG[3]:0.
SMALL[3]:
-
. SULPH[3]:
-
. TEENY[3]:
-
. CRING[3]:
-
. VALEN[3]:2. AMINO[4]:R. ALIPH[4]:+. AROMA[4]:
-
. CBETA[4]:
-
. CHARG[4]:+.
COVAL[4]:
-
. HBOND[4]:+. HPHOB[4]:
-
. IONIZ[4]:+. NITRO[4]:4. POLAR[4]:+. POSNG[4]:+. SMALL[4]:
-
. SULPH[4]:
-
. TEENY[4]:
-
.
CRING[4]:
-
. VALEN[4]:3. AMINO[5]:Q. ALIPH[5]:+. AROMA[5]:
-
. CBETA[5]:
-
. CHARG[5]:
-
. COVAL[5]:
-
. HBOND[5]:+. HPHOB[5]:
-
.
IONIZ[5]:
-
. NITRO[5]:2. POLAR[5]:+. POSNG[5]:0. SMALL[5]:
-
. SULPH[5]:
-
. TEENY[5]:
-
. CRING[5]:
-
. VALEN[5]:2. AMINO[6]:Q.
ALIPH[6]:+. AROMA[6]:
-
. CBETA[6]:
-
. CHARG[6]:
-
. COVAL[6]:
-
. HBOND[6]:+. HPHOB[6]:
-
. IONIZ[6]:
-
. NITRO[6]:2. POLAR[6]:+.
POSNG[6]:0. SMALL[6]:
-
. SULPH[6]:
-
. TEENY[6]:
-
. CRING[6]:
-
. VALEN[6]:2. AMINO[7]:Q. ALIPH[7]:+. AROMA[7]:
-
. CBETA[7]:
-
.
CHARG[7]:
-
. COVAL[7]:
-
. HBOND[7]:+. HPHOB[7]:
-
. IONIZ[7]:
-
. NITRO[7]:2. POLAR[7]:+. POSNG[7]:0. SMALL[7]:
-
. SULPH[7]:
-
.
TEENY[7]:
-
. CRING[7]:
-
. VALEN[7]:2. AMINO[8]:Q. ALIPH[8]:+. AROMA[8]:
-
. CBETA[8]:
-
. CHARG[8]:
-
. COVAL[8]:
-
. HBOND[8]:+.
HPHOB[8]:
-
. IONIZ[8]:
-
. NITRO[8]:2. POLAR[8]:+. POSNG[8]:0. SMALL[8]:
-
. SULPH[8]:
-
. TEENY[8]:
-
. CRING[8]:
-
. VALEN[8]:2. MULT3:7.

MULT5:4. MULT7:3. MULT9:2. 2GRAM:IA. GRAM2:HQ. 3GRAM:CIA. GRAM3:HQQ.

Bioinformatics








Tony C Smith

Artificial Intelligence

Computers do things
only human brains
can otherwise do

expert

expert

Bioinformatics








Tony C Smith

Artificial Intelligence

Computers do things
only human brains
can otherwise do

expert

system

expert

Bioinformatics








Tony C Smith

Artificial Intelligence

Computers do things
only human brains
can otherwise do

learning

system

expert

system

Bioinformatics








Tony C Smith

Machine learning

creating computer programs that get better with experience

learn how to make expert judgments

discover previously hidden, potentially useful information (data
mining)

What is machine learning?

How does it work?

user provides learning system with examples of concept to be
learned

induction algorithm infers a characteristic model of the examples

model is used to predict whether or not future novel instances are
also examples


and it does this very consistently, and very, very
quickly!

Bioinformatics








Tony C Smith

Bioinformatics

Biologists know proteins, computer scientists
know machine learning

Together, they can find hidden and potentially
useful information about genes and proteins

Biotechnology is a multi
-
billion dollar industry

Biotechnology is one of the best funded areas of
scientific research

Shortage of people educated in bioinformatics


Bioinformatics








Tony C Smith

The University of Waikato


Waikato University is ranked first in the country
in computer science and in molecular, cellular,
and whole
-
organism biology


centre of the universe for machine learning


Bioinformatics








Tony C Smith

The University of Waikato


If you’re interested in getting involved in
bioinformatics, or indeed any other area
along the leading edge of computer
science and/or biology, then …


Waikato wants You!