Bioinformatics
The application of computer science to
biological data
Tony C Smith
Department of Computer Science
University of Waikato
tcs@cs.waikato.ac.nz
Bioinformatics
Tony C Smith
The
essence
is prediction …
My dog is very littl
_
?
We know that letters do not occur in English at random
(e.g. ‘t’ is more common than ‘x’)
We know that context changes the probability of a letter
(e.g. ‘x’ is more common than ‘t’ after the sequence “I eat
Weet
-
Bi_”)
Predicting symbols is fundamental to a wide range of
important applications (e.g. encryption, compression)
Bioinformatics
Tony C Smith
Prediction in bioinformatics
Predicting the location of genes in DNA
Predicting gene roles in an organism
Predicting errors in a genetic transcription
Predicting the function of proteins
Predicting diseases from molecular samples
Anything that involves “making a judgment”; a
yes/no decision about whether some sample
datum ‘does’ or ‘does not’ have some property.
Bioinformatics
Tony C Smith
Representation
W e e t
–
B i x
01010111
01100101
01100101
01110100
00101101
…
… to the computer, everything is binary!
Bioinformatics
Tony C Smith
01010111
01100101
01100101
01110100
00101101
01
01
10
11
00
10
01
11
11
10
11
01
00
11
01
00
00
10
11
01
A
A
C
G
T
C
A
T
T
C
G
A
T
G
A
T
T
C
G
A
Just as we can teach a computer to predict
things about a sequence of letters in English
prose, we can also teach it to predict things
about a other sequences
—
like a genetic
sequence
Bioinformatics
Tony C Smith
A genetic prediction problem
ttgcaatcggcgctacgcttcaaaatttattatattcccggc
gcggctacgttcatcccagcagcagcgattttaaaattaa
cgcatcagactctcgtcgcgttcgtcgcctttattcacgcta
atggacgacatcttttactacgacggcgcctacgcatcg
cagcatacgacgcccagcatagtattttagaggcgagg
acatcatcatatcgcagctacagcgcatcagacgcata
cgacgacgactacgacgacactaacgacgatgttgcg
cacccacaccagttatatagagacgaactcgcatcagc
Bioinformatics
Tony C Smith
A genetic prediction problem
ttgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcg
cctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgc
agctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagctgc
aatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgccttt
attcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagct
acagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcg
gcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattca
cgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacag
cgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgct
acgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgcta
atggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcat
cagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgctacgct
tcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatgga
cgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcaga
cgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgctacgcttcaa
aatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgac
atcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcat
acgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgctacgcttcaaaatttat
tatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatctttt
actacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacga
cgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgctacgcttcaaaatttattatatt
cccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactac
gacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacg
actacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgttgcgcacccacaccagttatatagagacgaactc
Bioinformatics
Tony C Smith
A genetic prediction problem
ttgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaa
tgg
acgacatcttttactacgacggcgcctacgcatcg
cagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccag
tta
tatagagacgaactcgcatcagct
gcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatg
gac
gacatcttttactacgacggcgcctacgcatcgc
agcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagt
tat
atagagacgaactcgcatcagtgc
aatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatgga
cga
catcttttactacgacggcgcctacgcatcgcag
catacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagtta
tat
agagacgaactcgcatcagtgcaat
cggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacga
cat
cttttactacgacggcgcctacgcatcgcagcat
acgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatat
aga
gacgaactcgcatcagtgcaatcg
gcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgaca
tct
tttactacgacggcgcctacgcatcgcagcatac
gacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatag
aga
cgaactcgcatcagtgcaatcggc
gctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatc
ttt
tactacgacggcgcctacgcatcgcagcatacga
cgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagag
acg
aactcgcatcagtgcaatcggcgct
acgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttt
tac
tacgacggcgcctacgcatcgcagcatacgacg
cccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagac
gaa
ctcgcatcagtgcaatcggcgctac
gcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatctttta
cta
cgacggcgcctacgcatcgcagcatacgacgcc
cagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacga
act
cgcatcagtgttgcgcacccacacc
agttatatagagacgaactcttgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttc
gtc
gcctttattcacgctaatggacgacatcttttacta
cgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacga
tgt
tgcgcacccacaccagttatatag
agacgaactcgcatcagctgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtc
gcc
tttattcacgctaatggacgacatcttttactac
gacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgat
gtt
gcgcacccacaccagttatataga
gacgaactcgcatcagtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgc
ctt
tattcacgctaatggacgacatcttttactacga
cggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgt
tgc
gcacccacaccagttatatagaga
cgaactcgcatcagtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcct
tta
ttcacgctaatggacgacatcttttactacgacg
gcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttg
cgc
acccacaccagttatatagagacg
aactcgcatcagtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgccttt
att
cacgctaatggacgacatcttttactacgacggc
gcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcg
cac
ccacaccagttatatagagacgaa
ctcgcatcagtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttat
tca
cgctaatggacgacatcttttactacgacggcgc
ctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgca
ccc
acaccagttatatagagacgaact
cgcatcagtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattc
acg
ctaatggacgacatcttttactacgacggcgcct
acgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacc
cac
accagttatatagagacgaactcg
catcagtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcac
gct
aatggacgacatcttttactacgacggcgcctac
gcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcaccca
cac
cagttatatagagacgaactcgca
tcagtgttgcgcacccacaccagttatatagagacgaactc
Bioinformatics
Tony C Smith
A genetic prediction problem
A gene encodes a protein
It is a blueprint that provides biochemical
instructions on how to construct a sequence of
amino acids so as to make a working protein
that will perform some function in the organism
Bioinformatics
Tony C Smith
A genetic prediction problem
encoding region
untranslated region
transcription
factor
RNA
RNA
RNA
RNA
RNA
Bioinformatics
Tony C Smith
A genetic prediction problem
untranslated region
Bioinformatics
Tony C Smith
A genetic prediction problem
untranslated region
ttgcaatcggcgctacgcttcaaaatttattatattcccggc
Bioinformatics
Tony C Smith
A genetic prediction problem
ttgcaatcggcgctacgcttcaaaatttattatattcccggc
What transcription factors bind to this gene?
Where is the transcription factor binding site?
Bioinformatics
Tony C Smith
A genetic prediction problem
ttgcaatcggcgctacgcttcaaaatttattatattcccggc
Clues:
A binding site is often a short general pattern
E.g. CCGATNATCGG
Bioinformatics
Tony C Smith
A genetic prediction problem
ttgcaatcggcgctacgcttcaaaatttattatattcccggc
Clues:
The patterns are often reverse complements
E.g.
CCGATNATCGG
GGCTANTAGCC
Bioinformatics
Tony C Smith
A genetic prediction problem
ttgcaatcggcgctacgcttcaaaatttattatattcccggc
Clues:
Where there is one binding site, often there is
another nearby.
Bioinformatics
Tony C Smith
A genetic prediction problem
All of these properties are the kinds of things for
which computer science has developed
algorithms and data structures to identify quickly
and efficiently, and therefore it is exactly the
kind of problem computer scientists should be
able to solve.
Bioinformatics
Tony C Smith
proteomics
Three consecutive nucleotides in the coding region
form a ‘codon’ … i.e. encode an amino acid.
A string of amino acids makes a protein.
3 nucleotides, 4 possibilities each:
4
3
= 64 possible codons
But there are only 20 amino acids!
Bioinformatics
Tony C Smith
proteomics
Glycine:
GGA, GGC, GGG, GGT
Tyrosine:
TAT, TAC
Methionine:
ATG
There is quite a bit of redundancy in codons.
Bioinformatics
Tony C Smith
Amide
group
Carboxyl
group
R group
Amino Acid
Bioinformatics
Tony C Smith
Amino Acid
glycine
tyrosine
Bioinformatics
Tony C Smith
Bioinformatics
Tony C Smith
Bioinformatics
Tony C Smith
Bioinformatics
Tony C Smith
Artificial Intelligence
Computers do things
only human brains
can otherwise do
expert
expert
Bioinformatics
Tony C Smith
Artificial Intelligence
Computers do things
only human brains
can otherwise do
expert
system
expert
Bioinformatics
Tony C Smith
Artificial Intelligence
Computers do things
only human brains
can otherwise do
learning
system
expert
system
Bioinformatics
Tony C Smith
Machine learning
creating computer programs that get better with experience
learn how to make expert judgments
discover previously hidden, potentially useful information (data
mining)
What is machine learning?
How does it work?
user provides learning system with examples of concept to be
learned
induction algorithm infers a characteristic model of the examples
model is used to predict whether or not future novel instances are
also examples
–
and it does this very consistently, and very, very
quickly!
Bioinformatics
Tony C Smith
Biotechnology
Biologists know proteins, computer
scientists know machine learning
Together, they can find out a lot of hidden
information about genes and proteins
Biotechnology is a multi
-
billion dollar
industry
Biotechnology is one of the best funded
areas of scientific research
Bioinformatics
Tony C Smith
The University of Waikato
Waikato University is the centre of the
universe for machine learning
The Machine Learning Group is a large,
globally active, well
-
funded research group
The WEKA workbench of ML tools is used
around the world
Professors at Waikato University literally
wrote the book on sequence modeling
Bioinformatics
Tony C Smith
The University of Waikato
If you’re seriously interested in machine
learning, in getting involved in
bioinformatics research, or indeed any
other area along the leading edge of
computer science, then university is the
only place to be, and
Waikato wants You!
Enter the password to open this PDF file:
File name:
-
File size:
-
Title:
-
Author:
-
Subject:
-
Keywords:
-
Creation Date:
-
Modification Date:
-
Creator:
-
PDF Producer:
-
PDF Version:
-
Page Count:
-
Preparing document for printing…
0%
Σχόλια 0
Συνδεθείτε για να κοινοποιήσετε σχόλιο