BioInformatics
-
What and Why?
The following power point
presentation is designed to give
some background information on
Bioinformatics.
This presentation is modified from information supplied by Dr.
Bruno Gaeta, and with permission from eBioInformatics Pty
Ltd (c) Copywright
The need for bioinformaticists.
The number of entries in data bases of gene sequences is
increasing exponentially. Bioinformaticians are needed to
understand and use this information.
82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99
GenBank growth
Genome sequencing projects, including the human genome project
are producing vast amounts of information. The challenge is to use
this information in a useful way
COMPLETE/PUBLIC
Aquifex aeolicus
Pyrococcus horikoshii
Bacillus subtilis
Treponema pallidum
Borrelia burgdorferi
Helicobacter pylori
Archaeoglobus fulgidus
Methanobacterium thermo.
Escherichia coli
Mycoplasma pneumoniae
Synechocystis sp. PCC6803
Methanococcus jannaschii
Saccharomyces cerevisiae
Mycoplasma genitalium
Haemophilus influenzae
COMPLETE/PENDING PUBLICATION
Rickettsia prowazekii
Pseudomonas aeruginosa
Pyrococcus abyssii
Bacillus sp. C
-
125
Ureaplasma urealyticum
Pyrobaculum aerophilum
ALMOST/PUBLIC
Pyrococcus furiosus
Mycobacterium tuberculosis H37Rv
Mycobacterium tuberculosis CSU93
Neisseria gonorrhea
Neisseria meningiditis
Streptococcus pyogenes
Terry Gaasterland, Siv Andersson, Christoph Sensen
http://www.mcs.anl.gov/home/gaasterl/genomes.html
Publically available genomes (April 1998)
”..We must hook our individual computers into the
worldwide network that gives us access to daily changes
in the databases and also makes immediate our
communications with each other.
The programs that
display and analyze the material for us must be improved
-
and we must learn to use them more effectively.
Like the
purchased kits, they will make our life easier, but also like
the kits, we must understand enough of how they work to
use them effectively…”
Walter Gilbert (1991)
“Towards a paradigm shift in biology”
Nature News and Views
349:99
“Towards a paradigm shift in biology”
Nature News and Views
349:99
Bioinformatics impacts on all aspects of
biological research.
Promises of genomics and
bioinformatics
Medicine
Knowledge of protein structure facilitates drug design
Understanding of genomic variation allows the tailoring
of medical treatment to the individual’s genetic make
-
up
Genome analysis allows the targeting of genetic
diseases
The effect of a disease or of a therapeutic on RNA and
protein levels can be elucidated
The same techniques can be applied to
biotechnology, crop and livestock
improvement, etc...
What is bioinformatics?
Application of information technology to the
storage, management and analysis of
biological information
Facilitated by the use of computers
What is bioinformatics?
Sequence analysis
Geneticists/ molecular biologists analyse genome sequence
information to understand disease processes
Molecular modeling
Crystallographers/ biochemists design drugs using computer
-
aided
tools
Phylogeny/evolution
Geneticists obtain information about the evolution of organisms by
looking for similarities in gene sequences
Ecology and population studies
Bioinformatics is used to handle large amounts of data obtained in
population studies
Medical informatics
Personalised medicine
Sequence analysis: overview
Nucleotide sequence file
Search databases for
similar sequences
Sequence comparison
Multiple sequence analysis
Design further experiments
Restriction mapping
PCR planning
Translate
into protein
Search for
known motifs
RNA structure
prediction
non
-
coding
coding
Protein
sequence
analysis
Search for protein
coding regions
Manual
sequence
entry
Sequence database
browsing
Sequencing project
management
Protein sequence file
Search databases for
similar sequences
Sequence comparison
Search for
known motifs
Predict
secondary
structure
Predict
tertiary
structure
Create a multiple
sequence alignment
Edit the alignment
Format the alignment
for publication
Molecular
phylogeny
Protein family
analysis
Nucleotide
sequence
analysis
Sequence
entry
Gene Sequencing:
Automated chemcial
sequencing methods allow rapid generation of
large data banks of gene sequences
Database similarity searching: The BLAST program has been written
to allow rapid comparison of a new gene sequence with the 100s of
1000s of gene sequences in data bases
Sequences producing significant alignments: (bits) Value
gnl|PID|e252316 (Z74911) ORF YOR003w [Saccharomyces cerevisiae] 112 7e
-
26
gi|603258 (U18795) Prb1p: vacuolar protease B [Saccharomyces ce... 106 5e
-
24
gnl|PID|e264388 (X59720) YCR045c, len:491 [Saccharomyces cerevi... 69 7e
-
13
gnl|PID|e239708 (Z71514) ORF YNL238w [Saccharomyces cerevisiae] 30 0.66
gnl|PID|e239572 (Z71603) ORF YNL327w [Saccharomyces cerevisiae] 29 1.1
gnl|PID|e239737 (Z71554) ORF YNL278w [Saccharomyces cerevisiae] 29 1.5
gnl|PID|e252316 (Z74911) ORF YOR003w [Saccharomyces cerevisiae]
Length = 478
Score = 112 bits (278), Expect = 7e
-
26
Identities = 85/259 (32%), Positives = 117/259 (44%), Gaps = 32/259 (12%)
Query: 2 QSVPWGISRVQAPAAHNRG
---------
LTGSGVKVAVLDTGIST
-
HPDLNIRGG
-
ASFV 50
+ PWG+ RV G G GV VLDTGI T H D R + +
Sbjct: 174 EEAPWGLHRVSHREKPKYGQDLEYLYEDAAGKGVTSYVLDTGIDTEHEDFEGRAEWGAVI 233
Query: 51 PGEPSTQDGNGHGTHVAGTIAALNNSIGVLGVAPSAELYXXXXXXXXXXXXXXXXXQGLE 110
P D NGHGTH AG I + + GVA + ++ +G+E
Sbjct: 234 PANDEASDLNGHGTHCAGIIGSKH
-----
FGVAKNTKIVAVKVLRSNGEGTVSDVIKGIE 288
Sequence comparison:
Gene sequences can be aligned to see similarities
between gene from different sources
768 TT....TGTGTGCATTTAAGGGTGATAGTGTATTTGCTCTTTAAGAGCTG 813
|| || || | | ||| | |||| ||||| ||| |||
87 TTGACAGGTACCCAACTGTGTGTGCTGATGTA.TTGCTGGCCAAGGACTG 135
. . . . .
814 AGTGTTTGAGCCTCTGTTTGTGTGTAATTGAGTGTGCATGTGTGGGAGTG 863
| | | | |||||| | |||| | || | |
136 AAGGATC.............TCAGTAATTAATCATGCACCTATGTGGCGG 172
. . . . .
864 AAATTGTGGAATGTGTATGCTCATAGCACTGAGTGAAAATAAAAGATTGT 913
||| | ||| || || ||| | ||||||||| || |||||| |
173 AAA.TATGGGATATGCATGTCGA...CACTGAGTG..AAGGCAAGATTAT 216
Restriction mapping:
Genes can be
analysed to detect gene sequences that can be
cleaved with restriction enzymes
AceIII
1 CAGCTCnnnnnnn’nnn...
AluI
2 AG’CT
AlwI
1 GGATCnnnn’n_
ApoI
2 r’AATT_y
BanII
1 G_rGCy’C
BfaI
2 C’TA_G
BfiI
1 ACTGGG
BsaXI
1 ACnnnnnCTCC
BsgI
1 GTGCAGnnnnnnnnnnn...
BsiHKAI
1 G_wGCw’C
Bsp1286I
1 G_dGCh’C
BsrI
2 ACTG_Gn’
BsrFI
1 r’CCGG_y
CjeI
2 CCAnnnnnnGTnnnnnn...
CviJI
4 rG’Cy
CviRI
1 TG’CA
DdeI
2 C’TnA_G
DpnI
2 GA’TC
EcoRI
1 G’AATT_C
HinfI
2 G’AnT_C
MaeIII
1 ’GTnAC_
MnlI
1 CCTCnnnnnn_n’
MseI
2 T’TA_A
MspI
1 C’CG_G
NdeI
1 CA’TA_TG
Sau3AI
2 ’GATC_
SstI
1 G_AGCT’C
TfiI
2 G’AwT_C
Tsp45I
1 ’GTsAC_
Tsp509I
3 ’AATT_
TspRI
1 CAGTGnn’
50
100
150
200
250
PCR Primer Design:
Oligonucleotides for use in the polymerisation
chain reaction can be designed using computer
based prgrams
OPTIMAL primer length
--
> 20
MINIMUM primer length
--
> 18
MAXIMUM primer length
--
> 22
OPTIMAL primer melting temperature
--
> 60.000
MINIMUM acceptable melting temp
--
> 57.000
MAXIMUM acceptable melting temp
--
> 63.000
MINIMUM acceptable primer GC%
--
> 20.000
MAXIMUM acceptable primer GC%
--
> 80.000
Salt concentration (mM)
--
> 50.000
DNA concentration (nM)
--
> 50.000
MAX no. unknown bases (Ns) allowed
--
> 0
MAX acceptable self
-
complementarity
--
> 12
MAXIMUM 3' end self
-
complementarity
--
> 8
GC clamp how many 3' bases
--
> 0
Plot created using codon preference (GCG)
Gene discovery:
Computer program can be used to recognise the
protein coding regions in DNA
0
1,000
2,000
3,000
4,000
4,000
3,000
2,000
1,000
0
2.0
1.5
1.0
0.5
-
0.0
2.0
1.5
1.0
0.5
-
0.0
2.0
1.5
1.0
0.5
-
0.0
RNA structure prediction:
Structural
features of RNA can be predicted
G
G
A
C
A
G
G
A
G
G
A
U
A
C
C
G
C
G
G
U
C
C
U
G
C
C
G
G
U
C
C
U
C
A
C
U
U
G
G
A
C
U
U
A
G
U
A
U
C
A
U
C
A
G
U
C
U
G
C
G
C
A
A
U
A
G
G
U
A
A
C
G
C
G
U
Protein structure prediction:
Particular structural features can be recognised in protein
sequences
KD Hydrophobicity
Surface Prob.
Flexibility
Antigenic
Index
CF Turns
CF Alpha
Helices
CF Beta Sheets
GOR Alpha
Helices
GOR Turns
GOR Beta Sheets
Glycosylation Sites
0.8
1.2
0.0
-
1.7
1.7
10
-
5.0
5.0
50
100
50
100
Protein
Structure
:
the 3
-
D structure of
proteins is used to
understand protein
function and
design new drugs
Multiple sequence alignment:
Sequences of proteins from different organisms can be
aligned to see similarities and differences
Alignment formatted using MacBoxshade
Phylogeny inference:
Analysis of sequences
allows evolutionary relationships to be determined
E.coli
C.botulinum
C.cadavers
C.butyricum
B.subtilis
B.cereus
Phylogenetic tree constructed using the Phylip package
Large scale bioinformatics:
genome projects
Mapping
Identifying the location of
clones and markers on the
chromosome by genetic
linkage analysis and physical
mapping
Sequencing
Assembling clone sequence
reads into large (eventually
complete) genome sequences
Gene discovery
Identifying coding regions in
genomic DNA by database
searching and other methods
Function assignment
Using database searches,
pattern searches, protein
family analysis and structure
prediction to assign a function
to each predicted gene
Data mining
Searching for relationships and
correlations in the information
Genome comparison
Comparing different complete
genomes to infer evolutionary
history and genome
rearrangements
Challenges in bioinformatics
Explosion of information
Need for faster, automated analysis to process large
amounts of data
Need for integration between different types of
information (sequences, literature, annotations, protein
levels, RNA levels etc…)
Need for “smarter” software to identify interesting
relationships in very large data sets
Lack of “bioinformaticians”
Software needs to be easier to access, use and
understand
Biologists need to learn about the software, its
limitations, and how to interpret its results
Enter the password to open this PDF file:
File name:
-
File size:
-
Title:
-
Author:
-
Subject:
-
Keywords:
-
Creation Date:
-
Modification Date:
-
Creator:
-
PDF Producer:
-
PDF Version:
-
Page Count:
-
Preparing document for printing…
0%
Comments 0
Log in to post a comment