Introduction to bioinformatics

abalonestrawBiotechnology

Oct 2, 2013 (4 years and 11 days ago)

157 views

BioInformatics
-

What and Why?

The following power point
presentation is designed to give
some background information on
Bioinformatics.

This presentation is modified from information supplied by Dr.
Bruno Gaeta, and with permission from eBioInformatics Pty
Ltd (c) Copywright


The need for bioinformaticists.


The number of entries in data bases of gene sequences is
increasing exponentially. Bioinformaticians are needed to
understand and use this information.

82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99


GenBank growth

Genome sequencing projects, including the human genome project
are producing vast amounts of information. The challenge is to use
this information in a useful way

COMPLETE/PUBLIC

Aquifex aeolicus

Pyrococcus horikoshii

Bacillus subtilis

Treponema pallidum

Borrelia burgdorferi

Helicobacter pylori

Archaeoglobus fulgidus

Methanobacterium thermo.

Escherichia coli

Mycoplasma pneumoniae

Synechocystis sp. PCC6803

Methanococcus jannaschii

Saccharomyces cerevisiae

Mycoplasma genitalium

Haemophilus influenzae



COMPLETE/PENDING PUBLICATION

Rickettsia prowazekii

Pseudomonas aeruginosa

Pyrococcus abyssii

Bacillus sp. C
-
125

Ureaplasma urealyticum

Pyrobaculum aerophilum


ALMOST/PUBLIC

Pyrococcus furiosus

Mycobacterium tuberculosis H37Rv

Mycobacterium tuberculosis CSU93

Neisseria gonorrhea

Neisseria meningiditis

Streptococcus pyogenes

Terry Gaasterland, Siv Andersson, Christoph Sensen

http://www.mcs.anl.gov/home/gaasterl/genomes.html

Publically available genomes (April 1998)

”..We must hook our individual computers into the
worldwide network that gives us access to daily changes
in the databases and also makes immediate our
communications with each other.
The programs that
display and analyze the material for us must be improved
-

and we must learn to use them more effectively.

Like the
purchased kits, they will make our life easier, but also like
the kits, we must understand enough of how they work to
use them effectively…”


Walter Gilbert (1991)

“Towards a paradigm shift in biology”
Nature News and Views

349:99


“Towards a paradigm shift in biology”

Nature News and Views

349:99

Bioinformatics impacts on all aspects of
biological research.

Promises of genomics and
bioinformatics


Medicine


Knowledge of protein structure facilitates drug design


Understanding of genomic variation allows the tailoring
of medical treatment to the individual’s genetic make
-
up


Genome analysis allows the targeting of genetic
diseases


The effect of a disease or of a therapeutic on RNA and
protein levels can be elucidated


The same techniques can be applied to
biotechnology, crop and livestock
improvement, etc...


What is bioinformatics?


Application of information technology to the
storage, management and analysis of
biological information


Facilitated by the use of computers

What is bioinformatics?


Sequence analysis


Geneticists/ molecular biologists analyse genome sequence
information to understand disease processes


Molecular modeling


Crystallographers/ biochemists design drugs using computer
-
aided
tools


Phylogeny/evolution


Geneticists obtain information about the evolution of organisms by
looking for similarities in gene sequences


Ecology and population studies


Bioinformatics is used to handle large amounts of data obtained in
population studies


Medical informatics


Personalised medicine

Sequence analysis: overview

Nucleotide sequence file

Search databases for
similar sequences

Sequence comparison

Multiple sequence analysis

Design further experiments


Restriction mapping


PCR planning

Translate
into protein

Search for
known motifs

RNA structure
prediction

non
-
coding

coding

Protein
sequence
analysis

Search for protein
coding regions

Manual
sequence
entry

Sequence database
browsing

Sequencing project
management

Protein sequence file

Search databases for
similar sequences

Sequence comparison

Search for
known motifs

Predict
secondary
structure

Predict
tertiary

structure

Create a multiple
sequence alignment

Edit the alignment

Format the alignment
for publication

Molecular
phylogeny

Protein family
analysis

Nucleotide
sequence
analysis

Sequence
entry

Gene Sequencing:
Automated chemcial
sequencing methods allow rapid generation of
large data banks of gene sequences

Database similarity searching: The BLAST program has been written
to allow rapid comparison of a new gene sequence with the 100s of
1000s of gene sequences in data bases

Sequences producing significant alignments: (bits) Value


gnl|PID|e252316 (Z74911) ORF YOR003w [Saccharomyces cerevisiae] 112 7e
-
26

gi|603258 (U18795) Prb1p: vacuolar protease B [Saccharomyces ce... 106 5e
-
24

gnl|PID|e264388 (X59720) YCR045c, len:491 [Saccharomyces cerevi... 69 7e
-
13

gnl|PID|e239708 (Z71514) ORF YNL238w [Saccharomyces cerevisiae] 30 0.66

gnl|PID|e239572 (Z71603) ORF YNL327w [Saccharomyces cerevisiae] 29 1.1

gnl|PID|e239737 (Z71554) ORF YNL278w [Saccharomyces cerevisiae] 29 1.5



gnl|PID|e252316 (Z74911) ORF YOR003w [Saccharomyces cerevisiae]


Length = 478




Score = 112 bits (278), Expect = 7e
-
26


Identities = 85/259 (32%), Positives = 117/259 (44%), Gaps = 32/259 (12%)


Query: 2 QSVPWGISRVQAPAAHNRG
---------
LTGSGVKVAVLDTGIST
-
HPDLNIRGG
-
ASFV 50


+ PWG+ RV G G GV VLDTGI T H D R + +

Sbjct: 174 EEAPWGLHRVSHREKPKYGQDLEYLYEDAAGKGVTSYVLDTGIDTEHEDFEGRAEWGAVI 233


Query: 51 PGEPSTQDGNGHGTHVAGTIAALNNSIGVLGVAPSAELYXXXXXXXXXXXXXXXXXQGLE 110


P D NGHGTH AG I + + GVA + ++ +G+E

Sbjct: 234 PANDEASDLNGHGTHCAGIIGSKH
-----
FGVAKNTKIVAVKVLRSNGEGTVSDVIKGIE 288

Sequence comparison:

Gene sequences can be aligned to see similarities
between gene from different sources

768 TT....TGTGTGCATTTAAGGGTGATAGTGTATTTGCTCTTTAAGAGCTG 813


|| || || | | ||| | |||| ||||| ||| |||


87 TTGACAGGTACCCAACTGTGTGTGCTGATGTA.TTGCTGGCCAAGGACTG 135


. . . . .

814 AGTGTTTGAGCCTCTGTTTGTGTGTAATTGAGTGTGCATGTGTGGGAGTG 863


| | | | |||||| | |||| | || | |

136 AAGGATC.............TCAGTAATTAATCATGCACCTATGTGGCGG 172


. . . . .

864 AAATTGTGGAATGTGTATGCTCATAGCACTGAGTGAAAATAAAAGATTGT 913


||| | ||| || || ||| | ||||||||| || |||||| |

173 AAA.TATGGGATATGCATGTCGA...CACTGAGTG..AAGGCAAGATTAT 216


Restriction mapping:
Genes can be
analysed to detect gene sequences that can be
cleaved with restriction enzymes

AceIII


1 CAGCTCnnnnnnn’nnn...

AluI


2 AG’CT

AlwI


1 GGATCnnnn’n_

ApoI


2 r’AATT_y

BanII


1 G_rGCy’C

BfaI


2 C’TA_G

BfiI


1 ACTGGG

BsaXI


1 ACnnnnnCTCC

BsgI


1 GTGCAGnnnnnnnnnnn...

BsiHKAI


1 G_wGCw’C

Bsp1286I


1 G_dGCh’C

BsrI


2 ACTG_Gn’

BsrFI


1 r’CCGG_y

CjeI


2 CCAnnnnnnGTnnnnnn...

CviJI


4 rG’Cy

CviRI


1 TG’CA

DdeI


2 C’TnA_G

DpnI


2 GA’TC

EcoRI


1 G’AATT_C

HinfI


2 G’AnT_C

MaeIII


1 ’GTnAC_

MnlI


1 CCTCnnnnnn_n’

MseI


2 T’TA_A

MspI


1 C’CG_G

NdeI


1 CA’TA_TG

Sau3AI


2 ’GATC_

SstI


1 G_AGCT’C

TfiI


2 G’AwT_C

Tsp45I


1 ’GTsAC_

Tsp509I


3 ’AATT_

TspRI


1 CAGTGnn’

50

100

150

200

250

PCR Primer Design:

Oligonucleotides for use in the polymerisation
chain reaction can be designed using computer
based prgrams

OPTIMAL primer length
--
> 20

MINIMUM primer length
--
> 18

MAXIMUM primer length
--
> 22

OPTIMAL primer melting temperature
--
> 60.000

MINIMUM acceptable melting temp
--
> 57.000

MAXIMUM acceptable melting temp
--
> 63.000

MINIMUM acceptable primer GC%
--
> 20.000

MAXIMUM acceptable primer GC%
--
> 80.000

Salt concentration (mM)
--
> 50.000

DNA concentration (nM)
--
> 50.000

MAX no. unknown bases (Ns) allowed
--
> 0

MAX acceptable self
-
complementarity
--
> 12

MAXIMUM 3' end self
-
complementarity
--
> 8

GC clamp how many 3' bases
--
> 0

Plot created using codon preference (GCG)

Gene discovery:


Computer program can be used to recognise the
protein coding regions in DNA

0

1,000

2,000

3,000

4,000

4,000

3,000

2,000

1,000

0

2.0

1.5

1.0

0.5

-
0.0

2.0

1.5

1.0

0.5

-
0.0

2.0

1.5

1.0

0.5

-
0.0

RNA structure prediction:
Structural
features of RNA can be predicted



G



G

A



C



A



G

G


A


G


G



A



U

A

C

C

G

C

G

G

U

C

C

U

G

C

C

G

G

U

C

C

U

C

A

C

U

U

G

G

A

C

U

U

A

G

U

A

U

C

A

U

C

A

G

U

C

U

G

C

G

C

A

A

U

A

G

G

U

A

A

C

G

C

G

U

Protein structure prediction:
Particular structural features can be recognised in protein
sequences

KD Hydrophobicity

Surface Prob.

Flexibility

Antigenic

Index

CF Turns

CF Alpha

Helices

CF Beta Sheets

GOR Alpha

Helices

GOR Turns

GOR Beta Sheets

Glycosylation Sites

0.8

1.2

0.0

-
1.7

1.7

10

-
5.0

5.0

50

100

50

100

Protein
Structure

:
the 3
-
D structure of
proteins is used to
understand protein
function and
design new drugs

Multiple sequence alignment:
Sequences of proteins from different organisms can be
aligned to see similarities and differences

Alignment formatted using MacBoxshade

Phylogeny inference:
Analysis of sequences
allows evolutionary relationships to be determined

E.coli

C.botulinum

C.cadavers

C.butyricum

B.subtilis

B.cereus

Phylogenetic tree constructed using the Phylip package

Large scale bioinformatics:
genome projects


Mapping

Identifying the location of
clones and markers on the
chromosome by genetic
linkage analysis and physical
mapping


Sequencing

Assembling clone sequence
reads into large (eventually
complete) genome sequences


Gene discovery

Identifying coding regions in
genomic DNA by database
searching and other methods


Function assignment

Using database searches,
pattern searches, protein
family analysis and structure
prediction to assign a function
to each predicted gene

Data mining

Searching for relationships and
correlations in the information


Genome comparison

Comparing different complete
genomes to infer evolutionary
history and genome
rearrangements

Challenges in bioinformatics


Explosion of information


Need for faster, automated analysis to process large
amounts of data


Need for integration between different types of
information (sequences, literature, annotations, protein
levels, RNA levels etc…)


Need for “smarter” software to identify interesting
relationships in very large data sets


Lack of “bioinformaticians”


Software needs to be easier to access, use and
understand


Biologists need to learn about the software, its
limitations, and how to interpret its results