Bioinformatics: Basics

underlingbuddhaBiotechnology

Oct 2, 2013 (3 years and 6 months ago)

67 views

Evolution by natural selection


DNA
-
>RNA
-
>Protein


Structure

Function


What is bioinformatics?

Why is there bioinformatics?

Databases, the reagents

What does bioinformatics do?

5

Bioinformatics is about
understanding how life
works.

It is an hypothesis driven
science.

6

Bioinformatics is about
integrating biological themes
together with the help of
computer tools and biological
databases, and gaining new
knowledge from this.

Acquisition, curation, and analysis of

biological data

Hypothesis

u
Lots of new sequences being added

u
Automated sequencers

u
Genome Projects

u
EST sequencing

u
Microarray studies

u
Proteomics


Metagenomics


WGS

u
Patterns in datasets that can only be analyzed
using computers



Genome information


DNA sequence


Gene expression


Protein expression


Protein Structure


Genome mapping



Metabolic networks


Regulatory networks


Trait mapping


Gene function analysis


Scientific literature



"Biology is mere stamp
-
collecting”


1951

(Sanger & Tupper)
-

30 AAs of ß
-
chain bovine insulin


1965

(Holley)
-

nucleotide sequence of a yeast
alanine

tRNA


1970s

(various)
-

various protein sequencing methods


1972

(
Dayhoff
)
-

"
Atlas of protein sequence and structure
"


1977

(Sanger,
Maxam

& Gilbert) DNA sequencing


1980s

(Brenner and various others) automated sequencing


1980s

community databases


1987
-
92

genome sequencing projects


1992

(Venter)
E
xpressed
S
equence
T
ags and patents


1998

C.elegans

complete genome


2001

Human genome


Present

widespread use of automated sequencers

10

11

0
200
400
600
800
1000
1200
1400
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
No. of databases

Source: Nucleic
Acids Research:
Database issue

134

72

189

126

270

119

113

129

67

20

42

26

106

27

DNA Sequence Databases (9%)
RNA Sequence Databases (5%)
Protein Sequence Databases (13%)
Structure Databases (9%)
Genomics Databases (19%)
Metabolic and Signaling Pathways (8%)
Human and other Vertebrate Genomes (8%)
Human Genes & Diseases (9%)
Microarray Data & Gene Expression (5%)
Proteomics Resources (1%)
Other Molecular Biology Databases (3%)
Organelle Databases (2%)
Plant Databases (7%)
Immunological Databases (2%)
1.
GenBank at National Centre for Biotechnology
Information
(NCBI) of the National Institute of
Health (NIH) in Bethseda, USA



2.
European Nucleotide Archive
at the European
Bioinformatics Institute (EBI) in Hinxton, England



3.
DNA Database of Japan (DDBJ)
at the National
Institute of Genetics in Mishima, Japan



INTERNATIONAL
NUCLEOTIDE SEQUENCE
DATABASE
COLLABORATION (INSDC)

Ranks

Higher
taxa

Genus

Species

Lower

taxa

Total

Archaea

108

127

699

199

1133

Bacteria

1144

2136

11591

11445

26316

Eukaryota

17812

55141

223387

19199

315535

Fungi

1281

3860

23888

1769

30794

Metazoa

12985

35329

102285

8925

159524

Viridiplantae

2141

13580

89422

7214

112357

Viruses

519

358

8092

68135

77104

All
taxa

19608

57770

249723

99013

426110

Source:
http://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html/index.cgi

Entries

Bases

Species

Common name

19,301,988

9,296,587

2,187,038

2,199,144

3,927,943

3,222,429

1,705,210

228,265

1,343,269

1,770,193

1,424,327

2,321,188
1,224,108
214,236
1,454,515
662,510
811,571
1,890,146
82,708

738,474

15,881,839,899

9,118,049,806

6,503,434,302

5,381,235,474

5,055,840,446

4,793,300,236

3,127,958,433

1,352,948,327

1,251,053,810

1,194,842,997

1,147,237,486

1,138,511,865

1,058,563,193

1,003,309,475

947,332,578

915,431,680

896,784,038

895,052,594

828,906,407

778,132,243
Homo sapiens

Mus

musculus


Rattus

norvegicus


Bos

taurus


Zea

mays


Sus

scrofa

Danio

rerio


Strongylocentrotus

purpuratus


Oryza

sativa

Japonica Group

Nicotiana

tabacum


Xenopus

(
Silurana
)
tropicalis


Arabidopsis thaliana

Drosophila
melanogaster


Pan troglodytes

Canis

lupus
familiaris


Vitis

vinifera


Gallus
gallus


Glycine

max

Macaca

mulatta


Solanum

lycopersicum

Human

Mouse

Rat

Cow

Corn

Pig

Zebra fish

Sea urchin

Rice

Tobacco

Clawed frog

Thale

cress

Fruit fly

Chimpanzee

Dog

Grape

Chicken

Soybean

Rhesus macaque

Tomato


Number of base pairs

___________________________________________________________


1971

First published DNA sequence 12

1977

PhiX174 5,375

1982

Lambda 48,502

1992

Yeast Chromosome III 316,613

1995

Haemophilus influenza 1,830,138

1996

Saccharomyces 12,068,000

1998

C. elegans 97,000,000

2000

D. melanogaster 120,000,000

2001

H. sapines (draft) 2,600,000,000

2003

H. sapiens 2,850,000,000

17

Growth in complete genomes

Cochrane G et al. Nucl. Acids Res. 2011;39:D15
-
D18

© The Author(s) 2010. Published by Oxford University Press.

Bioinformatics as interdisciplinary science has to:

1


Pick up, provide and apply the appropriate mathematical
tools needed for tackling problems of systematic biology;

2


provide a suitable knowledge basis to specify the
application of the developed tools;

3


develop appropriate algorithms and implement them as
effective computer programs;

4


provide the required technical solutions for handling large
amounts of biological data

Use of computational search
and alignment techniques to
compare new genome against
known genes

Use of mathematical modeling
techniques to identify common
patterns, features and high
level functions

Integrated approach that
integrates both

Purpose

Software

Sequence assembly

Pair wise sequence
comparison

Sequence
-
profile comparison

Multiple alignment

Phylogenetic analysis

Gene identification



Analysis of rep DNAs

Protein sequence

‘Fingerprints’/motifs

Microarray data analysis


2
-
D Gel analysis

Arachne
, GAP4, AMOS

FASTA, BLAST


PSI
-
BLAST

ClustalW

PAUP,
Phylip

Genscan
,
GeneMarkHMM
, GRAIL,
Genei
,
Glimmer

RepeatMasker
,
RepeatFinder
, RECON

Pfam
,
ProDom
, COG

PROSITE, PRINTS, BLOCKS

GeneTraffic
,
GeneSpring
, GCOS, Cluster,
CaARRAY
, BASE,
Bioconductor

SWISS
-
2DPAGE, Melanie, Flicker,
PDQuest

Database/Tool

URL

Use

PlantsDB

http
:
//mips
.
gsf
.
de/projects/plants

Similarities

and

dissimilarities,

specific

characteristics

of

individual

plant

genomes

POGs/PlantRBP

http
:
//plantrbp
.
uoregon
.
edu/

For

cross

species

comparison

of

genomes

AgBase

http
:
//www
.
agbase
.
msstate
.
edu

For

functional

analysis

of

genes

TIGR

Plant

TA

database

http
:
//plantta
.
tigr
.
org

To

generate

a

comprehensive

resource

of

assembled

and

annotated

gene

transcripts

PathoPlant
®

http
:
//www
.
pathoplant
.
de

Plant

pathogen

interactions

and

signal

transduction

reactions

PlantGDB

http
:
//www
.
plantgdb
.
org/


Resources

for

comparative

genomics

SGN

http
:
//sgn
.
cornell
.
edu

Solanaceae

genomics

network

Sputnik

http
:
//mips
.
gsf
.
de/proj/sputnik/

EST

clustering

and

annotation

system

PopulusDB

http
:
//www
.
populus
.
db
.
umu
.
se/

Open

resource

for

tree

genomics

HARVEST

http
:
//harvest
.
ucr
.
edu/

EST

database

viewing

software

CR
-
EST

http
:
//pgrc
.
ipk
-
gatersleben
.
de/cr
-
est/index
.
php

Crop

EST

database

VitisExpDB

http
:
//cropdisease
.
ars
.
usda
.
gov/v
itis_at/main
-
page
.
htm

Grape

gene

expression

database

What is similar to my sequence?


Searching gets harder as the
databases get bigger
-

and quality
changes


Tools: BLAST and FASTA = time saving
heuristics (approximate methods)


Statistics + informed judgment of the
biologist

23

1.
Sequence analysis


Pairwise (Global & Local)


Global
: aligning sequence pairs in an end
-
to
-
end fashion


Local
: aligning specified regions in a pair of
sequences


Multiple sequence analysis (MSA)

24


Ab initio
: The gene looks like the average of
many genes


Genscan, GeneMark, GRAIL…


Similarity: The gene looks like a specific
known gene


Procrustes,…


Hybrid: A combination of both


Genomescan
(
http://genes.mit.edu/genomescan/
)

26

Briefings in Bioinformatics
2006. VOL 8(1) 6
-
21

GENERIC STEPS INVOLVED IN
EST ANALYSIS

27

MINING FOR SSRs

TRENDS in Biotechnology 2005 Vol.23(1) 48
-
55

TOOLS

1.
MISA

2.
WEBSAT

3.
Microsatellite Repeat
Finder

4.
Perfect Microsatellite
Repeat Finder

5.
Tandem Repeats Finder

6.
Repeat Finder

7.
Etandem

8.
Msatcommander

28

In silico
SNP/indel
identification

Source: Genes, Genomes and Genomics
-

SPECIAL ISSUE: Tree and Forest Genetics ( 2010)

Tools


AutoSNP


QualitySNP


HaploSNPer


MAVIANT


PolyBayes


SNiPpER

29

Source:
Asia Pac. J. Mol. Biol. Biotechnol., Vol. 15 (3), 2007

Tools

Database


miRBase
-

http://www.mirbase.org/

MiRAlign


: http://bioinfo.au.tsinghua.edu.cn/miralign/

miRanda : microRNA Target Detection


Miracle : http://miracle.igib.res.in/miracle/

RegRNA : http://regrna.mbc.nctu.edu.tw/

miRTar : http://mirtar.mbc.nctu.edu.tw/

miRU: Plant microRNA Potential Target Finder

miRseek : http://220.227.138.213/mirnablast/mirnablast.php


Can we predict the function of protein
molecules from their sequence?


sequence > structure > function



Prediction of some simple 3
-
D
structures (
a
-
helix,
b
-
sheet, membrane
spanning, etc.)


30

31

Genomics 89 (2007) 36

43

COMBINED BIOINFORMATICS AND
CHEMOINFORMATICS WORKFLOW.

1.
Sequence assembly





2.
Identification of target proteins.



3.
BLASTp

search against PDB to find
out homologous protein structures,
to be used as templates (red) for
protein homology modeling
experiments.



4.
Protein model structures (blue) can
in turn be employed for docking. For
docking experiments
ligand

structures have to be converted into
their 3D form.


Mapping

Identifying the location of
clones and markers on the
chromosome by genetic
linkage analysis and
physical mapping


Sequencing

Assembling clone sequence
reads into large (eventually
complete) genome
sequences


Gene discovery

Identifying coding regions in
genomic DNA by database
searching and other
methods


Function assignment

Using database searches,
pattern searches, protein
family analysis and structure
prediction to assign a
function to each predicted
gene

Data mining

Searching for relationships
and correlations in the
information


Genome comparison

Comparing different
complete genomes to infer
evolutionary history and
genome rearrangements

Development of automated
sequencing techniques

Joining the sequences of
smaller fragments

Prediction of promoters and
protein coding regions

Global modeling of chemical reactions in the microbial
cells

Network of gene
-
groups connected through the reactions
catalyzed by enzymes embedded in the gene
-
groups

Identifies the enzyme function of new genes by
comparing with that of evolutionary close genomes

Micro
-
array
analysis of
gene
expressions

Statistical
analysis of
promoter
regions of
orthologous

genes

Global
analysis of
frequency
patterns of
dimers

in the
intergenic

region

Biochemical
modeling at
the atomic
level

To identify transcription factors for protein
-
DNA
interactions there are four major approaches


We have only touched small parts of the
elephant


Trial and error (intelligently) is often your
best tool


Keep up with the main databases, and
you’ll have a pretty good idea of what is
happening and available