Introduction to bioinformatics

dasypygalstockingsΒιοτεχνολογία

2 Οκτ 2013 (πριν από 3 χρόνια και 6 μήνες)

58 εμφανίσεις

Introduction to bioinformatics

Sylvia B. Nagl

What is bioinformatics?


an emerging interdisciplinary research area



deals with the computational management
and analysis of biological information: genes,
genomes, proteins, cells, ecological systems,
medical information, robots, artificial
intelligence...







Relationships between











sequence


3D structure protein functions



Properties and evolution of genes, genomes,
proteins, metabolic pathways in cells



Use of this knowledge for prediction, modelling, and
design

The Core of Bioinformatics to date

TDQAAFDTNIVTLTRFVM
EQGRKARGTGEMTQLLNS
LCTAVKAISTAVRKAGIA
HLYGIAGSTNVTGDQVKK
LDVLSNDLVINVLKSSFA
TCVLVTEEDKNAIIVEPE
KRGKYVVCFDPLDGSSNI
DCLVSIGTIFGIYRKNST
DEPSEKDALQPGRNLVAA
GYALYGSATML
V

“The holy grail of bioinformatics”

GCTCCTCACTGTCTGTGTTTATTC
TTTTAGCTTCTTCAGATCTTTTAG
TCTGAGGAAGCCTGGCATGTGCA
AATGAAGTTAACCTAA
...


> 500, 000 genes
sequenced to date

Expected number of
unique protein
structures:

~ 700
-
1, 000

Basic concepts


conceptual foundations of bioinformatics:




evolution







protein folding






protein function








bioinformatics builds mathematical models
of these processes
-


to infer relationships between components
of complex biological systems














Information processing in cells



coding regions

regulatory

sites

nucleic acids

transcripts

proteins

One
-
to
-
many mappings!

Context
-
dependence!

Global cell state

Genome activation
patterns
: transcriptomics

Protein population
:
proteomics


Organisation:


tissue imaging


EM

X
-
ray, NMR




cells



molecular complexes

Global approaches: Toward a new Systems Biology


How does the spatial and
temporal organisation of
living matter give rise to
biological processes?

Genome

Living cell

“Virtual cell”

Perturbation

Dynamic response

Biological knowledge
(computerised)

Sequence information

Structural information


Basic principles


Practical
applications

Global approaches: Toward a new Systems Biology


Bioinformatics

Mathematical
modelling

Simulation

External environment

Internal environment

Metabolic net

Genetic networks

DNA hRNA

mRNAs

proteins

We

do

not

know

yet

whether

the

information

in

the

genome

is

sufficient

to

reconstruct

an

entire

biological

system
.

Information

on

building

blocks

not

enough,

information

on

their

interactions

is

essential
.

Bioinformatics in context




Genomics

Molecular
evolution

Biophysics

Molecular
biology

Ethical, legal,
and social
implications

Bioinformatics

Mathematics/
computer
science

Current challenges to users


Potential hurdles:






Methods are in flux and not fully developed
-

scattered and heterogeneous resources





Remedies:
Web resources






navigation guides








integration of tools and databanks


http://www.biochem.ucl.ac.uk/~nagl/bioinformatics.html

Example 1


Sequence homology search of the
genome of
Plasmodium
falciparum







Target identification for antimalerial
drugs

The search for new antimalarial
drugs



Malaria is one of the leading causes of morbidity
and mortality in the tropics.


300 to 500 million estimated clinical cases and 1.5
million to 2.7 million deaths per year.


Nearly all fatal cases are caused by
Plasmodium
falciparum.


The parasite's resistance to conventional
antimalarial drugs such as chloroquine is growing
at an alarming rate.


P. falciparum

has a plastidlike organelle, called the
apicoplast, acquired by endosymbiosis of an alga.






Self
-
replicating, maternally inherited (35kb, circular DNA).


Comparative genome analysis
:
Search for orthologs
.

Apicoplast contains enzymes found in plant and bacterial,
but
not

animal metabolic pathways.


Potential target for antimalerial drugs:

DOXP reductoisomerase


Jomaa et al. (1999)

Jomaa
et al
. (1999) Science 285: 1573
-
1576:

Biological databases

In 1995, the number of genes in the database started to exceed
the number of papers on molecular biology and genetics in the
literature!

(
Boguski, 1999
)

The challenge

Data types

primary data

secondary data

tertiary data

sequence

DNA

amino acid

AATGCGTATAGGC

DMPVERILEALAVE

primary database

secondary
protein structure

“motifs”:

regular
expressions, blocks,
profiles, fingerprints


e. g., alpha
-
helices, beta
-
strands

secondary db

domains, folding units

tertiary protein
structure

tertiary db

atomic co
-
ordinates

Primary biological databases


Nucleic acid




EMBL


GenBank


DDBJ (DNA
Data Bank of Japan)


Protein







PIR




MIPS




SWISS
-
PROT


TrEMBL




NRL
-
3D

International nucleotide data banks

EMBL

Europe



EMBL


EBI

GenBank

USA



NLM


NCBI

DDBJ

Japan



NIG


CIB

International

Advisory Meeting

Collaborative Meeting

TrEMBL

NRDB

GenBank file format

GenBank file format

Swiss
-
Prot

SWISS
-
PROT file format

SWISS
-
PROT file format

SWISS
-
PROT file format

SWISS
-
PROT file format

Other primary protein databases


TrEMBL (translated EMBL) in SWISS
-
PROT format



rapid access to sequence data from genome projects


computer
-
annotated

supplement to SWISS
-
PROT


translations of all coding sequences (CDS) in EMBL











SP
-
TrEMBL
















REM
-
TrEMBL: immunoglobulins, T
-
cell receptors, short
fragments, synthetic and patented sequences





Other primary protein databases


The Protein Information Resource (PIR)









integrated system of protein sequence databases
and derived related databases, e. g., alignment
databases







rapid searching, comparison, and pattern matching of
protein sequences





retrieval of descriptive, bibliographic, feature, and
concurrent cross
-
reference information


aims to be comprehensive and consistently
annotated


PIR: related databases


NRL
-
3D Sequence
-
Structure Database

















produced by PIR from sequence and annotation
information extracted from three
-
dimensional
structures in the Protein Databank (PDB)












allows keyword and similarity searches
















PIR: related databases


PATCHX integrated with PIR











a non
-
redundant database of protein sequences
produced by MIPS, the European branch of PIR
-
International











The PIR Protein Sequence Database and PATCHX
together provide the most complete collection of
protein sequence data currently available in the
public domain.

Composite protein sequence dbs

NRDB


OWL


MIPSX(PIR+PATCHX)


SP+TrEMBL

PIR


PIR


PIR




TrEMBL

SP


SP


SP




SP

PDB


GenBank

MIPSOwn

GenPept


NRL
-
3D


NRL
-
3D





MIPSH





PIRMOD





MIPSTrn





EMTrans





GBTrans





Kabat





PseqIP



OWL composite database

OWL only released every 6
-
8
weeks

By accession number



By database code



By text



By sequence



By title



By author



By query language



By regular expression

Direct OWL access:


OWL Blast server

Two other useful sites

INFOBIOGEN
-
The Public Catalog of Databases

http://www.infobiogen.fr/services/dbcat/


KEGG
-
Kyoto Encyclopedia of Genes and Genomes

http://www.genome.ad.jp/kegg/

Kyoto Encyclopedia of Genes and Genomes (KEGG) is an effort to
computerize current knowledge of molecular and cellular biology in
terms of the information pathways that consist of interacting molecules
or genes and to provide links from the gene catalogs produced by
genome sequencing projects.

Sequence Retrieval System (SRS)

Database browser that allows
users to


retrieve


link


access

entries from all interconnected
resources.

Users can formulate queries
across a range of different
database types.


Guide to Protein Databases:

http://www.biochem.ucl.ac.uk/~robert/bioinf/lecture1/index
.html

http://www.biochem.ucl.ac.uk/~robert/bioinf/lecture2/index
.html



With thanks to Dr Roman Laskowski.