ppt - Bioinformatics

lowlytoolboxΒιοτεχνολογία

22 Οκτ 2013 (πριν από 3 χρόνια και 9 μήνες)

195 εμφανίσεις

Chapter 1

Introduction

What is bioinformatics

Quantitation is essential in biology

Counting bacterial colonies

Counting animals in a natural environment

Counting genetic variability among plants and fruit flies led to the laws of Mendelian
inheritance

More complex quantitative tools involve predictions of human population growth or
enzyme kinetics

Very sophisticated tools may involve application of “game theory” to model behavior
and evolution

Non
-
linear partial differential equations to model cardiac blood flow or in situ
cytoplasm flow

None of these examples are bioinformatics

Bioinformatics relate to macromolecules

Earliest bioinformatics exercise: Margaret Dayhoff (1965) first protein sequence
database Atlas of Protein Sequence and Structure (now PIR)

Early 1970s Brookhaven National Laboratory compiled Protein Database (PDB) of X
-
ray and NMR structures

First sequence alignment algorithm Needleman Wunsch 1970s

Routine sequence comparisons and database searching

First protein structure prediction algorithm Chou and Fasman 1974

1980s saw establishment of GenBank and FASTA and BLAST

Human Genome Project started late 1980s

Main reason why bioinformatics flourished and grew was due to enormous volumes
of sequence data


Definition


Bioinformatics is the discipline that uses computers to store,
retrieve, manipulate and distribute information related to
biological macromolecules such as RNA, DNA and proteins


Computational biology encompasses all areas of biology that
involve computation


Goal


Better understand a living cell and how it functions at a
molecular level


Two major fields


1. Development of computational tools and databases


Software for sequence analysis


Sequence alignment, sequence database searching,
motif and pattern discovery, gene and promoter finding,
reconstruction of evolutionary relationships, genome
assembly and comparison


Software for structural analysis


Protein and nucleic acid structural analysis, comparison,
classification and prediction


Software for functional analysis


Gene expression profiling, protein
-
protein interaction
prediction, protein sub
-
cellular location prediction,
metabolic pathway reconstruction


Construction and curation of biological databases

2. Generate biological knowledge to better understand living
systems


Often identify new problems that require new software to
analyze


Bioinformatics is essential for basic genomic and molecular biology research


Major impact in biotechnology and biomedical sciences


Knowledge
-
based drug design


3D structure allows design of ligands that fit


Reduces time and cost to develop drugs


Forensic DNA analysis


Bayesian statistics and likelihood
-
based methods


Personalised healthcare


Agricultural biotechnology


Plant genome databases


Gene expression profiles


New crop varieties


Limitations of bioinformatics



The results are as good as the data


Errors in sequences


Hypothesis independent


Bioinformatics does not replace traditional hypothesis
-
driven approaches


It complements and identified new questions


Integrate gene expression and protein functions in the
cell


Analysis at the level of systems: systems biology


Description of a cell as a mathematical model


Predictive value


Chapter 2

Biological Databases


A database is a computerized archive used to store and
organize data so that information can be
retrieved

by a variety
of search criteria


A database can be thought as a stack or record cards, where
each record card contains defined items of information, say
Name, Address, Phone Number, Birth Date, etc.


In a database, each such card is an
entry
, and each set
information item is a
field


Each field of each entry contains a value (can be NULL)


Search all entries retrieve entries than contain a specific value
in a field


This process is called making a query


Biological databases often have higher level requirements
such as knowledge discovery, where previously unknown
relations between values are found


What is a database?

Different database formats


Flat file


ASCII file


Rows of comma delimited entries


The computer has to read the entire file to find all entries or
relationships


Many databases are distributed as flat files


Below is a simple ASCII data file from REBASE, a database
of restriction enzyme cleavage sites
(http://rebase.neb.com/rebase/rebase.html)


REBASE version 807 strider.807




=
-
=
-
=
-
=
-
=
-
=
-
=
-
=
-
=
-
=
-
=
-
=
-
=
-
=
-
=
-
=
-
=
-
=
-
=
-
=
-
=
-
=
-
=
-
=
-
=
-
=
-
=
-
=
-
=
-
=
-
=
-
=
-
=
-
=


REBASE, The Restriction Enzyme Database http://rebase.neb.com


Copyright (c) Dr. Richard J. Roberts, 2008. All rights reserved.


=
-
=
-
=
-
=
-
=
-
=
-
=
-
=
-
=
-
=
-
=
-
=
-
=
-
=
-
=
-
=
-
=
-
=
-
=
-
=
-
=
-
=
-
=
-
=
-
=
-
=
-
=
-
=
-
=
-
=
-
=
-
=
-
=
-
=




Rich Roberts Jun 30 2008



#AarI,cacctgc,4,8,

AatII,gacgt/c,

AbsI,cc/tcgagg,

AccI,gt/mkac,

Acc65I,g/gtacc,

#AceIII,cagctc,7,11,

#AciI,ccgc,
-
3,
-
1,

AclI,aa/cgtt,

#AcuI,ctgaag,16,14,

AfeI,agc/gct,

AflII,c/ttaag,

Relational database


The relational database does not describe relations
between entries


Relation is the mathematical term for “table”


Thus a relational database is composed of tables


Each table is composed of rows (entries = tuple) and each
row has columns (attributes) with a value in each cell


Where multiple tables share a common column, it is
possible to get relationship between the columns in different
tables by combinining data with identical values for a column

Student Number

Name

State

1

Jack

Kansas

2

John

Maryland

3

Jill

Washington

Entries/Tuple

Columns/Attributes

Student
Number

Name

Gender

State

1

Jack

M

Kansas

2

John

M

Maryland

3

Jill

F

Maryland

Student
Number

Course

1

BOC314

2

BOC334

3

BOC364

Course

Description

BOC314

Biochemistry

BOC334

Proteomics

BOC364

Bioinformatics

Query: What courses do students from Maryland take?

Query: Do females take more courses in the first or second semester

A simple three table relational database

Object oriented databases


Attributes of entries are represented as members of classes


Each member can be a member of more than one class


This gives rise to a hierarchical relationship, very much like a tree


Parent objects point to child objects, which, in turn, pointy to their
child objects


Thus, all students from Maryland will be pointed to by the Maryland
object


All students who do BOC364 will be pointed to by the BOC364
object


Great care must be taken when designing a object
-
oriented
database to ensure efficient querying

Biological databases


Primary databases


Raw sequence data


GenBank


PDB


Secondary databases


Computationally processed or curated database


SWISS
-
PROT


PIR


Specialized databases


For specific interest groups


FlyBase


SGD

Primary Databases

Three major databases

GenBank (
http://www.ncbi.nlm.nih.gov/Genbank/)

EMBL

DDBJ

Sequences are exchanged on a daily basis

Each database is up to date (use any one)

Deposition of data a prerequisite to publication

Secondary databases


Significant processing of original raw data


Annotation


ORFs


Functional links


SWISS
-
PROT


Carefully curated database


High quality


SWISS
-
PROT, trEMBL and PIR combined in UniProt


Pfam aligned protein sequences to define families


BLOCKS


motifs and patterns


DALI


secondary predictions to find evolutionary relationships


Specialized Databases


Often focused on a specific aspect of an organism


Curated by experts


Highly annotated and processed data


SGD


FlyBase


WormBase

Interconnection between biological databases


Need to access both primary and secondary database


Provide links between databases


Difficult to connect databases with different structures:
ASCII, Relational and Object
-
oriented


Common Object Request Broken Architecture (CORBA)


eXtensible Markup Language (XML)

Information retrieval

Entrez (Aahn
-
tray)

Gateway that allows text
-
based searches of a wide variety of data




Using “Limits” in Entrez

Preview/Index

History

Clipboard

Online Mendelian Inheritance in Man

PubMed

GenBank file format

GenBank file format continued

FASTA format


First line start with “>” sign followed by any information


Sequence continues with 60 or 80 characters per line

Abstract Syntax Notation (ASN.1)

Sequence retrieval system (SRS)

(
http://srs6.ebi.ac.uk/)

Result of SRS search