Essential Bioinformatics and Biocomputing (LSM2104 ... - BIDD

dasypygalstockingsBiotechnology

Oct 2, 2013 (3 years and 10 months ago)

102 views

Essential Bioinformatics and Biocomputing

(
LSM2104: Section

I)



Biological Databases and

Bioinformatics Software


Prof. Chen Yu Zong


Tel: 6874
-
6877

Email:
csccyz@nus.edu.sg

http://xin.cz3.nus.edu.sg

Room 07
-
24, level 7, SOC1, NUS

January 2003

Essential Bioinformatics and Biocomputing

(
LSM2104: Section

I)


Four lectures


Part 1: Biological databases:


Lecture 2. Biological information and databases

Lecture 3. More databases, retrieval systems, and database searching


Part 2: Software:


Lecture 4. Examples of the applications of bioinformatics software


and basic principles

Lecture 5. Overview of bioinformatics software


Essential Bioinformatics and
Biocomputing (LSM2104), NUS

3

Part 1: Biological databases

Part 1 outline:


1.
Biological information and databases


Overview and definition, types of biological databases


2.
Popular databases, records, data format


Genbank, SwissProt, OMIM, PDB, KEGG, BIND, Pfam, PROSITE, PubMed


3.
Accessing biological databases, retrieval systems


Entrez, SRS


4.
Searching biological databases


Data quality, coverage, redundancy, errors


Textbook:

--
T.K.Atwood and D.J. Parry Smith, Introduction to Bioinformatics.


Biological databases: chapters 3 and 4

Essential Bioinformatics and
Biocomputing (LSM2104), NUS

4

Biological
Information

Cancer as an

example:


Genes:

Growth Genes

Tumor

suppressor genes


Proteins:

Growth Factors

Enzymes

Receptors


Pathways:

Cell death


Systems:

Immune system

Blood supply


Function:

Role of proteins

Molecular

interactions

Essential Bioinformatics and
Biocomputing (LSM2104), NUS

5

Biological Information


Nucleic acids:


DNA sequence, genes, gene products (proteins), mutation,
gene coding, distribution patterns, motifs


Genomics: genome, gene structure and expression, genetic
map, genetic disorder


RNA sequence, secondary structure, 3D structure,
interactions


Proteins:


Protein sequence, corresponding gene, secondary structure,
3D structure, function, motifs, homology, interactions


Proteomics: expression profile, proteins in disease processes
etc.


Ligands and drugs (inhibitors, activators, substrates,
metabolites)


Essential Bioinformatics and
Biocomputing (LSM2104), NUS

6

Biological Information


Pathways:


Molecular networks, biological chain events,
regulation, feedback, kinetic data


Function:


Binding sites, interactions, molecular action
(binding, chemical reaction, etc.)


Biological effect (signaling, transport, feedback,
regulation, modification, etc.)


Functional relationship, protein families, motifs, and
homologs

Essential Bioinformatics and
Biocomputing (LSM2104), NUS

7

Biological databases

Purpose


1.
To disseminate biological data and information

2.
To provide biological data in computer
-
readable form

3.
To allow analysis of biological data



A database needs to have at minimum a specific tool for
searching and data extraction.


Web pages, books, journal articles, tables, text files, and spreadsheet
files cannot be considered as databases



Reading materials:


Baxevanis AD.The Molecular Biology Database Collection: 2002 update.
Nucleic Acids Res
. 2002 Jan 1;30(1):1
-
12.



Essential Bioinformatics and
Biocomputing (LSM2104), NUS

8

Biological databases

Lists of biological databases



INFOBIOGEN Catalog of Databases
http://www.infobiogen.fr/services/dbcat/



Nucleic Acids Research Database Listing


http://nar.oupjournals.org/cgi/content/full/30/1/1/DC1



These serve as starting point of biological databases.


More than 500 databases have been catalogued to date
and those from the two listings satisfy minimal criteria for
the content, access, and quality.



Other sites as a starting point.



Essential Bioinformatics and
Biocomputing (LSM2104), NUS

9

Biological databases


INFOBIOGEN Catalog of Databases




Type of database No of records


DNA 87


RNA 29


Protein 94


Genomic 58


Mapping 29


Protein structure 18


Literature 43


Miscellaneous 153





Total



511








Essential Bioinformatics and
Biocomputing (LSM2104), NUS

10

Biological databases
-

in Nucleic Acids Research

Type of database No of records

Major Sequence Repositories


7

Comparative Genomics


7

Gene Expression


20

Gene Identification and Structure


30

Genetic and Physical Maps


10

Genomic Databases


48

Intermolecular Interactions


5

Metabolic Pathways and Cellular Regulation

12

Mutation Databases



33

Pathology




8

Protein Databases




50

Protein Sequence Motifs



18

Proteome Resources



7

RNA Sequences




26

Retrieval Systems and Database Structure

3

Structure





32

Transgenics




2

Varied Biomedical Content



18

TOTAL





336

Essential Bioinformatics and
Biocomputing (LSM2104), NUS

11

Literature databases


PubMed

(MedLine)

1. It contains entries for more than 11 million
abstracts of scientific publications.


2. It enables user to do keyword searches, provides
links to a selection of full articles, and has text
mining capabilities, e.g. provides links to related
articles, and GenBank entries, among others.


3. Efficient searching PubMed requires some skill.
For example, searching with a keyword “interleukin”
returns 108,366 matches.


Essential Bioinformatics and
Biocomputing (LSM2104), NUS

12

PubMed web
-
site

(
http://www3.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed

)

Essential Bioinformatics and
Biocomputing (LSM2104), NUS

13

PubMed Search

(
http://www3.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed

)

Key Word

No. of Entries

Cancer

1.45M

Cancer

Blood supply

22K

Cancer

Blood supply

Protein

3.9K

Cancer

Blood supply

Enzyme

1.5K

Cancer treatment by
targeting blood supply:


Cancer growth depends on blood
supply (why?) and thus requires
the growth of new blood vessels


angiogenesis


Proteins involved in angiogenesis
may be potential anticancer targets


You can find some of these targets

by searching Pubmed


Key word “cancer angiogenesis
enzyme drug” produces 856 entries



Cancer

Blood supply

Enzyme

Drug

500

Essential Bioinformatics and
Biocomputing (LSM2104), NUS

14

Nucleic Acids databases

What info are in these databases:


DNA sequence, genes, gene products (proteins),
mutation, gene coding, distribution patterns, motifs


Genomics: genome, gene structure and expression,
genetic map, genetic disorder


RNA sequence, secondary structure, 3D structure,
interactions

Essential Bioinformatics and
Biocomputing (LSM2104), NUS

15

Nucleic Acids databases

DNA databases



GenBank,

EMBL, DDBJ


1. General purpose databases focusing on DNA sequences
and their properties


2. GenBank, EMBL
-
bank and DDBJ exchange data to
ensure comprehensive worldwide coverage and
accession numbers are managed consistently between the
three centers.


Reading materials:


Textbook, chapter 4

Essential Bioinformatics and
Biocomputing (LSM2104), NUS

16

DNA databases


GenBank database

(
http://www.ncbi.nih.gov/Genbank/
)


Contains publicly available DNA sequences from more than
100,000 organisms.


Also contains derived protein sequences, and annotations
describing biological, structural, and other relevant features.


Accessible through Entrez, NCBI’s integrated retrieval system
(studied later)


Sequence similarity search tools: BLAST
(studied later)



EMBL nucleotide sequence database

(
http://www.ebi.ac.uk/embl/
)


Contains nucleotide sequences collected from all public sources.


Accessible through Sequence Retrieval System (SRS) which
allows keyword searching
(studied later)


Sequence similarity search tools: Blitz, Fasta, and BLAST
(studied later)

Essential Bioinformatics and
Biocomputing (LSM2104), NUS

17

DNA databases:


GenBank


Web page



Essential Bioinformatics and
Biocomputing (LSM2104), NUS

18

DNA databases


An Example from
GenBank


flat file



Human
Alpha
-
Lactalbumin

gene


This protein is a complex of 2 proteins A and B. In the absence of the

B protein, the enzyme catalyzes the transfer of

galactose from UDP
-
galactose to Nacetylglucosamine (cf. EC 2.4.1.90).


Essential Bioinformatics and
Biocomputing (LSM2104), NUS

19

A GenBank entry


HEADER

Essential Bioinformatics and
Biocomputing (LSM2104), NUS

20

GenBank Entry


Links provided in the Header


MapViewer


find the gene position in chromosome


Related Sequences


other entries related to this gene (or sequence)


OMIM


link to catalog of human genes and genetic disorders



Protein


retrieve protein record from GenPept


Medline and PubMed

literature abstracts related to this gene


Taxonomy


Classification of organisms


UniGene


Unified

gene data


UniSTS


Unified

sequence tagged sites, marker and mapping data


LinkOut


links to publishers, aggregators libraries, biological databases,
sequence centers, and other Web resources


REFSEQ


reference sequence standards

Note:

These links are representative. Other links may also be found in GenBank
entries.

Essential Bioinformatics and
Biocomputing (LSM2104), NUS

21

GenBank entry
-

FEATURES


Essential Bioinformatics and
Biocomputing (LSM2104), NUS

22

GenBank Entry


Links provided in the Feature section

LocusID


locus and display of genomic and mRNA sequences

MIM


Link to OMIM
description, other entries for this sequence

EC_number


link to the corresponding cataloged enzymes

Protein_id


retrieve protein record from GenPept


CD


conserved protein domain (SMART),

CDD



conserved protein domain (Pfam).

Essential Bioinformatics and
Biocomputing (LSM2104), NUS

23

Biological databases:
GenBank
-

SEQUENCE

Essential Bioinformatics and
Biocomputing (LSM2104), NUS

24

GenBank
-

NOTES

Majority of GenBank entries have similar form to our example.

When accessing the database, the following needs to be noticed:



Some entries are huge, containing as much as 30,000 lines.

(NT_021877 Homo
sapiens chromosome 1 working draft sequence segment)



Some entries have contig information instead of sequence information.

(NT_021877 Homo sapiens chromosome 1 working draft sequence segment)



Some entries are derived from cDNA sequences and thus represent putative
genes/proteins.

These should be used with caution.
(AK007430. Mus musculus 10
d...[gi:12840976]).



Some annotations are predicted using automated analysis. These should also be
used with caution.
(XM_131483 Mus musculus simi...[gi:20832685]).


Essential Bioinformatics and
Biocomputing (LSM2104), NUS

25

GenBank
-

Statistics

Year Base Pairs Sequences



1982

680338 606


1992

101008486 78608


2000

11101066288 10106023


2001

15849921438 14976310



Data size is large and increases fast


Essential Bioinformatics and
Biocomputing (LSM2104), NUS

26

Biological Databases

Database Searching


1.
Databases must have methods for accessing and extracting data
stored.


2.
The most basic search is keyword searching


Keywords can be any word that occurs somewhere in the database

records. It can be the name of the gene or protein (e.g. lactalbumin),

species (e.g.
homo sapiens
, human), a taxonomy term

(e.g.primates), or a word from the reference title (e.g. cancer)


3.
Others include: Entry Id number, sequence


4.
Databases typically have hyperlinks that provide access to
additional information related to the entry from other sources.

Essential Bioinformatics and
Biocomputing (LSM2104), NUS

27

Biological databases:
OMIM


Online Mendelian Inheritance in Man

(
http://www.ncbi.nlm.nih.gov/Omim/
)



The OMIM database contains abstracts and texts describing
genetic disorders to support genomics efforts and clinical
genetics. It provides gene maps, and known disorder maps
in tabular listing formats. Contains keyword search.


Hamosh A.
et
al. Online Mendelian Inheritance in Man (OMIM), a knowledge base

of human genes and genetic disorders
Nucleic Acids Res
. 2002 30: 52
-
55.

Essential Bioinformatics and
Biocomputing (LSM2104), NUS

28

Biological databases:
OMIM web
-
page

Essential Bioinformatics and
Biocomputing (LSM2104), NUS

29

Biological databases:
OMIM search engine

Essential Bioinformatics and
Biocomputing (LSM2104), NUS

30

Biological databases:
OMIM statistics

All Entries

: 14088


Established Gene Locus

: 10476


Phenotype Descriptions

: 1194


Other Entries

: 2418

Essential Bioinformatics and
Biocomputing (LSM2104), NUS

31

Biological databases

Protein databases

1.
SWISS
-
PROT

(
http://us.expasy.org/sprot/sprot
-
top.html
)
is a
curated database focusing on high level of annotation
(sequence, function, structure, post
-
translational
modifications, variants, etc.) of proteins.


2.
TrEMBL

is Computer
-
annotated supplement to SWISS
-
PROT


Reading materials:
Textbook, chapter 3

Essential Bioinformatics and
Biocomputing (LSM2104), NUS

32

Protein databases

What are in these databases:


Protein sequence, corresponding gene, secondary
structure, 3D structure, function, motifs, homology,
interactions


Proteomics: expression profile, proteins in disease
processes etc.


Ligands and drugs (inhibitors, activators, substrates,
metabolites)


Essential Bioinformatics and
Biocomputing (LSM2104), NUS

33

Protein databases



SWISS
-
PROT

Notes:


SWISS
-
PROT provides high
-
quality annotations and
detailed info about sequence, structural, functional, and
other properties of proteins.



It provides a rich set of links to other sources of
information on SWISS
-
PROT entries. Unfortunately, some
of the links will not work at all times, because of the
dynamical change of the Web.



It also provides a rich set of protein analysis tools.


Essential Bioinformatics and
Biocomputing (LSM2104), NUS

34

SWISS
-
PROT

web
-
page

Essential Bioinformatics and
Biocomputing (LSM2104), NUS

35

SWISS
-
PROT entry
P00709

Essential Bioinformatics and
Biocomputing (LSM2104), NUS

36

Essential Bioinformatics and
Biocomputing (LSM2104), NUS

37

SWISS
-
PROT entry P00709

Essential Bioinformatics and
Biocomputing (LSM2104), NUS

38

SWISS
-
PROT entry P00709

Essential Bioinformatics and
Biocomputing (LSM2104), NUS

39

Biological databases:

Protein structure database: PDB (
http://www.pdb.org
)

1.
More than 18,000 macromolecular structures on proteins, peptides,
viruses, protein/nucleic acids complexes, nucleic acids, and
carbohydrates.


2.
Among the oldest databases


the first structure was deposited in 1972.


3.
New deposited structures has been steadily growing (3298 in 2001, and
1486 Jan 1
-
June 5, 2002).


4.
Determined mainly by the X
-
ray diffraction and NMR.


5.
It Contains tools for keyword search, comprehensive visualization, and
information extraction


such as sequence, geometry, and structural
neighbors details.

Essential Bioinformatics and
Biocomputing (LSM2104), NUS

40

Biological databases:
PDB web
-
page

http://www.rcsb.org/pdb/


Essential Bioinformatics and
Biocomputing (LSM2104), NUS

41

Biological databases:
A PDB entry

http://www.rcsb.org/pdb/


Essential Bioinformatics and
Biocomputing (LSM2104), NUS

42

Biological databases


PDB statistics

Essential Bioinformatics and
Biocomputing (LSM2104), NUS

43

Biological databases

Summary of Today’s lecture


Types of Biological information, data and databases



Simple data retrieval method.



Popular databases: Pubmed,
Genbank, SwissProt, OMIM,
PDB



Statistics:


Large number of publications (MEDLINE: >12M since 1960)


Large amount of data for sequence (DNA: >14M, Protein: > 120K)


Fair amount of data for 3D structure (Protein >14K, Nucleic acid >1K)