Protein Family Classification for Functional Genomics

powerfultennesseeBiotechnology

Oct 2, 2013 (4 years and 11 days ago)

103 views



BIO
-
TRAC 25 (Proteomics: Principles and Methods)

October 10, 2003


NIH, Bethesda, MD



Zhang
-
Zhi Hu, M.D.

Senior Bioinformatics Scientist,

Protein Information Resource

National Biomedical Research Foundation, GUMC

Tutorial:

Bioinformatics Resources

2

What is Bioinformatics?


NIH Biomedical Information Science and Technology
Initiative (BISTI) Working Definition (2002)

-

Research,
development, or application of computational tools and
approaches for expanding the use of biological, medical,
behavioral or health data, including those to acquire, store,
organize, archive, analyze, or visualize such data.

Bioinformatics

is the application of information technology
to the analysis, organization and distribution of biological
data in order to answer complex biological questions.

3

Bioinformatics Resources

The Molecular Biology Database Collection:
An Online
Compilation of Relevant Database Resources


2003 update:
http://www3.oup.co.uk/nar/database/



Nucleic Acids Research Database Issues (January Annually)
(2003
-

http://nar.oupjournals.org/content/vol31/issue1/
)


DBcat:
A Catalog of > 500 Biological Databases


http://www.infobiogen.fr/services/dbcat/


4

Molecular Biology Database Collection
(
http://nar.oupjournals.org/cgi/content/full/31/1/1#GKG120TB1
)

5

The Molecular Biology Database Collection:
2003 update (Baxevanis, A.D.)

--

An online resource of 386 key databases of 18 categories

Major sequence repositories


Comparative Genomics

Gene Expression

Gene Identification and
Structure

Genetic and Physical Maps

Genomic Databases

Intermolecular Interactions

Metabolic Pathways and
Cellular Regulation

Mutation Databases

Pathology

Protein Sequence Motifs

Proteome Resources

Retrieval Systems and
Database Structure

RNA Sequences

Structure

Transgenics

Varied Biomedical Content

6

Overview

Protein Sequence Analysis

I
. Sequence Similarity Search and Alignment

II
. Family Classification Methods

III
. Structure Prediction Methods

Molecular Biology Databases

IV
. Protein Family Databases

V
. Database of Protein Functions

VI
. Databases of Protein Structures

Proteomic Resources

VII
. 2D
-
gel databases

VIII
. Proteomic analyses

7

I. Sequence Similarity Search

Find a protein sequence:
text search

Based on
Pair
-
Wise Comparisons


BLOSUM

scoring matrix


PAM

scoring matrix

Dynamic Programming Algorithms


Global Similarity:
Needleman
-
Wunsch

(
GAP/BestFit
)


Local Similarity:
Smith
-
Waterman

(
SSEARCH
)

Heuristic Algorithms (Sequence Database Searching)


FASTA
: Based on K
-
Tuples (2
-
Amino Acid)


BLAST
: Triples of Conserved Amino Acids


Gapped
-
BLAST
: Allow Gaps in Segment Pairs (NREF)


PHI
-
BLAST
: Pattern
-
Hit Initiated Search (NCBI)


PSI
-
BLAST
: Iterative Search (NCBI)

8

Sequence Search by Text or Unique ID

Entrez

(http://www.ncbi.nlm.nih.gov/Entrez/)

(http://pir.georgetow
n.edu/pirwww/search
/textsearch.html)

9

Pair
-
Wise
Comparisons



Scoring matrix


G
lobal
and

local

Similarity:
Dynamic
Programming

(
Needleman
-
Wunsch,

Smith
-
Waterman
)

(
http://www.ebi.ac.uk/emboss/align/
)

10

FASTA Search

(
http://www.ebi.
ac.uk/fasta33/
)

(
http://pir.georgetown.edu/pirwww/search/fasta.html
)


11

Gapped
-
BLAST Search

(
http://pir.georgetown.edu/pirwww/search/pirnref.shtml
)

(
http://www.ncbi.nlm.nih.gov/BLAST/
)

A BLAST Result

13

PSI
-
BLAST Iterative Search

(
http://www.ncbi.nlm.nih.gov/BLAST/
)

14

PSI
-
BLAST

15

II. Family Classification Methods

Multiple Sequence Alignment

and Phylogenetic Analysis


ClustalW Multiple Sequence Alignment


Alignment Editor & Phylogenetic Trees

Searches Based on
Family Information


PROSITE Pattern Search


Motif and Profile Search


Hidden Markov Model (HMMs)

16

Multiple Sequence Alignment


ClustalW (
http://pir.georgetown.edu/pirwww/search/multaln.html
)

17

Alignment Editor (Jalview)

(
http://www.ebi.ac.uk/clustalw/
)

18

Alignment Editor (GeneDoc)

(
http://www.psc.edu/biomed/genedoc/
)

19

Phylogenetic Analysis

Tree Programs: (
http://evolution.

genetics.washington.edu/phylip.html
)

Tree Searches: (
http://pauling.
mbu.iisc.ernet.in/~pali/index.html
)

20

Phylogenetic Trees
(
IGFBP Superfamily
)


(Radial Tree)


(Phylogram)

21

PROSITE Pattern Search

(
http://pir.georgetown.edu/pirwww/search/patmatch.html
)

22

Profile Search

(
http://bmerc
-
www.bu.edu/bioinformatics/profile_request.html
)

23

Hidden Markov Model Search

(
http://www.sanger.ac.uk/Software/Pfam/search.shtml
)

(
http://smart.embl
-
heidelberg.de
)

24

III. Structural Prediction Methods

Signal Peptide:
SIGFIND, SignalP

Transmembrane Helix:
TMHMM, TMAP

2D Prediction (
a
-
桥汩砬x
b
-
獨敥sⰠ䍯楬敤
-
捯楬猩㨠
偈䐬⁊偲敤


㍄3䵯摥汩湧㨠䡯H潬潧礠䵯摥汩湧n(
䵯摥汬敲Ⱐ南䥓S
-
䵏䑅M
⤬⁔桲敡摩湧Ⱐ䅢
-
楮i瑩漠偲敤楣P楯i

25

Structure

Prediction:

A Guide

(
http://speedy.embl
-
heidelberg.de/gtsp/flow
chart2.html
)

26

Protein
Prediction
Server

(
http://www.cbs.
dtu.dk/services/
)

27

Signal Peptide Prediction

(
http://www.stepc.gr/~synaptic/sigfind.html
)

(
http://www.cbs.dtu.d
k/services/SignalP
-
2.0
)

28

Transmembrane Helix

(
http://www.cbs.dtu.dk/services/TMHMM/
)

29

Protein Structure Prediction

(
http://cmgm.stanford.edu/WWW/www_predict.html
)

(
http://restools.sdsc.edu/
biotools/biotools9.html
)

30

Structure Prediction Server

(
http://cubic.bioc.columbia.edu/predictprotein/
)

(
http://www.compbio.dun
dee.ac.uk/WWW_Servers/
JPred/jpred.html
)

31

3D
-
Modelling

(
http://www.salilab.org/modeller/modeller.html
)

(
http://www.expasy.
ch/swissmod/SWISS
-
MODEL.html
)

32

IV. Protein Family Databases

Whole Proteins


PIR:

Superfamilies and Families


COG
(Clusters of Orthologous Groups) of Complete Genomes


ProtoNet:

Automated Hierarchical Classification of Proteins

Protein Domains


Pfam:

Alignments and HMM Models of Protein Domains


SMART:

Protein Domain Families

Protein Motifs


PROSITE:

Protein Patterns and Profiles


BLOCKS:

Protein Sequence Motifs and Alignments


PRINTS:

Protein Sequence Motifs and Signatures

Integrated Family Databases


iProClass:

Superfamilies/Families, Domains, Motifs, Rich Links


InterPro:

Integrate Pfam, PRINTS, PROSITES, ProDom, SMART

33

Protein Clustering

(
http://www.ncbi.nlm.nih.gov/COG/
)

34

Protein Domains

Pfam (
http://www.sanger.ac.uk/Software/Pfam/
)

SMART (
http://
smart.embl
-
heid
elberg.de/smart/
show_motifs.pl
)

35

Protein Motifs


PROSITE is a database of protein families and domains. It
consists of biologically significant sites, patterns and profiles.

(
http://www.expasy.ch/prosite/
)

36

Integrated Family Classification

InterPro
:

An

integrated resource unifying PROSITE, PRINTS, ProDom, Pfam,
SMART, and TIGRFAMs, PIRSF.
(
http://www.ebi.ac.uk/interpro/search.html
)

37

V. Databases of Protein Functions

Metabolic Pathways, Enzymes, and Compounds


Enzyme Classification:

Classification and Nomenclature of Enzyme
-
Catalysed
Reactions (EC
-
IUBMB)


KEGG
(Kyoto Encyclopedia of Genes and Genomes): Metabolic Pathways


LIGAND
(at KEGG):
Chemical Compounds, Reactions and Enzymes


EcoCyc:

Encyclopedia of
E. coli

Genes and Metabolism


MetaCyc:

Metabolic Encyclopedia (
Metabolic Pathways)


WIT:

Functional Curation and Metabolic Models


BRENDA:

Enzyme Database


UM
-
BBD:

Microbial Biocatalytic Reactions and Biodegradation Pathways


Klotho:

Collection and Categorization of Biological Compounds

Cellular Regulation and Gene Networks


EpoDB:

Genes Expressed during Human Erythropoiesis


BIND:

Descriptions of interactions, molecular complexes and pathways


DIP:

Catalogs experimentally determined interactions between proteins


RegulonDB:

Escherichia coli

Pathways and Regulation

38

KEGG Metabolic & Regulatory Pathways

(
http://www.genome.ad.jp/dbget
-
bin/show_pathway?hsa00590+874
)


KEGG is a suite of databases and associated software, integrating our current knowledge


on molecular interaction networks, the information of genes and proteins, and of chemical


compounds and reactions.
(
http://www.genome.ad.jp/kegg/kegg2.html
)

39

BioCyc

(EcoCyc/MetaCyc Metabolic Pathways)


The BioCyc Knowledge Library is a collection of Pathway/Genome


Databases
(
http://biocyc.org/
)

40

Protein
-
Protein Interactions: DIP

(
http://dip.doe
-
mbi.ucla.edu/
)

41

Protein
-
Protein Interaction: BIND

(
http://www.bind.ca/
)

42

BioCarta Cellular Pathways

(
http://www.biocarta.com/index.asp
)

43

VI. Databases of Protein Structures

Protein Structure and Classification


PDB:

Structure Determined by X
-
ray Crystallography and NMR


CATH:

Hierarchical Classification of Protein Domain Structures


SCOP:

Familial and Structural Protein Relationships


FSSP:

Protein Fold Family Database

Protein Sequence
-
Structure Relationship


PIR
-
NRL3D:

Protein Sequence
-
Structure Database


PIR
-
RESID:

Protein Structure/Post
-
Translational Modifications


HSSP:

Families and Alignments of Structurally
-
Conserved Regions

44

PDB Structure Data

(
http://www.rcsb.org/pdb/
)

45

PDBsum:

Summary and Analysis

(
http://www.biochem.ucl.
ac.uk/bsm/pdbsum
)

46

Protein Structural
Classification

CATH
: Hierarchical
domain classification of
protein structures
(
http://www.biochem.

ucl.ac.uk/bsm/cath_new/
)

47

Protein Structural Classification

(
http://scop.mrc
-
lmb.
cam.ac.uk/scop/
)

The
SCOP

database aims to provide a detailed and comprehensive
description of the structural and evolutionary relationships between all
proteins whose structure is known, including all entries in the PDB.

48

VII. Proteomic Resources


GELBANK (
http://gelbank.anl.gov
): 2D
-
gel patterns from completed
genomes; SWISS
-
2DPAGE (
http://www.expasy.org/ch2d/
)

PEP: Predictions for Entire Proteomes: (
http://cubic.bioc.columbia.edu/
pep/
): Summarized analyses of protein sequences


Proteome BioKnowledge Library: (
http://www.proteome.com
): Detailed
information on human, mouse and rat proteomes

Proteome Analysis Database (
http://www.ebi.ac.uk/proteome/
): Online
application of InterPro and CluSTr for the functional classification of
proteins in whole genomes

Expression Profiling databases: GNF (
http://expression.gnf.org/cgi
-
bin/index.cgi
, human and mouse transcriptome), SMD (
http://genome
-
www5.stanford.edu/MicroArray/SMD/
, Stanford microarray data
analysis), EBI Microarray Informatics (
http://www.ebi.ac.uk/microarray/
index.html

,
managing, storing and analyzing microarray data
)

49

2D
-
Gel Image Databases (1)

(
http://gelbank.anl.gov/2dgels/index.asp
)

50

2D
-
Gel Image Databases (2)

(
http://us.expasy.org/ch2d/2d
-
index.html
)

(
http://us.expasy.org/cgi
-
bin/nice2dpage.pl?P06493
)

51

VIII. Proteome Analysis

(
http://www.ebi.ac.uk/proteome
)

52

Expression Profiling


Human and Mouse Transcriptome


(
http://expression.gnf.org/cgi
-
bin/index.cgi
)

(
http://genome
-
www.
stanford.edu/serum/
)

53

Lab:


Visit selected websites and analyze some protein sequences of


your own choices.


-

List of Bioinformatics Resources of this tutorial available
:


http://pir.georgetown.edu/~huz/bioinfo_resource.html



Try some of the following sequences for analysis:


1) well characterized proteins:
PIR:
A26366(CYP17),

JS0747(Sp1)


2) less characterized proteins:
PIR:
A59000(MATER)


TrEMBL:
Q9QY16(GRTH)


3) hypothetical protein:
PIR:
T12515, T00338 , T47130



SWISS
-
PROT:
Q9BWT7