Bioinformatics Resources and Tools on the Web: A Primer - MGH-PGA

powerfultennesseeΒιοτεχνολογία

2 Οκτ 2013 (πριν από 3 χρόνια και 9 μήνες)

60 εμφανίσεις

Bioinformatics Resources and Tools
on the Web: A Primer

Joel H. Graber

Center for Advanced Biotechnology

Boston University

Outline


Introduction: What is bioinformatics?


The basics


The five sites that all biologists should know


Some examples


Using the tools in a somewhat less
-
than
-
naïve manner



Questions/comments are welcome at all points


Much of this material comes from the Boston
University course:
BF527 Bioinformatic
Applications (
http://matrix.bu.edu/BF527/
)

What is bioinformatics?

Examples of Bioinformatics


Database interfaces


Genbank/EMBL/DDBJ, Medline, SwissProt, PDB, …


Sequence alignment


BLAST, FASTA


Multiple sequence alignment


Clustal, MultAlin, DiAlign


Gene finding


Genscan, GenomeScan, GeneMark, GRAIL


Protein Domain analysis and identification


pfam, BLOCKS, ProDom,


Pattern Identification/Characterization


Gibbs Sampler, AlignACE, MEME


Protein Folding prediction


PredictProtein, SwissModeler

Things to know and remember about
using web server
-
based tools


You are using someone else’s computer



You are (probably) getting a reduced set of
options or capacity



Servers are great for sporadic or proof
-
of
-
principle work, but for intensive work, the
software should be obtained and run locally

Five websites that all biologists
should know


NCBI (The National Center for Biotechnology Information;


http://www.ncbi.nlm.nih.gov/


EBI (The European Bioinformatics Institute)


http://www.ebi.ac.uk/


The Canadian Bioinformatics Resource


http://www.cbr.nrc.ca/


SwissProt/ExPASy (Swiss Bioinformatics Resource)


http://expasy.cbr.nrc.ca/sprot/


PDB (The Protein Databank)


http://www.rcsb.org/PDB/


NCBI (
http://www.ncbi.nlm.nih.gov/
)


Entrez interface to databases


Medline/OMIM


Genbank/Genpept/Structures


BLAST server(s)


Five
-
plus flavors of blast


Draft Human Genome


Much, much more…

EBI (
http://www.ebi.ac.uk/
)


SRS database interface


EMBL, SwissProt, and many more


Many server
-
based tools


ClustalW, DALI, …


SwissProt
(
http://expasy.cbr.nrc.ca/sprot/
)


Curation!!!


Error rate in the information is greatly reduced in
comparison to most other databases.


Extensive cross
-
linking to other data sources


SwissProt is the ‘gold
-
standard’ by which
other databases can be measured, and is the
best place to start if you have a specific
protein to investigate

A few more resources to be aware of


Human Genome Working Draft


http://genome.ucsc.edu/


TIGR (The Institute for Genomics Research)


http://www.tigr.org/


Celera


http://www.celera.com/


(Model) Organism specific information:


Yeast
:
http://genome
-
www.stanford.edu/Saccharomyces/


Arabidopis
:
http://www.tair.org/


Mouse
:
http://www.jax.org/


Fruitfly
:
http://www.fruitfly.org/


Nematode:
http://www.wormbase.org/


Nucleic Acids Research Database Issue


http://nar.oupjournals.org/

(First issue every year)

Example 1: Searching a new
genome for a specific protein


Specific problem: We want to find the closest
match in
C. elegans

of
D. melanogaster

protein
NTF1
, a transcription factor



First
-

understanding the different forms of blast

The different versions of BLAST

1
st

Step: Search the proteins


blastp

is used to search for
C. elegans

proteins that are similar to
NTF1






Two reasonable hits are found, but the hits
have suspicious characteristics


besides the fact that they weren’t included in the
complete genome
!

2
nd

Step: Search the nucleotides


tblastn

is used to search for translations of
C.
elegans

nucleotide that are similar to
NTF1






Now we have only one hit


How are they related?

Conclusion: Incorrect gene
prediction/annotation


The two predicted proteins have essentially
identical annotation


The protein
-
protein alignments are disjoint
and consecutive on the protein


The protein
-
nucleotide alignment includes
both protein
-
protein alignments in the proper
order



Why/how does this happen?

Final(?) Check: Gene prediction


Genscan is the best available
ab initio

gene
predictor


http://genes.mit.edu/GENSCAN.html



Genscan’s prediction spans both protein
-
protein alignments, reinforcing our conclusion
of a bad prediction


Ab initio

vs. similarity vs. hybrid
models for gene finding


Ab initio
: The gene looks like the average of
many genes


Genscan, GeneMark, GRAIL…


Similarity: The gene looks like a specific
known gene


Procrustes,…


Hybrid: A combination of both


Genomescan (
http://genes.mit.edu/genomescan/
)

A similar example: Fruitfly homolog
of mRNA localization protein
VERA


Similar procedure as just described


Tblastn

search with BLOSUM45 produces an unexpected exon










Conclusion: Incomplete (as opposed to incorrect)
annotation


We have verified the existence of the rare isoform through RT
-
PCR

Another example: Find all genes with
pdz domains


Multiple methods are possible



The ‘best’ method will depend on many things


How much do you know about the domain?


Do you know the exact extent of the domain?


How many examples do you expect to find?

Some possible methods if the domain
is a known domain:


SwissProt


text search capabilities


good annotation of known domains


crosslinks to other databases (domains)



Databases of known domains:


BLOCKS (
http://blocks.fhcrc.org/
)


Pfam (
http://pfam.wustl.edu/
)


Others (ProDom, ProSite, DOMO,…)

Determination of the nature of
conservation in a domain


For new domains, multiple alignment is your
best option


Global: clustalw


Local: DiAlign


Hidden Markov Model: HMMER


For known domains, this work has largely
been done for you


BLOCKS


Pfam

If you have a protein, and want to
search it to known domains


Search/Analysis tools


Pfam


BLOCKS


PredictProtein
(
http://cubic.bioc.columbia.edu/predictprotein/predictprotein.html
)


Different representations of
conserved domains


BLOCKS


Gapless regions


Often multiple blocks for one domain



PFAM


Statistical model, based on HMM


Since gaps are allowed, most domains have only
one pfam model

Conclusions


We have only touched small parts of the
elephant


Trial and error (intelligently) is often your best
tool


Keep up with the main five sites, and you’ll
have a pretty good idea of what is happening
and available