NCBI_MSALecturex

disturbedtonganeseΒιοτεχνολογία

2 Οκτ 2013 (πριν από 4 χρόνια και 1 μήνα)

105 εμφανίσεις

Basic concepts in bioinformatics

BLAST and MSAs

What we will do


Survey some of the tools available on the NCBI
website and related sites


Use
BLAST

to identify a family of proteins in a
single species


Construct an
MSA
with that family


Using MSA to learn more about the family


Annotation verification


Functionally important regions

First a word about databases
and
webservers

Overview of some resources

Use the web


There is not one single site that is best for all of
your needs


Getting
started:


Browse the NAR reviews of databases and webservers


http://nar.oxfordjournals.org/content/39/suppl_1


http://
nar.oxfordjournals.org
/content/38/suppl_2


Make
a bookmark to hub sites such as


http://bioinformatics.ca/links_directory/


http://
nar.oxfordjournals.org
/cont
ent/38/suppl_2

Nucleic Acids Research review

This is updated annually

Bioinformatics hub


http://bioinformatics.ca/links_directory/



This site has a partnership with NAR and as of 2008, had
over 1200 unique links to servers and tools.

NCBI is a good starting place


A very good and comprehensive site


http://www.ncbi.nlm.nih.gov/


What is available there:


Databases


Tools


Education


Downloadable material



NCBI website

Finding a gene


Entrez


The underlying software that links the various
resources and allows
B
oolean searches


BLAST


The
Basic Local Alignment Search Tool (BLAST
)


finds
regions of local similarity between
sequences by searching sequence
databases and
calculates the statistical significance of
matches


BLAST

Using this information


Downloading the files


Various formats available


Finding Related
genes/domains


What does your gene do?


Finding Taxonomic
information


to select genes of interest for evolutionary
comparisons


Finding Structural
information


To identify significant regions of biochemical
interest

Ways to search


http://www.ncbi.nlm.nih.gov/


The resources available are rich

Many web
-
based and downloadable applications are available

You can search many databases

Searching all databases takes you to an
Entrez

links screen

Clicking on one will take you to a subset of the data

Protein files

Clicking on a file will take you to a
Genpept

entry.

GenBank

file

http://www.ncbi.nlm.nih.gov/protein/125924597?report=genbank&log$=prottop&blast_rank=1&RID=Y7NSNX8N016


What can we do with the file?


You can reformat it
using the Display
button


You can download
to various locations

Displaying the file as a FASTA


FASTA is a useful format widely used by many
bioinformatics software packages.


We can use it to conduct a BLAST Search to
find sequences on the basis of homology

Conducting a BLAST

Types of BLAST


Basic BLAST


Search protein or nucleotide sequences against
databases


Includes
PsiBLAST
,
PhiBLAST
, MEGABLAST
etc


Search a combination using translated sequences


Blastx
,
tblastn
,
tblastx


Specialized BLAST


Specialized databases

Using a protein file

BLAST results

Scrolling down reveals similar records

Hands on work


Using BLAST to identify a family of
proteins
from one species


You will review what I just covered plus learn
about limiting how to limit searches to
species, how to reformat files and how to
download them

Getting started


Open or Download the
file
PtPP1_sampleproteinfile.txt


This file is in the FASTA format


Open the file in a text editor program


Copy the amino acid sequence

Worksheet goals


Use a FASTA sequence file of
protein
phosphatase type 1

from
Paramecium

to
identify all its family members


You will conduct a BLAST


Narrow your search to a specific species


Download all the identified files into single
concatenated file for subsequent work

Use the
tutorial entitled
Web
-
Based_Pt_BLAST.doc


Independent
w
ork time


Please ask questions if confused


Work in pairs if possible


Feel free to get up and move around

What you should have

A concatenated FASTA file with

12
sequences renamed



Can be downloaded from the wiki

Lets align the sequences


Why?


We can note conserved regions


Find errors or unusual features


We will construct a multiple sequence
alignment using two methods

What is a multiple sequence
alignment?


An alignment based on homology between 3
or more biological sequences


The homology can be used to


identify conserved regions,


biochemical relationships, or


to infer evolutionary relationships through
phylogenetic methods

Ways to conduct an MSA


Progressive alignment algorithms


Clustal

family and
Tcoffee


Iterative methods


MUSCLE


Hidden Markov models


SAM
-

Sequence
Alignment and Modeling
System


HMMER

Examples of what you can learn


We study the evolution of
phosphagen

kinases
Involved
in energy homeostasis: Allows for the storage of high
energy
phosphates


Alignments help us identify PKs or related proteins


S
equence alignments help identify substrate specificity

CK

Need energy!

Store energy!

Energy can be used when needed

Two different types of bacterial PKs were identified.

Using BLAST and MSA we identified a novel bacterial protein

(bAK)

The novel form was later shown to be a new type of protein kinase

The more traditional form has a role in development in
Myxococcus

xanthus

Using sequence alignments we can
predict substrate specificities

This helps us track
changes in substrate specificity
across
taxonomic groups

Examples of what you can learn


We study the evolution of
phosphagen

kinases


We can also use MSA to develop hypotheses


Residues involved in substrate specificity or
dimerization

These key residues seem to be
conserved

in dimer PKs

Alignment generated using Clustal W

Green

residues involved in catalytic activity

Protozoa

Sponges

Molluscs

Annelida

Nematodes &

Arthodpods

Echinoderm

Chordates

Are these also dimers?

Constructing an MSA


If you were unsuccessful in creating the
concatenated FASTA file of PtPP1 sequences,
download
the file,
PtPP1rename.fasta


Open or download
the tutorial,
Web
-
basedMSATutorial.docx
.


Follow the instructions therein

What you should have


An MSA of all the PP1 from Paramecium


Noted anything unusual or otherwise worth
noting


First


the sequences are very highly conserved


However, some unusual features are noted

Alignment results

Note the one sequence starts later than the rest

Alignment in the middle

Note that two of the sequences have gaps

C
-
terminus alignment

Once sequence goes on for much longer than the rest

Real differences or annotation errors


Need to go back the the nucleotide files


Need to examine the raw reads


Doing so reveals that these are in fact
annotation errors.

Summary


You have conducted a BLAST search


Identfied

and collected a subset of sequences
for additional work


Conducted an MSA and drawn conclusions
from it.