Introduction to Analyzing Protein Families - Pittsburgh ...

educationafflictedBiotechnology

Oct 4, 2013 (3 years and 11 months ago)

93 views


MARC: Developing Bioinformatics Programs

July 13, 2009



Alex
Ropelewski

ropelews@psc.edu


Hugh Nicholas

nicholas@psc.edu


Ricardo Gonzalez Mendez

ricardo.gonzalez7@upr.edu



1

Analyzing Families of Sequences

The following material is the result of a curriculum development effort to provide a set of
courses to support bioinformatics efforts involving students from the biological sciences,
computer science, and mathematics departments. They have been developed as a part
of the NIH funded project “Assisting Bioinformatics Efforts at Minority Schools” (2T36
GM008789). The people involved with the curriculum development effort include:



Dr. Hugh B. Nicholas, Dr. Troy Wymore, Mr. Alexander
Ropelewski

and Dr. David

Deerfield II, National Resource for Biomedical Supercomputing, Pittsburgh

Supercomputing Center, Carnegie Mellon University.


Dr. Ricardo
González

Méndez
, University of Puerto Rico Medical Sciences Campus.


Dr.
Alade

Tokuta
, North Carolina Central University.


Dr. Jaime
Seguel

and Dr.
Bienvenido

Vélez
, University of Puerto Rico at
Mayagüez
.


Dr.
Satish

Bhalla
, Johnson C. Smith University.


Unless otherwise specified, all the information contained within is Copyrighted © by
Carnegie Mellon University. Permission is granted for use, modify, and reproduce these
materials for teaching purposes.


The most recent updated copy of this presentation along with supplemental teaching
materials can be found online at:
http://marc.psc.edu/



The interdisciplinary science of using computational
approaches to analyze, classify, collect, represent
and store biological data with the goal of
accelerating and enhancing the understanding of
DNA, RNA and Protein sequences.

Bioinformatics

Process of applying computational methods to a
biological molecule represented as a character
string. The goal is to infer information about the
structure, function, or evolutionary history of the
sequence.


Sequence Analysis


A sequence is a way to represent a protein, DNA,
or RNA molecule as a character string.

What is a Sequence?

MRLLVLAALLTVGAGQAGLNSRALWQFNGM
IKCKIPSSEPLLDFNNYGCYCGLGGSGTPV
DDLDRCCQTHDNCYKQAKKLDSCKVLVDNP
YTNNYSYSCSNNEITCSSENNACEAFICNC
DRNAAICFSKVPYNKEHKNLDKKNC


Phospholipase A2
-

Bos taurus (Bovine).

Representing Proteins


A
-

Alanine


R
-

Arginine


N
-

Asparagine


D
-

Aspartic acid


C
-

Cysteine


E
-

Glutamic acid


Q
-

Glutamine


G
-

Glycine


H
-

Histidine


I
-

Isoleucine


L
-

Leucine


K


Lysine


M


Methionine


F
-

Phenylalanine


P
-

Proline


S
-

Serine


T
-

Threonine


W
-

Tryptophan


Y
-

Tyrosine


V
-

Valine


B
-

Asparagine or aspartic acid



Z
-

Glutamine or glutamic acid



J
-

Leucine or Isoleucine



X
-

Any Amino Acid


U
-

Selenocysteine



O
-

Pyrrolysine


Image from Wikipedia Commons: http://en.wikipedia.org/wiki/File:Oxytocin.jpg

G

L

P

C

N

Q

I

Y

C


Families share a common function, structure, and are related
through evolution

Why study families of sequences?

Aldehyde

Dehydrogenase

Family Members

8

The Goal

CURATED FAMILY:




All related sequences

sharing a common function

(Homologous Sequences)




All substantial motifs




Evolutionary history




Structural information




Experimental information

9

The Process

CURATED
DATASET

Classification
Libraries

Sequence
Libraries

Hidden Markov
Model

Structural
Libraries

Profile &
PSSM

Local
Patterns

Multiple
Sequence
Alignment

Evolutionary
Analysis

Homology
Modeling

Initial
Query

The Toolkit

Smith
-
Waterman

Clustalw

MSA

Meme

Profile
-
ss

Pfam

PDB

Fasta

Needleman
-
Wunsch

GenBank

EMBL

UniProt

Blast

Mast

Probcons

T
-
Coffee

hmmer

Phylip

PDB

Figtree

Notung

Python

BioPython

Genedoc


Part I
: Submit three candidate families for your course
project.


Part II
: Collect an initial set of sequences


Part III
: Generate a multiple sequence alignment, identify
patterns and motifs and use them to improve the quality of
your alignment, and identify additional distantly related
family members


Part IV:

Integrate the sequence analysis results to the
structure and function and evolution of the family



Part V:

Write a draft paper, or research grant and develop
an oral presentation for a conference

11

The Project

12

Part I

CURATED
DATASET

Classification
Libraries

Sequence
Libraries

Hidden Markov
Model

Structural
Libraries

Profile &
PSSM

Local
Patterns

Multiple
Sequence
Alignment

Evolutionary
Analysis

Homology
Modeling

Initial
Query


Learning Objectives:


Teach students ability select an appropriate
subject for experimentation.


Teach students how to use
PubMed
:


Find reviews, background information, and prior
work to understand what is known about the subject


Teach students how to concisely summarize
and properly cite prior research works

13

Part 1


Selecting Query


URL:
http://www.pubmed.gov/


National Library of Medicine’s database of articles
published in biomedical journals


Currently contains over 18 million citations, dating from
1948


About 90% of records are English
-
language sources or
have English abstracts


About 80% of the citations include the published
abstract


About 5,200 Journals



Some links to full
-
text articles at participating
publishers web sites

PubMed

14


Title of the journal article


Names of the authors


Abstract published with the article


MeSH

(Medical Subject Headings) tags


Journal source


First author affiliation


Language of the article


Publication type (review, letter, etc.)


Data in PubMed

15

Simple PubMed Search

16

Click Go

Enter Search
Term

Search

Results

Basic PubMed Search

17

Go to
advanced
search
page

Search

Database

Selection

Click on tab for
all articles

Click on tab for
review articles

Select to

Sort results

Display Format

Page through results

Select to save
or email results

Pubmed

Feature Tabs:


Limits: Limit search to certain dates, languages, etc.


Preview: Allows viewing and selecting of search fields


History: Log of recent searches


Clipboard: Allows items to be temporarily saved


Details: Shows how
PubMed

ran the search


PubMed

Boolean Logic

18

Salmonella

Eggs

Hamburger

Salmonella

Eggs

Hamburger

Salmonella
and

Eggs

Salmonella
or

Eggs

Salmonella
not

Eggs

Salmonella

Eggs

Hamburger

Salmonella

Eggs

Hamburger

Salmonella

Eggs

Hamburger

Salmonella
and

Eggs
and

Hamburger

Salmonella
and

Eggs
or

Hamburger

Salmonella
and

(
Eggs
or

Hamburger)

Salmonella

Eggs

Hamburger

PubMed

Advanced
Search

19

PubMed

Limits

20


Medical Subject Headings


Controlled vocabulary/key word system


Used to help locate appropriate articles


Articles in PubMed usually have between 5 to 15
MeSH tags associated with them.


MeSH Tutorial at:


http://www.nlm.nih.gov/bsd/disted/mesh


MeSH tags

21

MeSH Search

22

1) Select
MeSH

2) Enter Search Term

3) Click Go

4) Select
MeSH

Term

5) Select Search Box

MeSH Search

23

Click Search
PubMed

Search Box

MeSH Search

24

Click tab to see
all articles

Click tab to see
review articles

MeSH Search

25

Multiple
MeSH

Terms

26

Part II

CURATED
DATASET

Classification
Libraries

Sequence
Libraries

Hidden Markov
Model

Profile &
PSSM

Local
Patterns

Evolutionary
Analysis

Homology
Modeling

Initial
Query

Structural
Libraries

Multiple
Sequence
Alignment


Learning Objective:


Be able to search major libraries of
biomolecules

to collect sequences of interest


Understand information contained in the major
sequence, structure and classification libraries


Understand searching methods and their limitations


Understand the effect of search parameters


Be able to select appropriate methods and
parameters for a variety of sequences

27

Part II
-

Libraries

28

Searching Sequence
Libraries


Results

29

Sequence Libraries


Results

30

Part III

CURATED
DATASET

Classification
Libraries

Sequence
Libraries

Hidden Markov
Model

Structural
Libraries

Profile &
PSSM

Local
Patterns

Multiple
Sequence
Alignment

Evolutionary
Analysis

Homology
Modeling

Initial
Query


Learning Objective:


Be able to construct a biologically correct alignment for
a family of sequences


Understand what makes an alignment biologically correct


Be able construct and refine multiple sequence alignments


Be able to create abstract representations of multiple
alignments and search databases with them.


Be able to tie local patterns(motifs) found back to the biology of
the sequences


Understand the methods used to abstract an alignment and the
advantages and disadvantage of commonly used methods.


Understand the effect of search parameters


Be able to select appropriate methods and parameters for a
variety of sequences

31

Part III


Multiple Alignment

Multiple Sequence Alignment

Aliphatic Amino Acids (I,V,L)

Similarity of Amino Acids

34

Valine



Val


V

Leucine



Leu



L

Isoleucine



Ile


I

Similarity of Amino Acids

Understanding Motifs

36

Functional Residues

37

Substrate
Binding

NAD Binding

38

Part IV

CURATED
DATASET

Classification
Libraries

Sequence
Libraries

Hidden Markov
Model

Structural
Libraries

Profile &
PSSM

Local
Patterns

Multiple
Sequence
Alignment

Evolutionary
Analysis

Homology
Modeling

Initial
Query


Learning Objective:


Understand and integrate the sequence analysis results
to the structure and function of the protein family


Understand the evolutionary patterns of gene and
species


Integrate evolutionary information with structural
information to understand how the function has evolved
within the protein family


Predict or design experiments to be carried out in
-
vitro


Design drugs


Mutate the proteins


Mutate regulatory areas within genome to change expression

39

Part IV


Structure and Phylogeny

Integrating Alignment, Motifs & Structure

40

Active

Site

41

Conserved Asn

Binds Substrate

Integrating Alignment, Motifs & Structure

42

Catalytic

Thiol

(
Cys
)

Integrating Alignment, Motifs & Structure

Evolutionary Relationships

43


Learning Objectives:


Teach students ability to concisely summarize
and properly cite
relevant

prior research works


Teach students ability to concisely summarize
their research works


Teach students to revise papers based on
reviewers comments


Teach students how to write a research grant


Teach students how to give and prepare an oral
research presentation.

44

Part V


Prepare Work for Publication


Biologists:


You will be working through the same five step
project that your students will during your class.


By the time that you leave here, you should
have a good start on a research publication,
grant or have ideas for in
-
vitro experiments.


45

Workshop
Projects


Computer Scientists:


Take your favorite string matching algorithm
and apply it to biological sequence data.


Compare your algorithms performance with
some of the algorithms discussed in this
workshop in terms of speed, selectivity, or
sensitivity.


Feel free to use a parallel algorithm.

46

Workshop
Projects