MARC: Developing Bioinformatics Programs
July 13, 2009
Alex
Ropelewski
ropelews@psc.edu
Hugh Nicholas
nicholas@psc.edu
Ricardo Gonzalez Mendez
ricardo.gonzalez7@upr.edu
1
Analyzing Families of Sequences
The following material is the result of a curriculum development effort to provide a set of
courses to support bioinformatics efforts involving students from the biological sciences,
computer science, and mathematics departments. They have been developed as a part
of the NIH funded project “Assisting Bioinformatics Efforts at Minority Schools” (2T36
GM008789). The people involved with the curriculum development effort include:
•
Dr. Hugh B. Nicholas, Dr. Troy Wymore, Mr. Alexander
Ropelewski
and Dr. David
Deerfield II, National Resource for Biomedical Supercomputing, Pittsburgh
Supercomputing Center, Carnegie Mellon University.
•
Dr. Ricardo
González
Méndez
, University of Puerto Rico Medical Sciences Campus.
•
Dr.
Alade
Tokuta
, North Carolina Central University.
•
Dr. Jaime
Seguel
and Dr.
Bienvenido
Vélez
, University of Puerto Rico at
Mayagüez
.
•
Dr.
Satish
Bhalla
, Johnson C. Smith University.
Unless otherwise specified, all the information contained within is Copyrighted © by
Carnegie Mellon University. Permission is granted for use, modify, and reproduce these
materials for teaching purposes.
The most recent updated copy of this presentation along with supplemental teaching
materials can be found online at:
http://marc.psc.edu/
The interdisciplinary science of using computational
approaches to analyze, classify, collect, represent
and store biological data with the goal of
accelerating and enhancing the understanding of
DNA, RNA and Protein sequences.
Bioinformatics
Process of applying computational methods to a
biological molecule represented as a character
string. The goal is to infer information about the
structure, function, or evolutionary history of the
sequence.
Sequence Analysis
A sequence is a way to represent a protein, DNA,
or RNA molecule as a character string.
What is a Sequence?
MRLLVLAALLTVGAGQAGLNSRALWQFNGM
IKCKIPSSEPLLDFNNYGCYCGLGGSGTPV
DDLDRCCQTHDNCYKQAKKLDSCKVLVDNP
YTNNYSYSCSNNEITCSSENNACEAFICNC
DRNAAICFSKVPYNKEHKNLDKKNC
Phospholipase A2
-
Bos taurus (Bovine).
Representing Proteins
•
A
-
Alanine
•
R
-
Arginine
•
N
-
Asparagine
•
D
-
Aspartic acid
•
C
-
Cysteine
•
E
-
Glutamic acid
•
Q
-
Glutamine
•
G
-
Glycine
•
H
-
Histidine
•
I
-
Isoleucine
•
L
-
Leucine
•
K
–
Lysine
•
M
–
Methionine
•
F
-
Phenylalanine
•
P
-
Proline
•
S
-
Serine
•
T
-
Threonine
•
W
-
Tryptophan
•
Y
-
Tyrosine
•
V
-
Valine
•
B
-
Asparagine or aspartic acid
•
Z
-
Glutamine or glutamic acid
•
J
-
Leucine or Isoleucine
•
X
-
Any Amino Acid
•
U
-
Selenocysteine
•
O
-
Pyrrolysine
Image from Wikipedia Commons: http://en.wikipedia.org/wiki/File:Oxytocin.jpg
G
L
P
C
N
Q
I
Y
C
Families share a common function, structure, and are related
through evolution
Why study families of sequences?
Aldehyde
Dehydrogenase
Family Members
8
The Goal
CURATED FAMILY:
•
All related sequences
sharing a common function
(Homologous Sequences)
•
All substantial motifs
•
Evolutionary history
•
Structural information
•
Experimental information
9
The Process
CURATED
DATASET
Classification
Libraries
Sequence
Libraries
Hidden Markov
Model
Structural
Libraries
Profile &
PSSM
Local
Patterns
Multiple
Sequence
Alignment
Evolutionary
Analysis
Homology
Modeling
Initial
Query
The Toolkit
Smith
-
Waterman
Clustalw
MSA
Meme
Profile
-
ss
Pfam
PDB
Fasta
Needleman
-
Wunsch
GenBank
EMBL
UniProt
Blast
Mast
Probcons
T
-
Coffee
hmmer
Phylip
PDB
Figtree
Notung
Python
BioPython
Genedoc
Part I
: Submit three candidate families for your course
project.
Part II
: Collect an initial set of sequences
Part III
: Generate a multiple sequence alignment, identify
patterns and motifs and use them to improve the quality of
your alignment, and identify additional distantly related
family members
Part IV:
Integrate the sequence analysis results to the
structure and function and evolution of the family
Part V:
Write a draft paper, or research grant and develop
an oral presentation for a conference
11
The Project
12
Part I
CURATED
DATASET
Classification
Libraries
Sequence
Libraries
Hidden Markov
Model
Structural
Libraries
Profile &
PSSM
Local
Patterns
Multiple
Sequence
Alignment
Evolutionary
Analysis
Homology
Modeling
Initial
Query
Learning Objectives:
Teach students ability select an appropriate
subject for experimentation.
Teach students how to use
PubMed
:
Find reviews, background information, and prior
work to understand what is known about the subject
Teach students how to concisely summarize
and properly cite prior research works
13
Part 1
–
Selecting Query
URL:
http://www.pubmed.gov/
National Library of Medicine’s database of articles
published in biomedical journals
Currently contains over 18 million citations, dating from
1948
About 90% of records are English
-
language sources or
have English abstracts
About 80% of the citations include the published
abstract
About 5,200 Journals
Some links to full
-
text articles at participating
publishers web sites
PubMed
14
Title of the journal article
Names of the authors
Abstract published with the article
MeSH
(Medical Subject Headings) tags
Journal source
First author affiliation
Language of the article
Publication type (review, letter, etc.)
Data in PubMed
15
Simple PubMed Search
16
Click Go
Enter Search
Term
Search
Results
Basic PubMed Search
17
Go to
advanced
search
page
Search
Database
Selection
Click on tab for
all articles
Click on tab for
review articles
Select to
Sort results
Display Format
Page through results
Select to save
or email results
Pubmed
Feature Tabs:
•
Limits: Limit search to certain dates, languages, etc.
•
Preview: Allows viewing and selecting of search fields
•
History: Log of recent searches
•
Clipboard: Allows items to be temporarily saved
•
Details: Shows how
PubMed
ran the search
PubMed
Boolean Logic
18
Salmonella
Eggs
Hamburger
Salmonella
Eggs
Hamburger
Salmonella
and
Eggs
Salmonella
or
Eggs
Salmonella
not
Eggs
Salmonella
Eggs
Hamburger
Salmonella
Eggs
Hamburger
Salmonella
Eggs
Hamburger
Salmonella
and
Eggs
and
Hamburger
Salmonella
and
Eggs
or
Hamburger
Salmonella
and
(
Eggs
or
Hamburger)
Salmonella
Eggs
Hamburger
PubMed
Advanced
Search
19
PubMed
Limits
20
Medical Subject Headings
Controlled vocabulary/key word system
Used to help locate appropriate articles
Articles in PubMed usually have between 5 to 15
MeSH tags associated with them.
MeSH Tutorial at:
http://www.nlm.nih.gov/bsd/disted/mesh
MeSH tags
21
MeSH Search
22
1) Select
MeSH
2) Enter Search Term
3) Click Go
4) Select
MeSH
Term
5) Select Search Box
MeSH Search
23
Click Search
PubMed
Search Box
MeSH Search
24
Click tab to see
all articles
Click tab to see
review articles
MeSH Search
25
Multiple
MeSH
Terms
26
Part II
CURATED
DATASET
Classification
Libraries
Sequence
Libraries
Hidden Markov
Model
Profile &
PSSM
Local
Patterns
Evolutionary
Analysis
Homology
Modeling
Initial
Query
Structural
Libraries
Multiple
Sequence
Alignment
Learning Objective:
Be able to search major libraries of
biomolecules
to collect sequences of interest
Understand information contained in the major
sequence, structure and classification libraries
Understand searching methods and their limitations
Understand the effect of search parameters
Be able to select appropriate methods and
parameters for a variety of sequences
27
Part II
-
Libraries
28
Searching Sequence
Libraries
–
Results
29
Sequence Libraries
–
Results
30
Part III
CURATED
DATASET
Classification
Libraries
Sequence
Libraries
Hidden Markov
Model
Structural
Libraries
Profile &
PSSM
Local
Patterns
Multiple
Sequence
Alignment
Evolutionary
Analysis
Homology
Modeling
Initial
Query
Learning Objective:
Be able to construct a biologically correct alignment for
a family of sequences
Understand what makes an alignment biologically correct
Be able construct and refine multiple sequence alignments
Be able to create abstract representations of multiple
alignments and search databases with them.
Be able to tie local patterns(motifs) found back to the biology of
the sequences
Understand the methods used to abstract an alignment and the
advantages and disadvantage of commonly used methods.
Understand the effect of search parameters
Be able to select appropriate methods and parameters for a
variety of sequences
31
Part III
–
Multiple Alignment
Multiple Sequence Alignment
Aliphatic Amino Acids (I,V,L)
Similarity of Amino Acids
34
Valine
–
Val
–
V
Leucine
–
Leu
–
L
Isoleucine
–
Ile
–
I
Similarity of Amino Acids
Understanding Motifs
36
Functional Residues
37
Substrate
Binding
NAD Binding
38
Part IV
CURATED
DATASET
Classification
Libraries
Sequence
Libraries
Hidden Markov
Model
Structural
Libraries
Profile &
PSSM
Local
Patterns
Multiple
Sequence
Alignment
Evolutionary
Analysis
Homology
Modeling
Initial
Query
Learning Objective:
Understand and integrate the sequence analysis results
to the structure and function of the protein family
Understand the evolutionary patterns of gene and
species
Integrate evolutionary information with structural
information to understand how the function has evolved
within the protein family
Predict or design experiments to be carried out in
-
vitro
Design drugs
Mutate the proteins
Mutate regulatory areas within genome to change expression
39
Part IV
–
Structure and Phylogeny
Integrating Alignment, Motifs & Structure
40
Active
Site
41
Conserved Asn
Binds Substrate
Integrating Alignment, Motifs & Structure
42
Catalytic
Thiol
(
Cys
)
Integrating Alignment, Motifs & Structure
Evolutionary Relationships
43
Learning Objectives:
Teach students ability to concisely summarize
and properly cite
relevant
prior research works
Teach students ability to concisely summarize
their research works
Teach students to revise papers based on
reviewers comments
Teach students how to write a research grant
Teach students how to give and prepare an oral
research presentation.
44
Part V
–
Prepare Work for Publication
Biologists:
You will be working through the same five step
project that your students will during your class.
By the time that you leave here, you should
have a good start on a research publication,
grant or have ideas for in
-
vitro experiments.
45
Workshop
Projects
Computer Scientists:
Take your favorite string matching algorithm
and apply it to biological sequence data.
Compare your algorithms performance with
some of the algorithms discussed in this
workshop in terms of speed, selectivity, or
sensitivity.
Feel free to use a parallel algorithm.
46
Workshop
Projects
Enter the password to open this PDF file:
File name:
-
File size:
-
Title:
-
Author:
-
Subject:
-
Keywords:
-
Creation Date:
-
Modification Date:
-
Creator:
-
PDF Producer:
-
PDF Version:
-
Page Count:
-
Preparing document for printing…
0%
Comments 0
Log in to post a comment