Bioinformatics - Whitehead Institute for Biomedical Research

dasypygalstockingsBiotechnology

Oct 2, 2013 (4 years and 3 months ago)

136 views

Practical Bioinformatics Tools

For

Understanding Evolution

Robert Latek, PhD

Bioinformatics and Research Computing

Whitehead Institute for Biomedical Research

WIBR Bioinformatics, © Whitehead Institute 2004

2

Aims


Examine Techniques For Describing
Evolutionary Relationships


Learn To Apply Bioinformatics Tools
To Study Evolution



Question, Interrupt, Discuss, Suggest

WIBR Bioinformatics, © Whitehead Institute 2004

3

Bioinformatics ?


Definition


Integration of computational and biological methods to
promote biological discovery


Combination of Biology, Statistics, CS, Clinical Research


Purpose


Predict, Decipher, Visualize


Methodology


Data Mining and Comparisons



Data Visualization

MSRKGPRAEVCADCS
APDPGWASISRGVLVC
DECCSVHRLGRHISIV
KHLRHSAWPPTLLQM
VHTLASNGANSIWEHS
LLDPAQVQSGRRKAN

G. Bell

WIBR Bioinformatics, © Whitehead Institute 2004

4

Bioinformatics :
-
)


Biological Comparisons (Evolutionary Analysis)


How closely/distantly related are two populations?


Gene Function Prediction


How and why does Gene X function/malfunction?


Pharmaceutical Design & In Silico Testing


WIBR Bioinformatics, © Whitehead Institute 2004

5

Bioinformatics@WI


Bioinformatics and Research Computing


Collaboration, Consultation, Education
in Bioinformatics and Graphics


Provide hardware, commercial/custom
software tools, training, and
bioinformatics expertise

Decipher

Predict

WIBR Bioinformatics, © Whitehead Institute 2004

6

Discussion Map


Relationships Among Groups Of Genes


Comparing Sequences


Building Sequence Families



Sequence Conservation During Evolution


Aligning Multiple Sequences



Evolutionary Diagrams


Tracing The Descent From Common Ancestors


Growing Phylogenetic Trees

WIBR Bioinformatics, © Whitehead Institute 2004

7

Evolutionary Analysis


Definition


The use of phylogeny to reveal relationships
among sets of genes


Purpose


To utilize information about common ancestors to
predict gene function and regulation


Methodology


Compare properties between genes/organisms
and identify commonalities and differences


Organization of genes into a evolutionary
diagrams


Sequence by sequence comparisons

WIBR Bioinformatics, © Whitehead Institute 2004

8

Sequence
-
Based Comparisons


Identify sequences within an organism that are related
to each other and/or across different species


Within: Fetal and adult hemoglobin


Across : Human and chimpanzee hemoglobin


Generate an evolutionary history of related genes


Locate insertions, deletions, and substitutions that
have occurred during evolution

CREATE

CREA
S
E

-
RE
L
A
PS
E

G
REA
S
E
R

(C)

Cysteine

(R)

Arginine

(E)

Glutamate

(A)

Alanine

(T)

Threonine

(S)

Serine

(L)

Leucine

(P)

Proline

(G)

Glycine

[Ancestor]

[Progenitors]

WIBR Bioinformatics, © Whitehead Institute 2004

9

Homology & Similarity


Homology


Conserved sequences arising from a common
ancestor


Orthologs: homologous genes that share a
common ancestor in the absence of any gene
duplication (Mouse and Human Hemoglobin)


Paralogs: genes related through gene duplication
(one gene is a copy of another
-

Fetal and Adult
Hemoglobin)



Similarity


Genes that share common sequences but are not
necessarily related


WIBR Bioinformatics, © Whitehead Institute 2004

10

Sequences As Modules


Proteins are derived from a limited
number of basic building blocks
(Modules)



Evolution has shuffled these modules
giving rise to a diverse repertoire of
protein sequences



As a result, proteins can share a global
relationships or local relationship
specific to a particular DOMAIN

Global

Local

WIBR Bioinformatics, © Whitehead Institute 2004

11

Sequence Domains

Modules Define Functional/Structural Domains

WIBR Bioinformatics, © Whitehead Institute 2004

12

Sequence Families


Definition


Group of sequences that share a common function
and/or structure, that are potentially derived from a
common ancestor (set of homologous sequences)



Building A Family


Domains are used to group different sequences into
common families

WIBR Bioinformatics, © Whitehead Institute 2004

13

Defining A Sequence Family

Family A

Family B

Family D

Family E

Family C

WIBR Bioinformatics, © Whitehead Institute 2004

14

Sequence Family Resources


Search and Browse Family Databases


PFAM


http://pfam.wustl.edu/

>src

MGSNKSKPKDASQRRRSLEPAENVHGAGGGAFPASQTPSKPASADGHRGPSAAFAPAAAEPKLFGGFNSSDTVTS
PQRAGPLAGGVTTFVALYDYESRTETDLSFKKGERLQIVNNTEGDWWLAHSLSTGQTGYIPSNYVAPSDSIQAEEWY
FGKITRRESERLLLNAENPRGTFLVRESETTKGAYCLSVSDFDNAKGLNVKHYKIRKLDSGGFYITSRTQFNSLQQLVA
YYSKHADGLCHRLTTVCPTSKPQTQGLAKDAWEIPRESLRLEVKLGQGCFGEVWMGTWNGTTRVAIKTLKPGTMSP
EAFLQEAQVMKKLRHEKLVQLYAVVSEEPIYIVTEYMSKGSLLDFLKGETGKYLRLPQLVDMAAQIASGMAYVERMNY
VHRDLRAANILVGENLVCKVADFGLARLIEDNEYTARQGAKFPIKWTAPEAALYGRFTIKSDVWSFGILLTELTTKGRVP
YPGMVNREVLDQVERGYRMPCPPECPESLHDLMCQCWRKEPEERPTFEYLQAFLEDYFTSTEPQYQPGENL

WIBR Bioinformatics, © Whitehead Institute 2004

15

Sequence Family Resources


NCBI Family Database Resources


Conserved Domain Database


http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=cdd


Conserved Domain Architecture Retrieval Tool


http://www.ncbi.nlm.nih.gov/BLAST/


WIBR Bioinformatics, © Whitehead Institute 2004

16

Discussion Map


Relationships Among Groups Of Genes


Comparing Sequences


Building Sequence Families



Sequence Conservation During Evolution


Aligning Multiple Sequences




Evolutionary Diagrams


Tracing The Descent From Common Ancestors


Growing Phylogenetic Trees


WIBR Bioinformatics, © Whitehead Institute 2004

17

Multiple Sequence Alignments


Place residues in columns that
are derived from a common
ancestral residue


Identify
Matches
,
Mismatches
,
and
Gaps


MSA can reveal sequence
patterns


Demonstration of homology
between >2 sequences


Identification of functionally
important sites


Protein function prediction


Structure prediction


CRE
-
A
-
TE
-

CRE
-
A
-
S
E
-

-
RE
L
A
PS
E
-

G
RE
-
A
-
S
E
R

CREATE

CREA
S
E

G
REA
S
E
R

RE
L
A
PS
E

123456789

SeqA

SeqB

SeqC

SeqD

WIBR Bioinformatics, © Whitehead Institute 2004

18

Global vs. Local Alignments


Global


Search for alignments, matching over
entire sequences


Local


Examine regions of sequence for
conserved segments


Both Consider: Matches, Mismatches,
Gaps

WIBR Bioinformatics, © Whitehead Institute 2004

19

Global Sequence Alignments

Yeast Prion
-
Like Proteins

WIBR Bioinformatics, © Whitehead Institute 2004

20

How To Make A Global MSA


On The Web


http://pir.georgetown.edu/pirwww/search/multaln.html









On Your Computer


ClustalX: http://www
-
igbmc.u
-
strasbg.fr/BioInfo/ClustalX/


WIBR Bioinformatics, © Whitehead Institute 2004

21

MSA Example Sequences

>KSYK_HUMAN

FFFGNITREEAEDYLVQGGMSDGLYLLRQSRNYLGGFALSVAHGRKAHHYTIERELNGTYAIAGGRTHASPADLCHYH


>ZA70_HUMAN

WYHSSLTREEAERKLYSGAQTDGKFLLRPRKEQGTYALSLIYGKTVYHYLISQDKAGKYCIPEGTKFDTLWQLVEYL


>KSYK_PIG

WFHGKISRDESEQIVLIGSKTNGKFLIRARDNGSYALGLLHEGKVLHYRIDKDKTGKLSIPGGKNFDTLWQLVEHY


>MATK_HUMAN

WFHGKISGQEAVQQLQPPEDGLFLVRESARHPGDYVLCVSFGRDVIHYRVLHRDGHLTIDEAVFFCNLMDMVEHY


>CSK_CHICK

WFHGKITREQAERLLYPPETGLFLVRESTNYPGDYTLCVSCEGKVEHYRIIYSSSKLSIDEEVYFENLMQLVEHY


>CRKL_HUMAN

WYMGPVSRQEAQTRLQGQRHGMFLVRDSSTCPGDYVLSVSENSRVSHYIINSLPNRRFKIGDQEFDHLPALLEFY


>YES_XIPHE

WYFGKLSRKDTERLLLLPGNERGTFLIRESETTKGAYSLSLRDWDETKGDNCKHYKIRKLDNGGYYITTRTQFMSLQMLVKHY


>FGR_HUMAN

WYFGKIGRKDAERQLLSPGNPQGAFLIRESETTKGAYSLSIRDWDQTRGDHVKHYKIRKLDMGGYYITTRVQFNSVQELVQHY


>SRC_RSVP

WYFGKITRRESERLLLNPENPRGTFLVRKSETAKGAYCLSVSDFDNAKGPNVKHYKIYKLYSGGFYITSRTQFGSLQQLVAYY

Standard FASTA Sequence Format

WIBR Bioinformatics, © Whitehead Institute 2004

22

MSA Example Result

YES_XIPHE WYFGKLSRKDTERLLLLPGNERGTFLIRESETTKGAYSLSLRDWDETKGDNCKHYKIRKL

FGR_HUMAN WYFGKIGRKDAERQLLSPGNPQGAFLIRESETTKGAYSLSIRDWDQTRGDHVKHYKIRKL

SRC_RSVP WYFGKITRRESERLLLNPENPRGTFLVRKSETAKGAYCLSVSDFDNAKGPNVKHYKIYKL

MATK_HUMAN WFHGKISGQEAVQQLQPPED
--
GLFLVRESARHPGDYVLCVS
-----
FGRDVIHYRVLHR

CSK_CHICK WFHGKITREQAERLLYPPET
--
GLFLVRESTNYPGDYTLCVS
-----
CEGKVEHYRIIYS

CRKL_HUMAN WYMGPVSRQEAQTRLQGQRH
--
GMFLVRDSSTCPGDYVLSVS
-----
ENSRVSHYIINSL

ZA70_HUMAN WYHSSLTREEAERKLYSGAQTDGKFLLRPRK
-
EQGTYALSLI
-----
YGKTVYHYLISQD

KSYK_PIG WFHGKISRDESEQIVLIGSKTNGKFLIRAR
--
DNGSYALGLL
-----
HEGKVLHYRIDKD

KSYK_HUMAN FFFGNITREEAEDYLVQGGMSDGLYLLRQSRNYLGGFALSVA
-----
HGRKAHHYTIERE


:: . : :: : * :*:* * : * : ** :




YES_XIPHE DNGGYYITTRTQFMSLQMLVKHY

FGR_HUMAN DMGGYYITTRVQFNSVQELVQHY

SRC_RSVP YSGGFYITSRTQFGSLQQLVAYY

MATK_HUMAN
-
DGHLTIDEAVFFCNLMDMVEHY

CSK_CHICK
-
SSKLSIDEEVYFENLMQLVEHY

CRKL_HUMAN PNRRFKIGDQE
-
FDHLPALLEFY

ZA70_HUMAN KAGKYCIPEGTKFDTLWQLVEYL

KSYK_PIG KTGKLSIPGGKNFDTLWQLVEHY

KSYK_HUMAN LNGTYAIAGGRTHASPADLCHYH


* . : .

WIBR Bioinformatics, © Whitehead Institute 2004

23

Discussion Map


Relationships Among Groups Of Genes


Comparing Sequences


Building Sequence Families



Sequence Conservation During Evolution


Aligning Multiple Sequences



Evolutionary Diagrams


Tracing The Descent From Common Ancestors


Growing Phylogenetic Trees



WIBR Bioinformatics, © Whitehead Institute 2004

24

Phylogenetic Trees


A Graph Representing The
Evolutionary History Of Sequences


Relationship of sequences to one
another (How everything is connected)


Dissect the order of appearance of
insertions, deletions, and mutations



Identify Related Sequences, Predict
Function, Observe Epidemiology
(Analyze changes in viral strains)

A

B

C

D

Simple

Tree

WIBR Bioinformatics, © Whitehead Institute 2004

25

Tree Shapes

Rooted

Un
-
rooted

Branches intersect at Nodes

Leaves are the topmost branches


A

B

C

D

A

B

C

D

A

B

C

D

WIBR Bioinformatics, © Whitehead Institute 2004

26

Tree Characteristics


Tree Properties


Clade
: all the descendants of a common
ancestor represented by a node



Distance
: number of changes that have taken
place along a branch




Tree Types


Cladogram
: shows the branching order of
nodes



Phylogram
: shows branching order and
distances


A

B

C

D

.035

.009

.057

.044

.012

.016

Phylogram

WIBR Bioinformatics, © Whitehead Institute 2004

27

Tree Building Methods


Group Most Common Sequences


Find the tree that changes one sequence into all of the
others by the least number of steps



Sequences with the smallest number of differences have the
shortest distance between them and are called:


“related taxa”

WIBR Bioinformatics, © Whitehead Institute 2004

28

Tree Building Methods

A

B

A

B

C

D

F

A

B

D

C

E

F

2

1

1

A

B

F

E

C

D

2

1

3

E

A

B

C

D

E

F

C

D

E

F



A

B

C

D

E

F


F

A

B

C

D

E

A

B

C

D

E

F

1

2

3

4

5

WIBR Bioinformatics, © Whitehead Institute 2004

29

Example Evolutionary Trees


Tree Of Life


http://tolweb.org/tree/phylogeny.html



Theory Of Human Evolution At The SI


http://www.mnh.si.edu/anthro/humanorigins/ha/a_tree.html

Anthropological and Archeological

WIBR Bioinformatics, © Whitehead Institute 2004

30

How To Build A Tree


Create Alignment


http://pir.georgetown.edu/pirwww/search/multaln.html


Create Tree


http://www.genebee.msu.su/services/phtree_reduced.html


Draw Tree


http://iubio.bio.indiana.edu/treeapp/treeprint
-
form.html


Sequence Based

WIBR Bioinformatics, © Whitehead Institute 2004

31

MSA and Tree Relationship


“The optimal alignment of several sequences can
be thought of as minimizing the number of
mutational steps in an evolutionary tree for which
the sequences are the leaves” (Mount, 2001)

+R

CREATE




CREA
S
E

CREATE

CRE
-
A
-
TE
-

CRE
-
A
-
S
E
-

-
RE
L
A
PS
E
-

G
RE
-
A
-
S
E
R

SeqA

SeqB

SeqC

SeqD

T to S

C to G

G
REA
S
E

CREA
S
E

CREATE

+L +P

-
G

WIBR Bioinformatics, © Whitehead Institute 2004

32

Summary Review


Relationships Among Groups Of Genes


Comparing Sequences


Building Sequence Families



Sequence Conservation During Evolution


Aligning Multiple Sequences



Evolutionary Diagrams


Tracing The Descent From Common Ancestors


Growing Phylogenetic Trees


WIBR Bioinformatics, © Whitehead Institute 2004

33

References


Bioinformatics: Sequence and genome
Analysis. David W. Mount. CSHL Press, 2001.


Bioinformatics: A Practical Guide to the
Analysis of Genes and Proteins. Andreas D.
Baxevanis and B.F. Francis Ouellete. Wiley
Interscience, 2001.


Bioinformatics: Sequence, structure, and
databanks. Des Higgins and Willie Taylor.
Oxford University Press, 2000.