Bioinformatics, Part 2 - Dr. Christopher King

vivaciousefficientΒιοτεχνολογία

1 Οκτ 2013 (πριν από 3 χρόνια και 10 μήνες)

109 εμφανίσεις

1


Bioinformatics
, Part 2

Adapted from a paper

(
http://www.lifescied.org/cgi/content/full/4/3/207
;
http://www.nslc.wus
tl.edu/elgin/genomics/Bio3055/manual.pdf
)

by
April Bednarski

and
Himadri Pakrasi that was

f
unded by a grant from

the

Howard Hughes Medical Ins
titute of
Washington University.

Glossary

Genome



The entire amount of genetic information for an organism. The

human genome
is the set of 46 chromosomes.

Homologous



With regard to amino acids, homologous amino acid
s

have
similar chemical
properties and size
s
. For example, glutamate can be considered homologous to aspartate
because both residues
have
similar siz
e
s

and both residues contain a carboxylic acid
side
chain
.

S
equence alignment



a sequence alignment is a way of arranging the sequences
present
in

DNA, RNA, or protein
s so as
to identify regions

that are

similar
.

Multiple s
equence alignment



a sequence a
lignment of three or more biological
sequence
s
.

Conserved


the amino acid residues at a position in a multiple sequence alignment

are
identical

throughout the alignment.

Conservative residue change


the amino acid residues at a position in a multiple
seq
uence alignment

are
homologous
.

ClustalW


A program for making multiple sequence alignments.

www.ebi.ac.uk/clustalw/index.html

ExPASy


Expert Protein Analysis System
-

us.expasy.org/

A server maintained by the
Swiss

Institute of Bioinformatics
.
Home of SWISS
-
PROT, the most extensive and

annotated
protein database
.
The Swiss
-
Pdb Viewer protein
-
viewing program is

also available at this
site for free download
.

FASTA


Fast Alignment Search Tool
-
All (since it works on both nucleotide and amino acid
sequences
)
.

Associated with this software is a
way of formatting a nu
c
leic acid or protein
sequence
.
It is important

because many bioinformatics programs require th
at the
sequence be in FASTA

format
.
The FASTA format has a title line for each sequence that
begins

with a “>” followed by any needed text to name the sequence
.
The end of

the
title line is signified by a paragraph mark (hit the return key)
.
Bioinformat
ics
programs will know that the title line isn’t part of the sequence if

you have it formatted
correctly
.
The sequence itself does NOT have any returns,

spaces, or formatting of any
kind
.
The sequence is given in one
-
letter code
.
An

example of a protein

in correct FASTA
format is shown below:

>K
-
Ras protein Homo sapiens

MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVIDGETCLLDI

LDTAGQEEYSAMRDQYMRTGEGFLCVFAINNTKSFEDIHHYREQIKRVKDSEDVP

2


MVLVGNKCDLPSRTVDTKQAQDLARSYGIPFIETSAKTRQGVDDAFYTLVREIRK

HKEKMSKDGKKKKKKSKTK
CVIM

Sequence Manipulation Suite


bioinformatics.org/sms/

a website that contains a
collection of web
-
based

programs for analyzing and formatting DNA and protein
sequences.


Procedure

NCBI


Gene

1.

G
o
(again)
to the NCBI homepage:


http://www.ncbi.nlm.nih.gov

2.

Search in the
“Gene”

database
for

Homo sapiens PTGS2

.

Click on the “PTGS2” entry.

T
he section
NCBI Reference Sequences (RefSeq)

gives
RefSeq accession number
s

for
the mRNA sequence of Homo

sapiens
prostaglandin G/H synthase 2 precursor
.

(The
number starts with NM_.)

W
rite
one of them

here
__________________.

3.

O
pen the
RefSeq

entry

by clicking on th
at

number

(first link in th
e
section)
, then
click
on
“FASTA”
(ne
ar the top of the page)
.
C
opy the

nucleotide
sequence (including the title
line designated by the “>” symbol) and paste

it into a
text or Word
document.


4.

S
ave the file as
PTGS2rna.doc

(or .txt)
on your desktop
.
R
eview
the entry for


FASTA


in the Glossa
ry
:

u
nderstanding th
e

FASTA format
will
help in
working

with the
bioinformatics programs.

5.

The
amino acid sequence
is conveniently obtained by first clicking on the “RefSeq
Protein Product” link, which is in the second column of the page, then selecting th
e
FASTA format again.

F
ollow the

steps given above to save the
amino acid
sequence in
FASTA format as a document called
PTGS2prot.doc
.

Swiss
-
Prot Entry

1.

G
o to the Expasy website
(
http://us.expasy.org/
)
. Under
Database
s

select “UniProtKB”
(a protein knowledgebase)
.

At the top of the page, click “Fields

>>


(to the right of

the
search box
)
. For the first field, select “Protein Name”, and enter, for the “Term”,
Phospholipase C gamma 1. Click “Add & Search”, then click
“Fields” again, and for the
field, “Organisms”, use the term “Homo sapiens”. Click “Add & Search”, again.

Select the
one entry that has been reviewed (the gold star).

2.

What is the

accession number


of this protein?


3.

Click on the accession number.
W
rite
at least three alternate names for this protein.

3





4.

In which two
areas of the cell is
this protein
found
?

(Under “cellular component”)



5.

W
hat is its

cofactor


(needed for the enzyme to function)?



6.

W
hat is the PLC gamma1 amino acid length and
molar mass
in daltons of isoform 1
(under “Sequences”)
?


7.

Return to the home page of the ExPASy Proteomics Server; select the SWISS
-
2DPAGE
database. Enter the accession number in the search box. Has anyone reported 2
-
D gel
electrophoresis data?


Sequence Manipulat
ion

1.

G
o to the
Sequence Manipulation Suite

(
http://bioinformatics.org/sms/
).

2.

U
nder from the menu

entry,
“DNA Analysis”
, c
lick on “Translate”
.

3.

C
lear the data entry box by
c
lick
ing on

“Clear”.

4.

C
opy the mRNA se
quence

in FASTA format

from your file
(
PTGS2rna.doc
)
and
p
aste it
into the data

entry box on the Sequence Manipulation website.

5.

S
elect “Reading Frame 3” and “direct” from the pull
-
down menus, then

click “Submit”.

6.

W
hen the Output window opens with your resu
lts, copy and past
e

the

sequence into a
Word document and save it as, “translate.doc” on your

desktop.

7.

C
ompare this sequence in the “translate.doc” file with the sequence in the


PTGS2prot.doc

.

What are the first residues that are the same in the

sequen
ces?

Do the sequences look like they are the same? (
Note
:

protein sequences should start
with a methionine
, M
.)

4






Multiple Sequence Alignment with ClustalW

1.

Go to the ClustalW
2

website,

http
://www.ebi.ac.uk/Tools/clustalw2/index.html
.

2.

T
he following are
6

FASTA formatted sequences of PTGS2 from

different
organisms
.

C
opy and paste all of

the FASTA formatted sequences
into
the data entry box
.

>dog [Canis familiaris]

MLARALVLCAALAVVRAANPCCSH
PCQNQGICMSTGFDQYKCDCTRTGFYGENCS

TPEFLTRIKLYLKPT

PNTVHYILTHFKGVWNIVNNIPFLRNTIMKYVLTSRSHLIESPPTYNVNYGYKSW

EAFSNLSYYTRALPP

VPDDCPTPMGVKGKKELPDSKEIVEKFLLRRKFIPDPQGTNMMFAFFAQHFTHQF

FKTDHKRGPAFTKGL

GHGVDLNHVYGETLDRQHKLRLFKDGKMKYQVIDGEVYPPTVKDTQVEMIYPPHV

PEHLQFAV
GQEVFGL

VPGLMMYATIWLREHNRVCDVLKQEHPEWDDERLFQTSRLILIGETIKIVIEDYV

QHLSGYHFKLKFDPE

LLFNQQFQYQNRIAAEFNTLYHWHPLLPDTLQIDDQEYNFQQFIYNNSILLEHGL

TQFVESFSRQIAGRV

AGGRNVPAAVQQVAKASIDQSRQMKYQSLNEYRKRFRLKPYTSFEELTGEKEMAA

GLEALYGDIDAMELY

PALLVEKPRPDAIFGETMVEMGAPFSLKGLMG
NPICSPDYWKPSTFGGEVGFKII

NTASIQSLICNNVKG

CPFTAFSVQDGQLTKTVTINASSSHSGLDDINPTVLLKERSTEL


>cow [Bos taurus]

MLARALLLCAAVALSGAANPCCSHPCQNRGVCMSVGFDQYKCDCTRTGFYGENCT

TPEFLTRIKLLLKPT

PNTVHYILTHFKGVWNIVNKISFLRNMIMRYVLTSRSHLIESPPTYNVHYSYKSW

EAFSNLSYYTRALPP

VPDDCPTP
MGVKGRKELPDSKEVVKKVLLRRKFIPDPQGTNLMFAFFAQHFTHQF

FKTDFERGPAFTKGK

NHGVDLSHIYGESLERQHKLRLFKDGKMKYQMINGEMYPPTVKDTQVEMIYPPHV

PEHLKFAVGQEVFGL

VPGLMMYATIWLREHNRVCDVLKQEHPEWGDEQLFQTSRLILIGETIKIVIEDYV

QHLSGYHFKLKFDPE

LLFNQQFQYQNRIAAEFNTLYHWHPLLPDVFQIDGQEYNYQQFIYNNS
VLLEHGL

TQFVESFTRQRAGRV

AGGRNLPVAVEKVSKASIDQSREMKYQSFNEYRKRFLVKPYESFEELTGEKEMAA

ELEALYGDIDAMEFY

PALLVEKPRPDAIFGETMVEAGAPFSLKGLMGNPICSPEYWKPSTFGGEVGFKII

NTASIQSLICSNVKG

CPFTSFSVQDTHLTKTVTINASSSHSGLDDINPTVLLKERSTEL

5



>mouse [Mus musculus]

MLFRAVLLCAALGLSQAANP
CCSNPCQNRGECMSTGFDQYKCDCTRTGFYGENCT

TPEFLTRIKLLLKPT

PNTVHYILTHFKGVWNIVNNIPFLRSLIMKYVLTSRSYLIDSPPTYNVHYGYKSW

EAFSNLSYYTRALPP

VADDCPTPMGVKGNKELPDSKEVLEKVLLRREFIPDPQGSNMMFAFFAQHFTHQF

FKTDHKRGPGFTRGL

GHGVDLNHIYGETLDRQHKLRLFKDGKLKYQVIGGEVYPPTVKDTQVEMIYPPHI

PENL
QFAVGQEVFGL

VPGLMMYATIWLREHNRVCDILKQEHPEWGDEQLFQTSRLILIGETIKIVIEDYV

QHLSGYHFKLKFDPE

LLFNQQFQYQNRIASEFNTLYHWHPLLPDTFNIEDQEYSFKQFLYNNSILLEHGL

TQFVESFTRQIAGRV

AGGRNVPIAVQAVAKASIDQSREMKYQSLNEYRKRFSLKPYTSFEELTGEKEMAA

ELKALYSDIDVMELY

PALLVEKPRPDAIFGETMVELGAPFSLK
GLMGNPICSPQYWKPSTFGGEVGFKII

NTASIQSLICNNVKG

CPFTSFNVQDPQPTKTATINASASHSRLDDINPTVLIKRRSTEL


>Rabbit

MLARALLLCAAVALSHAANPCCSNPCQNRGVCMTMGFDQYKCDCTRTGFYGENCS

TPEFLTRIKLLLKPT

PDTVHYILTHFKGVWNIVNSIPFLRNSIMKYVLTSRSHMIDSPPTYNVHYNYKSW

EAFSNLSYYTRALPP

VADDCPTPMGVKGK
KELPDSKDVVEKLLLRRKFIPDPQGTNMMFAFFAQHFTHQF

FKTDLKRGPAFTKGL

GHGVDLNHIYGETLDRQHKLRLFKDGKMKYQVIDGEVYPPTVKDTQVEMIYPPHI

PAHLQFAVGQEVFGL

VPGLMMYATIWLREHNRVCDVLKQEHPEWDDEQLFQTSRLILIGETIKIVIEDYV

QHLSGYHFKLKFDPE

LLFNQQFQYQNRIAAEFNTLYHWHPLLPDTFQIDDQQYNYQQFLYNNSILLEHG
L

TQFVESFTRQIAGRV

AGGRNVPPAVQKVAKASIDQSRQMKYQSLNEYRKRFLLKPYESFEELTGEKEMAA

ELEALYGDIDAVELY

PALLVERPRPDAIFGESMVEMGAPFSLKGLMGNPICSPNYWKPSTFGGEVGFKIV

NTASIQSLICNNVKG

CPFTSFNVPDPQLTKTVTINASASHSRLEDINPTVLLKGRSTEL



>pig [Sus scrofa]

MLARALLLCAAVSLCTAAKPCCSNPCQNR
GICMSVGFDHYKCDCTRTGFYGENCT

TPEFLTRIKLFLKPT

PNTVHYILTHFKGVWNIVNNIPFLRNAIMKYVLISRSHLIDSPPTYNMHYGYKSW

EAFSNLSYYTRALPP

VPDDCPTPMGVKGRKELPDSKEVVEKLLLRRKFIPDPQGTNMMFAFFAQHFTHQF

FKTDQKRGPAFTKGQ

GHGVDLSHVYGESLERQHKLRLFKDGKMKYQIIDGEMYPPTAKDTQVEMIYPPHT

PEHLRFAVGHEVF
GL

VPGLMMYATIWLREHNRVCDVLKQEHPEWDDERLFQTSRLILIGETIKIVIEDYV

QHLSGYHFKLKFDPE

6


LLFNQQFQYQNRIAAEFNTLYHWHPLLPDAFQIDGHEYNYQQFLYNNSILLEHGI

TQFVESFSRQIAGRV

AGGRNLPAAVQKVSKASIDQSREMRYQSFNEYRKRFLLKPYRSFEELTGEKEMAA

ELEALYGDIDAMELY

PALLVEKPRPDAIFGETMVEAGAPFSLKGLMGNPICS
PEYWKPSTFGGEVGFKII

NTASIQSLICNNVKG

CPFTSFSVQDPQLAKTVTINASSSHSGLDDINPTVLLKERSTEL


>coral [Gersemia fruticosa]

MVAKFVVFLGLQLILCSVVCEAVNPCCSFPCESGAVCVEDGDKYTCDCTRTGHYG

VNCEKPNWSTWFKAL

IAPSEETKHFILTHFKWFWWIVNNVPFIRNTVMKAAYFSRTDFVPVPHAYTSYHD

YATMEAHYNRSYFAR

TLP
PVPKNCPTPFGVAGKKELPPAEEVANKFLKRGKFKTDHTSTSWLFMFFAQHF

THEFFKTIYHSPAFT

WGNHGVDVSHIYGQDMERQNKLRSFEDGKLKSQTINGEEWPPYLKDVDNVTMQYP

PNTPEDQKFALGHPF

YSMLPGLFMYASIWLREHNRVCTILRKEHPHWVDERLYQTGKLIITGELIKIVIE

DYVNHLANYNLKLTY

NPELVFDHGYDYDNRIHVEFNHMYHWHPFSPDEYNISGSTYSI
QDFMYHPEIVVK

HGMSSFVDSMSKGLC

GQMSHHNHGAYTLDVAVEVIKHQRELRMQSFNNYRKHFALEPYKSFEELTGDPKM

SAELQEVYGDVNAVD

LYVGFFLEKGLTTSPFGITMIAFGAPYSLRGLLSNPVSSPTYWKPSTFGGDVGFD

MVKTASLEKLFCQNI

AGECPLVTFTVPDDIARETRKVLEARDEL


For alignment select “Full”; for output format, sele
ct “aln w/numbers” so that particular
residues (amino acids) in the alignment can be found;
for

the Output order

s
elect “input”
.
Click the

“Run”

button located in the lower right.


3.

V
iew the output
-

the
SCORES

table:

SeqA Name Len(aa) SeqB Name Le
n(aa) Score

===================================================

1 dog 604 2 cow 604 90

1 dog 604 3 mouse 604 89


Note that different specific combinations are examined;
DOG TO COW
for example
.
You
would expect a higher
SCORE

(
right column;
similarity of the gene sequence) between two
mammals than a mouse and the coral
.
What is the similarity score
for the

gene found in
mouse and coral? ________


View the cladogram at

the bottom of the page
.
(T
o l
earn more about cladograms

go to

en.wikipedia.org/wiki/Cladogram
.
) Switch to the phylogram view. Which two species are
most similar, based on this view? (Or can one even tell?)


7



Now for the most im
portant

part of this ClustalW analysis:

an amino acid by amino acid
comparison of the same protein from different species
.
Go
a little
way
s

down the web page
and find
ALIGNMENT
.

A button labeled 'Show Colors' will be displayed in the Alignment
section
of results page
. I
f you press this button the alignment will be show in color
according to the table below
. (
This option only works when you have chosen ALN or GCG
as the output format).

AVFPMILW

Red

Small
:

small

or

hydrophobic
;
incl
udes

aromatic

except

Tyr

DE

Blue

Acidic

RHK

Magenta

Basic

STYHCNGQ

Green

Hydroxyl + Amine + Basic
-

Q

Others

Gr
ay



CONSENSUS SYMBOLS:

An alignment will display by default the following symbols
denoting the degree

of conservation observed in each column:

Symbol

Meaning

*

T
he residues in that column are identical in all sequences in the alignment.

:

C
onserved substitutions are present, according to the COLOR table above.

.

S
emi
-
conserved substitutions are
present
.

(space)

?

8




Figure 1. A Venn diagram showing the rel
ationship of the 20 naturally occurring amino
acids to
some
physio
-
chemical properties.

Exarchos et al.

BMC Bioinformatics
,

2009
,

10:113

(Creative Commons Attribution License)

Copy the alignment
of amino acids in various species
and paste it into a Word

document
.
T
o make this file

readable, do the following things:

a)

G
o to “Page Set
-
up” under “File” and change the page orientation to

landscape.

b)

S
elect all text and
change to “Courier” font, size 10
.
C
ourier is the

best font for
alignments because all the
letters are the same width
.
This is one of the major
secrets of working with FASTA sequences.

c)

S
ave
and Print
this file to the desktop as “ClustalW.doc”

(send the file to yourself by
email or place on a floppy or flash drive)
.
Place
a
copy in your lab n
otebook.

4.

R
eview the alignment
.
W
hat
does the presence of a space under a column
in the

alignment
indicate about the relation of the residues?





5.

Find the longest string of conserved

(defined in glossary)

residues (watch out for
strings at the ends of row
s). How many residues does it contain?