Working with protein sequences : a bioinformatics tutorial

austrianceilBiotechnology

Oct 1, 2013 (3 years and 8 months ago)

59 views


Bioinformatics 2


1

Working with protein sequences : a bioinformatics tutorial


A. Retrieving protein sequences


Connect to UniProt (Universal Protein Resource) :
http://www.uniprot.org/

Search in the Protein Knwoledgebase (UniPortKB)
for “human calcitonin” (query)


How many results are given ?

Look at the 10 first ones. Why were they selected ?

Modify your query to get only human genes. A simple possibility is to use the restriction :

-

Restrict term "human" to organism

-

Restrict te
rm "calcitonin" to protein family

How has the query been modified ?

How many protein sequence do you get ? Select them in the result table. A green banner
opens at the bottom of your screen.



More information on text search can be found by clicking at “D
ocumentation/help”, then “text
search” :

http://www.uniprot.org/help/text
-
search



Click to “retrieve” in the green banner. You are switching from the “search” to the “retrieve”
tab. Note that the U
niProt identifiers are now written in the query frame. Several formats are
proposed to open or download.


Open the data in the FASTA format. This is the standard format to exchange protein
sequences. The information line begins with a ‘>’. The rest of the
characters are interpreted as
amino acids.


Save the sequences. You can either download the file and open it with a text editor such as
Word or copy the sequences from the window and paste it to Word. Use a non
-
proportional
font such as “courier” to visua
lly maintain the alignment.



B. Aligning protein sequences


Click to “align” in the green banner. You are switching from the “search” to the “align” tab.
Note that the protein sequences in FASTA format are now written in the query frame.


Scroll down to t
he ClustalW results. Clustal is a multiple alignment program that can run in
terminal
-
like (ClustalW) or graphical mode (ClutalX). Clustal results can also been displayed
in a Java Applet Window (JalView) that will be used later.


Identify the amino acids
that are strictly conserved in the four sequences (marked by a *).
Using the color coding displayed in the “amino acid properties” menu, what are their physico
-
chemical characteristics ? Save the alignment as a PDF file using the “Print” button.


How man
y amino acids are strictly conserved ? Making simple assumptions, estimate the
probability that four 100 amino acid long protein sequences have such percentage of identity.


Bioinformatics 2


2


Color the sequences according to “Sequence annotation” : click on ‘peptide’, ‘prop
eptide’ and
‘signals’. These proteins are indeed precursors of secreted peptides. The “signal sequence”
targets the protein to the secretory pathway and is cleaved in the endoplasmic reticulum. The
“propeptide” sequence is cleaved when the protein is secre
ted. It often prevents to
prematurely unleash peptide activity. How many peptides are produced from the P01258
protein ? What are the characteristics of the amino acids that surround the secreted peptides ?


Scroll down to the ClustalW tree. Save the tree
(cladogram).


Start Jalview (top of the page). Using the online documentation (“Help” button), explain what
conservation, quality and consensus are.


In the online documentation, look at the different color schemes available. By default, Clustal
color sc
heme applies. Color the conserved amino acids using the ‘above identity threshold’
button in the ‘Color’ menu. Alternatively color the amino acids according to the Blosum62
conservation score. Save the alignment.


In the “calculate” menu, calculate a tree

by the Neighbour Joining method using Blosum62 as
a scoring method. In the “View” menu, show the “distances”. Save the tree (phylogram).
Compare this tree to the previous one.
Which are the most similar protein sequences ? Which
is the most divergent one
?


A phylogram is a branching diagram (tree) that is assumed to be an estimate of a phylogeny.
The branch lengths are proportional to the amount of inferred evolutionary change. A
cladogram is a branching diagram (tree) assumed to be an estimate of a phylo
geny where the
branches are of equal length. Therefore, cladograms show common ancestry, but do not
indicate the amount of evolutionary "time" separating taxa.



For more information on Clustal, go to
http://www.clustal.org/#Documentation

An advanced tutorial is provided.

Jalview documentation is provided at :
http://www.jalview.org/help.html



In the Jalview window, focus on the secreted peptide sequen
ce (85
-
121) using the “edit”
menu (select the zones to hide on the first line). Compare this tree to the previous one.
Which
are the most similar peptide sequences ? Which is the most divergent one ?



C. Exploring a protein sequence file


Retrieve P01258
(CALC_HUMAN) by clicking on the accession number. On the top right,
click on the ‘text’ button to see the structure of the associated text file.


What are the functions of the two peptides derived from this protein ?


What are the post
-
translational modif
ications of the calcitonin peptide ? Is any variant known
of the human calcitonin peptide ?

When was the human calcitonin protein sequenced ?


Bioinformatics 2


3


Scroll down to the “Cross references” table and open the DQ080435 genomic DNA file. How
long is the calcitonin ge
ne in the human genome ? Does it fit with the protein length ?


On which chromosome is the calcitonin gene ?


In the “Cross references” table, find the Pfam domain present in the calcitonin protein. How
many protein sequences were used for the alignment s
eed ? How many protein sequences are
now used as a
alignment for the hidden Markov model (HDD) ? How many NCBI sequences
have a score higher than the threshold ?


Which part of the protein 3D structure is available at the PDB database ?


D. Comparing prot
ein sequences with BLAST


Go back to the P01258 (CALC_HUMAN) page. Switch from the “search” to the “blast” tab
and click on the “Blast” button. Look at the options chosen for the search. Select :

Database : …Human

Threshold : 0.1

Matrix : Blosum62

Filterin
g : filter low complexity regions

And run Blast. Click on the colored line to see the alignment. Store the accession number of
the human analogues found.

How many sequences are found (hits) ? Why is IAPP absent (P10997) ?

Why are two more sequences prese
nt (C9JS72 and D3DQX4) ?


Change the threshold to 1 and run again Blast. What do you think about the new sequence
found ?


Go to the
P10997 (IAPP_HUMAN)
page. Run “Blast” with the same parameters. How many
sequences are found (hits) ? Store the accession n
umber of the human analogues found.


Note that Blasting from P10997 and from P01258 gives common hits. Go to one of these
common hits and run Blast again. Store the accession number of the human analogues found.

Select and align these sequences. Which par
t(s) of the protein are the most conserved ? Why ?



In Jalview, restrict the alignment to the secreted peptide part and calculate a tree by the
Neighbour Joining method, using Blosum62 as a scoring matrix. How many calcitonin
peptides are likely to be pre
sent in humans ?


Look at the CALCA and CALCB function in their UniProt page. Does it validate your
conclusion ?



Bioinformatics 2


4

Working with protein sequences : results


Sequences in FASTA format



>P01258

MGFQKFSPFLALSILVLLQAGSLHAAPFRSALESSPADPATLSEDEARLLLAALVQDYVQ

MK
ASELEQEQEREGSSLDSPRSKRCGNLSTCMLGTYTQDFNKFHTFPQTAIGVGAPGKKR

DMSSDLERDHRPHVSMPQNAN

>P10997

MGILKLQVFLIVLSVALNHLKATPIESHQVEKRKCNTATCATQRLANFLVHSSNNFGAIL

SSTNVGSNTYGKRNAVEVLKREPLNYLPL

>P06881

MGFQKFSPFLALSILVLLQAGSLHAAPFRSALESSPADPATLSEDEARLLLAALVQDYVQ

MKASELE
QEQEREGSRIIAQKRACDTATCVTHRLAGLLSRSGGVVKNNFVPTNVGSKAFG

RRRRDLQA

>P10092

MGFRKFSPFLALSILVLYQAGSLQAAPFRSALESSPDPATLSKEDARLLLAALVQDYVQM

KASELKQEQETQGSSSAAQKRACNTATCVTHRLAGLLSRSGGMVKSNFVPTNVGSKAFGR

RRRDLQA



Clustal alignment




Green : hydrophobic amino acids

Red : negatively charged amino acids

Blue : positively charged amino acids


16 aa are conserved (+ initial M). The probability that 1 amino acid is conserved at a given
position among 4 pr
otein sequences is 1/(20*20*20*20) ≈ 10
-
5
. There are C
16
100

ways to have
16 aa conserved. The probability that 16 aa are conserved by chance is therefore:

C
16
100

x [10
-
5
]
16

≈ exp(100ln100
-
16ln16
-
84ln84) x 10
-
80

= 10
19

x 10
-
80

= 10
-
69

using ln(n!) = nln
(n)
-

n



Clustal tree (cladogram)




Bioinformatics 2


5

Running JalView


Conservation

: this numerical index measures the number of conserved physico
-
chemical
properties conserved for each column of the alignment.

Quality

: the quality score is inversely proportional to th
e average cost of all pairs of
mutations observed in a particular column of the alignment

The
consensus

displayed below the alignment is the percentage of the modal residue per
column.

At each position along the sequence, the
consensus sequence

displays th
e most abundant
amino acid.




Clustal tree (phylogram)



Bioinformatics 2


6

Peptide alignment



Peptide tree




Bioinformatics 2


7

Genomic DNA




length : 4000 bp


introns !!!


PDB 2JXZ calcitonin peptide structure





Pfam entry PF00214

Seed alignment : 13

Full alignment : 148

NCBI se
quences with score > threshold : 985


BLAST

109 hits

6 hits in
Homo sapiens
.


Bioinformatics 2


8

The initial search was performed as a text search. The protein files were searched for the
words “human” and “calcitonin”. In the blast search, protein sequences are compared to t
he
query sequence.


Blast from P01258 (CALC_HUMAN)

P01258

P01258
-
2

C9JS72

P06881

P10092

D3DQX4


With threshold 1


ankyrin : a very long and repetitive protein. The alignment is not
significant.


Blast from
P10997 (IAPP_HUMAN)


P10997

P10092

D3DQX4

P06881


Blast from P06881 (CALCA_HUMAN)

P06881

P10092

D3DQX4

P01258
-
2

P01258

C9JS72

P10997


The peptide signal and the propeptide sequences are well conserved. This is important for the
protein processing. For instance, it is likely that IAPP and calcitonins are

imported in the ER
in a similar manner. The peptide is also likely to be activated by similar enzymes. The
secreted peptides are different, which makes sense.



Bioinformatics 2


9

Alignment of


P01258

P01258
-
2

C9JS72

P06881

P10092

D3DQX4

P10997


Whole protein tree :


Pept
ide tree :



Again, there is a discrepancy between the whole protein and the secreted peptide phylogeny.
As a protein, IAPP is different from calcitonins, but the secreted IAPP is close to CALCA and
CALB. This is confirmed by the peptide alignment.




calcitonins

IAPP

CGRP


Bioinformatics 2


10

Peptide alignment :



Two groups are apparent : the CALC sequences and the IAPP +CALCA/B sequences,
characterized by the N
-
terminus : CGNLSTC and XCXTATC, respectively. The IAPP is
slightly different from the three other.


The CALCA and CALCB are indeed p
eptide hormones that induce vasodilatation. Their
function is therefore distinct from that of CALC, despite their name !


In conclusion, there are three peptide classes, which is clearly visible on the tree and on the
alignment.


1. Calcitonins (CALC) :
C
alcitonin causes a rapid but short
-
lived drop in the level of calcium
and phosphate in blood by promoting the incorporation of those ions in the bones.


2. Calcitonin Gene
-
Related Peptide (CGRP) induce vasodilation. It dilates a variety of vessels
includin
g the coronary, cerebral and systemic vasculature. Their abundance in the CNS also
points toward a neurotransmitter or neuromodulator role. It also elevates platelet cAMP.


3. Islet Amyloid PolyPeptide (IAPP or Amylin)
selectively inhibits insulin
-
stimulat
ed glucose
utilization and glycogen deposition in muscle, while not affecting adipocyte glucose
metabolism. IAPP is the peptide subunit of amyloid found in pancreatic islets of type 2
diabetic patients and in insulinomas.


There are therefore 3
bona fide

h
uman calcitonins.