Working with protein sequences : a bioinformatics tutorial
A. Retrieving protein sequences
Connect to UniProt (Universal Protein Resource) :
Search in the Protein Knwoledgebase (UniPortKB)
for “human calcitonin” (query)
How many results are given ?
Look at the 10 first ones. Why were they selected ?
Modify your query to get only human genes. A simple possibility is to use the restriction :
Restrict term "human" to organism
rm "calcitonin" to protein family
How has the query been modified ?
How many protein sequence do you get ? Select them in the result table. A green banner
opens at the bottom of your screen.
More information on text search can be found by clicking at “D
ocumentation/help”, then “text
Click to “retrieve” in the green banner. You are switching from the “search” to the “retrieve”
tab. Note that the U
niProt identifiers are now written in the query frame. Several formats are
proposed to open or download.
Open the data in the FASTA format. This is the standard format to exchange protein
sequences. The information line begins with a ‘>’. The rest of the
characters are interpreted as
Save the sequences. You can either download the file and open it with a text editor such as
Word or copy the sequences from the window and paste it to Word. Use a non
font such as “courier” to visua
lly maintain the alignment.
B. Aligning protein sequences
Click to “align” in the green banner. You are switching from the “search” to the “align” tab.
Note that the protein sequences in FASTA format are now written in the query frame.
Scroll down to t
he ClustalW results. Clustal is a multiple alignment program that can run in
like (ClustalW) or graphical mode (ClutalX). Clustal results can also been displayed
in a Java Applet Window (JalView) that will be used later.
Identify the amino acids
that are strictly conserved in the four sequences (marked by a *).
Using the color coding displayed in the “amino acid properties” menu, what are their physico
chemical characteristics ? Save the alignment as a PDF file using the “Print” button.
y amino acids are strictly conserved ? Making simple assumptions, estimate the
probability that four 100 amino acid long protein sequences have such percentage of identity.
Color the sequences according to “Sequence annotation” : click on ‘peptide’, ‘prop
‘signals’. These proteins are indeed precursors of secreted peptides. The “signal sequence”
targets the protein to the secretory pathway and is cleaved in the endoplasmic reticulum. The
“propeptide” sequence is cleaved when the protein is secre
ted. It often prevents to
prematurely unleash peptide activity. How many peptides are produced from the P01258
protein ? What are the characteristics of the amino acids that surround the secreted peptides ?
Scroll down to the ClustalW tree. Save the tree
Start Jalview (top of the page). Using the online documentation (“Help” button), explain what
conservation, quality and consensus are.
In the online documentation, look at the different color schemes available. By default, Clustal
heme applies. Color the conserved amino acids using the ‘above identity threshold’
button in the ‘Color’ menu. Alternatively color the amino acids according to the Blosum62
conservation score. Save the alignment.
In the “calculate” menu, calculate a tree
by the Neighbour Joining method using Blosum62 as
a scoring method. In the “View” menu, show the “distances”. Save the tree (phylogram).
Compare this tree to the previous one.
Which are the most similar protein sequences ? Which
is the most divergent one
A phylogram is a branching diagram (tree) that is assumed to be an estimate of a phylogeny.
The branch lengths are proportional to the amount of inferred evolutionary change. A
cladogram is a branching diagram (tree) assumed to be an estimate of a phylo
geny where the
branches are of equal length. Therefore, cladograms show common ancestry, but do not
indicate the amount of evolutionary "time" separating taxa.
For more information on Clustal, go to
An advanced tutorial is provided.
Jalview documentation is provided at :
In the Jalview window, focus on the secreted peptide sequen
121) using the “edit”
menu (select the zones to hide on the first line). Compare this tree to the previous one.
are the most similar peptide sequences ? Which is the most divergent one ?
C. Exploring a protein sequence file
(CALC_HUMAN) by clicking on the accession number. On the top right,
click on the ‘text’ button to see the structure of the associated text file.
What are the functions of the two peptides derived from this protein ?
What are the post
ications of the calcitonin peptide ? Is any variant known
of the human calcitonin peptide ?
When was the human calcitonin protein sequenced ?
Scroll down to the “Cross references” table and open the DQ080435 genomic DNA file. How
long is the calcitonin ge
ne in the human genome ? Does it fit with the protein length ?
On which chromosome is the calcitonin gene ?
In the “Cross references” table, find the Pfam domain present in the calcitonin protein. How
many protein sequences were used for the alignment s
eed ? How many protein sequences are
now used as a
alignment for the hidden Markov model (HDD) ? How many NCBI sequences
have a score higher than the threshold ?
Which part of the protein 3D structure is available at the PDB database ?
D. Comparing prot
ein sequences with BLAST
Go back to the P01258 (CALC_HUMAN) page. Switch from the “search” to the “blast” tab
and click on the “Blast” button. Look at the options chosen for the search. Select :
Database : …Human
Threshold : 0.1
Matrix : Blosum62
g : filter low complexity regions
And run Blast. Click on the colored line to see the alignment. Store the accession number of
the human analogues found.
How many sequences are found (hits) ? Why is IAPP absent (P10997) ?
Why are two more sequences prese
nt (C9JS72 and D3DQX4) ?
Change the threshold to 1 and run again Blast. What do you think about the new sequence
Go to the
page. Run “Blast” with the same parameters. How many
sequences are found (hits) ? Store the accession n
umber of the human analogues found.
Note that Blasting from P10997 and from P01258 gives common hits. Go to one of these
common hits and run Blast again. Store the accession number of the human analogues found.
Select and align these sequences. Which par
t(s) of the protein are the most conserved ? Why ?
In Jalview, restrict the alignment to the secreted peptide part and calculate a tree by the
Neighbour Joining method, using Blosum62 as a scoring matrix. How many calcitonin
peptides are likely to be pre
sent in humans ?
Look at the CALCA and CALCB function in their UniProt page. Does it validate your
Working with protein sequences : results
Sequences in FASTA format
Green : hydrophobic amino acids
Red : negatively charged amino acids
Blue : positively charged amino acids
16 aa are conserved (+ initial M). The probability that 1 amino acid is conserved at a given
position among 4 pr
otein sequences is 1/(20*20*20*20) ≈ 10
. There are C
ways to have
16 aa conserved. The probability that 16 aa are conserved by chance is therefore:
84ln84) x 10
using ln(n!) = nln
Clustal tree (cladogram)
: this numerical index measures the number of conserved physico
properties conserved for each column of the alignment.
: the quality score is inversely proportional to th
e average cost of all pairs of
mutations observed in a particular column of the alignment
displayed below the alignment is the percentage of the modal residue per
At each position along the sequence, the
e most abundant
Clustal tree (phylogram)
length : 4000 bp
PDB 2JXZ calcitonin peptide structure
Pfam entry PF00214
Seed alignment : 13
Full alignment : 148
quences with score > threshold : 985
6 hits in
The initial search was performed as a text search. The protein files were searched for the
words “human” and “calcitonin”. In the blast search, protein sequences are compared to t
Blast from P01258 (CALC_HUMAN)
With threshold 1
ankyrin : a very long and repetitive protein. The alignment is not
Blast from P06881 (CALCA_HUMAN)
The peptide signal and the propeptide sequences are well conserved. This is important for the
protein processing. For instance, it is likely that IAPP and calcitonins are
imported in the ER
in a similar manner. The peptide is also likely to be activated by similar enzymes. The
secreted peptides are different, which makes sense.
Whole protein tree :
ide tree :
Again, there is a discrepancy between the whole protein and the secreted peptide phylogeny.
As a protein, IAPP is different from calcitonins, but the secreted IAPP is close to CALCA and
CALB. This is confirmed by the peptide alignment.
Peptide alignment :
Two groups are apparent : the CALC sequences and the IAPP +CALCA/B sequences,
characterized by the N
terminus : CGNLSTC and XCXTATC, respectively. The IAPP is
slightly different from the three other.
The CALCA and CALCB are indeed p
eptide hormones that induce vasodilatation. Their
function is therefore distinct from that of CALC, despite their name !
In conclusion, there are three peptide classes, which is clearly visible on the tree and on the
1. Calcitonins (CALC) :
alcitonin causes a rapid but short
lived drop in the level of calcium
and phosphate in blood by promoting the incorporation of those ions in the bones.
2. Calcitonin Gene
Related Peptide (CGRP) induce vasodilation. It dilates a variety of vessels
g the coronary, cerebral and systemic vasculature. Their abundance in the CNS also
points toward a neurotransmitter or neuromodulator role. It also elevates platelet cAMP.
3. Islet Amyloid PolyPeptide (IAPP or Amylin)
selectively inhibits insulin
utilization and glycogen deposition in muscle, while not affecting adipocyte glucose
metabolism. IAPP is the peptide subunit of amyloid found in pancreatic islets of type 2
diabetic patients and in insulinomas.
There are therefore 3