Sequence comparisons and databases
This assignment is heavily based on a previous one made by Bengt Persson, LiU
BLAST and FASTA
Xeroderma Pigmentosum (XP) is a heterogenous group of genetically determined skin disorders due
sensitivity to ultraviolet light. The sun
exposed areas of the skin have a strong tendency to
develop tumors. The median age of onset of the first skin neoplasm in these patients is 8 years, as
compared to 50 years for sporadic skin tumour
cases. The causes are different genetic defects of the
DNA repair system. The cell uses nucleotide excision repair to remove so called bulky lesions,
induced DNA damage. Several enzymes are involved in this type of DNA repair. In
mutations have been found in at least seven different genes coding for such enzymes.
Links: For more information about the disease Xeroderma Pigmentosum search
for number 278700, corr
esponding to the XPA gene.
BLAST is a local alignment tool f
ound at the NCBI website.
You may find the Query
tutorial, BLAST tutorial and More information useful. Also read the overview and FAQs at the
web site (http://www.ncbi.nlm.nih.gov/BLAST/).
If you have problems in using BLAST first look at the Frequently Asked Questions. I also
recommend you to read the chapter about Pairwise alignment techniques in the course literature
. Please, check the following things:
in some of the BLAST search modes (and also other programs you will use on this course)
must load the sequence in the FASTA format.
the correct database, usually nr (nr=non
if the search is taking very long time, you can use the option to get your results by e
does not work with BLAST 2 sequences.
Please, take your time in getting acquainted with BLAST. You wil
l have great use of it in this course,
and most likely in your future work!!!
Now, when you have gone through the general introduction
BLAST, you are ready to answer some questions. Please, use your own words to formulate
the difference between global and local alignment?
Describe the FASTA format.
blastn is used to compare a DNA or RNA sequence against a database of nucleotide sequences.
Describe the functions of the other modes (or programs) you can choose.
What is an E
There are two main tools for sequence comparisons
BLAST and FASTA. Make sequence
comparisons using both tools with the following sequence as query sequence. In both cases, make a
search against Swissprot database. Report the top
5 hits using:
MNDLSGKTVI ITGGARGLGA EAARQAVAAG ARVVLADVLD EEGAATAREL
VTIEEDWQRV VAYAREEFGS VDGLVNNAGI STGMFLETES VERFRKVVDI NLTGVFIGMK
TVIPAMKDAG GGSIVNISSA AGLMGLALTS SYGASKWGVR GLSKLAAVEL GTDRIRVNSV
HPGMTYTPMT AETGIRQGEG NYPNTPMGRV GNEPGEIAGA VVKLLSDTSS YVTGAELAVD
Now, we shall look at the dif
ference between searching the complete sequence and searching only a
short segment of the sequence, i.e. the difference between global and local alignment. The test case is
(below) which sequence you will compare with all sequences in the Swissprot
MEHKEVVLLL LLFLKSGQGE PLDDYVNTQG ASLFSVTKKQ LGAGSIEECA AKCEEDEEFT 60
QQCVIMAENR KSSIIIRMRD VVLFEKKVYL SECKTGNGKN YRGTMSKTKN 120
GITCQKWSST SPHRPRFSPA THPSEGLEEN YCRNPDNDPQ GPWCYTTDPE KRYDYCDILE 180
CEEECMHCSG ENYDGKISKT MSGLECQAWD SQSPHAHGYI PSKFPNKNLK KNYCRNPDRE 240
LRPWCFTTDP NKRWELCDIP RCTTPPPSSG PTYQCLKGTG ENYRGNVAVT VS
AQTPHTHNRT PENFPCKNLD ENYCRNPDGK RAPWCHTTNS QVRWEYCKIP SCDSSPVSTE 360
QLAPTAPPEL TPVVQDCYHG DGQSYRGTSS TTTTGKKCQS WSSMTPHRHQ KTPENYPNAG 420
LTMNYCRNPD ADKGPWCFTT DPSVRWEYCN LKKCSGTEAS VVAPPPVVLL PDVETPSEED 480
CMFGNGKGYR GKRATTVTGT PCQDWAAQEP
HRHSIFTPET NPRAGLEKNY CRNPDGDVGG 540
PWCYTTNPRK LYDYCDVPQC AAPSFDCGKP QVEPKKCPGR VVGGCVAHPH SWPWQVSLRT 600
RFGMHFCGGT LISPEWVLTA AHCLEKSPRP SSYKVILGAH QEVNLEPHVQ EIEVSRLFLE 660
PTRKDIALLK LSSPAVITDK VIPACLPSPN YVVADRTECF ITGWGETQGT FGAGLLKEAQ 720
N RYEFLNGRVQ STELCAGHLA GGTDSCQGDS GGPLVCFEKD KYILQGVTSW 780
GLGCARPNKP GVYVRVSRFV TWIEGVMRNN 810
Start by doing a search using the complete sequence. Store these results (e.g. in "emacs", the
"notepad" or any word processor) so that you can compare
them with the results in b) and c) below.
Do a search using only the segment between positions 121 and 240 as the query sequence.
Finally, do a third search using only the segment 601
Look at the top 20 results from these comparisons (
Are they identical or not?
Which are at the top?
Are any sequences found in
that were not found in
If so, could you find an explanation to this?
Because of the high rate of data production
and the need for researchers to have rapid access to new
data, public databases have become the major medium through which genome sequence data are
published. Public databases and the data services that support them are important resources in
s, and will soon be essential sources of information for all the molecular biosciences.
Ensembl and Genbank are two major nucleotide databases, whereas Uniprot (Swiss
Prot) is the
database for protein sequences.
BLAST can also be used to search for
hits in a specific database. We will use a BLAST example to
cquainted with two very important databases: Ensembl for genome/DNA information and
Prot for protein information.
database of fully sequences eukaryotic species. Click "View full list of all Ensembl
species" and look at all the little pictures of species. Remember that human was the first vertebrate to
fully sequenced and a first draft was released as late as
How many genomes (species) are available through Ensembl right now? For fun, check this in
year or so and compare!
Click "BLAST/BLAT" at the top and search Ensembl for the following protein sequence (a naturally
occurring human fusion prote
Hint: search against peptide database. Make sure you search the human genome.
The fusion protein itself is not present in the regular genome, but you get ma
scoring hits to
two parts. It seems that there has been a fusion of gene
tic elements from two different
Which chromosomes have been fused?
At what residue number has the fusion occurred?
There are many hits
with almost identical scores. By clicking on the results, looking at the gene
entries etc, can you figure out why?
There are also a number of hits with lower, but still highly significant, scores to genes on other
Hint: Start at th
e top, you don't need to look at all of them to come to a conclusion.
is a high
quality protein database. It's manually curated part, Swiss
Prot, is continuously
extended to contain the experimental results
and conclusions of scientific publications.
Click BLAST and run BLAST against the UniProt database.
What disease is the fusion protein associated with?
What is the function and subcellular location of the proteins that have generated this
What domains do these proteins contain?
Abnormal fusion of different chromosomes is called "translocation" and is the cause of a number of
Sequence Retrieval System
Biological databases are built by different teams,
in different locations, for different purposes, and
different data models and supporting database
management systems. However, biological
most valuable when interconnected than when isolated. The popularity of these
need for querying interrelated datasets, rather than isolated databases. The
advantages of physical
integration are that queries can be executed rapidly because all data are
located in one place, and the
user sees a homogeneous, integrated data source.
with using SRS for accessing a
database is that more precise search criteria may be used than what is
built into the proper search engine
of the database itself.
(Sequence Retrieval System) to solve problems 1
19 through 1
26. Mark the
database/databases you wish to include in your search, then use the standard query form.
“all text” is a good search criterion to start out with, but not a very specific one.
Be sure to
more specific ones as well, so that you discover the differences. Note that species names are
entered in their latin forms. Don’t forget to “plu
g in” all the databases / the specific
database you want to
include in your searches.
Search Swissprot to find how many entries there are there for the following organisms. What search
word and criterion did you use?
Accession no. P20153 leads to
How long is the coding sequence for cytochrome b in woolly mammoth?
Use SRS to find in Swissprot all protein sequences of human hydroxysteroid
Hydroxysteroid dehydrogenases are enzymes participating in the metabolism of steroids.
Dehydrogenases catalyse oxidation reactions.
Some hydroxysteroid dehydrogenases have a prefix, like 17beta
. Thus, you should use
re the word hydroxysteroid in order to catch all sequences.
Describe how you search for hydroxysteroid dehydrogenases in Swissprot. How many did you
For which of these Swissprot
sequences are the three
dimensional structures known (in the
: Use the link function in SRS.
terminal acetylation is a common post
translational modification of proteins. Describe how
search for all acetylated human protein
s in Swissprot. How many did you find? (Should be in the
You could start with an "All text" search in order to find at least one protein that is acetylated.
examining the Swissprot entry, you will find out how the N
dification acetylation is
in the Swissprot database.
Questions and solutions shou
ld be emailed to Jan
, by the latest February 28