Assignment 1 - Log in to PING PONG

stalliongrapevineΒιοτεχνολογία

1 Οκτ 2013 (πριν από 4 χρόνια και 11 μέρες)

104 εμφανίσεις

Assignment 1
-

Sequence comparisons and databases


This assignment is heavily based on a previous one made by Bengt Persson, LiU


BLAST and FASTA


Xeroderma Pigmentosum (XP) is a heterogenous group of genetically determined skin disorders due
to

unusual
sensitivity to ultraviolet light. The sun
-
exposed areas of the skin have a strong tendency to

develop tumors. The median age of onset of the first skin neoplasm in these patients is 8 years, as

compared to 50 years for sporadic skin tumour

cases. The causes are different genetic defects of the

DNA repair system. The cell uses nucleotide excision repair to remove so called bulky lesions,
typically

for UV
-
induced DNA damage. Several enzymes are involved in this type of DNA repair. In
XP patie
nts,

mutations have been found in at least seven different genes coding for such enzymes.


Links: For more information about the disease Xeroderma Pigmentosum search

the
OMIM

database
for number 278700, corr
esponding to the XPA gene.


BLAST is a local alignment tool f
ound at the NCBI website.
You may find the Query

tutorial, BLAST tutorial and More information useful. Also read the overview and FAQs at the

BLAST

web site (http://www.ncbi.nlm.nih.gov/BLAST/).


Hints:
If you have problems in using BLAST first look at the Frequently Asked Questions. I also

recommend you to read the chapter about Pairwise alignment techniques in the course literature


Understand
ing

Bioinformatics”

by
Zvelebil and
Baum
. Please, check the following things:



in some of the BLAST search modes (and also other programs you will use on this course)
you

must load the sequence in the FASTA format.



choose

the correct database, usually nr (nr=non
-
redundant).



if the search is taking very long time, you can use the option to get your results by e
-
mail.
This

does not work with BLAST 2 sequences.


Please, take your time in getting acquainted with BLAST. You wil
l have great use of it in this course,

and most likely in your future work!!!
Now, when you have gone through the general introduction
about

BLAST, you are ready to answer some questions. Please, use your own words to formulate
your

answers.


1
-
1.
What is
the difference between global and local alignment?


1
-
2.
Describe the FASTA format.


1
-
3.
blastn is used to compare a DNA or RNA sequence against a database of nucleotide sequences.

Describe the functions of the other modes (or programs) you can choose.


1
-
4.
What is an E
-
value?


There are two main tools for sequence comparisons
--

BLAST and FASTA. Make sequence

comparisons using both tools with the following sequence as query sequence. In both cases, make a

search against Swissprot database. Report the top

5 hits using:

1
-
5.
FASTA

(http://www.ebi.ac.uk/fasta3/)


1
-
6.
BLAST

(http://www.ncbi.nlm.nih.gov/BLAST)


MNDLSGKTVI ITGGARGLGA EAARQAVAAG ARVVLADVLD EEGAATAREL
GDAARYQHLD

VTIEEDWQRV VAYAREEFGS VDGLVNNAGI STGMFLETES VERFRKVVDI NLTGVFIGMK

TVIPAMKDAG GGSIVNISSA AGLMGLALTS SYGASKWGVR GLSKLAAVEL GTDRIRVNSV

HPGMTYTPMT AETGIRQGEG NYPNTPMGRV GNEPGEIAGA VVKLLSDTSS YVTGAELAVD

GGWTTGPTVK YVMGQ


Now, we shall look at the dif
ference between searching the complete sequence and searching only a

short segment of the sequence, i.e. the difference between global and local alignment. The test case is

human plasminogen

(below) which sequence you will compare with all sequences in the Swissprot

database using
FASTA

(http://www.ebi.ac.uk/fasta3)


MEHKEVVLLL LLFLKSGQGE PLDDYVNTQG ASLFSVTKKQ LGAGSIEECA AKCEEDEEFT 60

CRAFQYHSKE
QQCVIMAENR KSSIIIRMRD VVLFEKKVYL SECKTGNGKN YRGTMSKTKN 120

GITCQKWSST SPHRPRFSPA THPSEGLEEN YCRNPDNDPQ GPWCYTTDPE KRYDYCDILE 180

CEEECMHCSG ENYDGKISKT MSGLECQAWD SQSPHAHGYI PSKFPNKNLK KNYCRNPDRE 240

LRPWCFTTDP NKRWELCDIP RCTTPPPSSG PTYQCLKGTG ENYRGNVAVT VS
GHTCQHWS 300

AQTPHTHNRT PENFPCKNLD ENYCRNPDGK RAPWCHTTNS QVRWEYCKIP SCDSSPVSTE 360

QLAPTAPPEL TPVVQDCYHG DGQSYRGTSS TTTTGKKCQS WSSMTPHRHQ KTPENYPNAG 420

LTMNYCRNPD ADKGPWCFTT DPSVRWEYCN LKKCSGTEAS VVAPPPVVLL PDVETPSEED 480

CMFGNGKGYR GKRATTVTGT PCQDWAAQEP
HRHSIFTPET NPRAGLEKNY CRNPDGDVGG 540

PWCYTTNPRK LYDYCDVPQC AAPSFDCGKP QVEPKKCPGR VVGGCVAHPH SWPWQVSLRT 600

RFGMHFCGGT LISPEWVLTA AHCLEKSPRP SSYKVILGAH QEVNLEPHVQ EIEVSRLFLE 660

PTRKDIALLK LSSPAVITDK VIPACLPSPN YVVADRTECF ITGWGETQGT FGAGLLKEAQ 720

LPVIENKVC
N RYEFLNGRVQ STELCAGHLA GGTDSCQGDS GGPLVCFEKD KYILQGVTSW 780

GLGCARPNKP GVYVRVSRFV TWIEGVMRNN 810


a)
Start by doing a search using the complete sequence. Store these results (e.g. in "emacs", the

"notepad" or any word processor) so that you can compare
them with the results in b) and c) below.

b)
Do a search using only the segment between positions 121 and 240 as the query sequence.

c)
Finally, do a third search using only the segment 601
--
780.

Look at the top 20 results from these comparisons (
a
,
b
and
c
above).


1
-
7.
Are they identical or not?


1
-
8.
Which are at the top?


1
-
9.
Are any sequences found in
b
or
c
that were not found in
a
?


1
-
10.
If so, could you find an explanation to this?


Database information

Because of the high rate of data production
and the need for researchers to have rapid access to new

data, public databases have become the major medium through which genome sequence data are

published. Public databases and the data services that support them are important resources in

bioinformatic
s, and will soon be essential sources of information for all the molecular biosciences.

Ensembl and Genbank are two major nucleotide databases, whereas Uniprot (Swiss
-
Prot) is the
major

database for protein sequences.


BLAST can also be used to search for
hits in a specific database. We will use a BLAST example to
get

a
cquainted with two very important databases: Ensembl for genome/DNA information and

UniProt/Swiss
-
Prot for protein information.


Ensembl

is a

database of fully sequences eukaryotic species. Click "View full list of all Ensembl

species" and look at all the little pictures of species. Remember that human was the first vertebrate to
be

fully sequenced and a first draft was released as late as
2000.


1
-
11.
How many genomes (species) are available through Ensembl right now? For fun, check this in
a

year or so and compare!


Click "BLAST/BLAT" at the top and search Ensembl for the following protein sequence (a naturally

occurring human fusion prote
in):


>Fusion protein

MEPAPARSPRPQQDPARPQEPTMPPPETPSEGRQPSPSPSPTERAPASEE

EFQFLRCQQCQAEAKCPKLLPCLHTLCSGCLEASGMQCPICQAPWPLGAD

TPALDNVFFESLQRRLSVYRQIVDAQAVCTRCKESADFWCFECEQLLCAK

CFEAHQWFLKHEARPLAELRNQSVREFLDGTRKTNNIFCSNPNHRTPTLT

SIYCRGCSKPLCCSCALLDSSHSELKCDISAEIQQRQEELDAMTQALQEQ

DSAFGAVHAQMHAAVGQLGRARAETEELIRERVRQVVAHVRAQERELLEA

VDARYQRDYEEMASRLGRLDAVLQRIRTGSALVQRMKCYASDQEVLDMHG

FLRQALCRLRQEEPQSLQAAVRTDGFDEFKVRLQDLSSCITQGKAIETQS

SSSEEIVPSPPSPPPLPRIYKPCFVCQDKSSGYHYGVSACEGCKGFFRRS

IQKNMVYTCHRDKNCIINKVTRNRCQYCRLQKCFEVGMSKESVRNDRNKK

KKEVPKPECSESYTLTPEVGELIEKVRKAHQETFPALCQLGKYTTNNSSE

QRVSLDIDLWDKFSELSTKCIIKTVEFAKQLPGFTTLTIADQITLLKAAC

LDILILRICTRYTPEQDTMTFSDGLTLNRTQMHNAGFGPLTDLVFAFANQ

LLPLEMDDAETGLLSAICLICGDRQDLEQPDRVDMLQEPLLEALKVYVRK

R
RPSRPHMFPKMLMKITDLRSISAKGAERVITLKMEIPGSMPPLIQEMLE

NSEGLDTLSGQPGGGGRDGGGLAPPPGSCSPSLSPSSNRSSPATHSP


Hint: search against peptide database. Make sure you search the human genome.


The fusion protein itself is not present in the regular genome, but you get ma
ny high
-
scoring hits to
it's

two parts. It seems that there has been a fusion of gene
tic elements from two different
chromosomes.


1
-
12.
Which chromosomes have been fused?


1
-
13.
At what residue number has the fusion occurred?


1
-
14.
There are many hits
with almost identical scores. By clicking on the results, looking at the gene

entries etc, can you figure out why?


1
-
15.
There are also a number of hits with lower, but still highly significant, scores to genes on other

chromosomes. Why?

Hint: Start at th
e top, you don't need to look at all of them to come to a conclusion.


UniProt

is a high
-
quality protein database. It's manually curated part, Swiss
-
Prot, is continuously

extended to contain the experimental results
and conclusions of scientific publications.


Click BLAST and run BLAST against the UniProt database.


1
-
16.
What disease is the fusion protein associated with?


1
-
17.
What is the function and subcellular location of the proteins that have generated this
fusion

protein?


1
-
18.
What domains do these proteins contain?

Abnormal fusion of different chromosomes is called "translocation" and is the cause of a number of

D
iseases
.


SRS
-

Sequence Retrieval System


Biological databases are built by different teams,

in different locations, for different purposes, and
using

different data models and supporting database
-
management systems. However, biological
databases are

most valuable when interconnected than when isolated. The popularity of these
services indicates
the

need for querying interrelated datasets, rather than isolated databases. The
advantages of physical

integration are that queries can be executed rapidly because all data are
located in one place, and the

user sees a homogeneous, integrated data source.

Another advantage
with using SRS for accessing a

database is that more precise search criteria may be used than what is
built into the proper search engine

of the database itself.


Use
SRS

(Sequence Retrieval System) to solve problems 1
-
19 through 1
-
26. Mark the

database/databases you wish to include in your search, then use the standard query form.


Hints:
“all text” is a good search criterion to start out with, but not a very specific one.
Be sure to
use

more specific ones as well, so that you discover the differences. Note that species names are
preferably

entered in their latin forms. Don’t forget to “plu
g in” all the databases / the specific
database you want to

include in your searches.

Search Swissprot to find how many entries there are there for the following organisms. What search

word and criterion did you use?


1
-
19.
Caenorhabditis elegans
?

1
-
20.
Arabidopsis thaliana
?


Accession no. P20153 leads to

1
-
21.
which organism?

1
-
22.
what protein?


1
-
23.
How long is the coding sequence for cytochrome b in woolly mammoth?


Use SRS to find in Swissprot all protein sequences of human hydroxysteroid
dehydrogenases.

Hydroxysteroid dehydrogenases are enzymes participating in the metabolism of steroids.

Dehydrogenases catalyse oxidation reactions.

Hint:
Some hydroxysteroid dehydrogenases have a prefix, like 17beta
-
. Thus, you should use
wildcard

(*) befo
re the word hydroxysteroid in order to catch all sequences.


1
-
24.
Describe how you search for hydroxysteroid dehydrogenases in Swissprot. How many did you

find?


1
-
25.
For which of these Swissprot

sequences are the three
-
dimensional structures known (in the
PDB

database)?


Hint
: Use the link function in SRS.

1
-
26.
N
-
terminal acetylation is a common post
-
translational modification of proteins. Describe how
you

search for all acetylated human protein
s in Swissprot. How many did you find? (Should be in the
range

of hundreds.)

Hint:
You could start with an "All text" search in order to find at least one protein that is acetylated.
By

examining the Swissprot entry, you will find out how the N
-
terminal mo
dification acetylation is
encoded

in the Swissprot database.


Questions and solutions shou
ld be emailed to Jan
-
Olov Höög

(
Jan
-
Olov.Hoog@ki.se
)
, by the latest February 28
th
.