This course provides an introduction to bioinformatics, the combined ...

vivaciousefficientΒιοτεχνολογία

1 Οκτ 2013 (πριν από 3 χρόνια και 10 μήνες)

190 εμφανίσεις

BIO 224 Laboratory


Dr. Tom Peavy







Assignment 3

(due
Wednesday

Sept 2
9

by midnight
)





1.

Perform a blastp search using a highly conserved human protein as a query (
chordin

NP_003732.2
). Search the Reference Proteins
(RefSeq)
database for all organisms using
the
default parameters
.


A)


What are the default parameters with regards

to the scoring matrix, expect threshold,
gap costs, compositional adjustments, and filter/masking? What do each of these mean
with respect to how they affect the BLAST search

if you change the settings
?



B
)

After performing the search, display the

BLAST
"Search Summary" results and copy
and paste the "Database"

and the "Results Statistics" tables
into this document
.

C
)

What is length of the sequence used in the search (actual length)? What was the

effective
length of the query sequence

used in the search
?

Why do they differ?

D)


How many sequences did the database examine? What was the effectiv
e length of
the database? What was the effective search space
?

How was the effective search space
determined? How

is the search space relevant to the BLAST program

(meaning how is
it used; hint: think about the way the program searches for hits)
?

E
)

Examining the graphical and alignment displays on the

first page of the BLAST
result
, what species and protein had th
e highest score and E
-
value (
exclude the same
protein match, meaning human
chordin
)?

List the accession number.

F)

How is it that one can receive an E value of 0.0 in the output but yet not be the
identical gene that you used to search (meaning be a pro
tein other than human chordin
having an E value of 0.0)?

G)

List the conserved domains found in the chordin protein? (examine the drop down
link for “Show Conserved Domains” and click on the domains)

H
)

Examine the rest of the hits. Describe what you
suspect you are seeing with regards
to output of the first 100 sequences

(in broad strokes)
. When do you suspect they are no
longer orthologous? Which ones seem to be paralogues? Which ones only seem to
only
share a structural region or a domain?






2
.


Perform a
similar
blastp search using
the human
chordin

sequence
using the
Reference
Proteins database
,
but this time search only “
Arthropoda

.

A)

Answer the following questions:

i)

How
many
different species
of Drosophila
got hits? (note: use the t
axonomy
report
)


ii)

What
4

protein
s (and from what spe
c
i
es) have

the highest score
s

and E
-
value
s
listed in order
?

(record into
the
table found in question
C

for the correct
BLOSUM matrix
)



B
)

Using the
above
Arthropoda
BLAST search, at
what score and E
value
do you
suspect that the alignment is not for a homologous protein (meaning a non
chordin
-
BIO 224 Laboratory


Dr. Tom Peavy







related protein
, however it is likely to share significant structural similarities such as
domains
)? Provide your reasoning.


C
)

Next fill in the table by repe
ating the search (same query, same database, same
limitation to Arthopoda) using the
three

scoring matrices

listed below (note: total #
BLAST hits are listed above the visually colored alignment distributions)
.





total #
BLAST
hits

top

4

score
s

(list ab
brev s
equence name

& bit score)

E value

BL
OSUM
45


a)

b)

c)

d)

a)

b)

c)

d)

BLOSUM
62


a)

b)

c)

d)

a)

b)

c)

d)

BLOSUM80


a)

b)

c)

d)

a)

b)

c)

d)



D)

Were

the same protein
s

identified as the
top
4

most closely related sequence
s

in each
of the searches
?



E
)

What
wa
s the effect of changing the scoring matrix

with respect to the total number
or hits, best score
s
, and their
E value
s.
What might explain the differences between the
different scoring matrices
?


(hint: think about the relationship of the sco
ring ma
trices

in terms of matches
--

which
matrices give the highest scores for exact matches and highly conserved substitutions?)



3
.

In general, w
hat different search strategies might you use when
studying a hig
hly conserved
protein
versus a poorl
y cons
erved protein

when searching for homologs in another species
?

(think about the various matrices and databases)