Bioinformatics Practical 4: - School of Science and Technology

moredwarfBiotechnology

Oct 1, 2013 (4 years and 13 days ago)

127 views

Bioinformatics Practical 4:

BLAST in practice


Karen Marshall


Phone: 6773 2264


Room: #33 in the homestead building, School of Rural Science and Agriculture


Email: kmarsha2@metz.une.edu.au


The aim of this practical is to exami
ne how the BLAST parameters (such as the word
length and scoring system) affect the alignment outcome
.
Understanding BLAST
parameters is particularly relevant when doing alignments when the expected level of
homology is around 70
-
85%, such as sheep or cat
tle sequences to human sequences.
This
is because the

default BLAST parameters in many programs

(e.g. NCBI BLAST) are
often

optimized for high hom
ology alignments
. Use of
the default ‘high homology
parameters’ for a lower

homology alignment could

result
in many ‘hits’ being missed.


In this practical
only nucleic acid alignments are co
nsidered. Specifically we will


Part 1



Simulate

a set of sequences at different PAM (point accepted mutation) distances,
which correspond to different levels of sequence ho
mology



Perform alignments using the above sequences and
various

B
LAST parameters,
using the
BLASTN

software

(version 2.2.8
)

Part 2



Perform
BLASTS

of human and mouse interferon
-
gamma to the human genomic
sequence

surrounding this gene, applying the concept
s learnt in part 1


For
assessment

you are required to submit the
BLAST
N

ou
t
put
s

from

either

the
simulated sequences

(Part 1)

or

the real sequences (Part 2)
,
and provide a summary /
discussion of up to 500 words
.


Please ensure the BLAST
N

outputs are
conc
atenated into
one file, and sufficiently annotated (e.g. brief description at start of each original file)
.

Email files to
kmarsha2@metz.une.edu.au.




BACKGROUND


Markov mutational models and PAM distances.

The PAM (point accepted mutation) model of mo
lecular evolution was
originally
developed
from observations of residue replacements in closely related proteins (Dayhoff
et al. 1978, see
http://www.ncbi.nlm.nih.gov/BLAST/tutorial
/Altschul
-
1.html
). One
“PAM” corresponds to an average change in 1% of all amino acids present. A Markov
mutational model is used, assuming the mutation events are random and independent.
The same model was applied to DNA by States et al. (1991, Meth
ods in Enzymology, 3:
66
-
70).
Quoting from this paper:
“A matrix M of probabilities for substituting base i by
base j after any given amount of evolution can be calculated by successive iteration of a
reference mutation matrix: M
n
=(M
1
)
n
.
M
1

is a matrix r
eflecting 99% sequence
conservation and one point accepted mutation (1 PAM) per 100 bases. M
n

then
represents the substitution probabilities after n PAMs. To model the case for which all
base substitutions are equally likely, the diagonal elements of M
1

are all 0.99, whilst all
off
-
diagonal elements are 0.00333”.


Nucleic acid scoring schemes.

Scoring matrices for nucleic acid sequence comparisons
were determined by States et
al. (1991
) for different PAM distances under either a
uniform or biased mutatio
nal model.
Key figures from this paper are reproduced in
Figure 1 below.
Under the uniform mutation model it was assumed that mutation events
were random and independent. Under the biased mutation model a transition (G ↔ A
and C ↔ T) was assumed to be t
hree
-
times more likely than a transversion (G ↔ C and
A ↔ T). The efficiency of these scoring schemes was examined in their paper.
Conclusions were a) “to achieve optimal sensitivity it is necessary to use scores relevant
to the specific question being a
sked” i.e. use a scoring scheme relevant to the expected
level of sequence homology and b) “in noncoding regions, the use of scores based on a
biased as opposed to uniform mutational model may substantially improve the search
sensitivity” i.e. a scores mat
rix that can simplify to a value for a match and a value for a
mismatch may not be optimal.





Figure 1
. Key figures from States et al. (1991, Methods in Enzymology, 3: 66
-
70)



Expected levels of sequence homologies
.

The level of homology between two
sequences is usually not known prior to performing a BLAST. However,

an idea of t
he
level of homology can be gained from the literature. Expected homology levels will
differ depending on the species being compare (higher for e.g. human to mouse than
huma
n to cattle), and on the sequence type (coding, intronic, 5’ or 3’ UTRs, promoter,
intergenic etc.
, see Figure 2
)
.













Figure 2.

Homology in different parts of the gene, human to mouse.

From …



BLASTN.
In this practical
BLASTN
(2.2.8
: Alts
chul et al., 1997, Nucleic Acids Res.
25:3389
-
3402)
will be used on our
own computers
. Key features of BLASTN
(reproduced with permission from McEwan 2004 “Bioinformatics for quantitative
geneticists” course notes) are given below:




A suffix tree is form
ed of the locations of all short “words” of sequence in one of the files.
Typically this is the larger file and is termed the database. This look up table allows the user to
quickly find perfect matches of a certain length. These are the “seeds” from which

the alignment is
extended to form a high scoring pair (HSP) which is retained if it exceeds some threshold typically
the threshold expectation value used. This reduces the search space dramatically and the reduction
is greater the longer the seed length.



These HSPs are then joined together if the combined score including gap penalties is significantly
greater when they are combined.



It has features such as “
DUST
” which guard against false matches due to low complexity
sequence.



The seed lengths and their
format depend on the level of homology. The default of 11 (and even
higher for Megablast) is not suited for low homology matches. The minimum of 7 dramatically
increases execution time but is required when homology drops below ~82%.



The scoring regime has
to be altered for low homology comparisons by default it is optimized for
near perfect matches. The “magic” options

-
W 7
-
r 17
-
q
-
21
-
f 280
-
G 29
-
E 22
-
X 240

are the most sensitive available. However these
options mean the E values are now only approxim
ate.



More recently BLAST has introduced discontiguous seeds. These are seeds where only a pattern
of bases has to match. The theory suggests that these seeds are more sensitive to low homology
matches while retaining the benefits of speed observed with lon
ger seed lengths.



The most often options used are:

o

-
W

is the seed word length

o


r

reward for a match default =1

o


q

penalty for a mismatch default =3

o


G

cost to open a gap

o


E

cost to extend a gap

o


F

often use the magic option

F “m D”
. This prevents using l
ow complexity or
softmasked sequence to initiate a seed match but allows the match to be extended through
the region.

o


e

to set the threshold expectation
. Note this is the threshold for the HSP BEFORE gaps
are included!

o


m

to specify the output options the

most used are:




m 3

this prints out all the query and aligns matches below it with dots for
identities



-
m 4

is the same as

m 3 but prints matching section in full great for cutting and
pasting regions for primer design



-
m 8

results in a tab delimited fi
le that can be loaded into excel. This is the
format to use if you want the results to be displayed in a genome browser like
UCSC.

o


U

allows soft masking with lower case if the
T

option

is chosen, very useful.




Using BLASTN you can search with more than

one sequence at a time. Steps for a
BLASTN search are:


1.

Format one file into a database with the command

Formatdb

i

dbfile.txt

p F

o
T

This file can contain multiple sequences (in FASTA format), but the formatdb
command will fail if there are duplicat
ed sequences. Check the format.log file to see
if the process was successful.


2.

Perform a BLAST with the command

blastall

p blastn


d dbfile.txt

i

comp.txt

o out.txt

where

p is followed by the program (BLASTN)

d the name of the file
formatted in step

1,
-
i the other (smaller) sequence file, and

o the output file.

Other options can also be included e.g.

r for match score etc.




THE PRACTICAL: PART 1


1.

Examine the spreadsheet markov.xls which shows an initial DNA sequence,
and corresponding sequences

at PAM distances of 1


100.
The simulation
can be rerun by Ctrl
-
Alt
-
F9. Ensure you understand the simulation, an
d note
the consequences of the mutational

process (
saturation
, forward and backward
mutation).


2.

Extract the original sequence plus sequences
at PAM distances of 10, 20, 30,
40 and 50 for five different
replicas
. Include these in a single file
‘markov.seq’
in FASTA format ready for BLAST.

e.g.

>Ori_seq


AGATTCACTGGTGTGGCAA ….

>Rep1_Pam10

AGATTCACTGGTGTGGCAA ….

(if you don’t have the scripting

skills to do this quickly, use
the
‘markov.seq’

file
provided
).


3.

Prepare to BLAST the original sequence back to this file

a.

Create a file containing just the original sequence ‘markov.ori’

b.

Format the file ‘markov.seq’ using the formatdb command (see above)


4.

Perform a bla
st using the command below, and examine the output

blastall
-
p blastn
-
d markov.seq
-
i markov.ori
-
m 3
-
o markov
1
.out

Note that this uses
default parameters
for all but the output
.

Use
blastall


to
view default parameters; also also given in

Appendix 1.


5.

Perform blast, using increasingly more optimized parameters for this
simulation, and examine the output.

a.

remove ‘softmasking’

blastall
-
p blastn
-
d markov.seq
-
i markov.ori
-
m 3
-
F "m D"


-
o markov.out





b.

reduce wordlength

blastall
-
p blastn
-
d markov.seq
-
i markov.ori
-
m 3
-
F "m D"
-
W 7

-
o markov.out


c.

use ‘sensitive’ parameters

blastall
-
p blastn
-
d markov.seq
-
i markov.ori
-
m 3
-
F "m D"
-
W 7


-
r 17
-
q
-
21
-
f 280
-
G 29
-
E 22
-
X 240
-
o markov.out


d.

change the expectation value

blastall
-
p blastn
-
d markov.seq
-
i markov.ori
-
m 3
-
F "m D"
-
W 7

-
r 17
-
q
-
21
-
f 280
-
G 29
-
E 22
-
X 240
-
e 1e
-
2
-
o markov.out




THE PRACTICAL: PART 2


In this part of the practical you

will align the mRNA sequences of human and cattle
interferon
-
ga
mma (INFG) against the human genomic sequence surrounding this gene.
Two files are available
:




INFG_refseq.txt

containing the reference sequences for human (NM_000619)
and
cattle


(NM_174086)

INFG mRNA
. These sequences are in FASTA format
and were obtain
ed from the NCBI website using Batch Entrez



hs_
chr12_subseq.txt

containing human genomic sequence surrounding the INFG
gene (
chr12:66,589,493
-
67,085,092

~ ½ Mb) . This sequence is in FASTA
format with repeats masked to lower case, and was obtained from t
he USCS
‘golden path’ website.



1.

BLAST using parameters that will favour the alignment of the human INFG
mRNA sequence to the human genomic

sequence
. You should be able to identify
the 4 exons (the output switch

m9 provides a useful table).






The exon

/ intron report from NCBI ‘AceView’ for NM_000619 is as follows:


I
n

variant

Length

& DNA

Coordinates

on gene

Supporting

clone (s)

Exon

1


243bp

1 to 243

M29383

Intron [gt
-
ag]


1242bp

244 to 1485

M29383

and 32 others

Exo
n

2


69bp

1486 to 1554

NM_000619

and 32 others

Intron [gt
-
ag]


95bp

1555 to 1649

NM_000619

and 33 others

Exon

3


183bp

1650 to 1832

NM_000619

and 24 others

Intron [gt
-
ag]


2425bp

1833 to 4257

NM_000619

and 24 others

Exon

4


725bp

4258 to 4982




2.

BLAST using parameters that will favour the alignment of cattle (bos Taurus)
INFG mRNA sequence to the human genomic sequence
. The expected homology
level
for human

coding region to
cattle coding region
is around 85%
.


REFERENCES


An excellent BLAST tut
orial, written by one of the developers of BLAST, can be found
at
http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul
-
1.html
. This is strongely
recommended as further reading, an
d has
a comprehensive reference list
.


States, D.J., Gish, W. & Altschul, S.F. (1991) "Improved sensitivity of nucleic acid
database searches using application
-
specific scoring matrices."
Methods in Enzymology

3:66
-
70.


McEwan (2004) “Bioinformatics

for quantitative geneticists” course notes. Found on
-
line at
http://www
-
personal.une.edu.au/~jvanderw/aabc_materials2004.htm#ModuleC





Appendix 1.

blastall 2.2.7

arguments.



-
p Program Name [String]


-
d Database [String]


default = nr


-
i Query File [File In]


default = stdin


-
e Expectation value (E) [Real]


default = 10.0


-
m alignment view options:

0 = pairwise,

1 = query
-
anchored showing i
dentities,

2 = query
-
anchored no identities,

3 = flat query
-
anchored, show identities,

4 = flat query
-
anchored, no identities,

5 = query
-
anchored no identities and blunt ends,

6 = flat query
-
anchored, no identities and blunt ends,

7 = XML Blast output,

8 =

tabular,

9 tabular with comment lines

10 ASN, text

11 ASN, binary [Integer]


default = 0


-
o BLAST report Output File [File Out] Optional


default = stdout


-
F Filter query sequence (DUST with blastn, SEG with others)
[String]


default = T


-
G Cost to open a gap (zero invokes default behavior) [Integer]


default = 0


-
E Cost to extend a gap (zero invokes default behavior) [Integer]


default = 0


-
X X dropoff value for gapped alignment (in bits) (zero invokes
default behavior)



blastn 30, megablast 20, tblastx 0, all others 15 [Integer]


default = 0


-
I Show GI's in deflines [T/F]


default = F


-
q Penalty for a nucleotide mismatch (blastn only) [Integer]


default =
-
3


-
r Reward for a nucleotide match (blastn on
ly) [Integer]


default = 1


-
v Number of database sequences to show one
-
line descriptions for
(V) [Integer]


default = 500


-
b Number of database sequence to show alignments for (B) [Integer]


default = 250


-
f Threshold for extending hits,
default if zero


blastp 11, blastn 0, blastx 12, tblastn 13


tblastx 13, megablast 0 [Integer]


default = 0


-
g Perfom gapped alignment (not available with tblastx) [T/F]


default = T


-
Q Query Genetic code to use [Integer]


default
= 1


-
D DB Genetic code (for tblast[nx] only) [Integer]


default = 1


-
a Number of processors to use [Integer]


default = 1


-
O SeqAlign file [File Out] Optional


-
J Believe the query defline [T/F]


default = F


-
M Matrix [String]


d
efault = BLOSUM62


-
W Word size, default if zero (blastn 11, megablast 28, all others
3) [Integer]


default = 0


-
z Effective length of the database (use zero for the real size)
[Real]


default = 0


-
K Number of best hits from a region to keep
(off by default, if
used a value of 100 is recommended) [Integer]


default = 0


-
P 0 for multiple hit, 1 for single hit [Integer]


default = 0


-
Y Effective length of the search space (use zero for the real size)
[Real]


default = 0


-
S Quer
y strands to search against database (for blast[nx], and
tblastx)


3 is both, 1 is top, 2 is bottom [Integer]


default = 3


-
T Produce HTML output [T/F]


default = F


-
l Restrict search of database to list of GI's [String] Optional


-
U U
se lower case filtering of FASTA sequence [T/F] Optional


default = F


-
y X dropoff value for ungapped extensions in bits (0.0 invokes
default behavior)


blastn 20, megablast 10, all others 7 [Real]


default = 0.0


-
Z X dropoff value for fi
nal gapped alignment in bits (0.0 invokes
default behavior)


blastn/megablast 50, tblastx 0, all others 25 [Integer]


default = 0


-
R PSI
-
TBLASTN checkpoint file [File In] Optional


-
n MegaBlast search [T/F]


default = F


-
L Location on query sequence [String] Optional


-
A Multiple Hits window size, default if zero (blastn/megablast 0,
all others 40 [Integer]


default = 0


-
w Frame shift penalty (OOF algorithm for blastx) [Integer]


default = 0


-
t Length of

the largest intron allowed in tblastn for linking HSPs
(0 disables linking) [Integer]


default = 0


-
B Number of concatenated queries, for blastn and tblastn [Integer]
Optional


default = 0