Serial BLAST searching

lambblueearthBiotechnology

Sep 29, 2013 (4 years and 3 months ago)

111 views

BIOINFORMATIC
S
Vol.19 no.12 2003,pages 1492–1496
DOI:10.1093/bioinformatics/btg199
Serial BLAST searching
Ian Korf
The Wellcome Trust Sanger Institute,Hinxton,Cambridgeshire,CB10 1SA,UK
Received on October 27,2002;revised on February 11,2003;accepted on February 27,2003
ABSTRACT
Motivation:The translating BLAST algorithms are pow-
erful tools for Þnding protein-coding genes because they
identify amino acid similarities in nucleotide sequences.
Unfortunately,these kinds of searches are computationally
intensive and often represent bottlenecks in sequence
analysis pipelines.Tuning parameters for speed can
make the searches much faster,but one risks losing
low-scoring alignments.However,high scoring alignments
are relatively resistant to such changes in parameters,and
this fact makes it possible to use a serial strategy where a
fast,insensitive search is used to pre-screen a database
for similar sequences,and a slow,sensitive search is used
to produce the sequence alignments.
Results:Serial BLAST searches improve both the speed
and sensitivity.
Contact:ik1@sanger.ac.uk
INTRODUCTION
The BLAST programs (Altschul et al.,1990,1997;
Gish,http://blast.wustl.edu) are popular tools for Þnding
similarities between biological sequences.The translating
forms of the algorithms,BLASTX,TBLASTN,and
TBLASTX,are particularly useful as gene Þnding tools
because they Þnd amino acid similarities in nucleotide
sequences.The BLAST algorithms employ heuristics
at various stages of the algorithm,and the choice of
parameters for these is very important.Despite this fact,
the default parameters are generally used.While BLASTN
is tuned for speed,and therefore low sensitivity,the
other BLAST programs are tuned for high sensitivity and
are therefore comparatively slow.The result of this is
that sequence analysis pipelines that include translating
BLAST algorithms may spend a majority of their time
executing slow BLAST searches.For this reason,many
researchers omit translating BLAST searches altogether,
but in so doing lose a valuable source of information.
The BLAST algorithm is divided into three stages:
seeding,extension,and evaluation.In the seeding step,
the query sequence is split into words of size W.The
various words are used as alignment seeds if when they are
compared to database sequences,their score is above some
threshold,T.In the second step,extension,alignments
are generated from the seeds.The critical parameter
controlling extension is called X.Low values of X cause
alignments to terminate after only a few mismatches have
been found,while high values of X allow alignments to
continue through dissimilar regions.In the evaluation step,
individual alignments are compared to the S2 threshold
and combinations of alignments are compared to the S
threshold (these score-based parameters have statistical
counterparts called E2 and E).The combination of W,
T,X,S2,and S,along with the scoring matrix and gap
penalties determine the speed and sensitivity of BLAST
searches.
BLAST searches can easily be tuned for greater speed
at a cost of some sensitivity.The danger is that one
may lose small or divergent exons.But importantly,the
larger,more conserved exons should still be found with
relatively insensitive parameters.This behavior naturally
leads to a hierarchical strategy in which an insensitive
search is used to pre-screen a database in advance of a
sensitive search that produces the Þnal alignments.This
idea is not novel;it is common to use BLAST to collect
sequences prior to further analyses.However,it is not a
general practice to perform serial BLAST searches,and
in this paper I investigate how this approach impacts
speed and sensitivity.I Þnd that the speed of BLASTX
and TBLASTX can be increased at least 10-fold without
appreciably affecting sensitivity,and it is even possible to
increase both speed and sensitivity.
METHODS
In principal,a serial searching strategy can be applied to
many programs,but I chose BLAST because it is the most
common sequence search tool in use.There are several
ßavors of BLAST,and I chose WU-BLAST because it has
a particularly ßexible parameter set.I was most interested
in altering word size,and while the NCBI version limits
word size to 2 or 3,word size is unlimited in WU-BLAST.
The experiments employed WU-BLAST version
2.0MP-WashU [15-May-2002].All searches included the
following base parameters:B =10000000 V =10000000
Þlter = seg+xnu hspmax = 0.The B and V parameters
prevent the report formbeing truncated,the Þlter removes
low complexity sequence,and hspmax prevents the num-
1492
Published by Oxford University Press
by guest on September 29, 2013http://bioinformatics.oxfordjournals.org/Downloaded from
by guest on September 29, 2013http://bioinformatics.oxfordjournals.org/Downloaded from
by guest on September 29, 2013http://bioinformatics.oxfordjournals.org/Downloaded from
by guest on September 29, 2013http://bioinformatics.oxfordjournals.org/Downloaded from
by guest on September 29, 2013http://bioinformatics.oxfordjournals.org/Downloaded from
Serial BLAST searching
ber of alignments from being truncated.Those searches
using the 2-hit algorithm used hitdist = 30 for BLASTX
and hitdist =40 for TBLASTX.The TBLASTX searches
also include:nogap altscore = Ô* any-999Õ altscore=
Ôany * -999Õ.The altscore commands set the scores of
stop codons to -999,which combined with the nogap
option prevents stop codons fromappearing in alignments
(without nogap,stop codons can be skipped over by
paying a gap costs).
The query sequence used for the BLASTX experiments
was the complete genomic sequence of Saccharomyces
cerevisiae chromosome I (GenBank accession no.
NC
001133).The amino acid database was a compre-
hensive set of proteins from GenBank totaling 815 301
sequences and 251 650 683 letters.The query sequences
for the TBLASTX experiments were 500 Caenorhabditis
elegans genomic sequences each containing a single
protein-coding gene and 100 bp of ßanking sequence.
The genes were randomly chosen from among the col-
lection of more than 3000 genes that are conÞrmed by
transcripts (http://www.wormbase.org;speciÞc sequences
available from the author on request).The Caenorhab-
ditis briggsae genome used was version cb25.agp8
(http://www.sagner.ac.uk/Projects/C
briggsae).Contigs
were broken up into sequences at most 20 kb long with
2 kb overlaps.
For processing BLAST reports I used the BPlite.pmPerl
module fromthe Bioperl project (Stajich et al.,2002,http:
//www.biopoerl.org) as well as a few custom Perl scripts.
All experiments were conducted on a Sun Microsystems
V880 with four 750 MHz CPUs and 8 Gb RAM.
RESULTS
Word size and neighborhood word score
An effective way to speed up BLAST is to reduce the
number of alignment seeds.The most obvious way to do
this is to raise the neighborhood word score threshold,T.
For example,if using the WU-BLAST default parameters
(W=3 T=12) with BLOSUM62 scoring matrix,the
following pairs of words would all generate alignment
seeds because they all have a score of 12.The non-
matching pairs would not generate seeds if the score was
raised higher,but matching pairs always generate seeds.
Query word:AIL QEK LWP
Database word:AIL QER MWD
By setting T to a high value,such as 999,there will be
no neighborhood words,and this greatly reduces the num-
ber of seeds.Increasing W and scaling T appropriately can
also be used to reduce the number of seeds.For example,
W=4 T=16 has the same scale of 4 score units per amino
acid as the default.The reason why higher word sizes re-
duce the number of seeds is that the expected score of ran-
dompairs of amino acids is always negative,and therefore
adding another amino acid is more likely to decrease the
score than to increase it.
Table 1 shows how various combinations of W and T
affect the speed and sensitivity of a BLASTX search of
Saccharomyces cerevisiae chromosome I against a com-
prehensive amino acid database.As expected,strategies
that reduce the number of seeds increase speed at the cost
of sensitivity.For example,raising T from the default 12
to 15 makes BLASTXrun 2.7 times faster,but only 72.4%
of the alignments are found.Requiring matching words
gains a little more speed (3.9×) at the cost of more sensi-
tivity (61.1%).Scaling up to W=4 T=16 makes BLASTX
both faster (3.4×) and more sensitive (75.1%) than just
increasing T from 12 to 15,but this does not hold for
W=5 T=20 where speed is much greater (8.7×) and sen-
sitivity is much lower (53.4%).
The key to the serial strategy is apparent in the fast,
insensitive searches,such as W=5 T=20.Even though
this loses about half the alignments,nearly all (96.7%)
of the sequences are still identiÞed in the search.It
is important to note that these experiments use the
default BLOSUM62 scoring matrix,which was built from
sequences that transitively share 62%amino acid identity.
This matrix is not expected to perform well on distant
similarities,and it is the distant similarities that are
most likely to be lost under insensitive parameters.It is
therefore reasonable to focus on those alignments whose
percent identity is above some threshold,for example
40%.As expected,a greater fraction of alignments and
sequences are found when employing such a criterion,and
at this level,even highly insensitive parameters that miss
most alignments still recover nearly all of the sequences.
Two-hit alignment initiation
In modern versions of BLAST one can restrict alignment
initiation to regions containing two word hits that meet
or exceed T.The hits must be within some distance of
each other and also have no gaps between them.In WU-
BLAST,the 2-hit algorithmis off by default and turned on
with the hitdist parameter.Note that using 2-hit initiation
does not reduce the number of seeds,but it can have a
large impact on speed because many of the seeds will not
be used to initiate alignments.
As shown in Table 1,simply adding the 2-hit algorithm
to the WU-BLAST default W=3 T=12 gives approxi-
mately a 4-fold boost in speed.With various combina-
tions of W and T,it is possible to make BLASTX run
hundreds of times faster.There is,of course,a sensitivity
trade-off,but for reasonably good matches,almost all se-
quences are identiÞed.For example,using 2-hit initiation
and W=4 T=20,BLASTX runs about 130 times faster
but misses only 4% of the sequences at the 40% identity
threshold.
1493
I.Korf
Second stage BLAST
The second stage in the serial strategy is to align the
query against the database hits found in the Þrst stage
using sensitive parameters.This turns out to be a relatively
simple task since recent versions of WU-BLAST allow
one to index BLAST databases with xdformat and retrieve
individual sequences by accession number with xdget.The
procedure to parse accession numbers from a BLAST
report and retrieve them from the BLAST database is
easily automated with a script.
In the best-case scenario,the Þrst stage BLASTX
recovers all the sequences found using the default param-
eters.This turns out to be 17 541 sequences comprising
10 591 160 amino acids or 2.1% of the sequences and
4.2%of the amino acids in the entire database.Retrieving
these from the BLAST database with xdget,formatting
these with xdformat,and searching this subset database
with BLASTX using default parameters required 71.4,
1.4,and 2918 seconds respectively,or 4.6% of the time
required to search the entire database with default param-
eters.None of the higher speed parameter sets recovered
all 17 541 sequences,so a cost of 4.6% is a worst-case
estimate for the second stage.Therefore,in a serial search
using W=4 T=16 and 2-hit initiation,both stages cost
roughly 1/20 of the default search and the overall effect
is a 10-fold improvement in speed without much loss in
sensitivity.
Gene-Þnding with TBLASTX
As more and more genome sequences become available,
cross-species genome comparisons will become increas-
ingly common.There are already a few gene-Þnding
methods that rely on TBLASTX alignments as part
of their input (Roest Crollius et al.,2000;Wiehe et
al.,2001) and I expect others are in development.In
order to determine how the serial approach impacts the
performance of TBLASTX,I performed an experiment in
which 500 C.elegans genes were used as queries against
the C.briggsae draft genome.Sensitivity and speciÞcity
were calculated at the nucleotide level with respect to the
actual genes,as is typically done when evaluating the
performance of gene prediction programs (Burset and
Guigo,1996).
Table 2 shows the results of the experiment.As ex-
pected,the serial strategy greatly improves the speed of
the search.Even when employing W=5 T=999,which
speeds up the search by 60-fold,the sensitivity (94.2%)
is nearly the same as the default parameters (95.0%).
However,this is not the case when using a single BLAST
search with W=5 T=999 where the sensitivity drops
to 88.7%.In addition to speed,the serial strategy can
also increase sensitivity because the second stage has a
smaller search space.For example,using the parameters
W=3 T=14 hitdist = 40 in a serial search results in
slightly more sensitivity and about twice the speed of a
default WU-BLAST search (W=3 T=12).
In general,the speciÞcity of the TBLASTX searches
is quite low.Most of this is due to weak alignments that
do not appear to be coding sequence but are nevertheless
picked up by TBLASTX.However,there are also some
clear examples of missed exons and missed genes within
introns (data not shown).
DISCUSSION
The experiments in this paper demonstrate that a serial
strategy can greatly improve speed at almost no cost to
sensitivity.Various values of Tand W were tested using
both 1-hit and 2-hit alignment initiation.The parameters
X,S2,and S also affect BLAST performance,but I
have not included these experiments here because the
parameters have comparatively small effects,and the
default values work quite well.
One of the fundamental principles behind BLAST,and
the reason it is so fast is that it restricts alignments to
regions containing seeds.The serial strategy takes this
same approach at a higher level,requiring a reasonably
good alignment from an insensitive search as a sequence
seed rather than an alignment seed.As the experiments
demonstrate,decoupling searching and alignment in this
way is a useful strategy,but it only makes sense if one
expects the alignments in the Þrst and second stages to
be different.There would be no point in using a serial
strategy if the S.cerevisiae chromosome was replaced with
a 100 bp fragment since it is unlikely that additional
alignments would be found in the second stage.Similarly,
it is probably not a good idea to use serial searches with
genomic sequencing reads unless introns are expected to
be very short.
Sequences that are too long can also be problematic.
In the BLASTX experiment I used a 230 kb sequence
containing over 100 genes.This query identiÞed about 2%
of the known proteins to date,and such a large fraction
points to a problem with long sequences.To illustrate
this more clearly,we can imagine ßipping the experiment
around and searching the individual proteins against the
chromosome using TBLASTN.Some proteins will not
be similar to any region of the chromosome and will be
discarded after the Þrst search.Those that do match will be
used to search the entire chromosome in the second stage.
Clearly,the proteins should only be aligned in the vicinity
of where they were found in the Þrst search.One can
approximate this behavior by segmenting large sequences,
and this approach was used for the C.briggsae contigs (see
Methods).In general,segmenting improves both speed
and sensitivity,but there is some risk that alignments will
be lost if they are separated from their sequence seeds
(data not shown).A better solution would allow one to
1494
Serial BLAST searching
Table 1.How seeding parameters affect the speed and sensitivity of BLASTX
All alignments ￿40%Identity
W T Word hits speed Alignment sensitivity Sequence sensitivity Alignment sensitivity Sequence sensitivity
3 12 1 1.0 100.0 100.0 100.0 100.0
3 15 1 2.7 72.4 98.3 83.5 99.8
3 18 1 3.8 62.8 97.6 74.8 99.8
3 999 1 3.9 61.1 97.2 71.9 99.8
4 16 1 3.4 75.1 98.6 88.1 99.9
4 20 1 13.7 38.0 90.4 50.9 99.5
4 999 1 27.9 23.8 84.2 30.4 98.8
5 20 1 8.7 53.4 96.7 70.3 99.9
5 25 1 42.0 20.5 80.2 27.5 99.0
5 999 1 124.3 10.5 64.8 13.1 96.6
3 12 2 3.9 56.7 97.6 68.7 99.6
3 15 2 12.9 24.1 90.5 29.2 99.3
3 18 2 22.0 18.7 87.2 20.6 98.8
3 999 2 29.6 17.6 86.0 18.8 98.5
4 16 2 19.4 25.5 91.9 31.7 99.2
4 20 2 129.6 8.1 64.8 9.1 95.9
4 999 2 266.3 5.1 47.3 7.4 93.6
5 20 2 51.6 13.7 82.0 13.0 98.7
5 25 2 337.7 4.0 38.8 6.7 90.9
5 999 2 620.6 2.4 23.0 5.9 84.6
The word size (W),neighborhood word threshold score (T),and 2-hit algorithmcontrol how alignments are seeded and are the primary means of controlling
the speed and sensitivity of BLAST searches.The heavy horizontal line splits the table into 1-hit and 2-hit parameter sets.The columns under ÔAll
alignmentsÕ measure BLAST sensitivity with respect to the default parameters,where ÔalignmentÕ corresponds to the number of HSPs and ÔsequenceÕ
corresponds to the number of database matches.The columns under Ô￿40%IdentityÕ consider only those alignments and sequences whose highest scoring
HSP is ￿40%identical to the query.
deÞne an alignment neighborhood around sequence seeds,
but such an application does not yet exist.
Before extrapolating the BLASTX experiments to more
general cases,one must consider that the S.cerevisiae
genome contains few introns,so the majority of true
matches are expected to be long.In genomes with shorter
exons,one expects lower scoring alignments.Conse-
quently,in these genomes,it may be more difÞcult to Þnd
a signiÞcant alignment in the Þrst stage.For this reason,
it is probably a good idea not to try too hard to optimize
the speed of the Þrst search.Additionally,since the
second search will always take a little time to complete,
over-tuning the Þrst stage may not gain much overall
speed.At the risk of over-generalizing,I recommend
using W=4 T=16 hitdist =40.Note that even though
W=5 T=25 appears to be a good choice for the Þrst
stage,I do not recommend setting W=5 with small
values of T since the amount of memory required to store
the neighborhood words can become excessive (it was
approximately 500 MB for the BLASTX experiments).
If one requires matching words with T=999,then the
memory footprint is much lower and even larger word
sizes can be accommodated.
In the TBLASTX experiment,it is somewhat surprising
that even highly insensitive parameters achieve such good
performance.The C.elegans and C.briggsae genomes
diverged approximately 100 million years ago (Coghlan
and Wolfe,2002),and with more distant genomes it
may pay to be more cautious.The Þgures for sensitivity
probably understate the true sensitivity of the approach
because the C.briggsae genome is not yet Þnished and
gapped alignments were not used.
One of the advantages to the serial strategy that was
not explored here is that it allows one to search with
parameters that would normally be considered foolish.
For example,if one is interested in similarities that may
contain simple sequence repeats,the second stage search
can be performed without complexity Þlters.Or,if one
is looking to include very low scoring exons,one might
choose very sensitive parameters in the second stage
with lower values of W,T,S,and S2.One might also
allow gaps in TBLASTX to increase sensitivity though
this would probably increase the number of non-coding
similarities.
Note that this study employed WU-BLAST,and the
command line parameters listed here will not work with
NCBI-BLAST.NCBI-BLAST users will need to make the
following modiÞcations.
Since word size is limited to 3,the only way to
reduce seeds is to increase T.This is set with the
1495
I.Korf
Table 2.TBLASTX searches of C.elegans versus C.briggsae
Search 1 Search 2
W T Word hits SN SP SN SP Total speed
3 12 1 95.0 47.9 95.5 46.4 1.0
3 14 2 94.5 49.7 95.4 46.9 2.3
4 16 2 91.8 63.0 94.9 48.5 11.1
5 25 1 90.6 66.9 94.5 50.0 23.6
4 20 2 88.1 81.1 94.1 50.8 38.7
5 999 1 88.7 75.2 94.2 51.5 60.8
The sensitivity (SN) and speciÞcity (SP) of the serial search approach were
evaluated with respect to known genes using TBLASTX.Various values for
word size (W),neighborhood word threshold score (T),and number of
word hits were selected.The columns under ÔSearch 1Õ correspond to a
typical TBLASTX search.The columns under ÔSearch 2Õ and the ÔTotal
SpeedÕ refer to the combined,serial strategy.Note that the sensitivity of
Search 2 does not drop appreciably even at higher speeds.
−f command line option,for example:−f 16.
Minimizing the 2-hit distance with −A4 reduces
the number of extensions,but increases speed only
marginally.
The program to retrieve individual sequences is
called fastacmd,but rather than create a second
stage database,one has the option of creating an
alias database containing a list of speciÞc identiÞers.
This is more efÞcient than creating and destroying
a second stage database.See the NCBI-BLAST
documentation for more information.
There is no way to change stop codon scores from
the command line,edit the scoring matrix instead.
TBLASTX does not allow gaps,so there is no need
for a nogap option.However,if you want to turn
off gaps in BLASTX or TBLASTN,this is done via
−gF.
ACKNOWLEDGEMENTS
The author is supported by a grant from the National
Human Genome Research Institute (HG-00064-01).
REFERENCES
Altschul,S.F.,Gish,W.,Miller,W.,Myers,E.W.and Lipman,D.J.
(1990) Basic local alignment search tool.J.Mol.Biol.,215,403Ð
410.
Altschul,S.F.,Madden,T.L.,Schaffer,A.A.,Zhang,J.,Zhang,Z.,
Miller,W.and Lipman,D.J.(1997) Gapped BLAST and PSI-
BLAST:a new generation of protein database search programs.
Nucleic Acids Res.,25,3389Ð3402.
Burset,M.and Guigo,R.(1996) Evaluation of gene structure predic-
tion programs.Genomics,34,353Ð367.
Coghlan,A.and Wolfe,K.H.(2002) Fourfold faster rate of genome
rearrangement in nematodes than in Drosophila.Genome Res.,
12,857Ð867.
Roest Crollius,H.,Jaillon,O.,Bernot,A.,Dasilva,C.,Bouneau,L.,
Fischer,C.,Fizames,C.,Wincker,P.,Brottier,P.,Quetier,F.et al.
(2000) Estimate of human gene number provided by genome-
wide analysis using Tetraodon nigroviridis DNA sequence.Nat.
Genet.,25,235Ð238.
Stajich,J.E.,Block,D.,Boulez,K.,Brenner,S.E.,Chervitz,S.A.,
Dagdigian,C.,Fuellen,G.,Gilbert,J.G.,Korf,I.,Lapp,H.et al.
(2002) The Bioperl toolkit:perl modules for the life sciences.
Genome Res.,12,1611Ð1618.
Wiehe,T.,Gebauer-Jung,S.,Mitchell-Olds,T.and Guigo,R.(2001)
SGP-1:prediction and validation of homologous genes based on
sequence alignments.Genome Res.,11,1574Ð1583.
1496