
Basic Local Alignment Search Tool
from
Advanced Biocomputing, LLC
Index
Description
AB-BLAST 3.0 is a powerful
software package for gene and protein identification,
using sensitive, selective and rapid similarity searches of
protein and nucleotide sequence databases.
The feature list for AB-BLAST is long and continues to expand,
while performance is improved.
Much of this is outlined
below.
A complete suite of BLAST search programs
(blastp, blastn, blastx, tblastn and tblastx)
is provided in the package,
along with several database management and support programs that
include nrdb, patdb, xdformat, xdget,
seg, dust and xnu.
AB-BLAST has been built to be the most trusted database search tool
in your software toolbox,
doing what you tell it, reporting precisely what it’s doing
— even telling you what it could not do because
of specific parameter restrictions you might wish to change —
and able to handle even your biggest jobs with aplomb.
AB-BLAST uses the most rigorous, sensitive implementation of the BLAST algorithm available,
yet it often runs faster than the rest.
Users of other BLAST implementations have typically suffered through a series
of expensive and time-consuming rewrites, beta releases, format changes,
and a bewildering new set of program parameters, options, and behaviors every few years.
Meanwhile AB-BLAST was built from scratch to offer consistently superior performance
and flexibility — flexibility that has ensured near-absolute backward compatibility
with every AB-BLAST release for over 15 years.
The unique and useful combination of features, sensitivity, speed and reliability
of AB-BLAST
has been achieved through the use of advanced algorithms,
painstaking software coding,
extensive performance benchmarking
and error checking,
and a superior design that anticipates future needs,
all guided by decades of domain knowledge and a unified vision.
This helps AB-BLAST users maintain pace with the latest technology
with every new release.
By now it should be clear that
AB-BLAST is neither a re-hashed nor “Mac-ified”
version of NCBI BLAST —
although AB-BLAST is in many ways easier to use.
AB-BLAST shares essentially no code with NCBI BLAST,
except for some portions that both packages
copied from the public domain ungapped BLAST 1.4 (W. Gish, 1994, unpublished).
More information about the lineage and history
of AB-BLAST development
is available here.
Key Features
Some of the key features of AB-BLAST are described below.
-
AB-BLAST is the premier gapped BLAST with statistics.
AB-BLAST is derived from the original gapped BLAST with statistics
(W. Gish, 1996, unpublished).
Gapped alignment routines are available and used by default
in all AB-BLAST search modes
(BLASTP, BLASTN, TBLASTN, BLASTX and TBLASTX),
with purely ungapped alignments available as an option.
-
Faster and More Sensitive.
AB-BLAST is up to twice as fast as NCBI BLAST in all search modes, while being more sensitive.
With the same parameter settings,
the classical 1-hit BLAST algorithm used by default in AB-BLAST
is faster and more sensitive than the 1-hit algorithm used optionally by NCBI BLAST.
Even when compared at the same sensitivity level,
the AB-BLAST 1-hit algorithm is nearly as fast
and uses much less memory than the 2-hit algorithm described by
Altschul et al. (1997).
For users who desire maximum speed,
an AB-specific implementation of the 2-hit algorithm
is available in all search modes (including BLASTN) that is faster,
more sensitive and more memory-efficient than the NCBI 2-hit implementation.
(See the
hitdist
option.)
-
Multiple Local Alignments with Joint Evaluation.
Virtually all database search programs will find sequence similarities
(locally optimal alignments or approximations thereof)
that are by themselves statistically insignificant and thus are not reported,
but AB-BLAST can identify alignments
— none of which may be statistically significant on their own —
that are statistically significant as a group and therefore do get reported.
Alignments are clustered into “consistent” sets
under a variety of user-configurable constraints,
including maximum allowable separation distance and maximum allowable overlap.
This combination of sensitivity and selectivity
— used by default in all AB-BLAST search modes
and not available anywhere else —
increases the biological relevance of the results.
This feature is essential for finding:
-
all exons in a multi-exon gene sequence, not just high-scoring exons;
-
all complete or partial copies of a repetitive element
in a genomic sequence, not just high-scoring ones;
and
-
multiple, discrete domains of similarity between sequences,
not just the highest-scoring domain.
-
More Sensitive/Selective than Smith-Waterman.
The combination of well-chosen heuristics and statistics in AB-BLAST
can be more sensitive and selective than
the full dynamic programming approach of the classical
Smith-Waterman (1981) algorithm,
which reports only the single highest scoring alignment between two sequences,
as well as other approaches or BLAST implementations that may
identify multiple regions of local similarity but then only evaluate the alignments
in isolation for their statistical significance.
-
Full Smith-Waterman Option.
With the
postsw
option,
a full Smith-Waterman alignment is performed between pairs of query-subject
sequences that are already scheduled to be reported by BLASTP.
The Smith-Waterman alignments are combined with
the heuristic BLAST results, any redundancy between them is removed, and the statistics are recomputed.
In addition to providing alignments guaranteed to be optimal,
this post-processing can significantly improve the P-values and relative ranking of database hits,
often while increasing the execution time only marginally.
-
Choice of Statistical Methods.
AB-BLAST uses
“Sum” statistics
(Karlin and Altschul, 1993)
by default in all search modes,
with Poisson statistics available as an option
(poissonp).
Sum statistics and Poisson statistics involve
joint probability calculations on sets of one or more alignments.
To evaluate the significance of individual alignments (or alignment scores),
simple
Karlin-Altschul (1990)
statistics
are also available with the
kap
option.
-
BLASTN Flexibility.
Unique to AB-BLAST are these features of the BLASTN search mode:
-
Nucleotide scoring matrices.
AB-BLASTN supports fully-specified scoring matrices,
not just simple match/mismatch scoring systems.
This allows transitions to be scored differently than transversions;
and positive G-A substitution scores for the design of siRNAs (small interfering RNAs)
where G-U base pairing is allowed.
Scoring matrices can also be tailored to improve the design of PCR primers
or applied to areas of research where a simple match/mismatch scoring system
can not adequately discriminate.
Contrary to
W. Miller (2001),
scoring matrices were first supported by the NCBI ungapped BLASTN version 1.4
(Gish, W., 1994, unpublished;
see http://blast.advbiocomp.com/blast-1.4).
Support for nucleotide scoring matrices was indeed dropped
by the NCBI when its blastall program was released in 1997,
but this feature was maintained continuously in all WU versions of BLASTN
since the migration to Washington University in St. Louis in 1994,
continuously through the introduction of the original gapped BLAST (WU-BLAST) in 1996,
and on through to today with AB-BLASTN.
-
Flexible Word Lengths.
AB-BLASTN supports BLAST word lengths as short as 1
(re: the
W
parameter).
-
Nucleotide Neighborhood Words.
Nucleotide neighborhood words are supported by AB-BLASTN
using the standard neighborhood word score threshold parameter,
T.
Using neighborhood words, nucleotide sequence similarity can be detected
even in the absence of any identical residues between two sequences.
Users are cautioned, however, that careless use of the
T
parameter can result in crushing amounts
of memory being requested by BLASTN.
For this reason,
T
should likely be used only in conjunction with very short word lengths.
-
Consistently Accurate Statistics with BLASTN.
Since the release of the first gapped BLAST with statistics in 1996
(W. Gish, unpublished),
the statistical significance of gapped alignment scores
in all search modes — including BLASTN —
has been evaluated using appropriately pre-computed “gapped” values
for the statistical parameters λ, K and H,
rather than the often very different values for these parameters that are computed
at run-time for evaluating ungapped alignment scores.
If precomputed values are not available for the specific
combination of scoring system and gap penalties requested by the user,
a prominent warning has always been issued.
In contrast, NCBI-BLASTN only relatively recently began
using precomputed gapped values for λ, K and H
and warning users when appropriate values are unavailable.
-
Virtual Gene Structures.
Linkage information describing “consistent” groups or chains of local alignments (HSPs)
are provided by AB-BLAST
when the
topcomboN
or
links
options are used.
This facility can help with construction of overall gene structures
from what might otherwise be a barrage of local alignments
scattered throughout a 2-dimensional search space.
The
hspsepQmax,
hspsepSmax,
olfraction,
olmax,
golfraction
and
golmax
parameters can also help ensure the reported structures are more biologically relevant.
-
Ease of Database Management.
AB-BLAST supports the eXtended Database Format (XDF),
a power user’s dream
for working with peptide and nucleotide sequences.
Both the NCBI BLAST 2.0 database format and the NCBI implementation
of the BLAST search algorithm were originally restricted to sequences under 16 Mbp
in length,
whereas human genome contigs exceeded 25 Mbp in the last millennium
(Hattori et al., 2000)
and extended to hundreds of megabytes many years ago.
In contrast, XDF databases, which were introduced in 1999,
have the facility to accurately store individual sequences
of up to 1 Gbp (1 billion bp) in length with ambiguity codes intact.
Other BLAST software, such as the NCBI’s,
limits database files to 2 gigabytes each,
whereas from its inception XDF database files
could be of virtually unlimited size —
provided of course that the host operating system and file system
support such “large files” (as most modern operating systems and file systems do).
To support XDF databases,
the database formatting tool named xdformat
is provided with AB-BLAST.
Among other distinct capabilities
and advantages to using XDF and xdformat are:
A reverse chronological list of changes to the AB-BLAST software
is available in the file named HISTORY
that comes bundled
with the software.
When possible,
any bugs that have been found have typically been fixed within 24 hours
of their being reported.
Please send
us
bug reports, questions, or suggestions.
Licensing
Full information about licensing of AB-BLAST is provided
here.
Manifest
The AB-BLAST 3.0 package
includes the following data analysis and utility programs:
-
blasta — the unified database search program, which provides
blastp, blastn, blastx, tblastn, and tblastx
search functionality.
-
xdformat — the recommended program for rapidly converting sequences from
FASTA format into the native XDF format read by blasta. The program
can also append new sequences to an existing database;
automatically rollback on errors;
provides flexible indexing and verification services;
and can dump data back into FASTA format.
-
xdget — a flexible tool for retrieving sequences
(or segments thereof)
from an indexed XDF database;
retrieved sequences are
optionally reverse-complemented and translated
in the case of nucleotide sequences.
xdformat and xdget are actually one and the same program
to help ensure their mutual compatibility during upgrades.
-
nrdb — a tool for rapidly removing trivial redundancy
(i.e., duplicate sequences)
from one or more input files in FASTA format.
A simple hash table is used,
combined with data compression techniques to allow larger nucleotide
sequence data sets to be manipulated in memory.
-
patdb — a tool much like nrdb for rapidly removing trivial redundancy
from one or more input files in FASTA format,
but with the option
of identifying sequences that are perfect substrings of others.
A Patricia tree is used by the program,
possibly combined with one or more finite state automata.
This tool with its substring removal option can be usefully applied to protein sequences,
which often differ in their inclusion of the initiator methionine;
or in mapping short read sequences onto a genome.
For merely removing trivial redundancy,
nrdb may be more practically applied
to nucleotide sequences than patdb,
because nrdb uses nucleotide data compression techniques
that are not employed by patdb.
-
ab-blastall — a PERL script for converting an NCBI blastall command line
into a roughly equivalent blasta command line
and then invoking blasta.
The output is still in AB-BLAST format.
This is primarily intended as a technology demonstration tool
but may also assist users in their migration
from NCBI BLAST to the more accurate AB-BLAST.
For benchmarking of BLAST,
careful tweaking of parameters may be required, but even with great care,
benchmarking for speed can still be confounded by inaccuracies in NCBI BLAST.
-
ab-formatdb — a PERL script
for converting an NCBI formatdb command line
into the equivalent xdformat command line and then invoking xdformat.
This is primarily intended as a technology demonstration tool but may also
assist users in their migration from NCBI BLAST to AB-BLAST.
-
pam — a program to compute amino acid substitution scoring matrices
having arbitrary scales, using the Dayhoff PAM model.
-
pressdb.real — the legacy pressdb program for any rare users
who may be reliant on the NCBI BLAST 1.4 database format for nucleotide sequences.
-
setdb.real — the legacy setdb program for any rare users
who may be reliant on the NCBI BLAST 1.4 database format
for amino acid sequences.
-
gb2fasta — a parser to extract nucleotide sequences from GenBank flat files
into FASTA format.
-
gt2fasta — a parser to extract amino acid sequences from CDS features
in GenBank flat files and output them in FASTA format.
-
sp2fasta — a parser to extract protein or nucleotide sequences from
EMBL, TrEMBL, or Swiss-Prot database files and output them in FASTA format.
-
pir2fasta — a parser to extract protein sequences from NBRF PIR database
files and output them in FASTA format.
-
seg — a low-complexity filter for protein and nucleotide sequences
(Wootton and Federhen, 1993;
Wootton and Federhen, 1996).
The program identifies low compositional complexity regions.
-
dust — a low-complexity filter for nucleotide sequences
(Hancock and Armstrong, 1994;
Tatusov and Lipman, unpublished).
-
xnu — a low-complexity filter for protein sequences
(Claverie and States, 1993).
The program identifies short-periodicity repeats.
-
sysblast.sample — a sample configuration file that system
administrators may wish to modify and install as /etc/sysblast.
Parameter settings in this file can be used to:
limit the number of threads employed by each BLAST process;
change the default number of threads employed per process;
alter the “nice” value for BLAST processes;
limit the amount of memory utilized by each BLAST process.
AB-BLAST Command Line Options and Parameters
A complete list of command line options and parameters
for modifying the behavior of the AB-BLAST search programs
is available
here.
Comparable AB/NCBI BLAST Parameters
A brief comparison of the some of the most important
parameters for controlling sensitivity, selectivity and speed
of AB-BLAST and NCBI BLAST
is available
here.
Environment Variables
AB-BLAST can utilize the settings
of a few environment variables
to adapt its behavior to different computing environments:
BLASTDB, BLASTFILTER and BLASTMAT.
To allow for triple AB/WU/NCBI BLAST installations,
AB-BLAST also supports the environment variables
ABBLASTDB, ABBLASTFILTER and ABBLASTMAT,
as well as
WUBLASTDB, WUBLASTFILTER and WUBLASTMAT.
Settings of the AB versions of these variables take precedence over all others,
and WU variable settings take precedence
over the corresponding base name variables.
In AB-BLAST, the BLASTDB (or ABBLASTDB) environment variable
can be a list of one or more directory names in which the programs
are to look for database files.
In UNIX parlance, such an environment variable might be called a path
for the database files.
Directory names should be delimited from one another by a colon
(“:”) and listed in the order that they should be searched.
If the BLASTDB environment variable is not set, the programs use a default
path of .:/usr/ncbi/blast/db
, such that the programs first look in the
current working directory (“.”) for the requested database
and then look in the /usr/ncbi/blast/db
directory.
For backward compatibility with
programs that expect BLASTDB to be a single directory specification and
not a path, if the user has set a value for BLASTDB but omitted the current
working directory,
AB-BLAST will still look for database files
in the current working directory as a last resort.
This usage is unchanged from NCBI/WU BLAST version 1.4 (1994),
except multiple directories could be specified with the BLASTDB
variable beginning with WU-BLAST 2.0 ca. 1997.
The BLASTFILTER (or ABBLASTFILTER) environment variable
can be set to the directory containing the sequence filter programs,
such as
seg and
xnu.
The default directory for the filter programs is /usr/ncbi/blast/filter
.
This usage is unchanged from NCBI/WU BLAST version 1.4.
The BLASTMAT (or ABBLASTMAT)
environment variable can be set to the parent
directory for all scoring matrix files.
The default directory for these files is /usr/ncbi/blast/matrix
,
beneath which are expected nt
and aa
subdirectories
for storing scoring matrix
files for nucleotide and amino acid alphabets, respectively.
This usage is unchanged from NCBI/WU BLAST version 1.4.
For more information about environment variables, see the
Installation instructions.
Filters and Masks
AB-BLAST provides a highly flexible means
of applying both “hard” and “soft” masks to a query sequence,
of supporting alternative, user-defined filter programs
and non-standard parameters to the standard filters.
The filter
(for hard masking) and
wordmask
(for soft masking)
command line options provide the basic interface.
Multiple specifications of each type are acceptable
on the BLAST command line.
Furthermore, individual filter and wordmask specifications may
consist of entire pipelines of commands.
For example, three filters are used in succession by this pipeline:
filter="myfilter1 | myfilter2 | myfilter3 -x5 -"
The first two filters in this case expect to read their input from UN*X
standard input (also known as stdin),
whereas myfilter3
apparently needs to be told
to read data from stdin,
using the usual “-” or
hyphen argument.
The standard output (stdout)
from myfilter1
will be read via stdin
by myfilter2
,
which in turn processes the query
before handing its results to myfilter3
;
finally, myfilter3
reports its results to stdout,
which the BLAST program itself reads to obtain the fully masked sequence.
The final output from the filter pipeline is expected by the BLAST
program to be in FASTA format.
Instead of running all 3 filters in the above example as part of one
pipeline, they could instead be specified as three separate filter options
like this:
filter=myfilter1 filter=myfilter2 filter="myfilter3 -x5 -"
The same choice of running as a pipeline or running separately is available
for wordmasks, too.
Naturally, the two approaches can also be combined on the same command line.
An advantage to using the pipeline approach is that all 3 filters
in the example above may complete a little bit faster,
because much of the I/O overhead is avoided.
Furthermore,
when used in the pipeline,
there is no requirement that the output from myfilter1
and myfilter2
actually be in FASTA format.
Those two programs could potentially pass any information between
themselves and to myfilter3
.
The only absolute requirements are that the first filter in the pipeline,
myfilter1
, must read FASTA data
from stdin, and the last filter in the pipeline, myfilter3
,
must output FASTA data (that is also of the same length
as the query!) to stdout.
It should be noted that with some filter programs,
passing the query sequence sequentially through
a pipeline of filters may yield
a different result than processing the query independently with each filter
and OR-ing the results.
The script seg+xnu included in the filter/ directory provides
an example with which to test this.
Specifying filter=seg+xnu
on the BLAST command line
invokes a seg and xnu pipeline that is built-in to the search programs;
whereas specifying filter="seg+xnu -"
causes the seg+xnu script to be invoked on the query, which independently
executes seg and xnu,
then logically “ORs” the results with the pmerge
utility program.
(The echofilter option can be used to see the results of filtering displayed
in search program output).
The built-in seg+xnu pipeline is historically the way these two filters have
been invoked together,
but the somewhat slower method employed
by the seg+xnu script with pmerge may be more desirable.
Precomputed Statistical Parameters
Nucleotide Scoring Systems
Precomputed values for λ, K and H
are available for BLASTN searches
with the following match,mismatch
(M,N)
scoring systems,
using the sets of gap penalties
{Q,R}:
Precomputed Nucleotide Scoring Systems
M | N | {Q,R}
|
---|
+1 | −3 | {3,3} {3,2} {3,1} {7,2}
|
+1 | −2 | {2,2} {2,1} {1,1}
|
+3 | −5 | {10,5} {6,3} {5,5}
|
+4 | −5 | {10,5}
|
+1 | −1 | {3,1} {2,1}
|
+5 | −4 | {20,10} {10,10}
|
+5 | −11 | {22,22} {22,11} {12,2} {11,11}
|
Precomputed values are also available for a Purine-Pyrimidine scoring matrix
named “pupy”:
Protein Scoring Systems
Precomputed values for λ, K and H
are available for protein-level searches
(BLASTP, BLASTX, TBLASTN and TBLASTX)
with the following scoring matrix and
gap penalty combinations (or gap penalty ranges for R) {Q, R}:
BLOSUM50
Q | R
|
---|
16 | 1–4
|
15 | 1–4, 6, 8
|
14 | 1–5, 8
|
13 | 1–5, 8
|
12 | 2–5, 7
|
11 | 2–4, 6, 8
|
10 | 2–6, 8
|
9 | 3–5, 7
|
8 | 4–8
|
7 | 6, 7
|
BLOSUM55
Q | R
|
---|
16 | 1–4
|
15 | 1–4, 6, 8
|
14 | 1–5, 7
|
13 | 2–5, 8
|
12 | 2–5, 8
|
11 | 2–6, 8
|
10 | 3–6, 9
|
9 | 3–5, 7
|
8 | 4–8
|
7 | 7
|
BLOSUM62
Q | R
|
---|
12 | 1–3
|
11 | 1–3
|
10 | 1–4
|
9 | 1–5
|
8 | 2–7
|
7 | 2–6
|
6 | 3–5
|
5 | 5
|
BLOSUM80
Q | R
|
---|
12 | 2–12
|
11 | 2–11
|
10 | 2–10
|
9 | 3–9
|
8 | 4–8
|
7 | 5–7
|
PAM40
Q | R
|
---|
12 | 1, 2, 6
|
11 | 1, 2, 7
|
10 | 1–3, 7
|
9 | 1–3, 6
|
8 | 1–4
|
7 | 1–4
|
6 | 2–5
|
5 | 2–5
|
4 | 3, 4
|
PAM120
Q | R
|
---|
12 | 1, 2, 4
|
11 | 1–3
|
10 | 1–3, 5
|
9 | 1–3, 5
|
8 | 1–4, 6
|
7 | 2–4, 6
|
6 | 2–5
|
5 | 3–5
|
PAM250
Q | R
|
---|
16 | 1–4
|
15 | 1–5
|
14 | 1–6
|
13 | 1–6
|
12 | 2–7
|
11 | 2–7
|
10 | 3–8
|
9 | 3–7
|
8 | 5–7
|
7 | 7
|
Bugs
AB-BLAST is certainly not bug free, but historically
bugs have been fixed typically within 24 hours of their being reported.
The currently known bugs are:
-
The scale of the included BLOSUM80 scoring matrix is 1/3 bit, rather than
the 1/2 bit scale used otherwise for BLOSUM60 and above (BLOSUM60, 62, 70, 90, and 100).
This anomaly—which goes all the way back to NCBI BLAST 1.3 in 1993—may
be corrected (along with revised gapped lambda, K and H parameters) in a future release.
If you think you might be experiencing the effects of a bug,
please contact
us.
AB-BLAST exhibits a few other behaviors worth mentioning here,
because they could trip up or confuse even the most knowledgeable of BLAST users.
Any unexpected behavior might rightfully be construed as being a bug,
so the following information is provided here in the Bugs section to help avoid the unexpected.
If you should encounter problems or confusing areas
other than those described below,
or if you have questions or suggestions for improvement,
please send them to
us.
-
Due to support added for the amino acid codes J and O,
XDF protein sequence databases produced with the AB- version of xdformat
are not readable by programs in the old WU-BLAST package.
For an XDF protein sequence database named “foo”
created by WU-xdformat,
the AB-xdformat command:
xdformat -p -i foo
will report the alphabet name as
“NCBIstdaa(1)” (NCBIstdaa version 1).
The larger amino acid alphabet normally used
by the AB- version of xdformat
is named “NCBIstdaa(2)” (NCBIstdaa version 2).
WU-BLAST only uses the version 1 alphabet,
whereas AB-BLAST creates and updates databases using version 2,
can read databases in either alphabet,
and can read combinations of the two alphabets in virtual databases.
N.B.: If a protein database created by WU-xdformat
is updated using AB-xdformat,
the alphabet is silently updated to NCBIstdaa version 2, which will render
the database unreadable by programs in the WU-BLAST package.
This warning does not currently apply to nucleotide sequence databases,
because no change has thus far been necessary in the nucleotide alphabet
used by AB-BLAST.
-
AB-BLAST 3.0 establishes a new default scoring system for BLASTN.
Much confusion
— and at times much consternation —
has been caused by the default WU-BLASTN scoring system
being quite different from the default nucleotide scoring system
used by the NCBI blastall program.
For the sake of long-range compatibility and consistency,
the WU-BLASTN default +5/−4 (match/mismatch) scoring system,
which dates back to the earliest incarnations of BLASTN at the NCBI
in 1989,
had been left unchanged.
The NCBI changed its default nucleotide scoring system to +1/−3 upon introduction
of blastall in 1997.
The +1/−3 scoring system selects for nearly identical sequences.
The newer default scoring system is more consistent
with the BLASTN default word length of 11, which also selects
for nearly identical sequences.
With the transition of WU-BLAST to AB-BLAST,
the decision was made to eliminate the inconsistency
between the default scoring system and default word length,
by making the default nucleotide scoring system used by AB-BLAST
be the same as that used by NCBI blastall
(i.e., match and mismatch scores
M
=1
,
N=−3
;
and gap penalties
Q=7
,
R=2
).
NOTE: The default amino acid scoring system used in all other search modes remains unchanged
in AB-BLAST and is somewhat different from the NCBI amino acid scoring system.
Another major difference in default behavior
between WU- and NCBI-BLAST
—
whether or not to filter query sequences for low-complexity regions
—
also remains unchanged in AB-BLAST.
Namely,
just as WU-BLAST does not filter query sequences by default,
neither does AB-BLAST filter sequences by default.
-
The amino acid codes
U (selenocysteine or Sec)
and
O (pyrrolysine or Pyl)
are acceptable in query and database sequences,
but the scoring matrices distributed with AB-BLAST do not specify scores
for these letters.
By default these letters are scored the same as alignment with an X (unknown residue)
would be scored,
except for their self-alignment scores
(i.e., U with U and O with O)
which are set to 0 by default.
If more meaningful scores are known, alternative scores for these letters
can be set explicitly in the amino acid scoring matrices.
-
The only accepted way to specify an alternative scoring matrix file
is to refer to the file by name
(e.g.,
matrix=BLOSUM55
)
and for the file to reside in the current working directory
or for the path to the file to be
listed in the BLASTMAT environment variable.
If both a path and file name to a scoring matrix file are specified,
such as in matrix=/usr/local/blast/matrix/aa/BLOSUM62
or matrix=aa/blosum62
,
the search programs will claim not to be able to find the file even though
it may indeed exist and be readable.
This is a security measure that may allow managers
of network- or web-based search services to expose
the command line to users without opening up access to potentially any file
on the server,
when the mere knowledge that a file exists might be considered a breach of security.
-
The gap penalty parameters
Q
and
R
of AB-BLAST
have similar but important differences in interpretation
from the parameters G and E of NCBI Gapped BLAST.
While the two extension penalties
R
(AB-BLAST) and
E
(NCBI BLAST) are analogous,
Q
(AB-BLAST) is analogous to the sum of G and E
with NCBI BLAST.
In other words, where
Q
represents the total penalty for a gap of length 1,
NCBI Gapped BLAST computes this penalty as G + E.
-
The default sort order for reporting database hits
is by increasing E-value (most-to-least significant ordering),
but for a given database hit,
the alignments or HSPs with that sequence are sorted
primarily by query strand, secondarily by the database (“subject”) strand,
and only then by E-value.
For example,
if any alignments of a given database sequence are to the minus strand of the query,
they will be reported after any alignments to the plus strand,
even if alignments to the plus strand are less significant.
In a TBLASTX search,
in which both the query and subject are translated nucleotide sequences,
for each strand of the query,
hits to the plus strand of the subject will be reported
before any hits to its minus strand.
Consequently, identifying the HSP ascribed with the greatest statistical significance
may require many lower-significance alignments to be parsed first.
Naturally, this consideration is not an issue for BLASTP searches,
where only one “strand” of query and subjects is searched.
-
On those rare computing platforms today that do not support “large” files
(files >2 GB in size),
users will be unable to search nucleotide sequence databases
larger than about 8 billion nucleotides or 2 billion amino acids.
Migrating to a contemporary 32-bit operating system
—
or to a 64-bit computing platform that provides “large file” support
—
is sufficient to break through the “2 GB barrier”.
-
The statistical significance of gapped alignment scores is computed using
values for λ, K and H
obtained from built-in,
precomputed tables.
(The values for λ, K and H used to assess the
significance of ungapped alignment scores are still computed at run time,
as is practical).
These parameter values are
determined by the scoring matrix and gap penalties being used.
Precomputed values are necessarily not available for
all scoring matrix and gap penalty combinations, though;
and the precomputed values may not be well-suited
to an unusual residue composition of the query or database sequences.
In cases when precomputed values are unavailable,
the programs issue a relevant WARNING message and proceed to evaluate gapped alignment scores
using values for λ, K and H
that are likely to be incorrect:
the values computed at run-time for ungapped alignments.
In such cases, the reported significance estimates may be highly inaccurate
and will be biased towards being overly significant.
If the user knows more accurate parameter values for their situation,
however,
the
gapK,
gapL
and
gapH
command line options
can be used to set them.
-
Selecting an alternative scoring matrix does not alter
the gap penalties
(Q
and
R)
from their default values.
Leaving gap penalties at their default values when choosing an alternative
scoring matrix
can not only result in alignments with undesirable gap characteristics but
can create a situation in which
the programs do not have precomputed values in their built-in tables
for λ, K and H.
Worst-case, the end result can be that the alignments represent horribly inaccurate
mappings between the query and subject sequences and the P-values ascribed
to the alignments are horribly inaccurate as well.
(Actually, a worst-case scenario might be when the alignments and statistics
are bad but not bad enough to be noticed by the user, who then proceeds to use
the results—both false positives and false negatives—as though they were meaningful.)
As described earlier, a WARNING message will be displayed when precomputed
values are not available,
but nevertheless the search will go on
and the alignments and statistics may be anywhere from
slightly to horribly misleading.
-
The
hspsepqmax
and
hspsepsmax
parameters are measures of distance
in residues along the sequences in the specific form in which they are
actually compared.
For instance, in a BLASTX search (conceptually translated nucleotide
query compared against a protein sequence database),
hspsepqmax
refers to a distance measured in amino acid residues, not the underlying
nucleotides in the query.
-
ASN.1
formatted output is not available from AB-BLAST.
XML and tab-delimited output formats are recommended instead.
(See the
mformat
parameter.)
Supported Platforms for Standard & Enterprise Editions
The computing platforms currently supported for
AB-BLAST Standard Edition and Enterprise Edition are listed below.
(The list of platforms supported for AB-BLAST Personal Edition is much
shorter).
Software for computing platforms other than those listed here
may be available upon request, but additional charges may apply.
- Linux kernel versions 2.4 and 2.6 for 32-bit i786 (Pentium4) and 64-bit X64
- Apple Mac OS X 10.4+ on 32-bit X86 and X64
- Apple Mac OS X 10.4+ on 32-bit PowerPC G4 and 64-bit PowerPC G5
- Sun Solaris 10 on 64-bit X64
The list of supported platforms is subject to change without notice.
Multiple processors
(multithreading or parallel processing)
are effectively and efficiently
supported by AB-BLAST on all of the above platforms.
AB-BLAST also supports large files
(files greater than 2 GB in size),
when the host operating system and file system support large files.
Installation
To install AB-BLAST,
the first step is to download the UN*X tar archive of executable files appropriate
for your computing platform
from the Advanced Biocomputing, LLC website.
To locate the software,
licensed users will have received a confidential URL via e-mail.
Please note that scoring matrix files and documentation,
which are not generally platform-specific,
are nevertheless included in each package.
No databases are included, however.
Unpack the tar archive in a new, empty directory.
For convenience, precompiled and optimized versions
of the low-complexity sequence filters
(e.g., seg,
xnu,
and
dust)
are included (see the filter/ subdirectory that gets created),
along with two sequence redundancy removal programs
nrdb
and patdb.
Users of Mac OS X 10.6 (“Snow Leopard”) Only
To ensure proper, complete unpacking of tar archives on normal, case-insensitive HFS+ file systems,
use the Terminal app to execute the command:
gnutar zxf <archive.tar.gz>
where <archive.tar.Z> is substituted with the name of the AB-BLAST archive you downloaded.
The use of gnutar is needed to avoid a bug in the version of tar currently distributed with Snow Leopard
(at least up to version 10.6.1) that involves the treatment of hard-linked files.
The executable programs from the tar archive may be moved as desired
into any directory listed in the PATH
environment variable,
whether this means adding the newly created directory to the PATH
or moving the executables
into an existing directory already listed in the PATH
.
(Lots of information about interrogating and setting environment variables
—
and about the PATH
environment variable itself
—
can be found in Google and other
search engines using the query “PATH environment variable”).
If the software is installed in a directory that was already listed
in the PATH
,
it may be necessary to exit the currently open shell and open a new one
in order for the shell to recognize the existence
of the newly installed programs.
Note that the files
blastp, blastn, blastx, tblastn and tblastx
are actually “hard links” to the same executable program,
blasta,
that encodes the integrated capabilities of all 5 search methods.
If desired, the links can be renamed, as long as the original names appear as substrings
within the new names.
Alphabetic case is unimportant.
For instance, a link named ab-blastp will still invoke blasta
in its blastp operational mode.
A Note to Mac OS X Users
AB-BLAST software is intended to be invoked via a CLI (command line interface).
Programs will need to be invoked either using the Terminal application
(located in the /Applications/Utilities folder)
or from within a script or other application provided by a third party.
The programs bundled with AB-BLAST are not themselves intended to be double-clicked to execute.
A Note About File Permissions and File Copying
The AB-BLAST package is copyrighted and only available under license.
To help ensure users of the software
do not unintentionally copy or distribute it,
all copies of binary files are recommended to be maintained
with execute-only permissions.
As delivered in the software archives from Advanced Biocomputing, LLC,
execute-only permissions have already been set,
but if the binary files should be copied by you,
these permissions may become altered and thus allow other users
to then copy the software in an unauthorized manner.
Restoration of execute-only permissions to an executable program file
can be accomplished by running the command:
chmod a-rw,a+x filename
where filename is the name of the executable file.
If you already had AB-BLAST (or WU-BLAST) installed (with BLAST-able databases),
your installation or update of AB-BLAST is essentially complete.
If you did not have AB-BLAST or WU-BLAST installed,
read on...
Unpacking the tar archive creates a matrix/ subdirectory containing scoring
matrix files.
Wherever this directory ultimately resides,
the ABBLASTMAT (BLASTMAT) environment variable should be set to point there.
In the absence of this environment variable being set,
AB-BLAST programs first look for scoring matrix files
in any matrix/
subdirectory
of the directory in which the search programs themselves reside
and then in the /usr/ncbi/blast/matrix
directory.
Low-complexity sequence filters or masking programs —
e.g.,
seg,
xnu
and
dust —
are now included in the tar archives described here.
The bundled versions of these programs are precompiled and optimized.
While these filter programs are not required for running the search programs,
they can enormously reduce the amount of garbage output produced,
memory used, and search time taken.
Hence, it is highly recommended that these programs be made available to users.
Whatever directory you install the filter programs in,
the BLASTFILTER (or ABBLASTFILTER) environment variable should be set to point there.
In the absence of this environment variable being set,
the programs look for masking programs in any filter/
subdirectory of the directory in which the search programs themselves reside
and then in /usr/ncbi/blast/filter
.
NOTE: unlike NCBI BLAST,
the AB-BLAST search programs do not employ sequence filtering by default.
This behavior might change in the future, though.
In case the search programs are updated on your system without warning
and you wish to guarantee in an automated analysis pipeline
that no filtering will ever be performed,
just specify filter=none
on the command line.
The databases themselves are obviously not included with the software.
Once the source databases have been downloaded from any of many Internet sites,
the database files are typically uncompressed and processed into FASTA format,
if they are not in FASTA format already.
Included in the tar archives are several utility programs for converting
textual database files:
The NCBI software
Toolbox also contains some relevant parsers.
One of these is
asn2fsa
, which converts both nucleotide and peptide sequences in
GenBank ASN.1 format
into FASTA format files.
The asn2ff
parser,
which converts GenBank ASN.1 data into other flat file formats,
may also come in handy, especially if you are inclined to parse GenBank
into FASTA using your own routines
or to using the gb2fasta and gt2fasta programs mentioned above.
All of the above parsers can read from standard input (sometimes signified by a single dash, “-”),
so their input files can be maintained on disk in compressed format and dynamically zcat-ed or gunzip-ed
directly into the parsers, thus saving the time and storage required for the uncompressed data.
Because a dash is often used to signify the start of each command line option,
if a dash is needed to specify standard input for the required input file name argument,
some of these programs require that a double-dash (--
)
be specified on the command line before the single-dash.
This double-dash signifies the end of the command line options and the
start of the required arguments.
Once a source database is in FASTA format,
the xdformat program
should be used to convert it into “blastable” format.
Concise usage instructions for xdformat (and xdget)
can be obtained by invoking each program without any command line arguments.
By default,
xdformat produces 3 output files whose names are derived from the name
of the FASTA input file.
The 3 output files have distinct file name extensions and
together comprise the blastable database.
If sequence identifiers are optionally indexed during database creation,
the blastable database will consist of a total of 4 output files.
Databases formatted by xdformat
contain full ambiguity code information within the blastable database files it produces.
By default, if any unrecognized amino acid or nucleotide codes are encountered
or if the FASTA input file should otherwise appear corrupt,
xdformat will emit an error message and halt.
In such cases, if the blastable database was to be newly created,
xdformat will remove the blastable database files before halting.
If an existing blast database was being appended with new sequences when the error arose,
the blastable database will be rolled back to its original state prior to the update attempt
with none of the new sequences appended.
While formatting the database,
the xdformat program can optionally (-I option)
index the sequence identifiers
for later identifier-based retrieval with the xdget program.
XDF databases that were formatted without an identifier index
can have an index created post hoc by xdformat
with its -X option.
It may be of interest to note
for the purposes of their maintenance
that xdformat and xdget are actually one-and-the-same
program file,
merely invoked under the two different names to obtain the two
different program behaviors.
This helps ensure that the index created with xdformat
will be compatible with xdget.
See the file "FAQ-Indexing.html" for more details on identifier indexing.
For compatibility with legacy BLAST installations,
the xdformat program can function
in a setdb- and pressdb-compatibility mode,
wherein its behavior is similar to that of setdb and pressdb.
In its compatibility mode, a similar command line structure is used
and the output files produced have the same names as those produced by setdb and pressdb.
Compatibility mode is invoked
when xdformat is renamed or has links pointing
to it named setdb and pressdb.
While the files produced in compatibility mode have the same file names as those
produced by the original setdb and pressdb programs
(setdb.real and pressdb.real),
the content of these files is always XDF.
Versions of the BLASTA search program dated on or after 1999-12-14
are able to work with the more-capable XDF databases.
Note that two XDF databases —
one protein and one nucleotide —
can be created with the exact same name and exist in the exact same directory,
because the 3-letter extensions of XDF database file names are distinct
for protein sequence databases and nucleotide sequence databases.
If xdformat and the legacy setdb and pressdb programs
have all been used to create databases with the same name that reside in the same directory,
the BLAST search programs will preferentially search the databases
created with xdformat which will have the standard XDF database
file name extensions.
Note that two XDF databases —
one protein and one nucleotide —
can be created with the same name and exist in the same directory,
because the file name extensions of XDF database files are distinct
for protein sequence databases and nucleotide sequence databases.
Using the -t option to xdformat,
a descriptive name or title can be assigned to a database that will appear in BLAST search output.
The title of an existing database can be changed after its creation,
by appending an empty FASTA database and specifying the -t option with the desired new title.
For example,
xdformat -n -a mydb -t "Fancy New Title" /dev/null
The blastable database files can be placed anywhere,
but for convenience the BLASTDB environment variable
should include their directory location.
If the BLASTDB environment variable is not set,
the programs look for databases by default
in /usr/ncbi/blast/db
and in the current working
directory.
If the old pressdb program (instead of xdformat)
is used to create the blastable database,
the associated nucleotide sequence FASTA file must be located
in the same directory as the three output files from pressdb,
if the BLAST search programs are to find the FASTA file.
It may sometimes be useful to maintain the FASTA files
in a separate directory — even on another disk partition — and
provide UNIX soft links in the BLASTDB directory that point to the
real location of the FASTA files.
In addition, on systems where NCBI BLAST will not be in use,
blastable databases can be maintained in multiple directories listed
in the BLASTDB environment variable,
with each directory name delimited from the next by a colon (:
),
just as directory names are often delimited
in the PATH environment variable.
On multi-processor computer systems,
the search programs will employ
as many CPUs as are installed;
when more than about 4 CPUs are used,
this default behavior cause efficiency of hardware utilization
to be quite low,
compared to running individual single-threaded BLAST jobs
on each CPU.
Memory use also increases linearly with the number of CPUs or threads
employed.
One way to govern the number of processors employed is
to wrap the search programs in a shell script that sets a lower
number of CPUs via the cpus=# command line option.
Another, simpler approach to changing the default number of CPUs
for all users follows below,
for implementation by BLAST system managers possessing
“root” or “SuperUser” privileges.
Distributions of AB-BLAST
include a sample file named sysblast.sample,
that illustrates the system-wide configuration parameters
that can be established to govern the execution of BLAST jobs
and, thereby, provide a more productive, trouble-free level of service.
When the sysblast file is installed under the name /etc/sysblast,
all BLAST jobs executed on a given computer system can be made
subject to the parameters:
- cpusmax=<n>: a hard limit on the number of CPUs or threads employed
by each BLAST job;
it is possible to prohibit BLAST searches entirely
on a given computer by configuring a negative value for cpusmax;
- cpus=<n>: the default number of CPUs or threads employed per BLAST job;
- nice=<n>: a “nice” value for altering the priority of BLAST processes;
As is standard for UNIX operating systems:
- positive nice values correspond to lower priority
- only the root user can run at negative nice values (higher priority);
- any nice value set in /etc/sysblast is added to the current nice value of a BLAST process.
- memmax=<n>: the maximum amount of memory that may be
allocated by any single BLAST job.
The interpretation and recommended usage of memmax are:
-
memmax is expressed in units of bytes,
with optional modifiers
k (kilobytes), m (megabytes), and g (gigabytes).
-
It is almost certainly a bad idea to set memmax to a value that is
greater than the actual amount of memory (silicon RAM) installed
in the computer;
-
If memmax=0, the effective limit is “unlimited”, or the natural
upper limit for a process executing under the given operating system;
-
Values of memmax < 0 are ignored, in which case
the standard UNIX datasize resource limit set by the
user’s command shell governs BLAST memory usage instead;
The sysblast file is only effective when installed in the /etc
directory.
The /etc
directory generally resides locally to any given computer system,
so parameter settings can be tailored to each computer,
even if the BLAST software is maintained on a shared disk partition.
The /etc
directory should only be writable by “root”.
Unlike the shell script wrapper approach described above,
the limits set in /etc/sysblast typically can not be circumvented
by normal (non-root) users of a computer system.
See the comments included in the sample sysblast file for further details.
Differences between AB-BLAST and WU-BLAST
Apart from bug fixes, the most outward differences in usage and appearance
of AB-BLAST and WU-BLAST include:
- The default scoring system for AB-BLASTN is
match/mismatch scores M=+1 N=−3 with gap penalties Q=7 R=2;
whereas WU-BLASTN uses
M=+5 N=−4 with gap penalties Q=10 R=10 by default.
-
In all search modes, the default value for the gapped alignment drop-off score
gapX
is ≈50% higher for AB-BLAST,
which will tend to make the AB-BLAST search programs
slightly more sensitive and just slightly slower.
-
AB-BLAST
supports an expanded amino acid alphabet,
compared to the amino acid alphabet used by WU-BLAST.
Programs in the WU-BLAST package are consequently unable to search or modify
protein sequence databases that were created or modified by AB-xdformat.
Once a protein sequence database created with WU-xdformat
has been modified by AB-xdformat,
it can no longer be searched or modified by any of the WU-BLAST programs.
Databases created by WU-xdformat can be searched
and modified by the AB-BLAST programs.
At least for the time being,
the AB-BLAST search programs can also search virtual databases
that are a combination of databases created
with WU-xdformat and AB-xdformat.
No difference currently exists between the nucleotide alphabets used
by AB-BLAST and WU-BLAST
or the ability of programs in either package
to search/modify nucleotide sequence databases created/modified
by programs in the other package.
-
The bundled BLOSUM30 and BLOSUM35 scoring matrices
have been re-scaled to provide better precision.
-
The bundled amino acid scoring matrices
— and the matrices output by the pam program —
now contain a J row and a J column.
These matrices are incompatible with WU-BLAST,
which does not support the letter J and will report a FATAL error
when reading the files.
The AB-BLAST amino acid scoring matrices are slightly different
from the matrices distributed by the NCBI,
which also indicate scores for the letter J,
but at the time of this writing the NCBI matrices are cross-compatible with AB-BLAST.
-
AB-BLAST supports the amino acid letter code O (“oh”)
normally used to represent Pyrrolysine (Pyl),
whereas WU-BLAST does not.
The letter O may appear in query sequences, database sequences, scoring matrix files
and with command line parameters such as the
altscore
option,
but the scoring matrices bundled with AB-BLAST do not actually utilize this letter.
The default score for aligning any other letter
with O is the same score as for aligning with X,
whereas the O self-alignment score defaults to zero (0).
-
The AB-BLAST 3.0 search programs support a new
compat2.0
option to obtain roughly equivalent parameter
settings to those used by WU-BLAST 2.0.
- The analog to wu-blastall is named ab-blastall.
- The analog to wu-formatdb is named ab-formatdb.
-
AB-BLAST programs preferentially use settings
of the new environment variables
ABBLASTMAT, ABBLASTDB and ABBLASTFILTER.
See the section on
Environment Variables for important details.
When upgrading from WU-BLAST to AB-BLAST,
due to the support for the letter J in AB-BLAST,
it is important to ensure that the AB-BLAST search programs
use the bundled scoring matrices
rather than the old matrices that were distributed with WU-BLAST,
because of the latter matrices’ lack of support for the letter J.
- The maximum allowable value for the
dbslice
parameter has been increased.
-
The sp2fasta
program parses input more reliably.
- Each release of AB-BLAST is generally distributed in 3 “Editions”
— Personal, Standard and Enterprise —
which differ from each other in the degree of parallelism they support,
whereas WU-BLAST was distributed in a single version.
- The first few lines of output, including the program declaration line
and copyright notice, are different.
See
Citing BLAST for examples of the program declaration line
from AB-BLAST.
-
Programs in the AB-BLAST package that use the UNIX standard
getopt() function to parse the command line
will now uniformly across all computing platforms
produce “POSIXLY_CORRECT” behavior.
(N.B. The BLAST search programs do not use getopt(),
but most other programs in the package,
including xdformat and xdget, do).
This means some command lines that are acceptable to WU-BLAST
on some computing platforms (usually Linux)
may be rejected by AB-BLAST and need to be restructured.
This can happen if all parameters and flags are not specified
before (to the left of) the required command line arguments.
-
Better thread management under energy conservation conditions.
-
Better memory management under Mac OS X.
Citing BLAST
Citations or acknowledgments of AB-BLAST usage are greatly appreciated,
as are any personal accounts of how the software is being used
that you might wish to share.
When URLs are acceptable, please cite with:
Gish, W. (1996-2009) http://blast.advbiocomp.com
When URLs are not acceptable, please use:
Gish, W. (unpublished).
In scientific communications,
it is important to report
both the program name and the specific version used.
In the case of AB-BLAST,
the version is a combination of the version number, edition (Personal, Standard, or Enterprise),
release date,
target platform,
and build date.
The release date is the first (left-most) date displayed on the first line
of output and corresponds to the completion date of the source code.
The build date is the second date reported
and corresponds to the date and time the executables were built for the indicated target platform.
Both dates are reported in
ISO 8601 format.
For example, consider this introductory line of output
from AB-BLAST 3.0 Standard Edition:
BLASTN 3.0SE-AB [2009-05-29] [sol10-x64-ILPF64 2009-05-30T01:25:46]
Here the program name is BLASTN, the software version is “3.0SE”
from “AB” (Advanced Biocomputing, LLC),
the release date is May 29, 2009,
and the build date of the 64-bit Solaris 10 X64 binary
is May 30, 2009, at 1:25 AM.
“ILPF64” in the target platform description indicates
integers (I), long integers (L), memory pointers (P), and file pointers (F) were all compiled with 64-bits precision.
The first line of output from AB-BLAST Personal Edition
substitutes the letters “PE” for SE,
as shown in this example:
BLASTP 3.0PE-AB [2009-09-27] [linux26-x64-ILPF64 2009-09-27T18:03:31]
The first line of output from AB-BLAST Enterprise Edition
substitutes the letters “EE” for SE:
TBLASTX 3.0EE-AB [2009-09-27] [linux26-x64-ILPF64 2009-09-27T18:03:31]
Historical Notes
- The original description of the
BLAST algorithm
was published by
Altschul et al. (1990).
In addition to the algorithm itself,
BLASTP and BLASTN functionality are described,
without referring to the programs by name.
BLASTX-like functionality is briefly mentioned as being in progress
(again not by name),
but TBLASTN was actually the third BLAST search mode implemented.
Statistical significance of the ungapped alignments found by the programs
was assessed using
“Karlin-Altschul” statistics
— sometimes also referred to as
“Karlin-Dembo-Altschul” statistics,
due to the major contribution of
Amir Dembo.
-
In December 1989,
prior to the development of the World Wide Web,
the NCBI
Experimental BLAST Network Service
was opened to the public.
The BLAST network service provided fast, convenient client-server access
from anywhere on the Internet
to the very latest versions of the recently parallelized BLAST search programs
running on powerful 8–16 processor Silicon Graphics servers at the NCBI.
The BLAST servers searched against a comprehensive set of public sequence databases
that were updated daily.
Users could access the BLAST servers transparently using a UN*X command line client
that was invoked just like the BLAST application programs themselves,
or via a graphical client named HyperBLAST (J.M. Cherry, 1990, unpublished)
created with
HyperCard.
At about this time, the
“nr”
(quasi-non-redundant) protein and nucleotide sequence databases
were also established (W. Gish, unpublished).
The nr database — protein and nucleotide —
quickly became the standard database searched with BLAST,
and users could often do so in a matter of just a few seconds.
The experimental BLAST service was ultimately discontinued
a decade later, in March 2000.
Experience gained from providing a service that could
arbitrate many simultaneous and diverse requests for BLAST
helped Gish design a more flexible and robust network service
architecture known as the NCBI “Dispatcher”,
which was then largely implemented by others at the NCBI
(principally Jonathan Epstein) and went into operation ca. 1995.
At the request of NCBI management,
the experimental BLAST service was never published
and remains W. Gish (unpublished).
Awareness of the service nevertheless spread quickly by word-of-mouth,
as was the case for the later WU BLAST.
-
The BLASTX program first appeared in the release of
BLAST version 1.1 in July 1990.
The program was later described and evaluated by
Gish and States (1993).
The BLAST3
program
(
Altschul and Lipman, 1990) was also folded into the BLAST 1.1 release
and parallelized.
The use of Poisson statistics
to evaluate the joint probability of multiple HSPs from a given (query,subject)
sequence pair,
as had been suggested by
Karlin and Altschul (1990),
was also first featured in BLAST 1.1.
-
The BLASTC program,
a specialized version of BLASTX that considered codon usage information
in addition to sequence similarity
(States and Gish, 1994),
appeared only once,
in the
BLAST 1.3
distribution.
The BLAST 1.3 distribution was also the last to include the BLAST3
program.
-
BLAST 1.4
(W. Gish, 1994, unpublished)
was the first version to use
Karlin and Altschul (1993)
“Sum” statistics
to evaluate the joint probability of finding multiple HSPs between a given pair of sequences.
Sum statistics were found to be more practical in a biological context
than the Poisson statistics utilized by default in BLAST 1.3.
-
The TBLASTX program first appeared in BLAST 1.4
and remains attributable to W. Gish (1994, unpublished).
-
All five of the supported BLAST programs in BLAST 1.4
(BLASTP, BLASTN, TBLASTN,
BLASTX and TBLASTX)
were for the first time coded
using a standard API (application programming interface)
to a generalized BLAST function library.
This function library made maintenance and improvements to the five core programs easier
and aided the development of more specialized BLAST applications,
such as Entrez sequence neighboring tools and specialized EST analysis tools.
-
The first release of WU BLAST was numbered 1.4,
which was virtually identical to the public domain NCBI BLAST 1.4,
save for a few bug fixes.
The WU BLAST Archives (original URL http://blast.wustl.edu)
first appeared on the Internet in 1995,
to provide continuity of support for the work Warren Gish began at the NCBI,
as well as to provide a central resource where the community could find
BLAST-related software, information and earlier versions.
-
In late 1994
at the invitation of Warren Gish,
who had recently moved to Washington University in St. Louis,
Stephen Altschul and he engaged in a collaboration to test
Gish’s hypotheses that:
- Sum statistics
(Karlin and Altschul, 1993)
allowed the precise evaluation of multiple ungapped alignment scores,
using the analytically computed ungapped parameters
λu, Ku and Hu.
Extreme Value statistics
— analogous to the statistics for ungapped alignment scores published by
Karlin and Altschul (1990)
—
had been shown empirically to be good estimators
of the statistical significance of individual gapped alignments
from Smith-Waterman comparisons,
using empirical estimates for the gapped parameters
λg and Kg
(Collins and Coulson, 1990;
Mott, 1992;
Waterman and Vingron, 1994).
It stood to reason that
Sum statistics might be extended empirically to evaluating
multiple gapped alignment scores, using empirically estimated parameters
λg, Kg and Hg;
-
while good estimates for
λg, Kg and Hg could be
computed through lengthy (expensive) Monte Carlo simulations
for a specific scoring system and particular pair of sequences,
fixed estimates for these parameters precomputed
for sequences of “average” composition
would work well enough as to be of practical use
in comprehensive database searches;
and
-
for the search algorithm itself,
multiple, locally optimal gapped alignments between two sequences
could be approximated by a two-stage BLAST implementation
that would remain fast, yet be far more sensitive than ungapped BLAST,
produce more-easily interpreted alignments,
and yield alignment scores suitable for evaluation
with the expanded role proposed for Sum statistics.
If the effort panned out as hoped,
the new gapped BLAST method
would in some cases be more sensitive and selective
than even the standard Smith-Waterman algorithm,
due to the newer method’s ability to find multiple gapped alignments
between a pair of sequences
and to evaluate their significance jointly with Sum statistics.
While Altschul set to work empirically testing Sum statistics
on gapped alignment scores,
Gish focused on the alignment problem.
Early results from their work appeared in
Altschul and Gish (1996)
and provided much of the foundation for WU BLAST 2.0
and later NCBI blastall.
-
The first complete implementation of gapped BLAST
(BLASTP, BLASTN, BLASTX, TBLASTN and TBLASTX)
with statistical significance estimates (both Poisson and Sum statistics)
was publicly released as WU BLAST version 2.0d1
(W. Gish, unpublished),
in time for presentation
at the Cold Spring Harbor conference on Genome Mapping and Sequencing
in May 1996.
-
The NCBI published its BLAST version 2 or “Gapped BLAST”,
including a description of a new 2-hit ungapped BLAST algorithm
and the PSI-BLAST algorithm,
in
Altschul et al. (1997),
in September 1997.
All search modes, except BLASTN, used the new 2-hit
algorithm by default.
Within days of their publication,
a faster, more sensitive 2-hit algorithm
was deployed in WU-BLAST 2.0.
-
The NCBI published a description of PHI-BLAST in
Zhang et al. 1998.
-
In late 2008,
rights to WU-BLAST were acquired from
Washington University in St. Louis
by the author, Warren R. Gish.
The right to license the software to the community were acquired by
Advanced Biocomputing, LLC in 2009.
References
Altschul, SF, and W Gish (1996).
Local alignment statistics.
ed. R. Doolittle.
Methods Enzymol. 266:460–80.
Altschul, SF, and DJ Lipman (1990).
Protein database searches for multiple alignments.
Proc. Natl. Acad. Sci. USA 87:5509–13.
Altschul, SF, Gish, W, Miller, W, Myers, EW, and DJ Lipman (1990).
Basic local alignment search tool.
J. Mol. Biol. 215:403–10.
Altschul, SF, Madden, TL, Schäffer, AA, Zhang, J, Zhang, Z, Miller, W, and DJ Lipman (1997).
Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.
Nucl. Acids Res. 25(17):3389–402.
Claverie, JM, and DJ States (1993).
Information enhancement methods for large scale sequence analysis.
Computers in Chemistry 17:191–201.
Collins, JF, and AF Coulson (1990).
Significance of protein sequence similarities.
Methods Enzymol. 183:474–7.
Dembo, A, and S Karlin (1991).
Strong limit theorems of empirical functionals for large exceedances
of partial sums of i.i.d. variables.
Ann. Probab. 19:1737–55.
Dembo, A, and S Karlin (1992).
Limit distributions of maximal segmental score among Markov dependent
partial sums.
Adv. Appl. Probab. 24:113–40.
Gish, W, and DJ States (1993).
Identification of protein coding regions by database similarity search.
Nat. Genet. 3:266–72.
Hancock, JM, and JS Armstrong (1994).
SIMPLE34: an improved and enhanced implementation
for VAX and Sun computers of the SIMPLE algorithm
for analysis of clustered repetitive motifs in nucleotide sequences.
Comput. Appl. Biosci. 10:67–70.
Karlin, S, and SF Altschul (1990).
Methods for assessing the statistical significance of molecular sequence
features by using general scoring schemes.
Proc. Natl. Acad. Sci. USA 87:2264–8.
Karlin, S, and SF Altschul (1993).
Applications and statistics for multiple high-scoring segments
in molecular sequences.
Proc. Natl. Acad. Sci. USA 90:5873–7.
Karlin, S, Dembo, A, and T Kawabata (1990).
Statistical composition of high scoring segments from molecular sequences.
Ann. Stat. 18:571–81.
RF Mott (1992).
Maximum-likelihood estimation of the statistical distribution of Smith-Waterman
local sequence similarity scores.
Bull. Math. Biol. 54:59–75.
Smith, TF, and MS Waterman (1981).
Identification of common molecular subsequences.
J. Mol. Biol. 147:195–7.
States, DJ, and W Gish (1994).
Combined use of sequence similarity and codon bias for coding region
identification.
J. Comp. Biol. 1:39–50.
Waterman, MS, and M Vingron (1994).
Rapid and accurate estimates of statistical significance for sequence data base searches.
Proc. Natl. Acad. Sci. USA 91:4625–8.
Wootton, JC, and S Federhen (1993).
Statistics of local complexity in amino acid sequences
and sequence databases.
Computers in Chemistry 17:149–63.
Wootton, JC, and S Federhen (1996).
Analysis of compositionally biased regions in sequence databases.
ed. R. Doolittle.
Methods Enzymol. 266:554–71.
Zhang, Z, Schäffer, AA, Miller, W, Madden, TL, Lipman, DJ, Koonin, EV,
and SF Altschul (1998).
Protein sequence similarity searches using patterns as seeds.
Nucl. Acids Res. 26:3986–90.
Last updated: 2009-11-21
Return to the
AB-BLAST Archives home page