The program BLAT by James Kent was tested to attempt a comparison
between this alignment package and the alternative that we are
proposing. However, it was found that BLAT is not comparable to the
weighted linear regression approach (see explanation below). In fact,
the test alignments that we ran using BLAT have convinced us that
our weighted linear regression naturally complements BLAT (in the
same way that it complements MUMmer) when the goal is to find a
contig's best candidate region of alignment to a reference genome.

We tested BLAT by aligning three long contigs (100Kb, 20Kb and 10Kb)
to a 25Mb chromosome, where the contigs do align to the chromosome
(with a certain amount of insertions, deletions and substitutions)
around the region from 440Kb to 540Kb. This is real genomic data
(not simulated), where the contigs and the chromosome are from two
different species of rice. It was not possible to tune the
parameters of BLAT to extend all the seed matches into one single
region of homology for each contig. BLAT is able to produce all the
seed matches at a very high speed (even faster than MUMmer), but
when the extensions are attempted, the program is not able to clump
the multiple matches into one big alignment region. In this regard,
the output of BLAT is indeed suitable to be used as the input to the
weighted linear regression algorithm, but they cannot be compared
side to side. They simply work at different stages of the pipeline.

We will show in detail the results of our work with BLAT. BLAT works
in two stages: Search, where regions of the two sequences that are
likely to be homologous are detected, and then alignment, where
these regions are examined in more detail to determine if they in
reality conform a non-casual region of homology. BLAT starts by
building up an index of nonoverlapping K-mers (K is input by the
user, default is 11) and their positions in the reference genome.
BLAT then looks up each overlapping K-mer of the query sequence in
the index. In this way, BLAT builds a list of hits where the query
and the target match. These hits are not necessarily perfect
matches; the program allows for mismatches in the seeds through
parameter oneOff. Each hit contains a reference position and a query
position. The hit list is split into buckets of 64k hits each, based
on the reference position. Each bucket is sorted on the diagonal
(reference minus query positions). Hits that are within the gap
limit (also an input to the program) and have the same diagonal
coordinate are bundled together into proto-clumps. However, small
insertions and deletions within the homologous area are permitted by
allowing matches to be clumped if they are near each other rather
than identical on the diagonal coordinate. Nonetheless, James Kent
argues that when substitutions are allowed in finding the initial
seeds, insertions and deletions cannot be accommodated. Hits within
proto-clumps are then sorted along the reference coordinate and put
into real clumps if they are within a window limit on the database
coordinate.

Clumps with less than a minimum number of hits are discarded (this
number is given by the user, default is 2), and the rest are used to
define regions of the reference which are homologous to the query
sequence. Clumps which are within 300 bases or 100 amino acids in
the reference are merged together, and five hundred additional bases
are added on each side to form the final homologous region. Note
these numbers are fixed, so it makes sense that when long contigs
are aligned to the reference, it is hard to generate one single
region of homology: there are seeds that are separated by more than
300 bases.

BLAT accomplishes its goal of speeding up the alignment by passing
onto the alignment stage as few matches as possible from the ones
found in the the search stage. This is done by being too restrictive
in offsets from the diagonal. The consequence is that, for a long
contig as the ones we experimented with, where insertions and
deletions are unavoidable, it is impossible for BLAT to generate one
single homology region per contig.

Figure default.jpg shows the results after running BLAT with the
defaults. Let us focus our analysis on the longer contig, the one
that aligns from 420Kb to 510Kb (roughly). As mentioned before,
BLAT's output consists on a series of local alignments which, when
seen in a dot plot, show a strong homology region with respect to
the reference. Our goal is to tune BLAT such that the output is one
single line indicating the start and end coordinates of this region.

There is one parameter to BLAT, called stepSize, defined as the
spacing between exact matches (default is 11). We interpreted that
by increasing this parameter, more space would be allowed between
seeds and as a consequence more seeds would be clumped together.
That was not the case. When the parameter is increased to 100, 200
and 300, no alignments are joint into single alignments (see
stepSize100.jpg, stepSize200.jpg and stepSize300.jpg), and some very
small matches start to disappear. When stepSize is 400
(stepSize400.jpg), all alignments are lost. We could not interpret
the meaning of this parameter from the contents of James Kent's
paper, but it was clear that this is not an argument that can help
on our goal.

Another parameter, minMatch, is defined as the number of seeds or
matches that must exist in a clump in order for the matches to be
joined. The default is 2, and to increase it would cause a decrease
in the number of matches that are reported by BLAT, which is not the
goal. Hence this parameter was never varied.

Parameter maxGap defines the size of the maximum gap allowed between
seeds for them to be clumped. Default is 2. When this parameter was
increased to 20 (maxGap20.jpg), some of the matches started to be
joined. However, when maxGap was increased to 60 and 100 (see
corresponding figures), the difference was not substantial, and many
matches remained separated. The parameter cannot be increased beyond
100, which reveals the nature of BLAT not being designed with long
alignments in mind.

We played around with other parameters, like oneOff, which permits
mismatches in the initial seeds; minIdentity, which sets the minimum
sequence identity as a percentage; and extendThroughN, which allows
extension of alignments through large blocks of N's. See for example
figure maxGap100minId50extend.jpg. There is virtually no difference
with the result obtained using the defaults.