The program BLAT by James Kent was tested to attempt a comparison between this alignment package and the alternative that we are proposing. However, it was found that BLAT is not comparable to the weighted linear regression approach (see explanation below). In fact, the test alignments that we ran using BLAT have convinced us that our weighted linear regression naturally complements BLAT (in the same way that it complements MUMmer) when the goal is to find a contig's best candidate region of alignment to a reference genome. We tested BLAT by aligning three long contigs (100Kb, 20Kb and 10Kb) to a 25Mb chromosome, where the contigs do align to the chromosome (with a certain amount of insertions, deletions and substitutions) around the region from 440Kb to 540Kb. This is real genomic data (not simulated), where the contigs and the chromosome are from two different species of rice. It was not possible to tune the parameters of BLAT to extend all the seed matches into one single region of homology for each contig. BLAT is able to produce all the seed matches at a very high speed (even faster than MUMmer), but when the extensions are attempted, the program is not able to clump the multiple matches into one big alignment region. In this regard, the output of BLAT is indeed suitable to be used as the input to the weighted linear regression algorithm, but they cannot be compared side to side. They simply work at different stages of the pipeline. We will show in detail the results of our work with BLAT. BLAT works in two stages: Search, where regions of the two sequences that are likely to be homologous are detected, and then alignment, where these regions are examined in more detail to determine if they in reality conform a non-casual region of homology. BLAT starts by building up an index of nonoverlapping K-mers (K is input by the user, default is 11) and their positions in the reference genome. BLAT then looks up each overlapping K-mer of the query sequence in the index. In this way, BLAT builds a list of hits where the query and the target match. These hits are not necessarily perfect matches; the program allows for mismatches in the seeds through parameter oneOff. Each hit contains a reference position and a query position. The hit list is split into buckets of 64k hits each, based on the reference position. Each bucket is sorted on the diagonal (reference minus query positions). Hits that are within the gap limit (also an input to the program) and have the same diagonal coordinate are bundled together into proto-clumps. However, small insertions and deletions within the homologous area are permitted by allowing matches to be clumped if they are near each other rather than identical on the diagonal coordinate. Nonetheless, James Kent argues that when substitutions are allowed in finding the initial seeds, insertions and deletions cannot be accommodated. Hits within proto-clumps are then sorted along the reference coordinate and put into real clumps if they are within a window limit on the database coordinate. Clumps with less than a minimum number of hits are discarded (this number is given by the user, default is 2), and the rest are used to define regions of the reference which are homologous to the query sequence. Clumps which are within 300 bases or 100 amino acids in the reference are merged together, and five hundred additional bases are added on each side to form the final homologous region. Note these numbers are fixed, so it makes sense that when long contigs are aligned to the reference, it is hard to generate one single region of homology: there are seeds that are separated by more than 300 bases. BLAT accomplishes its goal of speeding up the alignment by passing onto the alignment stage as few matches as possible from the ones found in the the search stage. This is done by being too restrictive in offsets from the diagonal. The consequence is that, for a long contig as the ones we experimented with, where insertions and deletions are unavoidable, it is impossible for BLAT to generate one single homology region per contig. Figure default.jpg shows the results after running BLAT with the defaults. Let us focus our analysis on the longer contig, the one that aligns from 420Kb to 510Kb (roughly). As mentioned before, BLAT's output consists on a series of local alignments which, when seen in a dot plot, show a strong homology region with respect to the reference. Our goal is to tune BLAT such that the output is one single line indicating the start and end coordinates of this region. There is one parameter to BLAT, called stepSize, defined as the spacing between exact matches (default is 11). We interpreted that by increasing this parameter, more space would be allowed between seeds and as a consequence more seeds would be clumped together. That was not the case. When the parameter is increased to 100, 200 and 300, no alignments are joint into single alignments (see stepSize100.jpg, stepSize200.jpg and stepSize300.jpg), and some very small matches start to disappear. When stepSize is 400 (stepSize400.jpg), all alignments are lost. We could not interpret the meaning of this parameter from the contents of James Kent's paper, but it was clear that this is not an argument that can help on our goal. Another parameter, minMatch, is defined as the number of seeds or matches that must exist in a clump in order for the matches to be joined. The default is 2, and to increase it would cause a decrease in the number of matches that are reported by BLAT, which is not the goal. Hence this parameter was never varied. Parameter maxGap defines the size of the maximum gap allowed between seeds for them to be clumped. Default is 2. When this parameter was increased to 20 (maxGap20.jpg), some of the matches started to be joined. However, when maxGap was increased to 60 and 100 (see corresponding figures), the difference was not substantial, and many matches remained separated. The parameter cannot be increased beyond 100, which reveals the nature of BLAT not being designed with long alignments in mind. We played around with other parameters, like oneOff, which permits mismatches in the initial seeds; minIdentity, which sets the minimum sequence identity as a percentage; and extendThroughN, which allows extension of alignments through large blocks of N's. See for example figure maxGap100minId50extend.jpg. There is virtually no difference with the result obtained using the defaults.