CASA: Benchmark server for sequence alignment accuracy

Extraction of pairing lists

          The distribution of family/superfamily pairs with respect to sequence identity as well as with respect to families/superfamilies is not even. Since the benchmark parameters are doubly averaged over all pairs with in a given family/superfamily and then over all families/superfamilies, the benchmark input data set needs to be balanced among all families/superfamilies. To do this, first we find the average number of pairs (AVG) by all families/superfamilies contributed to a given sequence identity range. For example, if there are M families contributing to a given sequence identity bin (say 0-4%), with the i^th family contributing N_i pairs,

                   AVG = average (N₁, N₂, .....N_M)

Once this AVG is computed for every sequence identity bin, we go in a round robin and pick a pair from every family/superfamily with number of rounds being 0.5*AVG, AVG and 2*AVG to produce datasets 1, 2 and 3 respectively. These datasets are restricted to sequence identity range less or equal 30%. Dataset 5 is also extracted in this manner with sequence identity going upto 100%.

         Dataset 6 is all pairs we have and dataset 4 is all pairs with sequence identity less than or equal to 30%.

This website was designed and developed by Robel Kahsay, Ph.D.