tl

tr

             Extraction of pairing lists

            

          The distribution of family/superfamily pairs with respect to sequence identity as well as with respect to families/superfamilies is not even. Since the benchmark parameters are doubly averaged over all pairs with in a given family/superfamily and then over all families/superfamilies, the benchmark input data set needs to be balanced among all families/superfamilies. To do this, first we find the average number of pairs (AVG) by all families/superfamilies contributed to a given sequence identity range. For example, if there are M families contributing to a given sequence identity bin (say 0-4%), with the ith family contributing Ni pairs,

                   AVG = average (N1, N2, .....NM)

Once this AVG is computed for every sequence identity bin, we go in a round robin and pick a pair from every family/superfamily with number of rounds being 0.5*AVG, AVG and 2*AVG to produce datasets 1, 2 and 3 respectively. These datasets are restricted to sequence identity range less or equal 30%. Dataset 5 is also extracted in this manner with sequence identity going upto 100%.

         Dataset 6 is all pairs we have and dataset 4 is all pairs with sequence identity less than or equal to 30%.

            
bl

bg

This website was designed and developed by Robel Kahsay, Ph.D.