CASA: Benchmark server for sequence alignment accuracy

Introduction

Sequence alignment programs are used routinely to detect remote homologies for the purpose of fold assignment and comparative modeling. The sequence alignment quality of these methods is crucial for accurate structure prediction. In this regard, there has been no systematic utility to assess the quality of newly developed methods. Here we present a server system that allows a user to evaluate the alignments by his/her sequence alignment method. The sequence alignments are judged based on comparison with corresponding structural alignments produced with the CE program [1] . The sequences to be aligned were derived from domain assignments defined in the SCOP database (version 1.48) [2]. Thesequences were derived from SEQRES records in the PDB entries. We provide a list of sequences and several suggested lists of alignments to be prepared from thesequence list by the user. The server also compares the accuracy of user's alignments with the accuracy of precompiled alignments from PSI-BLAST, clustalw, BLAST, FASTA and FSSP. Click here to see how these alignments were compiled.

Assessment criteria

     Once the user inputs sequence alignments, the server prepares a quality assessment report that can be customized by the user. The quality metrics used are those described by Sauder , Arthur, and Dunbrack [3]. These are three metrics described as follows. For a pair of aligned sequences, we define:

            n_A      the length of an alignment (in number of pairsaligned) from the user

            n_S      the length of the corresponding structure alignment from CE

             n_C      the number of aligned pairs in common between the user's alignment and the CE alignment

The two metrics are:

            Q_M =   n_C/n_A

            Q_D =  n_C/n_S

            Q_C =  n_C/(n_S + n_A - n_C)

       The first is called the "modeler's view" since it provides a measure of how accurate the user's sequence alignment is per residue of the user's alignment. The second is called the "developer's view" since it provides a measure of how accurate the user's sequence alignment is per residue of the structure alignment. A high value of Q_M and a low value of Q_D indicates an accurate alignment that is shorter than is possible given the similarity in the actual protein structures. A low value of Q_M and a comparatively high value of Q_D indicates an alignment that is too long, aligning dissimilarregions of the two proteins but apparently aligning similar regions correctly. The combined accuracy parameter, Q_C, is the total number of correct matches divided by the total number of positions that are aligned in either the structural alignment or the alignment being evaluated.

Using CASA

Follow these steps to evaluate a sequence alignment method:

1) Download a a unix archive file containing FASTA sequences available for alignment

2) Download a list of sequence pairs to be aligned from this page

3) Make the alignments with your program

4) Make an archive file of your alignments. For unix users, you will need to tar your alignment files and gzip them. For windows users, make sure your alignment file names are not altered upon zipping. Whether your program uses multiple sequence alignments or pairwise alignments, each alignment that is submitted should contain only two sequences. If d13as_.fasta (QUERY) and d13sr_.fasta (HIT) are thetwo aligned sequences, the resulted alingment must be in file named d13as_+d13sr_.xxx where xxx is any three letter file extension name. Moreover, the format of the alignment file must be like that of FASTA format. A typical alignment file in FASTA format looks like,

> d3pmga2 A:191-303 3.73.1.1.1 Phosphoglucomutase, first 3 domains {Rabbit (Oryctolagus cuniculus)}

MLRNIFDFNALKELLSGPNRLKIRIDAMHGVVGPYVKKILCEELGAPANSAVNCVPLEDFGGHHPDPNLT YAADLVETMKSGEHDFGAAFDGDGDRNMILGKHGFF

> d3pmga3 A:304-420 3.73.1.1.1 Phosphoglucomutase, first 3 domains {Rabbit (Oryctolagus cuniculus)}

DSVAVIAANIFSIPYFQQTGVRGFARSM--PTSGALDRV-ANATKIA---LY-------------ET-PT GWKFFGNLMDASKLSLCGEESFGTGSDHIREKDGLW

The colored text in the first line of the FASTA sequence is optional. If the program under consideration does local alignments, the user can submit alignments of the fragments only. Once you have produced *.xxx alignment files for the sequence pairs downloaded, tar and gzip them all.

5) Submit your archive (*.tar.gz or *.zip) file.

6)

Once you have results from the CASA server, you may publish them independently as long as you cite the paper by Sauder et al [3] and the paper by Kahsay et al.[4], or you may hold on to them for future use.

We welcome feedback on this webserver. Please contact Roland Dunbrack at RL_Dunbrack@fccc.edu or Robel Kahsay at kahsay@capsl.udel.edu .

Resolved issues

1 CASA can now take alignments archived in windows (*.zip file).

References

[1]

Shindyalov IN, Bourne PE.
Protein structure alignment by incremental combinatorial extension (CE) of the optimal path.
Protein Eng. 1998; 11:739-747

[2]

Hubbard TJ, Ailey B, Brenner SE, Murzin AG, Chothia C.
SCOP: a structural classification of proteins database.
Nucleic Acids Res. 1999 27:254-256.

[3]

Sauder JM, Arthur JW, Dunbrack RL Jr.
Large-scale comparison of protein sequence alignment algorithms with structure alignments.
Proteins. 2000 Jul 1;40(1):6-22.

[4]

Kahsay R, Dongre N, Guang G, Wang G, Dunbrack RL Jr
CASA: A Server for The Critical Assessment of Sequence Alignment Accuracy
Bioinformatics 18(3): 496-497 (2002)

This website was designed and developed by Robel Kahsay, Ph.D.