QC-COMP: A server for comparison of profile hidden Markov models for protein families

Comparison approach

As the number of available protein sequences increases in a much faster pace than their structure and functions can be experimentally determined, the development of more sensitive and accurate sequence analysis algorithms is increasingly important. Use of homologies, implied by sequence similarity, remains a major basis for computationally predicting structure and functions as well as inferring evolutionary relationships among different protein families. Despite the tremendous number of methods devised, the task of detecting distant homologies, those that share sequence similarity lower than 30%, and predicting structural alignment is still quite a challenge. It is known that both the detection ability, in terms of sensitivity and specificity, and prediction accuracy of structural alignment of a method can be enhanced when incorporating the information shared by a group of proteins which typically belong in a functional family. Methods that use profile information, generally known as sequence-profile and profile-profile comparison methods depending on what is compared on each side, are more sensitive in identifying remote relationships. The profile-profile methods, that compare two protein sequences by including profile information for both of the sequences are shown to be the most powerful approaches for the purpose of remote homolog detection.

Most of these profile-profile methods start defining a scoring scheme to measure the "distance" between columns of the two profiles. The similarity between two profiles is then measured as the minimal total "distance" when aligning columns in one profile to columns in the other profile and such an alignment can be found by using the same dynamic programming technique as used in Smith-Waterman local alignment for a pair of sequences. Since a column in the profiles is a probability distribution, the "distance" between a pair of columns is interpreted as a measure of difference between probability distributions. Because there is hardly a standard scoring scheme to measure probability distribution differences, various approaches exist and that is where most of these profile-profile methods differ from one another. A recent paper by Mittelman et al [15] compared several scoring schemes, including their own scheme called COMPASS, which was reported to be the best among those being compared.

Because of the rigorous and established theory of the hidden Markov models and the models' success in profiling protein families, it is desirable to develop practical methods to compare protein families represented as hidden Markov models. A profile hidden Markov model for a protein family is a probability distribution over an infinite space of all possible amino acid strings, which assigns higher probability for the members of the family. A conceptual approach to compare two hidden Markov models M₁ and M₂ is to consider them as vectors in the infinite space of all sequences and define a metric that measures the similarity or distance between the vectors. If we choose the components of these infinite dimension vectors to be the z-scores of all sequences, the models can be then represented as

M₁ = { z¹(x₁), z¹(x₂), z¹(x₃), . . . z¹(x_inf) }
M₂ = { z²(x₁), z²(x₂), z²(x₃), . . . z²(x_inf) }

where {x₁, x₂, x₃, . . . x_inf} is an infinite set of all sequences in an arbitrarily preset order. Although represented as vectors, the Euclidean distance between vectors is not well defined in this space and is not readily computable in practice due to the vectors' infinite size. To circumvent the infinite size problem, we instead focus on comparing two vectors just by their most significant components, where the corresponding sequence receives a maximal z score. Let us assume that M₁ peaks at z¹(x_i) and M₂ peaks at z²(x_j) -- x_i and x_j are called consensus sequences for M₁ and M₂ respectively. We consider two models highly similar if their consensus sequences are identical, or if not identical, the consensus of one model shall also be very probable in the other model. Specifically, we compare profile hidden Markov models by defining a similarity score given as,

Symmetric z-score₁₂	=	-1/2 * [ M₁ . e_c₂	+	M₂ . e_c₁ ]
	=	-1/2 * [ z¹(c₂)	+	z²(c₁) ]

where e_c1 (e_c2 ) is the unit vector in the direction of consensus sequence c1 (c2) of model M₁ (M₂). One problem with using consensus sequences to compare profile hidden Markov models is that the task of finding exact consensus sequence for a given HMM is an NP-hard problem [14]. As a surrogate to the exact consensus sequence, we use a quasi consensus sequence which is generated, following the approach by HMMER[3], by calculationg the state-occupancy probability of the match and insert states of a given HMM node and emit the maximally likely residues from these states if the state occupancy probability is >= 0.5.

Using the QC-COMP server

Our server uses the SUPERFAMILY[13] database which is a hidden Markov model representation of all proteins of known structure in the superfamily level according to the SCOP[9] classification. We also created a database of consensus sequences containing the quasi-consensus sequences for all of the models and scored (using SAM[4]'s hmmscore with -sw 3 option and consider the negative of the score) this database using each model in the SUPERFAMILY database. This results in a distribution of scores for each model from which z-scores can be calculated. Thus, zⁱ(c_j) means the z-score of consensus c_j according to the distribution for model i.

At this point, the user is expected to submit a query profile hidden Markov model either in SAM or HMMER formats along with the seed sequence for the profile. The server will then generate a quasi-consensus sequence for the query HMM and use all the models in the SUPERFAMILY database to score this query consensus. The score of this query consensus by i^th model is then transformed into a z-score according to the distribution of the model, zⁱ(c_q). Then we do comparison in the other direction in which the query HMM is used to score the consensus database from which a ditribution of the query HMM is computed to transform the scores of the consensus database into z-scores, z^q(c_i). The similarity score between the query model and the i^th model is then,

Symmetric z-score_iq	=	M_i . e_{c_q}	+	M_q . e_{c_i}
	=	zⁱ(c_q)	+	z^q(c_i)

The significance of symmetric z-score is estimated using empirical fit function from false positive rate (p-value) of all-vs-all comparison of all the models in the SUPERFAMILY database.

y = a₀ + a₁.e^(-a₂.x) + a₃.e^(-a₄.x) + a₅.e^(-a₆.x)

a₀ = -0.00667;
a₁ = 0.04856;
a₂ = 0.05642;
a₃ = 0.07311;
a₄ = 0.23910;
a₅ = 6.03009;
a₆ = 1.58653;

Chi-square : 3.50398

Correlation coefficient : 0.99982

RMS per cent error : 0.06471

Theil U coefficient : 0.01196

Extracting alignment between query and hit sequences

Alignments between the seed sequences of the two profile HMMs being compared can be found by aligning quasi consensus sequences to models and use transitivity. To illustrate this, let's assume that there are two models M₁ and M₂ generated using seed sequences S₁ and S₂ and let C₁ and C₂ be quasi consensus sequences for M₁ and M₂ respectively. To get an alignment between S₁ and S₂, we first align C₁ to model M₂ using a standard sequence-model alignment procedure, e.g., Viterbi algorithm. Such alignment aligns each residue of C₁ to a match (or insert) state of model M₂ and gives a multiple alignment that includes C₁ and S₂ from which pair-wise alignment between C₁ and S₂ can extracted. We can also align C₁ to model M₁ and extract pair-wise alignment between C₁ and S₁. From these pair of pair-wise alignments (S₁-C₁ and C₁-S₂) we can induce alignment between S₁ and S₂ using C₁ as a hinge. Another alignment between S₁ and S₂ can be found in the other direction by using C₂ as a hinge. We then combine these two alignments into a final alignment according to the following two schemes. The first scheme (called "and" selection) is to take segments that are common in both alignments. The final alignment thus obtained misses out segments where the two alignments disagree. Two variations to this "and"-selection are suggested on which alignment of these uncommon segments shall be picked to patch the final alignment. In the first variation ("and_1"selection), for a region where the two alignments disagree, we choose the alignment that contains more substitution pairs, as opposed to deletion/insertion pairs. As a refinement of "and_1"selection, the second variation ("and_2" selection) calculates a BLOSUM score for each of the two alignments at an uncommon segment and picks the alignment that has a larger score.

From our benchmark experiments in assessing the accuracy of alignments, we find that use of the consensus sequences as intermediates gives less accurate alignments if the symmetric z-score between the two models is higher than 22. This suggests that the use of consensus based approach should be only when models are "significantly" distant. Therefore, our server decides wether to use consensus sequence as intermediates for creating alingmnets based on the symmetric z-score value for the hits.

References

[1]   S.F. Altschul, T.L. Madden, A.A. Schaffer, J. Zhang, Z. Zhang, W. Miller and D.J. Lipman, Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucl. Acids Res. 25 (1997), pp. 3389-3402.

[2]   A.A. Schaffer, Y.I. Wolf, C.P. Ponting, E.V. Koonin, L. Aravind and S.F. Altschul, IMPALA: matching a protein sequence against a collection of PSI-BLAST-constructed position-specific score matrices. Bioinformatics 15 (1999), pp. 1000-1011

[3]   S.R. Eddy, Profile hidden Markov models. Bioinformatics 14 (1998), pp. 755-763

[4]   A. Krogh, M. Brown, I.S. Mian, K. Sjolander and D. Haussler, Hidden Markov models in computational biology. Applications to protein modeling. J. Mol. Biol. 235 (1994), pp. 1501-1531

[5]   R.E. Durbin, A. Krogh, G. Mitchison and S. Eddy. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids, Cambridge University Press, Cambridge, UK (1999).

[6]   G. Yona and M. Levitt, Within the twilight zone: a sensitive profile-profile comparison tool based on information theory. J. Mol. Biol. 315 (2002), pp. 1257-1275.

[7]   T.F. Smith and M.S. Waterman, Identification of common molecular subsequences. J. Mol. Biol. 147 (1981), pp. 195-197.

[8]   J.M. Sauder, J.W. Arthur and R.L. Dunbrack, Jr, Large-scale comparison of protein sequence alignment algorithms with structure alignments. Proteins: Struct. Funct. Genet. 40 (2000), pp. 6-22

[9]   A.G. Murzin, S.E. Brenner, T. Hubbard and C. Chothia, SCOP: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol. 247 (1995), pp. 536-540.

[10]   Ruslan Sadreyev, Nick Grishin, COMPASS: a tool for comparison of multiple protein alignments with assessment of statistical significance. J. Mol. Biol. (2003) 326, pp. 317-336.

[11]   Xiaobing Shi, David J. States, Sensitive detection of distant protein relationships using hidden Markov model alignment.

[12]   Sonnhammer, E.L., Eddy, S.R., Birney, E., Bateman, A., Durbin, R. (1998) Pfam: Multiple sequence alignments and HMM-profiles of protein domains. Nucleic Acids Res., 26, 320-2

[13]   Julian Gough, Kevin Karplus, Richard Hughey, Cyrus Chothia, Assignment of homology to genome sequences using a library of hidden Markov models that represent all proteins of known structure, J. Mol. Biol. (2001) 313, pp. 903-919.

[14]   R.B. Lyngso, C.N.S Pedersen, H. Nielsen, Metrics and similarity measures for hidden Markov models, in: Proceedings of the Seventh International Conference on Intelligent Systems for Molecular Biology (ISMB), 1999, pp. 178-186.

[15]   David Mittelman, Ruslan Sadreyev, and Nick Grishin, Probabilistic scoring measures for profile-profile comparison yield more accurate short seed alignments. Bioinformatics, 19(2003)1531-1539.

[16]   M. Gribskov and N. Robinson, Use of receiver operating characteristic analysis to evaluate sequence matching. Computers and Chemistry, 10(1996)25-33.

This website is designed and developed by Robel Kahsay, Ph.D.