Protein Classification and Functional Prediction

**Problem **

Proteins are linear seqeuences of amino acids that perform essential functions in organisms. To determine the function of a protein, expensive experimental procedures are required. With better sequencing techniques, it becomes evident that computational prediction/classfication of an unknown protein’s function will help drive the labor and cost down. .

Protein remote homology detection involves detecting functionally related proteins with very little sequence identity. The task is challanging due to two major reasons: first, the length of the sequences vary and second, remotely homologous proteins need not share great sequence identity. Some share sequence identity as little as 30%.

**Motivation**

Previous studies [1,2,3] show that the following features are sufficient to describe superfamilies in the Immunoglobulin-like beta-sandwich fold:

• a set of critical positions, and

• the preferred amino acid residues in these positions, and

• the distance between each neighboring pair of critical positions.

Our goal is to find computational methods, with good classification performance, that are able to recover these critical positions as well as the preferred residues.

**Our approaches **

We attempt to recover these critical positions by employing different approaches. In our earlier work, we used a generative model; At present, we use a discriminative model.

**The generative approach **

We employ a class of models, known as duration-explicit hidden Markov models, in an attempt to recover the critical positions. We call our models sparse profile hidden markov models and show the structure in Fig. 1. The structure of our model resembles that of a profile hidden Markov model. However, the crucial difference between our model and the profile hidden Markov model is that, while the number of symbols an insertion state (diamond shape) emits is represented by a geometric distribution, in our model, the number of symbols an insertion state emits can be explicitly modeled by any arbitrary distribution. Moreover, since we only care about the critical positions, we use the background distribution to represent the emission probabilities in the insertion states in order to express our indifference in the residue composition between critical positions. Our match states attempt to capture the critical positions.

Our approach of focusing on the critical positions and modeling the emission probabilities of the insertion states as background probabilities results in extremely sparse models. Our publication [5] indicates that approximately m/4 critical positions are sufficient to describe a superfamily with m positions. Fig. 2 and 3 show the performance, in terms of recall (red curve) and precision (blue curves) of the sparse profile HMM on two different superfamilies: Beta-Galactosidase and Cadherin, as a function of number of critical positions (horizontal axis). It can be seen that the performance of the sparse profile HMM starts showing severe degradation once the number of critical positions falls lower than 25%.

Our main contribution to this project is that our results are consistent with previous established studies [1,2,3], that a set of critical positions, the preferred symbols on these positions and the distance between each neighboring positions are sufficient to describe the superfamilies in the Immunoglobulin-like beta-sandwich fold. Moreover, our models are extremely sparse and the main advantage of a sparse model is that fewer parameters need to be estimated and thus the model is less prone to overfitting. Our results indicate that a 1:4 compression ratio can be achieved.

**The discriminative approach **

Though in most problem domains a generative model performs quite well, it does come with some disadvantages. Denote X as the sequence (example) and Y as the label of the sequence. First, one needs to assume a parametric form P(X) and when the parametric assumption is invalid, one risks injecting bias into the model. Second, building a generative model for a superfamily requires only positive sequences, sequences that are known to belong to this superfamily; no negative sequences participated in the process of estimation. As a result, if one picks the important features indicated by the built generative model, one cannot guarantee that these features do not appear in the negative examples.

Discriminative models do not suffer from the shortcomings listed above. First, no parametric form of P(X) is required. Second, it focuses on building the decision boundary between the positive and negative classes (meaing both positive and negative examples participate in the estimation process) and thus it guarantees that the important features only appear in one class but not the other.

However, use of discriminative models for protein remote homology detection presents a big challange: proteins are of variable length but the discriminative model requires fixed-length features. To tackle the problem, Jaakkola et al. [6,7] proposed to use a hyprid model: a generative model, trained using positive sequences only acts as a feature extractor, extracting features from positive and negative sequences and the features are fed to to a discriminative model to perform classification. The features are the gradient of the log likelihood of the sequence with respect to the model parameters. On the other hand, Leslie et al. [8,9] proposed to represent each sequence by the frequency of all k-long substrings in a |\Sigma|^k-dimensional spectrum, where \Sigma represents the alphabet set, and use these high dimensional features to perform classification.

We follow the same formalism proposed by Jaakkola et al. In our hybrid setting, we use a profile HMM, a generative model, as the feature extractor and the discriminative model we use is the logistic regression model. The generative models are estimated from positive sequences only and the features that we propose to use are the sufficient statistics of the symbols in each match states. These features we propose attempt to capture the critical positions and the corresponding preferred symbols. Finally, we place a Normal or Laplacian prior centered at 0, on the parameter of the logistic regression model to enforce sparsity.

We compare our model and features with some state-of-the-art algorithms in Fig. 4 and 5. We use the ROC-50 scores to evaluate the methods. The ROC-50 score is the normalized area under the ROC curve up to the first 50 false positives. Such criteria has been reported to be more effective when comparing methods on highly imbalanced test sets, in which examples in one class, in our case, the negative examples, are far greater than the examples in the other class. In Fig. 4, the horizontal axis corresponds to the ROC-50 score and the vertical access denotes the number of experiments, out of 54, achieving such or better ROC-50 score. The subfigure on the right panel is a more detailed view in the area of high ROC50 score. The figures indicate that both logistic regression models (with Normal and Laplacian priors) achieve comparable performance with the mismatch(5,1) kernel. However, we noted that, out of 54 families, there are 480 features on average. The logistic regression model with Laplacian prior selects 43 features per family on average, which shows more than 90% reduction of features.

In Fig. 5, we show the pairwise performance comparison on different methods. In the left panel, we compare the mismatch(5,1) kernel (vertical axis) with the logistic regression model with Normal prior. Points falling above the diagonal correspond to an experiment in which the logistic model with Normal prior performs better than the mismatch(5,1) kernel. There are 30 and 22 points above and below the diagonal, respectively. The two-sided p-value of the sign test is 0.33, which indicates that the performance of the two models are comparable. In the left panel, we compare the effect of different priors on the logistic regression model. The priors are Normal prior (horizontal axis) and Laplacian prior (vertical axis). There are 25 and 26 points above and below the diagonal, respectively. The two-sided p-value of the sign test is approximately 1, suggesting that although the Laplacian prior only selects 10% of the features, this sparse set of features play important role in the classification task.

We show the recovered critical positions and the preferred symbols in Fig. 6 and 7. The model (table) shown in Fig. 6 corresponds to the scorpion toxin-like superfamily and the tested family is Plant defensins. Out of 188 features, the logistic regression model with Laplacian prior picks 19 features around 12 positions. The ROC50 score for this experiment is 1. We also show the corresponding HMM-logo obtained from PFAM. The height of each symbol at each position corresponds to its importance suggested by the profile HMM, estimated from positive sequences only. The leading numbers in the table denotes the positions in our hmm while the number in parentheses denotes the positions in the HMM-logo. The slight variation occurs because we use a slightly different huristic to determine the structure of the profile HMM. We can see that the HMM-logo from PFAM suggests that 8 positions with Cysteine (C) residues play important role in this superfamily. However, only 5 such positions (circled in red) are captured by our model. Upon investigation, it becomes clear that the reason that the other three positions with Cysteine residues are not selected because such alignment has also been observed in the negative sequences. The model shown in Fig. 7 corresponds to the same superfamily, with a different tested family. The logistic regression model with Laplacian prior selects 14 out of 116 features and the ROC50 score of this experiment is .91. In this experiment, all 6 positions with Cysteine residues indicated by the HMM-logo are captured by our model, suggesting that such pattern is characteristic and unique to this family.

Our main contribution to this project is that we present a set of features that are biologically intuitive and, when combined with the model we choose, yields interpretable results. The performance of our feature and model is comparable to the state-of-the-art algorithms, SVM-Fisher and string kernels, but our features and model offers biologically intuitive interpretation. With the use of a sparsity-enforcing prior such as Laplace, our models produces consistent results suggested in [1,2,3], that a sparse set of positions and the preferred residues might be sufficient to describe a superfamily, and more importantly, our model shows promising results to recover such positions and preferred residues.

**References**

- A. E. Kister, A. V. Finkelstein and I. M. Gelfand. Common features in structures and sequences of sandwich-like proteins. PNAS, vol. 99, no. 22, pp. 14137-14141,2002.
- B. Reva, A. Kister, S. Topiol, and I. Gelfand. Determining the roles of different chain fragments in recognition of immunoglobulin fold. Protein Eng., vol. 15, no. 1, pp. 13-19, 2002.
- A. E. Kister, M. A. Roytberg, C. Chothia, J. M. Vasiliev, and I. M. Gelfand. The sequence determinants of cadherin molecules. Protein Sci., vol. 10, no. 9, pp. 1801-1810, 2001.
- S. Eddy. Profile hidden Markov models. Bioinformatics, vol. 14, no. 9, pp. 755-763, 1998.
- P. Huang and V. Pavlovic. Protein Homology Detection using Sparse Profile Hidden Markov Models (Poster). ISMB 2006, Detroit, MI.
- T. Jaakkola, M. Diekhans, and D. Haussler. Using the Fisher Kernel method to detect remote protein homologies. Proceedings of the Seventh International Conference on Intelligent Systems for Molecular Biology. AAAI Press, 1999, pp. 149-158.
- T. Jaakkola, M. Diekhans, and D. Haussler. A discriminative framework for detecting remote protein homologies. Journal of Computational Biology, vol. 7, no. 1-2, 2000, pp. 95-114.
- C. S. Leslie, E. Eskin, and W. S. Noble. The spectrum kernel: A string kernel for svm protein classification. Pacific Symposium on Biocomputing, 2002, pp. 566-575.
- C. S. Leslie, E. Eskin, J. Weston, and W. S. Noble. Mismatch string kernels for svm protein classification. NIPS, 2002, pp. 1417-1424.