In this paper we evaluate the performance of machine learning methods in the task of predicting the bonding state of cysteines starting from protein sequences. This task is very important and is the first step for the identification of disulfide bonds in proteins. We score the performance of three different approaches, such as: 1) Hidden Support Vector Machines (HSVM) which integrates the SVM predictions with a Hidden Markov Model; 2) SVM-HMM which discriminatively trains models that are isomorphic to a kth-order hidden Markov model; 3) Grammatical-Restrained Hidden Conditional Random Fields (GRHCRF) that we recently introduced. To evaluate the present (and future) methods we built a new non-redundant dataset. We report the performance using indices based on per-cysteine and per-protein scores. Furthermore, we evaluate two different encoding schemes based on sequence profile and position specific scoring matrix (PSSM) as computed with the PSI-BLAST program. The sequence profile consists of a matrix that reports the different residue frequencies for each sequence position as internally scored by PSI-BLAST program, while the PSSM modulates such frequencies with respect to the reference substitution score matrix (BLOSUM62). The evaluation is carried out with different dimensions of the local cysteine environment and using different Markov models. Our results show that when the evolutionary information is encoded with PSSM all the methods perform better than with sequence profile. Finally, among the different methods it appears that GRHCRFs performs slightly better than the others achieving a per protein accuracy of 87% with a correlation coefficient of 0.73. Considering that our dataset does not contain trivial protein cases (only one cysteine per protein) the accuracy achieved is among the best performing reported so far.
Savojardo C., Fariselli P., Martelli P.L., Shukla P., Casadio R. (2010). Prediction of cysteine bonding state with machine-learning methods. s.l : s.n.
Prediction of cysteine bonding state with machine-learning methods
SAVOJARDO, CASTRENSE;FARISELLI, PIERO;MARTELLI, PIER LUIGI;SHUKLA, PRIYANK;CASADIO, RITA
2010
Abstract
In this paper we evaluate the performance of machine learning methods in the task of predicting the bonding state of cysteines starting from protein sequences. This task is very important and is the first step for the identification of disulfide bonds in proteins. We score the performance of three different approaches, such as: 1) Hidden Support Vector Machines (HSVM) which integrates the SVM predictions with a Hidden Markov Model; 2) SVM-HMM which discriminatively trains models that are isomorphic to a kth-order hidden Markov model; 3) Grammatical-Restrained Hidden Conditional Random Fields (GRHCRF) that we recently introduced. To evaluate the present (and future) methods we built a new non-redundant dataset. We report the performance using indices based on per-cysteine and per-protein scores. Furthermore, we evaluate two different encoding schemes based on sequence profile and position specific scoring matrix (PSSM) as computed with the PSI-BLAST program. The sequence profile consists of a matrix that reports the different residue frequencies for each sequence position as internally scored by PSI-BLAST program, while the PSSM modulates such frequencies with respect to the reference substitution score matrix (BLOSUM62). The evaluation is carried out with different dimensions of the local cysteine environment and using different Markov models. Our results show that when the evolutionary information is encoded with PSSM all the methods perform better than with sequence profile. Finally, among the different methods it appears that GRHCRFs performs slightly better than the others achieving a per protein accuracy of 87% with a correlation coefficient of 0.73. Considering that our dataset does not contain trivial protein cases (only one cysteine per protein) the accuracy achieved is among the best performing reported so far.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.