In this paper we evaluate the performance of machine learning methods in the task of predicting the bonding state of cysteines starting from protein sequences. This task is very important and is the first step for the identification of disulfide bonds in proteins. We score the performance of three different approaches, such as: 1) Hidden Support Vector Machines (HSVM) which integrates the SVM predictions with a Hidden Markov Model; 2) SVM-HMM which discriminatively trains models that are isomorphic to a kth-order hidden Markov model; 3) Grammatical-Restrained Hidden Conditional Random Fields (GRHCRF) that we recently introduced. To evaluate the present (and future) methods we built a new non-redundant dataset. We report the performance using indices based on per-cysteine and per-protein scores. Furthermore, we evaluate two different encoding schemes based on sequence profile and position specific scoring matrix (PSSM) as computed with the PSI-BLAST program. The evaluation is carried out with different dimensions of the local cysteine environment and using differentMarkov models. Our results show that when the evolutionary information is encoded with PSSM all the methods perform better than with sequence profile. Finally, among the different methods it appears that GRHCRFs performs slightly better than the others achieving a per protein accuracy of 87% with a correlation coefficient of 0.73. Considering that our dataset does not contain trivial protein cases (only one cysteine per protein) the accuracy achieved is among the best performing reported so far.

Prediction of cysteine bonding state with machine-learning methods

SAVOJARDO, CASTRENSE;FARISELLI, PIERO;MARTELLI, PIER LUIGI;SHUKLA, PRIYANK;CASADIO, RITA
2010

Abstract

In this paper we evaluate the performance of machine learning methods in the task of predicting the bonding state of cysteines starting from protein sequences. This task is very important and is the first step for the identification of disulfide bonds in proteins. We score the performance of three different approaches, such as: 1) Hidden Support Vector Machines (HSVM) which integrates the SVM predictions with a Hidden Markov Model; 2) SVM-HMM which discriminatively trains models that are isomorphic to a kth-order hidden Markov model; 3) Grammatical-Restrained Hidden Conditional Random Fields (GRHCRF) that we recently introduced. To evaluate the present (and future) methods we built a new non-redundant dataset. We report the performance using indices based on per-cysteine and per-protein scores. Furthermore, we evaluate two different encoding schemes based on sequence profile and position specific scoring matrix (PSSM) as computed with the PSI-BLAST program. The evaluation is carried out with different dimensions of the local cysteine environment and using differentMarkov models. Our results show that when the evolutionary information is encoded with PSSM all the methods perform better than with sequence profile. Finally, among the different methods it appears that GRHCRFs performs slightly better than the others achieving a per protein accuracy of 87% with a correlation coefficient of 0.73. Considering that our dataset does not contain trivial protein cases (only one cysteine per protein) the accuracy achieved is among the best performing reported so far.
Proceedings CIBB2010
1
10
Savojardo C.; Fariselli P.; Martelli P.L.; Shukla P.; Casadio R.
File in questo prodotto:
Eventuali allegati, non sono esposti

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: http://hdl.handle.net/11585/100540
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact