Given a novel protein it is very important to know if it is a DNA-binding protein, since DNA-binding proteins participate in a fundamental role in the regulation of gene expression. In this work, we propose a parallel fusion between a classifier trained using the features extracted from the gene ontology database and a classifier trained using the dipeptide composition of the protein. As classifiers the support vector machine (SVM) and the 1-nearest neighbour are used. The Matthews’s correlation coefficient obtained by our fusion method is ≈0.97 when the jackknife cross-validation is used, this result outperforms the best performance obtained in the literature (0.924) using the same dataset where the SVM is trained using only the Chou’s pseudo amino acid based features. In this work also the area under the ROC-curve (AUC) is reported and our results show that the fusion permits to obtain a very interesting 0.995 AUC. In particular we want to stress that our fusion obtains a 5% false negative with a 0% of false positive. The Matthews’s correlation coefficient obtained using the single best GO-number is only 0.7211, hence it is not possible to use the gene ontology database as a simple lookup table. Finally, we test the complementarity of the two tested feature extraction methods using the Q-statistic. We obtain the very interesting result of 0.58, this mean that the features extracted from the gene ontology database and the features extracted from the amino acid sequence are partially independent and that their parallel fusion should be more studied. Keywords:

Genetic programming for creating Chou's pseudoamino acid based features for submitochondria localization

NANNI, LORIS;LUMINI, ALESSANDRA
2008

Abstract

Given a novel protein it is very important to know if it is a DNA-binding protein, since DNA-binding proteins participate in a fundamental role in the regulation of gene expression. In this work, we propose a parallel fusion between a classifier trained using the features extracted from the gene ontology database and a classifier trained using the dipeptide composition of the protein. As classifiers the support vector machine (SVM) and the 1-nearest neighbour are used. The Matthews’s correlation coefficient obtained by our fusion method is ≈0.97 when the jackknife cross-validation is used, this result outperforms the best performance obtained in the literature (0.924) using the same dataset where the SVM is trained using only the Chou’s pseudo amino acid based features. In this work also the area under the ROC-curve (AUC) is reported and our results show that the fusion permits to obtain a very interesting 0.995 AUC. In particular we want to stress that our fusion obtains a 5% false negative with a 0% of false positive. The Matthews’s correlation coefficient obtained using the single best GO-number is only 0.7211, hence it is not possible to use the gene ontology database as a simple lookup table. Finally, we test the complementarity of the two tested feature extraction methods using the Q-statistic. We obtain the very interesting result of 0.58, this mean that the features extracted from the gene ontology database and the features extracted from the amino acid sequence are partially independent and that their parallel fusion should be more studied. Keywords:
2008
L. Nanni; A. Lumini
File in questo prodotto:
Eventuali allegati, non sono esposti

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11585/63179
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 163
  • ???jsp.display-item.citation.isi??? 160
social impact