Many domains have a stake in the development of reliable systems for automatic protein classification. Of particular interest in recent studies of automatic protein classification is the exploration of new methods for extracting features from a protein that enhance classification for specific problems. These methods have proven very useful in one or two domains, but they have failed to generalize well across several domains (i.e. classification problems). In this paper we evaluate several feature extraction approaches for representing proteins with the aim of sequence-based protein classification. Several protein representation are evaluated, those starting from: the position specific scoring matrix (PSSM) of the proteins; the amino-acid sequence; a matrix representation of the protein, of dimension (length of the protein)×20, obtained using the substitution matrices for representing each amino-acid as a vector. A valuable result is that a texture descriptor can be extracted from the PSSM protein representation which improve the performance of standard descriptors based on the PSSM representation. Experimentally we develop our systems by comparing several protein descriptors on nine different datasets. Each descriptor is used to train a support vector machine (SVM) or an ensemble of SVM. Although different stand-alone descriptors work well on some datasets (but not on others), we have discovered that fusion among classifiers trained using different descriptors obtains a good performance across all the tested datasets. Matlab code/Datasets used in the proposed paper is available at bias.csr.unibo.it\nanni\PSSM.rar.
An empirical study on the matrix-based protein representations and their combination with sequence-based approaches / Loris Nanni;Alessandra Lumini;Sheryl Brahnam. - In: AMINO ACIDS. - ISSN 0939-4451. - STAMPA. - 44:4(2013), pp. 887-901. [10.1007/s00726-012-1416-6]
An empirical study on the matrix-based protein representations and their combination with sequence-based approaches
LUMINI, ALESSANDRA;
2013
Abstract
Many domains have a stake in the development of reliable systems for automatic protein classification. Of particular interest in recent studies of automatic protein classification is the exploration of new methods for extracting features from a protein that enhance classification for specific problems. These methods have proven very useful in one or two domains, but they have failed to generalize well across several domains (i.e. classification problems). In this paper we evaluate several feature extraction approaches for representing proteins with the aim of sequence-based protein classification. Several protein representation are evaluated, those starting from: the position specific scoring matrix (PSSM) of the proteins; the amino-acid sequence; a matrix representation of the protein, of dimension (length of the protein)×20, obtained using the substitution matrices for representing each amino-acid as a vector. A valuable result is that a texture descriptor can be extracted from the PSSM protein representation which improve the performance of standard descriptors based on the PSSM representation. Experimentally we develop our systems by comparing several protein descriptors on nine different datasets. Each descriptor is used to train a support vector machine (SVM) or an ensemble of SVM. Although different stand-alone descriptors work well on some datasets (but not on others), we have discovered that fusion among classifiers trained using different descriptors obtains a good performance across all the tested datasets. Matlab code/Datasets used in the proposed paper is available at bias.csr.unibo.it\nanni\PSSM.rar.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.