Prediction of subcellular localization in eukaryotes at the basis of large scale genome annotation.

Pierleoni, Andrea; Martelli, Pier Luigi; Fariselli, Piero; Casadio, Rita

In this work we present an integrated platform for large-scale eukaryotic genome annotation based on the prediction of subcellular localization, GPI-anchor prediction and membrane protein discrimination into inner and outer classes. Large scale proteomic projects have determined a huge number of aminoacidic sequences whose functions are, in the largest part, still unknown. In eukaryotes compartmentalization plays a major role in intracellular biochemical pathways. However the determination of subcellular localization with experimental high-throughput procedures is a difficult task and computational procedures are needed. We developed BaCelLo (1), a predictor for five classes of subcellular localizations (secretory pathway, cytoplasm, nucleus, mitochondrion and chloroplast) that is based on different SVMs organized in a decision tree. The system exploits the information derived from the aminoacidic sequence and from the evolutionary information contained in alignment profiles. It analyzes the whole sequence composition and the compositions of both the N- and C-termini. The training set is curated in order to avoid redundancy. For the first time a balancing procedure is introduced in order to mitigate the effect of biased training sets. Three kingdom-specific predictors are implemented: for animals, plants and fungi, respectively. When distributing the proteins from animals and fungi into four classes, accuracy of BaCelLo reach 74% and 76%, respectively; a score of 67% is obtained when proteins from plants are distributed into five classes. BaCelLo outperforms the other presently available methods for the same task and gives more balanced accuracy and coverage values for each class. BaCelLo is also described in Nature Protocols, in the Bioinformatics section (2) BaCelLo can be accessed at http://www.biocomp.unibo.it/bacello/. BaCelLo is currently under integration in a workflow which will allow GO functional integration, prediction of GPI-anchors and discrimination between inner and outer membrane proteins. The workflow will be tested on large-scale genome annotation. With a suite of machine learning based methods, developed in house (BaCelLo, SpepLip (3) and ENSEMBLE (4)), we presently built eSLDB (eukaryotic Subcellular Localization DataBase) (5) an online database collecting the annotations of subcellular localization of eukaryotic proteomes. So far five proteomes have been processed and stored: Homo sapiens, Mus musculus, Caenorhabditis elegans, Saccharomyces cerevisiae and Arabidospis thaliana. For each sequence, the database lists localization obtained adopting three different approaches: 1) experimentally determined (when available); 2) homology based (when possible); 3) predicted. All the data are available at the website and can be searched by sequence, by protein code and/or by protein description. Furthermore a more complex search can be performed combining different search fields and keys. All the data contained in the database can be freely downloaded in flat file format. The Database is available at: http://gpcr.biocomp.unibo.it/esldb/.

Pierleoni A., Martelli P.L., Fariselli P., Casadio R. (2007). Prediction of subcellular localization in eukaryotes at the basis of large scale genome annotation.. s.l : s.n.