Subcellular localization using fluorescence imagery: Utilizing ensemble classification with diverse feature extraction strategies and data balancing

Tahir, Muhammad; Khan, Asifullah; Majid, Abdul; Lumini, Alessandra

doi:10.1016/j.asoc.2013.06.027

Protein subcellular localization plays a vital role in understanding proteins’ behavior under different circumstances. The effectiveness of various drugs can be assessed by the successful prediction of protein locations. Therefore, it is important to develop a prediction system that is sufficiently reliable and accurate in making decisions regarding the protein localization. However, main problem in developing a reliable and high throughput prediction system is the presence of imbalanced data, which greatly affects the performance of a prediction system. In order to remedy this problem, we utilized the notion of oversampling through Synthetic Minority Oversampling Technique (SMOTE). Further, different feature extraction strategies and ensemble classification techniques are assessed for their contribution toward the solution of the challenging problem of subcellular localization. After applying SMOTE data balancing technique, a remarkable improvement is observed in the performance of random forest and rotation forest ensemble classifiers for CHOM, CHOA and VeroA datasets. It is anticipated that our proposed model might be helpful for the research community in the field of functional and structural proteomics as well as in drug discovery.

Muhammad Tahir, Asifullah Khan, Abdul Majid, Alessandra Lumini (2013). Subcellular localization using fluorescence imagery: Utilizing ensemble classification with diverse feature extraction strategies and data balancing. APPLIED SOFT COMPUTING, 13, 4231-4243 [10.1016/j.asoc.2013.06.027].