Background: In recent years, RNA-seq technology has gained widespread use in diverse research and clinical applications. Alongside this expansion, machine learning techniques have enabled accurate reconstruction of full transcriptomic signals from a considerably reduced set of highly informative genes (e.g., S1500+). Results: We employ machine learning methods, specifically XGBoost (eXtreme Gradient Boosting) a decision tree approach, to perform RNA-seq and transcriptomic analyses across multiple tissues. Our goal is to identify a small subset of expressed genes that can capture the complete tissue-specific transcriptomic profile. Using public GTEx (Genotype Tissue Expression) data, we analyze each tissue separately and discover the key fact that taking into account just the top 500 genes per tissue (ranked by XGBoost feature importance) are sufficient to provide transcriptomic signatures that match the performance of state-of-the-art gene sets (e.g., S1500+). To further validate our approach, we apply it to neuronal tissues by comparing samples from individuals with neuropathic pain versus those without pain. In dorsal root ganglia (DRG) RNA-seq data from patients experiencing varying levels of pain, our method suggests EGR1 as a factor in radicular/neuropathic pain, thereby opening avenues on the development of therapies that may alleviate pain by targeting EGR1 pathway. Conclusions: We demonstrate how to train and apply the XGBoost algorithm to select a small gene set that can approximate the full transcriptomic signal with varying accuracy depending on tissue type, generally achieving performance comparable to S1500+ gene sets in GTEx data. This method focusses on one tissue at a time, a different list of genes is selected for each tissue and ranked according to the importance of each gene into the reconstruction of the transcriptomic signal. This ranking also aids in highlighting specific genes that may be critical in predicting tissue-specific pathologies.
Demurtas, P., Bertozzi, J., Di Silvestro, I., Carlin, K., Ghetti, A., Krause, B., et al. (2025). Gene selection for prediction of transcriptome signal based on a machine learning approach. DISCOVER APPLIED SCIENCES, 7(11), N/A-N/A [10.1007/s42452-025-07841-1].
Gene selection for prediction of transcriptome signal based on a machine learning approach
Perini, Giovanni;Zanchetta, Ferdinando;
2025
Abstract
Background: In recent years, RNA-seq technology has gained widespread use in diverse research and clinical applications. Alongside this expansion, machine learning techniques have enabled accurate reconstruction of full transcriptomic signals from a considerably reduced set of highly informative genes (e.g., S1500+). Results: We employ machine learning methods, specifically XGBoost (eXtreme Gradient Boosting) a decision tree approach, to perform RNA-seq and transcriptomic analyses across multiple tissues. Our goal is to identify a small subset of expressed genes that can capture the complete tissue-specific transcriptomic profile. Using public GTEx (Genotype Tissue Expression) data, we analyze each tissue separately and discover the key fact that taking into account just the top 500 genes per tissue (ranked by XGBoost feature importance) are sufficient to provide transcriptomic signatures that match the performance of state-of-the-art gene sets (e.g., S1500+). To further validate our approach, we apply it to neuronal tissues by comparing samples from individuals with neuropathic pain versus those without pain. In dorsal root ganglia (DRG) RNA-seq data from patients experiencing varying levels of pain, our method suggests EGR1 as a factor in radicular/neuropathic pain, thereby opening avenues on the development of therapies that may alleviate pain by targeting EGR1 pathway. Conclusions: We demonstrate how to train and apply the XGBoost algorithm to select a small gene set that can approximate the full transcriptomic signal with varying accuracy depending on tissue type, generally achieving performance comparable to S1500+ gene sets in GTEx data. This method focusses on one tissue at a time, a different list of genes is selected for each tissue and ranked according to the importance of each gene into the reconstruction of the transcriptomic signal. This ranking also aids in highlighting specific genes that may be critical in predicting tissue-specific pathologies.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


