Predicting gene expression level in E. coli from mRNA sequence information

Zhao, L.; Abedpour, N.; Blum, C.; Kolkhof, P.; Beller, M.; Kollmann, M.; Capriotti, E.

doi:10.1109/CIBCB.2019.8791456

The accurate characterization of the translational mechanism is crucial for enhancing our understanding of the relationship between genotype and phenotype. In particular, predicting the impact of the genetic variants on gene expression will allow to optimize specific pathways and functions for engineering new biological systems. In this context, the development of accurate methods for predicting the translation efficiency and/or protein expression from the nucleotide sequence is a key challenge in computational biology. In this work we present PGExpress, a new regression method for predicting the log2-fold-change of the translation efficiency of an mRNA sequence in E. coli. PGExpress algorithm takes as input 12 features corresponding to the predicted RNA secondary structure and anti-Shine-Dalgarno hybridization free energies. The method was trained on a set of 1,772 sequence variants (WT-High)of 137 essential E. coli genes. For each gene, we considered 13 sequence variants of the first 33 nucleotides encoding for the same amino acids followed by the superfolder GFP. Each gene variant is represented sequence blocks that include the Ribosome Binding Site (RBS), the first 33 nucleotides of the coding region (C33), the remaining part of the coding region (CC), and their combinations. Our gradient-boosting-based tool (PGExpress) was trained using a 10-fold gene-based cross-validation procedure on the WT-High dataset. In this test PGExpress achieved a correlation coefficient of 0.60, with a Root Mean Square Error (RMSE)of 1.3. When the regression task is cast as a classification problem, PGExpress reached an overall accuracy of 0.74 a Matthews correlation coefficient 0.48 and an Area Under the Receiver Operating Characteristic Curve (AUC)of 0.81. In the regression task, PGExpress results in better performance than RBSCalculator in the prediction of the log2-fold-change of the translational efficiency and its variation on the WT-High dataset. Finally, we validated our method by performing in-house experiments on five newly generated mRNA sequence variants. The predictions of the expression level of the new variants are in agreement with our experimental results in E. coli.

Zhao L., Abedpour N., Blum C., Kolkhof P., Beller M., Kollmann M., et al. (2019). Predicting gene expression level in E. coli from mRNA sequence information. Institute of Electrical and Electronics Engineers Inc. [10.1109/CIBCB.2019.8791456].

Predicting gene expression level in E. coli from mRNA sequence information

Zhao L.;Abedpour N.;Blum C.;Kolkhof P.;Beller M.;Kollmann M.;Capriotti E.

2019

Abstract

The accurate characterization of the translational mechanism is crucial for enhancing our understanding of the relationship between genotype and phenotype. In particular, predicting the impact of the genetic variants on gene expression will allow to optimize specific pathways and functions for engineering new biological systems. In this context, the development of accurate methods for predicting the translation efficiency and/or protein expression from the nucleotide sequence is a key challenge in computational biology. In this work we present PGExpress, a new regression method for predicting the log2-fold-change of the translation efficiency of an mRNA sequence in E. coli. PGExpress algorithm takes as input 12 features corresponding to the predicted RNA secondary structure and anti-Shine-Dalgarno hybridization free energies. The method was trained on a set of 1,772 sequence variants (WT-High)of 137 essential E. coli genes. For each gene, we considered 13 sequence variants of the first 33 nucleotides encoding for the same amino acids followed by the superfolder GFP. Each gene variant is represented sequence blocks that include the Ribosome Binding Site (RBS), the first 33 nucleotides of the coding region (C33), the remaining part of the coding region (CC), and their combinations. Our gradient-boosting-based tool (PGExpress) was trained using a 10-fold gene-based cross-validation procedure on the WT-High dataset. In this test PGExpress achieved a correlation coefficient of 0.60, with a Root Mean Square Error (RMSE)of 1.3. When the regression task is cast as a classification problem, PGExpress reached an overall accuracy of 0.74 a Matthews correlation coefficient 0.48 and an Area Under the Receiver Operating Characteristic Curve (AUC)of 0.81. In the regression task, PGExpress results in better performance than RBSCalculator in the prediction of the log2-fold-change of the translational efficiency and its variation on the WT-High dataset. Finally, we validated our method by performing in-house experiments on five newly generated mRNA sequence variants. The predictions of the expression level of the new variants are in agreement with our experimental results in E. coli.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2019
			
	Titolo del volume
	
				2019 IEEE Conference on Computational Intelligence in Bioinformatics and Computational Biology, CIBCB 2019
			
	Pagina iniziale
	
				1
			
	Pagina finale
	
				8
			
	Codice DOI
	
				https://dx.doi.org/10.1109/CIBCB.2019.8791456
			
	Citazione
	
				Zhao L.,  Abedpour N.,  Blum C.,  Kolkhof P.,  Beller M.,  Kollmann M., et al. (2019). Predicting gene expression level in E. coli from mRNA sequence information. Institute of Electrical and Electronics Engineers Inc. [10.1109/CIBCB.2019.8791456].
			
	Tutti gli autori
	
						Zhao L.; Abedpour N.; Blum C.; Kolkhof P.; Beller M.; Kollmann M.; Capriotti E.
					
	Appare nelle tipologie:
	
				4.01 Contributo in Atti di convegno

File in questo prodotto:

Eventuali allegati, non sono esposti

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11585/738254

Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni

ND

2

1

CRIS Current Research Information System