The incomplete determination of the mRNA 5' end sequence may lead to the incorrect assignment of the first AUG codon and to errors in the prediction of the encoded protein product. Due to the significance of the mouse as a model organism in biomedical research, we performed a systematic identification of coding regions at the 5' end of all known mouse mRNAs, using an automated expressed sequence tag (EST)-based approach which we have previously described. By parsing almost 4 million BLAT alignments we found 351 mouse loci, out of 20,221 analyzed, in which an extension of the mRNA 5' coding region was identified. Proof-of-concept confirmation was obtained by in vitro cloning and sequencing for Apc2 and Mknk2 cDNAs. We also generated a list of 16,330 mouse mRNAs where the presence of an in-frame stop codon upstream of the known start codon indicates completeness of the coding sequence at 5' end in the current form. Systematic searches in the main mouse genome databases and genome browsers showed that 82 % of our results are original and have not been identified by their annotation pipelines. Moreover, the same information is not easily derivable from RNA-Seq data, due to short sequence length and laboriousness in building full-length transcript structures. In conclusion, our results improve the determination of full-length 5' coding sequences and might be useful in order to reduce errors when studying mouse gene structure and function in biomedical research.
Allison Piovesan, Maria Caracausi, Maria Chiara Pelleri, Lorenza Vitale, Silvia Martini, Chiara Bassani, et al. (2014). Improving mRNA 5′ coding sequence determination in the mouse genome. MAMMALIAN GENOME, 25, 149-159 [10.1007/s00335-013-9498-3].
Improving mRNA 5′ coding sequence determination in the mouse genome
PIOVESAN, ALLISON;CARACAUSI, MARIA;PELLERI, MARIA CHIARA;VITALE, LORENZA;MARTINI, SILVIA;CASADEI, RAFFAELLA;STRIPPOLI, PIERLUIGI
2014
Abstract
The incomplete determination of the mRNA 5' end sequence may lead to the incorrect assignment of the first AUG codon and to errors in the prediction of the encoded protein product. Due to the significance of the mouse as a model organism in biomedical research, we performed a systematic identification of coding regions at the 5' end of all known mouse mRNAs, using an automated expressed sequence tag (EST)-based approach which we have previously described. By parsing almost 4 million BLAT alignments we found 351 mouse loci, out of 20,221 analyzed, in which an extension of the mRNA 5' coding region was identified. Proof-of-concept confirmation was obtained by in vitro cloning and sequencing for Apc2 and Mknk2 cDNAs. We also generated a list of 16,330 mouse mRNAs where the presence of an in-frame stop codon upstream of the known start codon indicates completeness of the coding sequence at 5' end in the current form. Systematic searches in the main mouse genome databases and genome browsers showed that 82 % of our results are original and have not been identified by their annotation pipelines. Moreover, the same information is not easily derivable from RNA-Seq data, due to short sequence length and laboriousness in building full-length transcript structures. In conclusion, our results improve the determination of full-length 5' coding sequences and might be useful in order to reduce errors when studying mouse gene structure and function in biomedical research.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.