Genome-scale analysis of human mRNA 5’ coding sequences based on expressed sequence tag (EST) database.

Piovesan, Allison; Casadei, Raffaella; Vitale, Lorenza; Facchin, Federica; Pelleri, Maria Chiara; Canaider, Silvia; Bianconi, Eva; Frabetti, Flavia; Strippoli, Pierluigi

The term "5´ end mRNA artifact" refers to the incorrect assignment of the first AUG codon in an mRNA, due to the incomplete determination of its 5´ end sequence (Casadei et al., 2003). Since the '70s, the amino acid sequence of gene products has been routinely deduced from the nucleotide sequence of the relative cloned cDNA (DNA complementary to mRNA), according to rules for recognition of the start codon (first-AUG rule, optimal sequence context) and the genetic code (Kozak, 2002). All standard methods for the cloning of cDNA are affected by a potential inability to effectively clone the 5´ region of mRNA. This is due to the reverse transcriptase failure to extend first-strand cDNA along the full length of the mRNA template toward its 5´ end (Sambrook, 2001). The identification of a more complete mRNA 5´ end could reveal an additional upstream AUG – in-frame with the previously determined one – thus extending the predicted amino terminus sequence of the product and avoiding subsequent relevant errors in the experimental study of the relative cDNA (Casadei et al., 2003). The continuous incorporation of information derived from individual and large-scale cDNA sequencing projects, including those specifically designed to characterize mRNA 5´ end (Carninci et al., 1996; Suzuki et al., 2000; Porcel et al., 2004), in the last few years led to continuous improvement of completeness of mRNA reference sequences (e.g., RefSeq), and also to the corresponding protein coding sequences. However, genome browsers do not appear to systematically extract useful information from the ever-increasing vast quantity of EST (expressed sequence tag) data. To date, EST data remain invaluable due to significantly longer continuous RNA sequences they may provide in comparison with the very short fragments typically deposited in current high-throughput nucleotide sequencing databases. We previously used individual EST-based gene model refinement by classic in silico sequence analysis to revise the mRNA sequence of 109 human chromosome 21 protein-coding genes (Casadei et al., 2003). The success of this approach encouraged us to develop a piece of software ("5'_ORF_Extender" software) in order to automate the steps that were previously performed manually, applying it to the Danio rerio (zebrafish) genome (Frabetti et al., 2007). In the present work, we present a modified strategy able to analyze the much more numerous human sequences. Firstly, we fully revised the software algorithm by using pre-computed coordinates of the UCSC-downloaded RefSeqs and ESTs genome alignment data and specific UCSC-downloaded EST sequence entries. Furthermore, we adopted an original quality filter which was able to test if each single EST candidate with sequence information of possible use for extending a known mRNA, was attributed to the same locus of that mRNA by an updated, complete and embedded version of UniGene. Lastly, we automated data summarization for an analyzed genome. Following these improvements, parsing more than 7 million BLAT alignment, 5'_ORF_Extender 2.0 recognized a total of 477 loci, out of the 18,665 human loci represented in the mRNA reference set, as bona fide candidates for extension. Proof-of-concept confirmation was obtained by in vitro cloning and sequencing for GNB2L1 (guanine nucleotide binding protein (G protein), beta polypeptide 2-like 1), QARS (glutaminyl-tRNA synthetase) and TDP2 (tyrosyl-DNA phosphodiesterase 2) cDNAs, and the consequences for the functional studies of these loci are discussed. In addition, we generated a list of 20,775 human mRNAs in which the presence of an in-frame stop codon upstream of the known start codon indicates completeness of the coding sequence at 5´ in the current form. Bibliografia: R. Casadei et al., mRNA 5’ region sequence incompleteness: a potential source of systematic errors in translation initiation codon assignment in human mRNAs, Gene 321 (2003) 185–193. M. Kozak, Pushing the limits of the scanning mechanism for initiation of translation, Gene 99 (2002) 1–34. P. Carninci et al., High-efficiency fulllength cDNA cloning by biotinylated CAP trapper, Genomics 37 (1996) 327–336. F. Frabetti et al., Systematic analysis of mRNA 5’ coding sequence incompleteness in Danio rerio: an automated EST-based approach, Biol. Direct 2 (2007) 34.

CRIS Current Research Information System