Protein structure prediction in the genomic era: annotation-facilitated remote homology detection

Piovesan, Damiano; Casadio, Rita

As a result of large sequencing projects, data banks of protein sequences and structures are growing rapidly. The number of sequences is however orders of magnitude larger than the number of structures known at atomic level and this is so in spite of the efforts in accelerating processes aiming at the resolution of protein structure. Tools have been developed in order to bridge the gap between sequence and protein 3D structure, based on the notion that information is to be retrieved from the data bases and that knowledge-based methods can help in approaching a solution of the protein folding problem. The problem of computing the protein 3D structure starting from sequence is presently classified as easy to be solved, difficult albeit with a putative solution, “ab initio” and therefore very difficult, depending on the level of sequence identity that the target sequence has with proteins already solved with atomic details in the Protein Data Bank. When a template with a high level of sequence identity to the target at hand exists, then the protein folding problem can be routinely solved by assigning with different optimisation procedures the atomic coordinates of the template to the target. However when sequence identity falls in the twilight region (≤30% of sequence identity), then different heuristic procedures may help in finding putative folds for the target. The process may or may not lead to a successful solution, depending on different assumptions and strategies, including alignments among predicted features. Finally, “ab initio methods” (based on first principles) are still under developments and far from being useful when searching for a putative model. This work describes a recent non-hierachical clustering procedure that was implemented with the specific purpose of fully exploiting the present knowledge in the data bases of sequences, structures and functions. This procedure largely increases the number of sequences that can be annotated by transfer of annotation in a set of 988 genomes, including Homo Sapiens. When in a given cluster distantly related sequences from different genomes coexist, the procedure allows a safe transfer of annotation both for structure and function, independently of the level of sequence identity. In some specific cases of functional annotation, human sequences can be safely modelled on Prokaryotic templates. In computing these models machine learning approaches of sequence analysis help in constraining the optimal alignment of the distantly related sequences. Our analysis addresses the problem of structural transfer among distantly related proteins and allows solutions that increase by some 6000 models the structure of the human proteome.

Piovesan D., Casadio R. (2010). Protein structure prediction in the genomic era: annotation-facilitated remote homology detection. s.l : s.n.