Predicting the functional impact of protein variation is one of the most challenging problems in Bioinformatics with direct implications for biomedicine. A rapidly growing number of genome-scale studies provide large amounts of experimental data allowing the application of rigorous statistical approaches for predicting if a given single point mutation has or not an impact on human health. Up until now, existing methods have limited their source data to either protein or gene information. Novel in this work, we take advantage of both and focus on protein evolutionary information by using estimated selective pressures at the codon level. Here we introduce a new method called SeqProfCod (acronym for sequence, profile and codon information) to predict the likeliness that a given protein variant is associated or not with human disease. In this work we also demonstrate that the majority of human mutations that are associated with disease are also under strong purifying selection ((ω<0.1). Therefore, our method relies on three sources of information: protein sequence, multiple protein sequence alignments and the estimation of selective pressure at the codon level. SeqProfCod has been benchmarked with a large dataset of 8,987 single point mutations from 1,434 human proteins from SWISS-PROT. It achieves 82% overall accuracy and a correlation coefficient of 0.59 demonstrating the synergic effect of the three sources of information. The results of large-scale application of SeqProfCod over all annotated point mutations in SWISS-PROT, which are available for download at http://bioinfo.cipf.es/sgu/services/SeqProfCod/, could be used to support clinical studies.
Selective pressure at the codon level improves the prediction of disease related protein mutations in human
CAPRIOTTI, EMIDIO;CASADIO, RITA;
2008
Abstract
Predicting the functional impact of protein variation is one of the most challenging problems in Bioinformatics with direct implications for biomedicine. A rapidly growing number of genome-scale studies provide large amounts of experimental data allowing the application of rigorous statistical approaches for predicting if a given single point mutation has or not an impact on human health. Up until now, existing methods have limited their source data to either protein or gene information. Novel in this work, we take advantage of both and focus on protein evolutionary information by using estimated selective pressures at the codon level. Here we introduce a new method called SeqProfCod (acronym for sequence, profile and codon information) to predict the likeliness that a given protein variant is associated or not with human disease. In this work we also demonstrate that the majority of human mutations that are associated with disease are also under strong purifying selection ((ω<0.1). Therefore, our method relies on three sources of information: protein sequence, multiple protein sequence alignments and the estimation of selective pressure at the codon level. SeqProfCod has been benchmarked with a large dataset of 8,987 single point mutations from 1,434 human proteins from SWISS-PROT. It achieves 82% overall accuracy and a correlation coefficient of 0.59 demonstrating the synergic effect of the three sources of information. The results of large-scale application of SeqProfCod over all annotated point mutations in SWISS-PROT, which are available for download at http://bioinfo.cipf.es/sgu/services/SeqProfCod/, could be used to support clinical studies.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.