CRIS Current Research Information System

Motivation: The advent of massive DNA sequencing technologies is producing a huge number of human single-nucleotide polymorphisms occurring in protein-coding regions and possibly changing their sequences. Discriminating harmful protein variations from neutral ones is one of the crucial challenges in precision medicine. Computational tools based on artificial intelligence provide models for protein sequence encoding, bypassing database searches for evolutionary information. We leverage the new encoding schemes for an efficient annotation of protein variants.Results: E-SNPs&GO is a novel method that, given an input protein sequence and a single amino acid variation, can predict whether the variation is related to diseases or not. The proposed method adopts an input encoding completely based on protein language models and embedding techniques, specifically devised to encode protein sequences and GO functional annotations. We trained our model on a newly generated dataset of 101 146 human protein single amino acid variants in 13 661 proteins, derived from public resources. When tested on a blind set comprising 10 266 variants, our method well compares to recent approaches released in literature for the same task, reaching a Matthews Correlation Coefficient score of 0.72. We propose E-SNPs&GO as a suitable, efficient and accurate large-scale annotator of protein variant datasets.

Manfredi, M., Savojardo, C., Martelli, P.L., Casadio, R. (2022). E-SNPs&GO: embedding of protein sequence and function improves the annotation of human pathogenic variants. BIOINFORMATICS, 38(23), 5168-5174 [10.1093/bioinformatics/btac678].

E-SNPs&GO: embedding of protein sequence and function improves the annotation of human pathogenic variants

Manfredi, Matteo;Savojardo, Castrense;Martelli, Pier Luigi;Casadio, Rita

2022

Abstract

Motivation: The advent of massive DNA sequencing technologies is producing a huge number of human single-nucleotide polymorphisms occurring in protein-coding regions and possibly changing their sequences. Discriminating harmful protein variations from neutral ones is one of the crucial challenges in precision medicine. Computational tools based on artificial intelligence provide models for protein sequence encoding, bypassing database searches for evolutionary information. We leverage the new encoding schemes for an efficient annotation of protein variants.Results: E-SNPs&GO is a novel method that, given an input protein sequence and a single amino acid variation, can predict whether the variation is related to diseases or not. The proposed method adopts an input encoding completely based on protein language models and embedding techniques, specifically devised to encode protein sequences and GO functional annotations. We trained our model on a newly generated dataset of 101 146 human protein single amino acid variants in 13 661 proteins, derived from public resources. When tested on a blind set comprising 10 266 variants, our method well compares to recent approaches released in literature for the same task, reaching a Matthews Correlation Coefficient score of 0.72. We propose E-SNPs&GO as a suitable, efficient and accurate large-scale annotator of protein variant datasets.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2022
			
	Rivista
	
				BIOINFORMATICS
			
	Codice DOI
	
				https://dx.doi.org/10.1093/bioinformatics/btac678
			
	Citazione
	
				Manfredi, M., Savojardo, C., Martelli, P.L., Casadio, R. (2022). E-SNPs&GO: embedding of protein sequence and function improves the annotation of human pathogenic variants. BIOINFORMATICS, 38(23), 5168-5174 [10.1093/bioinformatics/btac678].
			
	Tutti gli autori
	
						Manfredi, Matteo; Savojardo, Castrense; Martelli, Pier Luigi; Casadio, Rita
					
	Appare nelle tipologie:
	
				1.01 Articolo in rivista

File in questo prodotto:

File	Dimensione	Formato
btac678.pdf accesso aperto Descrizione: Manoscritto Tipo: Versione (PDF) editoriale / Version Of Record Licenza: Licenza per Accesso Aperto. Creative Commons Attribuzione (CCBY) Dimensione 509.53 kB Formato Adobe PDF Visualizza/Apri	509.53 kB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11585/917133

Citazioni

3

32

28

social impact