Models that map DNA and protein sequences into deep embeddings have been recently developed. While their ability to improve prediction in downstream tasks has been demonstrated, clear advantages and disadvantages of embedding types, and different means of applying them, are not yet available. In this paper we compare five different models (one for DNA, four for proteins) and different embedding aggregation methods with respect to their ability to preserve evolutionary and functional information, using a hierarchical tree approach. Specifically, we introduce a novel procedure that builds hierarchical clustering trees to assess the relative position of sequences in the embedding latent space, compared to the phylogenetic and functional similarities between sequences. The methods are benchmarked on five different datasets from various organisms. The ESM protein language model and DNABert emerge as best performers in different settings.

Tolloso, M., Galfre, S.G., Pavone, A., Podda, M., Sirbu, A., Priami, C. (2024). How Much Do DNA and Protein Deep Embeddings Preserve Biological Information? [10.1007/978-3-031-71671-3_15].

How Much Do DNA and Protein Deep Embeddings Preserve Biological Information?

Sirbu, Alina;
2024

Abstract

Models that map DNA and protein sequences into deep embeddings have been recently developed. While their ability to improve prediction in downstream tasks has been demonstrated, clear advantages and disadvantages of embedding types, and different means of applying them, are not yet available. In this paper we compare five different models (one for DNA, four for proteins) and different embedding aggregation methods with respect to their ability to preserve evolutionary and functional information, using a hierarchical tree approach. Specifically, we introduce a novel procedure that builds hierarchical clustering trees to assess the relative position of sequences in the embedding latent space, compared to the phylogenetic and functional similarities between sequences. The methods are benchmarked on five different datasets from various organisms. The ESM protein language model and DNABert emerge as best performers in different settings.
2024
International Conference on Computational Methods in Systems Biology
209
225
Tolloso, M., Galfre, S.G., Pavone, A., Podda, M., Sirbu, A., Priami, C. (2024). How Much Do DNA and Protein Deep Embeddings Preserve Biological Information? [10.1007/978-3-031-71671-3_15].
Tolloso, Matteo; Galfre, Silvia Giulia; Pavone, Arianna; Podda, Marco; Sirbu, Alina; Priami, Corrado
File in questo prodotto:
Eventuali allegati, non sono esposti

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11585/1008320
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? 0
social impact