Motivation: methyLImp, a method we recently introduced for the missing value estimation of DNA methylation data, has demonstrated compet- itive performance in data imputation compared to the existing, general-purpose, approaches. However, imputation running time was consider- ably long and unfeasible in case of large datasets with numerous missing values. Results: methyLImp2 made possible computations that were previously unfeasible. We achieved this by introducing two important modifica- tions that have significantly reduced the original running time without sacrificing prediction performance. First, we implemented a chromosome- wise parallel version of methyLImp. This parallelization reduced the runtime by several 10-fold in our experiments. Then, to handle large datasets, we also introduced a mini-batch approach that uses only a subset of the samples for the imputation. Thus, it further reduces the running time from days to hours or even minutes in large datasets. Availability and implementation: The R package methyLImp2 is under review for Bioconductor. It is currently freely available on Github https://github.com/annaplaksienko/methyLImp2.

Anna Plaksienko, P.D.L. (2024). methyLImp2.

methyLImp2

Pietro Di Lena
Secondo
;
Christine Nardini
Penultimo
;
2024

Abstract

Motivation: methyLImp, a method we recently introduced for the missing value estimation of DNA methylation data, has demonstrated compet- itive performance in data imputation compared to the existing, general-purpose, approaches. However, imputation running time was consider- ably long and unfeasible in case of large datasets with numerous missing values. Results: methyLImp2 made possible computations that were previously unfeasible. We achieved this by introducing two important modifica- tions that have significantly reduced the original running time without sacrificing prediction performance. First, we implemented a chromosome- wise parallel version of methyLImp. This parallelization reduced the runtime by several 10-fold in our experiments. Then, to handle large datasets, we also introduced a mini-batch approach that uses only a subset of the samples for the imputation. Thus, it further reduces the running time from days to hours or even minutes in large datasets. Availability and implementation: The R package methyLImp2 is under review for Bioconductor. It is currently freely available on Github https://github.com/annaplaksienko/methyLImp2.
2024
Anna Plaksienko, P.D.L. (2024). methyLImp2.
Anna Plaksienko, Pietro Di Lena, Christine Nardini, Claudia Angelini
File in questo prodotto:
Eventuali allegati, non sono esposti

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11585/962589
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact