Motivation: methyLImp, a method we recently introduced for the missing value estimation of DNA methylation data, has demonstrated compet- itive performance in data imputation compared to the existing, general-purpose, approaches. However, imputation running time was consider- ably long and unfeasible in case of large datasets with numerous missing values. Results: methyLImp2 made possible computations that were previously unfeasible. We achieved this by introducing two important modifica- tions that have significantly reduced the original running time without sacrificing prediction performance. First, we implemented a chromosome- wise parallel version of methyLImp. This parallelization reduced the runtime by several 10-fold in our experiments. Then, to handle large datasets, we also introduced a mini-batch approach that uses only a subset of the samples for the imputation. Thus, it further reduces the running time from days to hours or even minutes in large datasets. Availability and implementation: The R package methyLImp2 is under review for Bioconductor. It is currently freely available on Github https://github.com/annaplaksienko/methyLImp2.
Anna Plaksienko, P.D.L. (2024). methyLImp2.
methyLImp2
Pietro Di LenaSecondo
;Christine NardiniPenultimo
;
2024
Abstract
Motivation: methyLImp, a method we recently introduced for the missing value estimation of DNA methylation data, has demonstrated compet- itive performance in data imputation compared to the existing, general-purpose, approaches. However, imputation running time was consider- ably long and unfeasible in case of large datasets with numerous missing values. Results: methyLImp2 made possible computations that were previously unfeasible. We achieved this by introducing two important modifica- tions that have significantly reduced the original running time without sacrificing prediction performance. First, we implemented a chromosome- wise parallel version of methyLImp. This parallelization reduced the runtime by several 10-fold in our experiments. Then, to handle large datasets, we also introduced a mini-batch approach that uses only a subset of the samples for the imputation. Thus, it further reduces the running time from days to hours or even minutes in large datasets. Availability and implementation: The R package methyLImp2 is under review for Bioconductor. It is currently freely available on Github https://github.com/annaplaksienko/methyLImp2.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.