Data lakes are storage repositories that contain large amounts of data in its native format; either structured ssemi-structured or unstructured, to be used when needed. Data lakes are open to a wide range of use cases such as carrying out advanced analytics, extracting knowledge patterns, etc. However, simply dumping all the data into a data lake would only lead to a so-called data swamp. To prevent such a situation, enterprises can adopt best practices among which to build and maintain metadata. In recent years there has been a growing body of research about managing metadata in data lake environments. Existing research efforts deal separately with different activities such as metadata modeling, metadata capture and extraction, metadata usage, etc. Nevertheless, despite its importance, a global view about the research landscape about metadata management for data lakes is still missing. This survey congre- gates different facets of metadata management in data lakes and presents a global view along with the technological impli- cations and the required features for building successful meta- data management systems. Besides, this survey summarizes and discusses research gaps, open problems and main chal- lenges facing both industrialists and academics. This survey pertains to the broader field of Big Data and especially to the data platforms that manage enterprise big data assets. Furthermore, considering the parallels between data lakes and digital libraries regarding their dependence on metadata for content management, this study could offer valuable insights to the digital library community, offering them a technological outlook on metadata management.
Boukraaa, D., Balab, M., Rizzi, S. (2024). Metadata Management in Data Lake Environments: a Survey. JOURNAL OF LIBRARY METADATA, 24(4), 215-274 [10.1080/19386389.2024.2359310].
Metadata Management in Data Lake Environments: a Survey
Stefano Rizzi
2024
Abstract
Data lakes are storage repositories that contain large amounts of data in its native format; either structured ssemi-structured or unstructured, to be used when needed. Data lakes are open to a wide range of use cases such as carrying out advanced analytics, extracting knowledge patterns, etc. However, simply dumping all the data into a data lake would only lead to a so-called data swamp. To prevent such a situation, enterprises can adopt best practices among which to build and maintain metadata. In recent years there has been a growing body of research about managing metadata in data lake environments. Existing research efforts deal separately with different activities such as metadata modeling, metadata capture and extraction, metadata usage, etc. Nevertheless, despite its importance, a global view about the research landscape about metadata management for data lakes is still missing. This survey congre- gates different facets of metadata management in data lakes and presents a global view along with the technological impli- cations and the required features for building successful meta- data management systems. Besides, this survey summarizes and discusses research gaps, open problems and main chal- lenges facing both industrialists and academics. This survey pertains to the broader field of Big Data and especially to the data platforms that manage enterprise big data assets. Furthermore, considering the parallels between data lakes and digital libraries regarding their dependence on metadata for content management, this study could offer valuable insights to the digital library community, offering them a technological outlook on metadata management.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


