Datasheets for Digital Cultural Heritage Datasets

Alkemade, Henk; Claeyssens, Steven; Colavizza, Giovanni; Freire, Nuno; Lehmann, Jörg; Neudeker, Clemens; Osti, Giulia; Daniel Van Strien,

doi:10.5334/johd.124

Sparked by issues of quality and lack of proper documentation for datasets, the machine learning community has begun developing standardised processes for establishing datasheets for machine learning datasets, with the intent to provide context and information on provenance, purposes, composition, the collection process, recommended uses or societal biases reflected in training datasets. This approach fits well with practices and procedures established in GLAM institutions, such as establishing collections’ descriptions. However, digital cultural heritage datasets are marked by specific characteristics. They are often the product of multiple layers of selection; they may have been created for different purposes than establishing a statistical sample according to a specific research question; they change over time and are heterogeneous. Punctuated by a series of recommendations to create datasheets for digital cultural heritage, the paper addresses the scope and characteristics of digital cultural heritage datasets; possible metrics and measures; lessons from concepts similar to datasheets and/or established workflows in the cultural heritage sector. This paper includes a proposal for a datasheet template that has been adapted for use in cultural heritage institutions, and which proposes to incorporate information on the motivation and selection criteria, digitisation pipeline, data provenance, the use of linked open data, and version information.

Henk Alkemade, S.C. (2023). Datasheets for Digital Cultural Heritage Datasets. JOURNAL OF OPEN HUMANITIES DATA, 9(17), 1-11 [10.5334/johd.124].