In document-oriented databases, schema is a soft concept and the documents in a collection can be stored using different local schemata. This gives designers and implementers augmented flexibility; however, it requires an extra effort to understand the rules that drove the use of alternative schemata when sets of documents with different ---and possibly conflicting--- schemata are to be analyzed or integrated. In this paper we propose a technique, called schema profiling, to explain the schema variants within a collection in document-oriented databases by capturing the hidden rules explaining the use of these variants. We express these rules in the form of a decision tree (schema profile). Consistently with the requirements we elicited from real users, we aim at creating explicative, precise, and concise schema profiles. The algorithm we adopt to this end is inspired by the well-known C4.5 classification algorithm and builds on two original features: the coupling of value-based and schema-based conditions within schema profiles, and the introduction of a novel measure of entropy to assess the quality of a schema profile. A set of experimental tests made on both synthetic and real datasets demonstrates the effectiveness and efficiency of our approach.

Enrico Gallinucci, M.G. (2018). Schema profiling of document-oriented databases. INFORMATION SYSTEMS, 75, 13-25 [10.1016/j.is.2018.02.007].

Schema profiling of document-oriented databases

Enrico Gallinucci;Matteo Golfarelli;Stefano Rizzi
2018

Abstract

In document-oriented databases, schema is a soft concept and the documents in a collection can be stored using different local schemata. This gives designers and implementers augmented flexibility; however, it requires an extra effort to understand the rules that drove the use of alternative schemata when sets of documents with different ---and possibly conflicting--- schemata are to be analyzed or integrated. In this paper we propose a technique, called schema profiling, to explain the schema variants within a collection in document-oriented databases by capturing the hidden rules explaining the use of these variants. We express these rules in the form of a decision tree (schema profile). Consistently with the requirements we elicited from real users, we aim at creating explicative, precise, and concise schema profiles. The algorithm we adopt to this end is inspired by the well-known C4.5 classification algorithm and builds on two original features: the coupling of value-based and schema-based conditions within schema profiles, and the introduction of a novel measure of entropy to assess the quality of a schema profile. A set of experimental tests made on both synthetic and real datasets demonstrates the effectiveness and efficiency of our approach.
2018
Enrico Gallinucci, M.G. (2018). Schema profiling of document-oriented databases. INFORMATION SYSTEMS, 75, 13-25 [10.1016/j.is.2018.02.007].
Enrico Gallinucci, Matteo Golfarelli, Stefano Rizzi
File in questo prodotto:
File Dimensione Formato  
is18-Profiling-postprint.pdf

accesso aperto

Tipo: Postprint
Licenza: Licenza per Accesso Aperto. Creative Commons Attribuzione - Non commerciale - Non opere derivate (CCBYNCND)
Dimensione 1.02 MB
Formato Adobe PDF
1.02 MB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11585/629348
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 62
  • ???jsp.display-item.citation.isi??? 41
social impact