KIParla is a large, modular corpus of spontaneous spoken Italian originally transcribed in ELAN using Jefferson-style conventions. While this representation preserves fine-grained interactional detail and time alignment, it limits interoperability, large-scale querying, and computational reuse. This paper presents the design and implementation of a pseudo-tokenized, verticalized pivot format developed to support validation, maintenance, and infrastructural integration without sacrificing descriptive richness. The proposed format makes explicit the analytical units implicit in Jefferson transcription—transcription units, spans, and tokens—and enforces well-formedness constraints at character, span, and unit levels. Overlap, the most complex relational phenomenon, is resolved through a graph-based algorithm that derives temporal overlap events from alignment data and deterministically matches them to textual spans. Each token is represented as a structured record enriched with lexical, prosodic, interactional, and alignment features, anchored through explicit character offsets. The vertical format functions as a maintained pivot representation from which alternative formats, including ELAN files and UD-compatible treebank representations, can be reproducibly derived. This architecture enables large-scale lemmatization, part-of-speech tagging, syntactic annotation, and cross-layer querying, while supporting version-controlled, DevOps-inspired workflows for sustainable corpus growth. The KIParla pivot format thus reconciles interaction-oriented transcription practices with computational standards and provides a model for reuse-oriented spoken-language data engineering.
Pannitto, L., Mauri, C. (2026). Reuse by Design: A Pivot-Based Architecture for the KIParla Corpus of Spoken Italian. JOURNAL OF OPEN HUMANITIES DATA, 12, 1-17 [10.5334/johd.527].
Reuse by Design: A Pivot-Based Architecture for the KIParla Corpus of Spoken Italian
Pannitto, Ludovica;Mauri, Caterina
2026
Abstract
KIParla is a large, modular corpus of spontaneous spoken Italian originally transcribed in ELAN using Jefferson-style conventions. While this representation preserves fine-grained interactional detail and time alignment, it limits interoperability, large-scale querying, and computational reuse. This paper presents the design and implementation of a pseudo-tokenized, verticalized pivot format developed to support validation, maintenance, and infrastructural integration without sacrificing descriptive richness. The proposed format makes explicit the analytical units implicit in Jefferson transcription—transcription units, spans, and tokens—and enforces well-formedness constraints at character, span, and unit levels. Overlap, the most complex relational phenomenon, is resolved through a graph-based algorithm that derives temporal overlap events from alignment data and deterministically matches them to textual spans. Each token is represented as a structured record enriched with lexical, prosodic, interactional, and alignment features, anchored through explicit character offsets. The vertical format functions as a maintained pivot representation from which alternative formats, including ELAN files and UD-compatible treebank representations, can be reproducibly derived. This architecture enables large-scale lemmatization, part-of-speech tagging, syntactic annotation, and cross-layer querying, while supporting version-controlled, DevOps-inspired workflows for sustainable corpus growth. The KIParla pivot format thus reconciles interaction-oriented transcription practices with computational standards and provides a model for reuse-oriented spoken-language data engineering.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.



