KIParla is a large, modular corpus of spontaneous spoken Italian originally transcribed in ELAN using Jefferson-style conventions. While this representation preserves fine-grained interactional detail and time alignment, it limits interoperability, large-scale querying, and computational reuse. This paper presents the design and implementation of a pseudo-tokenized, verticalized pivot format developed to support validation, maintenance, and infrastructural integration without sacrificing descriptive richness. The proposed format makes explicit the analytical units implicit in Jefferson transcription—transcription units, spans, and tokens—and enforces well-formedness constraints at character, span, and unit levels. Overlap, the most complex relational phenomenon, is resolved through a graph-based algorithm that derives temporal overlap events from alignment data and deterministically matches them to textual spans. Each token is represented as a structured record enriched with lexical, prosodic, interactional, and alignment features, anchored through explicit character offsets. The vertical format functions as a maintained pivot representation from which alternative formats, including ELAN files and UD-compatible treebank representations, can be reproducibly derived. This architecture enables large-scale lemmatization, part-of-speech tagging, syntactic annotation, and cross-layer querying, while supporting version-controlled, DevOps-inspired workflows for sustainable corpus growth. The KIParla pivot format thus reconciles interaction-oriented transcription practices with computational standards and provides a model for reuse-oriented spoken-language data engineering.

Pannitto, L., Mauri, C. (2026). Reuse by Design: A Pivot-Based Architecture for the KIParla Corpus of Spoken Italian. JOURNAL OF OPEN HUMANITIES DATA, 12, 1-17 [10.5334/johd.527].

Reuse by Design: A Pivot-Based Architecture for the KIParla Corpus of Spoken Italian

Pannitto, Ludovica;Mauri, Caterina
2026

Abstract

KIParla is a large, modular corpus of spontaneous spoken Italian originally transcribed in ELAN using Jefferson-style conventions. While this representation preserves fine-grained interactional detail and time alignment, it limits interoperability, large-scale querying, and computational reuse. This paper presents the design and implementation of a pseudo-tokenized, verticalized pivot format developed to support validation, maintenance, and infrastructural integration without sacrificing descriptive richness. The proposed format makes explicit the analytical units implicit in Jefferson transcription—transcription units, spans, and tokens—and enforces well-formedness constraints at character, span, and unit levels. Overlap, the most complex relational phenomenon, is resolved through a graph-based algorithm that derives temporal overlap events from alignment data and deterministically matches them to textual spans. Each token is represented as a structured record enriched with lexical, prosodic, interactional, and alignment features, anchored through explicit character offsets. The vertical format functions as a maintained pivot representation from which alternative formats, including ELAN files and UD-compatible treebank representations, can be reproducibly derived. This architecture enables large-scale lemmatization, part-of-speech tagging, syntactic annotation, and cross-layer querying, while supporting version-controlled, DevOps-inspired workflows for sustainable corpus growth. The KIParla pivot format thus reconciles interaction-oriented transcription practices with computational standards and provides a model for reuse-oriented spoken-language data engineering.
2026
Pannitto, L., Mauri, C. (2026). Reuse by Design: A Pivot-Based Architecture for the KIParla Corpus of Spoken Italian. JOURNAL OF OPEN HUMANITIES DATA, 12, 1-17 [10.5334/johd.527].
Pannitto, Ludovica; Mauri, Caterina
File in questo prodotto:
Eventuali allegati, non sono esposti

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11585/1069383
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
  • OpenAlex ND
social impact