One of the most evident tenets of the literature on overlapping markup is that the philosophy of documents as trees (as dictated by meta-markup languages such as SGML and XML) is a simplification that sometimes fails and requires corrections. These corrections have been proposed at the markup level (e.g., milestones, segmentation), at the meta-markup level (e.g., LMNL, TexMecs, XCONCUR, etc.) or at level of the abstract model (e.g., GODDAG). Unfortunately full GODDAGs do not allow linearizations in general, and as such a restricted version of GODDAG, r-GODDAG, has been proposed that is guaranteed to be linearizable (in TexMecs) and still allows many nice features beyond trees. In this paper we discuss that the problem of linearizing more-than-hierarchical structures lies basically in the embedding of markup within content and that no such problem arises with an appropriate standoff approach, that is able to represent full GODDAGs without restrictions. This gives ample opportunities to deal with interesting markup features that are describable with GODDAGs but not with r-GODDAGs, such as non-contiguous elements and virtual elements. Besides, we discuss whether a specific constraint of full GODDAGs is really necessary once all residual hopes of embeddability are given up, and we further propose a minimal extension to GODDAG, genially called "extended GODDAG" (e-GODDAG) that, by removing the requirement for names in non-terminal nodes, adds support for additional interesting markup features such as content repetitions. In truth, e-GODDAGs are even less embeddable than full GODDAGs, but they are just as easily dealt with by using stand-off markup. We further propose a meta-syntax for non-embedded markup, called EARMARK, that can be used for stand-off annotations of textual content, and that naturally represents e-GODDAGs with fully W3C-compliant technologies. EARMARK is based on an ontologically precise definition of markup that instantiates the markup of a text document as an OWL document, and through appropriate OWL and SWRL characterizations it can define structures such as trees, r-GODDAGs, full GODDAGs and e-GODDAGs, and can be used to generate validity constraints (including co-constraints), and to verify adherence to content model patterns. As mentioned, in general the embedding of a full EARMARK document is not possible, but approaches can be taken in that direction: just like segmentation and fragmentation are strategies to embed in a strictly-hierarchical language a r-GODDAG-specific feature such as overlapping elements, similarly a number of strategies exist to provide embedding of GODDAG and e-GODDAG features in less expressive syntaxes. In the final part of the paper we discuss our wish to provide at the metalanguage level a series of embedding strategies of the non-hierarchical features of EARMARK, i.e. a number of language-independent mechanisms to express e-GODDAGs structures into XML (as well as in TexMecs and in LMNL) and that can be recognized as such (i.e., as strategies, as tricks) by tools and readers alike, especially for further uses of such documents.

Towards markup support for full GODDAGs and beyond: the EARMARK approach

DI IORIO, ANGELO;PERONI, SILVIO;VITALI, FABIO
2009

Abstract

One of the most evident tenets of the literature on overlapping markup is that the philosophy of documents as trees (as dictated by meta-markup languages such as SGML and XML) is a simplification that sometimes fails and requires corrections. These corrections have been proposed at the markup level (e.g., milestones, segmentation), at the meta-markup level (e.g., LMNL, TexMecs, XCONCUR, etc.) or at level of the abstract model (e.g., GODDAG). Unfortunately full GODDAGs do not allow linearizations in general, and as such a restricted version of GODDAG, r-GODDAG, has been proposed that is guaranteed to be linearizable (in TexMecs) and still allows many nice features beyond trees. In this paper we discuss that the problem of linearizing more-than-hierarchical structures lies basically in the embedding of markup within content and that no such problem arises with an appropriate standoff approach, that is able to represent full GODDAGs without restrictions. This gives ample opportunities to deal with interesting markup features that are describable with GODDAGs but not with r-GODDAGs, such as non-contiguous elements and virtual elements. Besides, we discuss whether a specific constraint of full GODDAGs is really necessary once all residual hopes of embeddability are given up, and we further propose a minimal extension to GODDAG, genially called "extended GODDAG" (e-GODDAG) that, by removing the requirement for names in non-terminal nodes, adds support for additional interesting markup features such as content repetitions. In truth, e-GODDAGs are even less embeddable than full GODDAGs, but they are just as easily dealt with by using stand-off markup. We further propose a meta-syntax for non-embedded markup, called EARMARK, that can be used for stand-off annotations of textual content, and that naturally represents e-GODDAGs with fully W3C-compliant technologies. EARMARK is based on an ontologically precise definition of markup that instantiates the markup of a text document as an OWL document, and through appropriate OWL and SWRL characterizations it can define structures such as trees, r-GODDAGs, full GODDAGs and e-GODDAGs, and can be used to generate validity constraints (including co-constraints), and to verify adherence to content model patterns. As mentioned, in general the embedding of a full EARMARK document is not possible, but approaches can be taken in that direction: just like segmentation and fragmentation are strategies to embed in a strictly-hierarchical language a r-GODDAG-specific feature such as overlapping elements, similarly a number of strategies exist to provide embedding of GODDAG and e-GODDAG features in less expressive syntaxes. In the final part of the paper we discuss our wish to provide at the metalanguage level a series of embedding strategies of the non-hierarchical features of EARMARK, i.e. a number of language-independent mechanisms to express e-GODDAGs structures into XML (as well as in TexMecs and in LMNL) and that can be recognized as such (i.e., as strategies, as tricks) by tools and readers alike, especially for further uses of such documents.
2009
Balisage Series on Markup Technologies
s.p.
A. Di Iorio; S. Peroni; F. Vitali
File in questo prodotto:
Eventuali allegati, non sono esposti

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11585/87836
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 11
  • ???jsp.display-item.citation.isi??? ND
social impact