Towards the quantification of the semantic information encoded in written language

Montemurro, M. A.; Zanette, D. H.

doi:10.1142/S0219525910002530

Written language is a complex communication signal capable of conveying information encoded in the form of ordered sequences of words. Beyond the local order ruled by grammar, semantic and thematic structures affect long-range patterns in word usage. Here, we show that a direct application of information theory quantifies the relationship between the statistical distribution of words and the semantic content of the text. We show that there is a characteristic scale, roughly around a few thousand words, which establishes the typical size of the most informative segments in written language. Moreover, we find that the words whose contributions to the overall information is larger, are the ones more closely associated with the main subjects and topics of the text. This scenario can be explained by a model of word usage that assumes that words are distributed along the text in domains of a characteristic size where their frequency is higher than elsewhere. Our conclusions are based on the analysis of a large database of written language, diverse in subjects and styles, and thus are likely to be applicable to general language sequences encoding complex information. © 2010 World Scientific Publishing Company.

Montemurro M.A., Zanette D.H. (2010). Towards the quantification of the semantic information encoded in written language. ADVANCES IN COMPLEX SYSTEM, 13(2), 135-153 [10.1142/S0219525910002530].

Towards the quantification of the semantic information encoded in written language

Montemurro M. A.^{Membro del Collaboration Group};

2010

Abstract

Written language is a complex communication signal capable of conveying information encoded in the form of ordered sequences of words. Beyond the local order ruled by grammar, semantic and thematic structures affect long-range patterns in word usage. Here, we show that a direct application of information theory quantifies the relationship between the statistical distribution of words and the semantic content of the text. We show that there is a characteristic scale, roughly around a few thousand words, which establishes the typical size of the most informative segments in written language. Moreover, we find that the words whose contributions to the overall information is larger, are the ones more closely associated with the main subjects and topics of the text. This scenario can be explained by a model of word usage that assumes that words are distributed along the text in domains of a characteristic size where their frequency is higher than elsewhere. Our conclusions are based on the analysis of a large database of written language, diverse in subjects and styles, and thus are likely to be applicable to general language sequences encoding complex information. © 2010 World Scientific Publishing Company.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2010
			
	Rivista
	
				ADVANCES IN COMPLEX SYSTEM
			
	Codice DOI
	
				https://dx.doi.org/10.1142/S0219525910002530
			
	Citazione
	
				Montemurro M.A.,  Zanette D.H. (2010). Towards the quantification of the semantic information encoded in written language. ADVANCES IN COMPLEX SYSTEM, 13(2), 135-153 [10.1142/S0219525910002530].
			
	Tutti gli autori
	
						Montemurro M.A.; Zanette D.H.
					
	Appare nelle tipologie:
	
				1.01 Articolo in rivista

File in questo prodotto:

Eventuali allegati, non sono esposti

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11585/770505

Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni

ND

41

38

ND

CRIS Current Research Information System