quantification of the semantic information encoded in written language

Index » 喫茶店 (Koohii Lounge)

  • 1
 
nest0r Member
Registered: 2007-10-19 Posts: 5236 Website

http://arxiv.org/abs/0907.1558 - Towards the quantification of the semantic information encoded in written language

Abstract: "Written language is a complex communication signal capable of conveying information encoded in the form of ordered sequences of words. Beyond the local order ruled by grammar, semantic and thematic structures affect long-range patterns in word usage. Here, we show that a direct application of information theory quantifies the relationship between the statistical distribution of words and the semantic content of the text. We show that there is a characteristic scale, roughly around a few thousand words, which establishes the typical size of the most informative segments in written language. Moreover, we find that the words whose contributions to the overall information is larger, are the ones more closely associated with the main subjects and topics of the text. This scenario can be explained by a model of word usage that assumes that words are distributed along the text in domains of a characteristic size where their frequency is higher than elsewhere. Our conclusions are based on the analysis of a large database of written language, diverse in subjects and styles, and thus are likely to be applicable to general language sequences encoding complex information."


More accessible version:

http://bit.ly/12AQvD -  Statistics could help decode ancient scripts

"A STATISTICAL method that picks out the most significant words in a book could help scholars decode ancient texts like the Voynich manuscript - or even messages from aliens...

... For a more detailed analysis, the team calculated the "entropy" of each word, a measure of how evenly distributed it is, in both the original text and in a scrambled version in which the words appeared in a random stream. From the difference between the two entropies multiplied by the frequency of the word, the team generated that word's "information value" in the text."




I am very excited about this 'meta' analysis--it's an evolution of the perceptual framework previously occupied by 'frequency lists' and the like that I use as virtual reference points. It gives me a few idea-seeds on possible applications to language-learning. Need to let them gestate, though.

Edit - Tentative thoughts on potential uses: new ways of identifying the most important swaths of a text's domains of perhaps 1-3000 words and using them for condensed contextual reading, gleaning from these domains the most significant themes that can be used to organize and customize according to preferences. Also, using this method of analysis as a better way to prioritize vocabulary learning through 'information value' rather than simple word frequency, making the items to be studied more relevant and efficient.

Last edited by nest0r (2009 August 18, 6:35 am)

Advertising (register and sign in to hide this)
JapanesePod101 Sponsor
 
  • 1