This is especially interesting to me, in light of the comparison of scripted dialogue to 'unscripted' spoken corpora.
The Use of Film Subtitles to Estimate Word Frequencies
Abstract:
We examine the use of film subtitles as an approximation of word frequencies in human interactions. Because subtitle files are widely available on the Internet, they may present a fast and easy way to obtain word frequency measures in language registers other than text writing. We compiled a corpus of 52 million French words, coming from a variety of films. Frequency measures based on this corpus compared well to other spoken and written frequency measures, and explained variance in lexical decision times in addition to what is accounted for by the available French written frequency measures.
Asriel
Member
From: 東京
Registered: 2008-02-26
Posts: 1343
nest0r, you know I love the effort you put into research, and it truly is amazing. Perhaps you could set up a website/blog? Perhaps you already have one...
Anyway, I had thought of something similar to this a few days ago, when I realized how easy it is to rip subtitles from DVDs. Unfortunately, that results in idx/sub files, which is hard to effectively pull usable text from.
I'm thinking that something like this could be done relatively simply, even just using the subtitles over at d-addicts, etc. Especially for someone more talented at coding than me. Set up Japanese parser*, run the code through, and have numbers.
edit: *I don't remember the name of that Japanese Parser, but it's not Mezbup, because that's a member here. It's like..Mecbap, Mecchap, Hubcap, or something along those lines...
Last edited by Asriel (2010 April 03, 11:48 pm)
I don't know, I didn't look too hard, because I was thinking they might be worried about copyright? Even though at that point it would just be words, perhaps they were iffy about having the paper and the list so close together. But Boris New participated, and they seem to also be involved with Lexique, so.
Last edited by nest0r (2010 April 04, 1:01 pm)