Back

The Use of Film Subtitles to Estimate Word Frequencies

#1
This is especially interesting to me, in light of the comparison of scripted dialogue to 'unscripted' spoken corpora.

The Use of Film Subtitles to Estimate Word Frequencies

Abstract:

We examine the use of film subtitles as an approximation of word frequencies in human interactions. Because subtitle files are widely available on the Internet, they may present a fast and easy way to obtain word frequency measures in language registers other than text writing. We compiled a corpus of 52 million French words, coming from a variety of films. Frequency measures based on this corpus compared well to other spoken and written frequency measures, and explained variance in lexical decision times in addition to what is accounted for by the available French written frequency measures.
Reply
#2
nest0r, you know I love the effort you put into research, and it truly is amazing. Perhaps you could set up a website/blog? Perhaps you already have one...

Anyway, I had thought of something similar to this a few days ago, when I realized how easy it is to rip subtitles from DVDs. Unfortunately, that results in idx/sub files, which is hard to effectively pull usable text from.

I'm thinking that something like this could be done relatively simply, even just using the subtitles over at d-addicts, etc. Especially for someone more talented at coding than me. Set up Japanese parser*, run the code through, and have numbers.

edit: *I don't remember the name of that Japanese Parser, but it's not Mezbup, because that's a member here. It's like..Mecbap, Mecchap, Hubcap, or something along those lines...
Edited: 2010-04-03, 11:48 pm
Reply
#3
Asriel Wrote:set up Mezbup
できた!
Reply
May 16 - 30 : Pretty Big Deal: Save 31% on all Premium Subscriptions! - Sign up here
JapanesePod101
#4
My posts in this forum = nest0r's blog. ;p

FooSoft has a nice site setup, even did some light novel analysis. Can't remember if they've tinkered with subs or not, but yes, there's so much potential out there for multimedia corpora/generating resources from native materials to keep some folks for falling into the binary trap (learning materials vs. native, etc.).
Reply
#5
Asriel Wrote:edit: *I don't remember the name of that Japanese Parser, but it's not Mezbup, because that's a member here. It's like..Mecbap, Mecchap, Hubcap, or something along those lines...
MeCab.
Reply
#6
How did I miss this? This thread is full of kittens and frogs! Nest0r, once again, great link, but you forgot to give me thinly veiled hint to this. Do you know if there's an available frequency list from this corpus study? Smile
Reply
#7
I don't know, I didn't look too hard, because I was thinking they might be worried about copyright? Even though at that point it would just be words, perhaps they were iffy about having the paper and the list so close together. But Boris New participated, and they seem to also be involved with Lexique, so.
Edited: 2010-04-04, 1:01 pm
Reply
#8
Actually the newest Lexique version seems to have them. You can download it free: http://www.lexique.org/telLexique.php
Edited: 2010-04-04, 1:06 pm
Reply