The Use of Film Subtitles to Estimate Word Frequencies

Index » 喫茶店 (Koohii Lounge)

  • 1
 
nest0r Member
Registered: 2007-10-19 Posts: 5236 Website

This is especially interesting to me, in light of the comparison of scripted dialogue to 'unscripted' spoken corpora.

The Use of Film Subtitles to Estimate Word Frequencies

Abstract:

We examine the use of film subtitles as an approximation of word frequencies in human interactions. Because subtitle files are widely available on the Internet, they may present a fast and easy way to obtain word frequency measures in language registers other than text writing. We compiled a corpus of 52 million French words, coming from a variety of films. Frequency measures based on this corpus compared well to other spoken and written frequency measures, and explained variance in lexical decision times in addition to what is accounted for by the available French written frequency measures.

Asriel Member
From: 東京 Registered: 2008-02-26 Posts: 1343

nest0r, you know I love the effort you put into research, and it truly is amazing. Perhaps you could set up a website/blog? Perhaps you already have one...

Anyway, I had thought of something similar to this a few days ago, when I realized how easy it is to rip subtitles from DVDs. Unfortunately, that results in idx/sub files, which is hard to effectively pull usable text from.

I'm thinking that something like this could be done relatively simply, even just using the subtitles over at d-addicts, etc. Especially for someone more talented at coding than me. Set up Japanese parser*, run the code through, and have numbers.

edit: *I don't remember the name of that Japanese Parser, but it's not Mezbup, because that's a member here. It's like..Mecbap, Mecchap, Hubcap, or something along those lines...

Last edited by Asriel (2010 April 03, 11:48 pm)

mezbup Member
From: sausage lip Registered: 2008-09-18 Posts: 1681 Website

Asriel wrote:

set up Mezbup

できた!

Advertising (register and sign in to hide this)
JapanesePod101 Sponsor
 
nest0r Member
Registered: 2007-10-19 Posts: 5236 Website

My posts in this forum = nest0r's blog. ;p

FooSoft has a nice site setup, even did some light novel analysis. Can't remember if they've tinkered with subs or not, but yes, there's so much potential out there for multimedia corpora/generating resources from native materials to keep some folks for falling into the binary trap (learning materials vs. native, etc.).

Reply #5 - 2010 April 04, 1:44 am
iSoron Member
From: Canada Registered: 2008-03-24 Posts: 490

Asriel wrote:

edit: *I don't remember the name of that Japanese Parser, but it's not Mezbup, because that's a member here. It's like..Mecbap, Mecchap, Hubcap, or something along those lines...

MeCab.

Reply #6 - 2010 April 04, 7:56 am
Nemotoad Member
Registered: 2010-03-17 Posts: 66

How did I miss this? This thread is full of kittens and frogs! Nest0r, once again, great link, but you forgot to give me thinly veiled hint to this. Do you know if there's an available frequency list from this corpus study? smile

Reply #7 - 2010 April 04, 1:00 pm
nest0r Member
Registered: 2007-10-19 Posts: 5236 Website

I don't know, I didn't look too hard, because I was thinking they might be worried about copyright? Even though at that point it would just be words, perhaps they were iffy about having the paper and the list so close together. But Boris New participated, and they seem to also be involved with Lexique, so.

Last edited by nest0r (2010 April 04, 1:01 pm)

Reply #8 - 2010 April 04, 1:05 pm
nest0r Member
Registered: 2007-10-19 Posts: 5236 Website

Actually the newest Lexique version seems to have them. You can download it free: http://www.lexique.org/telLexique.php

Last edited by nest0r (2010 April 04, 1:06 pm)

  • 1