Back

Word frequency list, what corpus is ideal?

#1
Ok, looking at the word frequency list generated from the "innocent novel corpus", I see that many infrequent words are up in the list because they get used in the same novel.
So, I wonder how much big must a corpus be, to dilute those words, so that a couple of novels won't misrepresent the entire corpus?
Just think at the 才人, it's not a frequent word but it's up in that list because it is the name of Zero no Tsukaima main character.

Also what different kind of sources could I use to have a relatively agnostic frequency list?

I've just downloaded the Wikipedia dump, but obviouly as an encyclopedia it uses a certain selection of words.
I'm downloading a bunch of news websites with WinHTTrack, and I plan to use those too.
I'm thinking to use generalist websites, like the huffington post which treats all sort of topics, and maybe others. Any suggestion?
Reply
#2
(2017-04-15, 12:03 pm)cophnia61 Wrote: Ok, looking at the word frequency list generated from the "innocent novel corpus", I see that many infrequent words are up in the list because they get used in the same novel.
So, I wonder how much big must a corpus be, to dilute those words, so that a couple of novels won't misrepresent the entire corpus?
Just think at the 才人, it's not a frequent word but it's up in that list because it is the name of Zero no Tsukaima main character.

Also what different kind of sources could I use to have a relatively agnostic frequency list?

I've just downloaded the Wikipedia dump, but obviouly as an encyclopedia it uses a certain selection of words.
I'm downloading a bunch of news websites with WinHTTrack, and I plan to use those too.
I'm thinking to use generalist websites, like the huffington post which treats all sort of topics, and maybe others. Any suggestion?

Here is what I use:

https://www.amazon.com/Frequency-Diction...f+japanese
Reply
#3
(2017-04-15, 12:03 pm)cophnia61 Wrote: Ok, looking at the word frequency list generated from the "innocent novel corpus", I see that many infrequent words are up in the list because they get used in the same novel.
So, I wonder how much big must a corpus be, to dilute those words, so that a couple of novels won't misrepresent the entire corpus?
Just think at the 才人, it's not a frequent word but it's up in that list because it is the name of Zero no Tsukaima main character.

Also what different kind of sources could I use to have a relatively agnostic frequency list?

I've just downloaded the Wikipedia dump, but obviouly as an encyclopedia it uses a certain selection of words.
I'm downloading a bunch of news websites with WinHTTrack, and I plan to use those too.
I'm thinking to use generalist websites, like the huffington post which treats all sort of topics, and maybe others. Any suggestion?

I recently made a corpus from a huge collection of anime and drama subtitles. I used such a large collection of different sources, that I don't think there are many infrequent words that are influenced by just one or two shows. However, there are several names that get pushed up near the top.
However a bigger problem I feel, is that a lot of words can get misrepresented because of limitations in the software that generates the list. For example, a common name might not get recognized by the software, and SOME of the characters in the name happen to be an infrequent word, which has now become a frequent word according to the software. Or in other situations, when it tries to de-conjugate a word to get the dictionary form, it might end up getting the wrong word. Also some word endings can end up getting mis-attributed.
In any case, I have found that any results have to be heavily vetted to ensure that the "words" that are being reported are 1) actually words, and 2) actually common

As for how to make an agnostic list, I would say that's easier said than done. There are some things like the Balanced Corpus of Contemporary Written Japanese, but that is only going to cover written Japanese. If you ignore spoken Japanese, you are basically ignoring half of the language (well, not really, but its still going to miss stuff). Plus, I find that corpus is really weighted heavily towards science and politics rather than "real life".
Edited: 2017-04-15, 4:50 pm
Reply
MONSTER Sale Get 28% OFF Basic, Premium & Premium PLUS! (Oct 16 - 27)
JapanesePod101
#4
(2017-04-15, 4:49 pm)Zarxrax Wrote: I recently made a corpus from a huge collection of anime and drama subtitles. I used such a large collection of different sources, that I don't think there are many infrequent words that are influenced by just one or two shows. However, there are several names that get pushed up near the top.
However a bigger problem I feel, is that a lot of words can get misrepresented because of limitations in the software that generates the list. For example, a common name might not get recognized by the software, and SOME of the characters in the name happen to be an infrequent word, which has now become a frequent word according to the software. Or in other situations, when it tries to de-conjugate a word to get the dictionary form, it might end up getting the wrong word. Also some word endings can end up getting mis-attributed.
In any case, I have found that any results have to be heavily vetted to ensure that the "words" that are being reported are 1) actually words, and 2) actually common

As for how to make an agnostic list, I would say that's easier said than done. There are some things like the Balanced Corpus of Contemporary Written Japanese, but that is only going to cover written Japanese. If you ignore spoken Japanese, you are basically ignoring half of the language (well, not really, but its still going to miss stuff). Plus, I find that corpus is really weighted heavily towards science and politics rather than "real life".

Thank you for the reply!
I'm interested in this Balanced Corpus of Contemporary Written Japanese, could you explain me what it is and how to acquire it? Is it an actual collection of texts, or is it a frequency list obtained from those texts?
Reply
#5
(2017-04-15, 5:08 pm)cophnia61 Wrote:
(2017-04-15, 4:49 pm)Zarxrax Wrote: I recently made a corpus from a huge collection of anime and drama subtitles. I used such a large collection of different sources, that I don't think there are many infrequent words that are influenced by just one or two shows. However, there are several names that get pushed up near the top.
However a bigger problem I feel, is that a lot of words can get misrepresented because of limitations in the software that generates the list. For example, a common name might not get recognized by the software, and SOME of the characters in the name happen to be an infrequent word, which has now become a frequent word according to the software. Or in other situations, when it tries to de-conjugate a word to get the dictionary form, it might end up getting the wrong word. Also some word endings can end up getting mis-attributed.
In any case, I have found that any results have to be heavily vetted to ensure that the "words" that are being reported are 1) actually words, and 2) actually common

As for how to make an agnostic list, I would say that's easier said than done. There are some things like the Balanced Corpus of Contemporary Written Japanese, but that is only going to cover written Japanese. If you ignore spoken Japanese, you are basically ignoring half of the language (well, not really, but its still going to miss stuff). Plus, I find that corpus is really weighted heavily towards science and politics rather than "real life".

Thank you for the reply!
I'm interested in this Balanced Corpus of Contemporary Written Japanese, could you explain me what it is and how to acquire it? Is it an actual collection of texts, or is it a frequency list obtained from those texts?

This page has some basic info: http://pj.ninjal.ac.jp/corpus_center/bccwj/en/
Different variations of the frequency list can be obtained here: http://pj.ninjal.ac.jp/corpus_center/bcc...-list.html

I believe the book that phil321 referenced was based on an early version of this corpus.
Reply
#6
WOW THANKKS!!!! It's good enough for me! You saved me a lot of time!
Reply
#7
This is basically a fundamental problem with corpus linguistics. The only way to get around it is to have shitloads of text, like tens or hundreds of gigabytes. Even the bccwj isn't perfect, it has a lot of words that were very frequent for a very short period of time but then fell back into being less common once whatever societal event they're attached to passed by.

Another method is to basically normalize each "series" in the corpus so that they don't have too much of their own effect on the overall frequency list, but analyzers generally don't give you that kind of flexibility. I'm considering doing it for mine, but the way I designed the UI doesn't really give me anywhere to put that kind of functionality. You still need a massive corpus if you do this, say a couple gigabytes, but it's more managable.
Edited: 2017-04-19, 2:56 pm
Reply
#8
what do you mean with "normalize"?
Reply
#9
Normalize would be to take a cross section of text from different types of written material and put them in different buckets.  When you create your master frequency list, you would make sure that each bucket contributes equally so that you don't get a list biased towards newspaper words or anime words.  

That said, the The Balanced Corpus of Contemporary Written Japanese that Zarxrax linked is "Balanced" in exactly that way.  It is composed of yahoo data, bing blogs, textbooks, magazines, genre fiction, ect and then balanced/normalized.  See this link for their methodology.
Edited: 2017-04-19, 4:35 pm
Reply
#10
Yeah, I was talking about balancing on a finer level and making a frequency list biased towards the "register" you want to get into. So you would make a frequency list based mostly on entertainment media, but you would normalize each contributor to that list so that a story that's really long or uses a single name all the time doesn't contribute too much to the frequency of that term.
Reply
#11
Ah, sorry I speed read your post and missed a few bits. So you are using "entertainment" as your universe and individual series as your buckets.

Do you have a simple system for doing this? It seems like it would require a lot of elbow grease doing it by hand..
Edited: 2017-04-19, 5:08 pm
Reply
#12
I'm planning on supporting it in my text analyzer. Until then you could use individual frequency lists per series, using a spreadsheet to normalize each series to 1m total occurrences or something. It wouldn't be too hard, just pasting in a formula to each spreadsheet and making things line up correctly.
Reply
#13
making everything line up would the hard part if the lists were long. It would require a bazillion vlookups. I could do it in python and sql in an evening maaaybe.
Reply
#14
Another thing:

http://link.springer.com/article/10.1007...013-9261-0

Quote:In the course of the SUW analysis of the BCCWJ, about 1 % of the corpus is analyzed with special care. This subset is named BCCWJ-Core. As far as the texts in this subset are concerned, the results of the automatic SUW analysis are checked and corrected manually so that the average accuracy becomes higher than 99 % even when they are evaluated with the third criterion mentioned above. The high-accuracy POS data thus obtained was utilized as the learning data for the training of the MeCab + UniDic analysis system.

What do they mean with "training of the MeCab + UniDic analysis system"?
Does MaCab can be trained? What do they mean?
Reply
#15
UniDic is a dictionary for mecab. Mecab dictionaries are formatted a certain way, with every possible surface form of every possible morpheme being present as a search key. Mecab dictionary entries also have weights for how likely they are to occur in a non-synchronous (i.e. you don't know where the segmentations are) markov chain, which is something /like/ training, and is what they're referring to.

Kuromoji uses mecab dictionaries but has some extra heuristics for long-ish distance morphological analysis that it feeds to its viterbi engine.
Edited: 2017-04-20, 7:35 am
Reply
#16
(2017-04-19, 2:55 pm)wareya Wrote: Even the bccwj isn't perfect, it has a lot of words that were very frequent for a very short period of time but then fell back into being less common once whatever societal event they're attached to passed by.

I'm going through it now and I see a lot of unbalance toward words about economy. :/
Reply
#17
(2017-04-23, 9:01 am)cophnia61 Wrote:
(2017-04-19, 2:55 pm)wareya Wrote: Even the bccwj isn't perfect, it has a lot of words that were very frequent for a very short period of time but then fell back into being less common once whatever societal event they're attached to passed by.

I'm going through it now and I see a lot of unbalance toward words about economy. :/

I'm just curious...why are you trying to make your own frequency list when there are published (e.g., Routledge) frequency lists available?  Is it because you need >5,000 words?
Reply
#18
(2017-04-23, 9:01 am)phil321 Wrote: I'm just curious...why are you trying to make your own frequency list when there are published (e.g., Routledge) frequency lists available?  Is it because you need >5,000 words?


I'm going to start a new deck where I will review words phonetically to help me with listening comprehension.
In truth I've already tried it outside of Anki and it is helping me so much that I decided to take a more structured route of review.
As opposed to when I started studying Japanese, where I mined words from light-novels, now I already know the majority of those words. So it would be impractical to mine words again and add them to my new deck as I encounter them, when I could just go through an already existing list of words.

Which is exactly what you're saying xD But yes, one of the problems is that I need a deck bigger than 5,000 words, and the other is that there isn't really a truly balanced list to go through. So my first thought was to make a customized list with Kuromoji, feeding it with my light novels. That way I will review exactly almost all the words that I already know because most of the words I learned are from those light-novels.

But then I decided to study for the JLPT and I started to read news articles and there are in fact many words which I don't know. So, considering that I'm about to start this new Anki deck, why not use a more balanced list instead of a list based only on fiction? So I thought, why no make one myself (yeah, why??? LOL ) but then Zarxrax suggested me the BCCWJ which is way better than any list I could ever do.

So I picked up the BCCWJ list which seems one or the most "professionals" out there, so in the end I will stick with it. But I won't lie, I think it's a little unbalanced thowards news/economic words, but it's ok ._.

Maybe I'll use my light-novel based list to unsuspend words inside the BCCWJ list and review them first, and then I'll go through the words left-out and study them.


PS: I needed a big list to feed to an utility which I made, which picks a definition for every word so that I don't need to make each note by hand. This way I have a huge list of words + reading + japanese definition in frequency order (according to BCCWJ) ready for use.

PS2: I seem to understand that Routledge A frequency dictionary of Japanese is based to the BCCWJ list.
Edited: 2017-04-23, 12:51 pm
Reply
#19
A Frequency Dictionary of Japanese uses the BCCWJ, as in the corpus itself, as one of its data sources, but not the "normal" word list generated from it.
Edited: 2017-04-23, 1:08 pm
Reply
#20
Yes, the Frequency Dictionary combines the BCCWJ with the Corpus of Spontaneous Japanese (CSJ) so it gets a blend of spoken and written sources.

Unfortunately I think the CSJ requires a fee for the raw data, though it has a free online lookup tool. (You can find it by browsing around on the same site that has the BCCWJ.)
Edited: 2017-04-24, 3:02 am
Reply
#21
Ok, so there are two BCCWJ lists, one with "short" words and the other with long "words".
The short list treats things like 店 in 喫茶店 as suffixes, so in this list you will find 喫茶 but not 喫茶店.
Or, the kanji 新 in 新体操 will be treated as a prefix, so that you will find 体操 but not 新体操.
There are a lot more examples.

So, as it's said in the preface of the book "frequency dictionary of japanese words", for practical purposes the "long" list is more suitable.

One problem with this list is that it treats words like 結婚 and 結婚する as different words, so you will find both inside the list.
The "frequency dictionary of japanese words" book solved this by giving only one word for both cases, like:

結婚(する)


So, I've checked what Kuromoji does instead with those words, and it seems to treat them with the "prefix" and "suffix" logic, like the short words version of the BCCWJ list.

I've parsed a little text file with wareya's utility and in facts for a text like:

新体操、喫茶店。

it gives me:

1 喫茶 キッサ キッサ 0,1 漢 名詞 普通名詞 一般 *
1 体操 タイソウ タイソー 0 漢 名詞 普通名詞 サ変可能 *
1 新 シン シン * 漢 接頭辞 * * *
1 店 テン テン * 漢 接尾辞 名詞的 一般 *


So, I wonder, is it so bad to use the "short words" version of that list? Or similarly is i bad to use a Kuromoji generated list?

I don't see any problem with words like 体操 or 喫茶, as once you know them it's straightforward to learn words like 新体操 and 喫茶店 in the wild.

What do you think about that? Especially those who use or plains to use Kuromoji/Unnamed text analyzer.
Edited: 2017-05-10, 12:35 pm
Reply
#22
Kuromoji basically does whatever the dictionary inside it is designed to do. If a segment for 結婚する exists, and is cheaper than the segments 結婚 and する, then it'll see 結婚する instead of 結婚+する. If 結婚する exists but is more expensive than 結婚+する, 結婚+する will be used instead. And 結婚+する will be used if 結婚する doesn't exist.

I disagree with AFDoJ's decision that "long" words are preferable. When 研究者 shows up, a frequency list should definitely identify 研究 and 者. 研究者 doesn't have a unique meaning; it's the sum of its parts. As a language learner, you want to learn units of meaning first, collocations second. The only exception is when those collocations are themselves a unit of meaning, like in 新聞, which should definitely be counted as 新聞 and not 新+聞. So in this case, I think my analyzer is making a reasonable segmentation out of 新体操、喫茶店。

Kuromoji-ipadic gives a way to basically coerce it into picking up on shorter segments, but kuromoji-unidic, which my analyzer uses because it has a kana accent version, doesn't expose that. If it did I would definitely make an option to prefer shorter segments and turn it on by default. I might want to fork kuromoji anyway, so we'll see how it goes.
Edited: 2017-05-10, 1:36 pm
Reply
#23
(2017-05-10, 1:35 pm)wareya Wrote: Kuromoji basically does whatever the dictionary inside it is designed to do. If a segment for 結婚する exists, and is cheaper than the segments 結婚 and する, then it'll see 結婚する instead of 結婚+する. If 結婚する exists but is more expensive than 結婚+する, 結婚+する will be used instead. And 結婚+する will be used if 結婚する doesn't exist.

I disagree with AFDoJ's decision that "long" words are preferable. When 研究者 shows up, a frequency list should definitely identify 研究 and 者. 研究者 doesn't have a unique meaning; it's the sum of its parts. As a language learner, you want to learn units of meaning first, collocations second. The only exception is when those collocations are themselves a unit of meaning, like in 新聞, which should definitely be counted as 新聞 and not 新+聞. So in this case, I think my analyzer is making a reasonable segmentation out of 新体操、喫茶店。

Kuromoji-ipadic gives a way to basically coerce it into picking up on shorter segments, but kuromoji-unidic, which my analyzer uses because it has a kana accent version, doesn't expose that. If it did I would definitely make an option to prefer shorter segments and turn it on by default. I might want to fork kuromoji anyway, so we'll see how it goes.

I think that your reasoning is flawless, in fact in the beginning I went with the short version because it seem the most obvious thing to do, but then I read the AFDoJ introduction and it conditioned me to chose the long version, but then those issues that I described show up.

Btw even if I've already said it, thank you for your work with your analyzer, I think it's priceless!

What do you mean with pitch dictionary? I've read it in other post of yours, but I don't understand what do you mean.
I mean, I know what pitch accent is, but not how an analyzer can use it to its own advantage. Could you give me an example?
Reply
#24
(2017-05-10, 2:44 pm)cophnia61 Wrote:
(2017-05-10, 1:35 pm)wareya Wrote: Kuromoji basically does whatever the dictionary inside it is designed to do. If a segment for 結婚する exists, and is cheaper than the segments 結婚 and する, then it'll see 結婚する instead of 結婚+する. If 結婚する exists but is more expensive than 結婚+する, 結婚+する will be used instead. And 結婚+する will be used if 結婚する doesn't exist.

I disagree with AFDoJ's decision that "long" words are preferable. When 研究者 shows up, a frequency list should definitely identify 研究 and 者. 研究者 doesn't have a unique meaning; it's the sum of its parts. As a language learner, you want to learn units of meaning first, collocations second. The only exception is when those collocations are themselves a unit of meaning, like in 新聞, which should definitely be counted as 新聞 and not 新+聞. So in this case, I think my analyzer is making a reasonable segmentation out of 新体操、喫茶店。

Kuromoji-ipadic gives a way to basically coerce it into picking up on shorter segments, but kuromoji-unidic, which my analyzer uses because it has a kana accent version, doesn't expose that. If it did I would definitely make an option to prefer shorter segments and turn it on by default. I might want to fork kuromoji anyway, so we'll see how it goes.

I think that your reasoning is flawless, in fact in the beginning I went with the short version because it seem the most obvious thing to do, but then I read the AFDoJ introduction and it conditioned me to chose the long version, but then those issues that I described show up.

Btw even if I've already said it, thank you for your work with your analyzer, I think it's priceless!

What do you mean with pitch dictionary? I've read it in other post of yours, but I don't understand what do you mean.
I mean, I know what pitch accent is, but not how an analyzer can use it to its own advantage. Could you give me an example?

In my opinion, I favor long words. While it's true that it's useful to learn the short prefixes and suffixes and the like, if you have studied RTK, then you have basically learned them for the most part already. I would disagree that a word like 研究者 does not have a unique meaning. If your goal is to simply be able to figure out a word from the sum of its parts the first time you encounter it, then that's fine. But for purposes of actually using the language, you need to learn this word separately. How else could you actually use it? If you try the reverse logic to "build" a word from parts then you may easily make up some nonsense like 研究人.

Plus there are a lot of words where the combinations are not at all obvious. I was surprised to find that the kuromoji text analyzer didn't pick up 自転車. Can you really figure that out from the meanings of 自転 and 車?
Reply
#25
well, even what Zarxrax says is true :/
I wonder if there is a way to adjust the way Kuromoji works.
I would like to treat most of the words with the "long" logic, unless the extra part is "suru" or some prefixes/suffixes like "o, go, tachi, san, sama".
Is it possible? ._.
Reply