Back

Word frequency list, what corpus is ideal?

#1
Ok, looking at the word frequency list generated from the "innocent novel corpus", I see that many infrequent words are up in the list because they get used in the same novel.
So, I wonder how much big must a corpus be, to dilute those words, so that a couple of novels won't misrepresent the entire corpus?
Just think at the 才人, it's not a frequent word but it's up in that list because it is the name of Zero no Tsukaima main character.

Also what different kind of sources could I use to have a relatively agnostic frequency list?

I've just downloaded the Wikipedia dump, but obviouly as an encyclopedia it uses a certain selection of words.
I'm downloading a bunch of news websites with WinHTTrack, and I plan to use those too.
I'm thinking to use generalist websites, like the huffington post which treats all sort of topics, and maybe others. Any suggestion?
Reply
#2
(2017-04-15, 12:03 pm)cophnia61 Wrote: Ok, looking at the word frequency list generated from the "innocent novel corpus", I see that many infrequent words are up in the list because they get used in the same novel.
So, I wonder how much big must a corpus be, to dilute those words, so that a couple of novels won't misrepresent the entire corpus?
Just think at the 才人, it's not a frequent word but it's up in that list because it is the name of Zero no Tsukaima main character.

Also what different kind of sources could I use to have a relatively agnostic frequency list?

I've just downloaded the Wikipedia dump, but obviouly as an encyclopedia it uses a certain selection of words.
I'm downloading a bunch of news websites with WinHTTrack, and I plan to use those too.
I'm thinking to use generalist websites, like the huffington post which treats all sort of topics, and maybe others. Any suggestion?

Here is what I use:

https://www.amazon.com/Frequency-Diction...f+japanese
Reply
#3
(2017-04-15, 12:03 pm)cophnia61 Wrote: Ok, looking at the word frequency list generated from the "innocent novel corpus", I see that many infrequent words are up in the list because they get used in the same novel.
So, I wonder how much big must a corpus be, to dilute those words, so that a couple of novels won't misrepresent the entire corpus?
Just think at the 才人, it's not a frequent word but it's up in that list because it is the name of Zero no Tsukaima main character.

Also what different kind of sources could I use to have a relatively agnostic frequency list?

I've just downloaded the Wikipedia dump, but obviouly as an encyclopedia it uses a certain selection of words.
I'm downloading a bunch of news websites with WinHTTrack, and I plan to use those too.
I'm thinking to use generalist websites, like the huffington post which treats all sort of topics, and maybe others. Any suggestion?

I recently made a corpus from a huge collection of anime and drama subtitles. I used such a large collection of different sources, that I don't think there are many infrequent words that are influenced by just one or two shows. However, there are several names that get pushed up near the top.
However a bigger problem I feel, is that a lot of words can get misrepresented because of limitations in the software that generates the list. For example, a common name might not get recognized by the software, and SOME of the characters in the name happen to be an infrequent word, which has now become a frequent word according to the software. Or in other situations, when it tries to de-conjugate a word to get the dictionary form, it might end up getting the wrong word. Also some word endings can end up getting mis-attributed.
In any case, I have found that any results have to be heavily vetted to ensure that the "words" that are being reported are 1) actually words, and 2) actually common

As for how to make an agnostic list, I would say that's easier said than done. There are some things like the Balanced Corpus of Contemporary Written Japanese, but that is only going to cover written Japanese. If you ignore spoken Japanese, you are basically ignoring half of the language (well, not really, but its still going to miss stuff). Plus, I find that corpus is really weighted heavily towards science and politics rather than "real life".
Edited: 2017-04-15, 4:50 pm
Reply
JapanesePod101
#4
(2017-04-15, 4:49 pm)Zarxrax Wrote: I recently made a corpus from a huge collection of anime and drama subtitles. I used such a large collection of different sources, that I don't think there are many infrequent words that are influenced by just one or two shows. However, there are several names that get pushed up near the top.
However a bigger problem I feel, is that a lot of words can get misrepresented because of limitations in the software that generates the list. For example, a common name might not get recognized by the software, and SOME of the characters in the name happen to be an infrequent word, which has now become a frequent word according to the software. Or in other situations, when it tries to de-conjugate a word to get the dictionary form, it might end up getting the wrong word. Also some word endings can end up getting mis-attributed.
In any case, I have found that any results have to be heavily vetted to ensure that the "words" that are being reported are 1) actually words, and 2) actually common

As for how to make an agnostic list, I would say that's easier said than done. There are some things like the Balanced Corpus of Contemporary Written Japanese, but that is only going to cover written Japanese. If you ignore spoken Japanese, you are basically ignoring half of the language (well, not really, but its still going to miss stuff). Plus, I find that corpus is really weighted heavily towards science and politics rather than "real life".

Thank you for the reply!
I'm interested in this Balanced Corpus of Contemporary Written Japanese, could you explain me what it is and how to acquire it? Is it an actual collection of texts, or is it a frequency list obtained from those texts?
Reply
#5
(2017-04-15, 5:08 pm)cophnia61 Wrote:
(2017-04-15, 4:49 pm)Zarxrax Wrote: I recently made a corpus from a huge collection of anime and drama subtitles. I used such a large collection of different sources, that I don't think there are many infrequent words that are influenced by just one or two shows. However, there are several names that get pushed up near the top.
However a bigger problem I feel, is that a lot of words can get misrepresented because of limitations in the software that generates the list. For example, a common name might not get recognized by the software, and SOME of the characters in the name happen to be an infrequent word, which has now become a frequent word according to the software. Or in other situations, when it tries to de-conjugate a word to get the dictionary form, it might end up getting the wrong word. Also some word endings can end up getting mis-attributed.
In any case, I have found that any results have to be heavily vetted to ensure that the "words" that are being reported are 1) actually words, and 2) actually common

As for how to make an agnostic list, I would say that's easier said than done. There are some things like the Balanced Corpus of Contemporary Written Japanese, but that is only going to cover written Japanese. If you ignore spoken Japanese, you are basically ignoring half of the language (well, not really, but its still going to miss stuff). Plus, I find that corpus is really weighted heavily towards science and politics rather than "real life".

Thank you for the reply!
I'm interested in this Balanced Corpus of Contemporary Written Japanese, could you explain me what it is and how to acquire it? Is it an actual collection of texts, or is it a frequency list obtained from those texts?

This page has some basic info: http://pj.ninjal.ac.jp/corpus_center/bccwj/en/
Different variations of the frequency list can be obtained here: http://pj.ninjal.ac.jp/corpus_center/bcc...-list.html

I believe the book that phil321 referenced was based on an early version of this corpus.
Reply
#6
WOW THANKKS!!!! It's good enough for me! You saved me a lot of time!
Reply
#7
This is basically a fundamental problem with corpus linguistics. The only way to get around it is to have shitloads of text, like tens or hundreds of gigabytes. Even the bccwj isn't perfect, it has a lot of words that were very frequent for a very short period of time but then fell back into being less common once whatever societal event they're attached to passed by.

Another method is to basically normalize each "series" in the corpus so that they don't have too much of their own effect on the overall frequency list, but analyzers generally don't give you that kind of flexibility. I'm considering doing it for mine, but the way I designed the UI doesn't really give me anywhere to put that kind of functionality. You still need a massive corpus if you do this, say a couple gigabytes, but it's more managable.
Edited: 2017-04-19, 2:56 pm
Reply
#8
what do you mean with "normalize"?
Reply
#9
Normalize would be to take a cross section of text from different types of written material and put them in different buckets.  When you create your master frequency list, you would make sure that each bucket contributes equally so that you don't get a list biased towards newspaper words or anime words.  

That said, the The Balanced Corpus of Contemporary Written Japanese that Zarxrax linked is "Balanced" in exactly that way.  It is composed of yahoo data, bing blogs, textbooks, magazines, genre fiction, ect and then balanced/normalized.  See this link for their methodology.
Edited: 2017-04-19, 4:35 pm
Reply
#10
Yeah, I was talking about balancing on a finer level and making a frequency list biased towards the "register" you want to get into. So you would make a frequency list based mostly on entertainment media, but you would normalize each contributor to that list so that a story that's really long or uses a single name all the time doesn't contribute too much to the frequency of that term.
Reply
#11
Ah, sorry I speed read your post and missed a few bits. So you are using "entertainment" as your universe and individual series as your buckets.

Do you have a simple system for doing this? It seems like it would require a lot of elbow grease doing it by hand..
Edited: 2017-04-19, 5:08 pm
Reply
#12
I'm planning on supporting it in my text analyzer. Until then you could use individual frequency lists per series, using a spreadsheet to normalize each series to 1m total occurrences or something. It wouldn't be too hard, just pasting in a formula to each spreadsheet and making things line up correctly.
Reply
#13
making everything line up would the hard part if the lists were long. It would require a bazillion vlookups. I could do it in python and sql in an evening maaaybe.
Reply
#14
Another thing:

http://link.springer.com/article/10.1007...013-9261-0

Quote:In the course of the SUW analysis of the BCCWJ, about 1 % of the corpus is analyzed with special care. This subset is named BCCWJ-Core. As far as the texts in this subset are concerned, the results of the automatic SUW analysis are checked and corrected manually so that the average accuracy becomes higher than 99 % even when they are evaluated with the third criterion mentioned above. The high-accuracy POS data thus obtained was utilized as the learning data for the training of the MeCab + UniDic analysis system.

What do they mean with "training of the MeCab + UniDic analysis system"?
Does MaCab can be trained? What do they mean?
Reply
#15
UniDic is a dictionary for mecab. Mecab dictionaries are formatted a certain way, with every possible surface form of every possible morpheme being present as a search key. Mecab dictionary entries also have weights for how likely they are to occur in a non-synchronous (i.e. you don't know where the segmentations are) markov chain, which is something /like/ training, and is what they're referring to.

Kuromoji uses mecab dictionaries but has some extra heuristics for long-ish distance morphological analysis that it feeds to its viterbi engine.
Edited: 2017-04-20, 7:35 am
Reply
#16
(2017-04-19, 2:55 pm)wareya Wrote: Even the bccwj isn't perfect, it has a lot of words that were very frequent for a very short period of time but then fell back into being less common once whatever societal event they're attached to passed by.

I'm going through it now and I see a lot of unbalance toward words about economy. :/
Reply
#17
(2017-04-23, 9:01 am)cophnia61 Wrote:
(2017-04-19, 2:55 pm)wareya Wrote: Even the bccwj isn't perfect, it has a lot of words that were very frequent for a very short period of time but then fell back into being less common once whatever societal event they're attached to passed by.

I'm going through it now and I see a lot of unbalance toward words about economy. :/

I'm just curious...why are you trying to make your own frequency list when there are published (e.g., Routledge) frequency lists available?  Is it because you need >5,000 words?
Reply
#18
(2017-04-23, 9:01 am)phil321 Wrote: I'm just curious...why are you trying to make your own frequency list when there are published (e.g., Routledge) frequency lists available?  Is it because you need >5,000 words?


I'm going to start a new deck where I will review words phonetically to help me with listening comprehension.
In truth I've already tried it outside of Anki and it is helping me so much that I decided to take a more structured route of review.
As opposed to when I started studying Japanese, where I mined words from light-novels, now I already know the majority of those words. So it would be impractical to mine words again and add them to my new deck as I encounter them, when I could just go through an already existing list of words.

Which is exactly what you're saying xD But yes, one of the problems is that I need a deck bigger than 5,000 words, and the other is that there isn't really a truly balanced list to go through. So my first thought was to make a customized list with Kuromoji, feeding it with my light novels. That way I will review exactly almost all the words that I already know because most of the words I learned are from those light-novels.

But then I decided to study for the JLPT and I started to read news articles and there are in fact many words which I don't know. So, considering that I'm about to start this new Anki deck, why not use a more balanced list instead of a list based only on fiction? So I thought, why no make one myself (yeah, why??? LOL ) but then Zarxrax suggested me the BCCWJ which is way better than any list I could ever do.

So I picked up the BCCWJ list which seems one or the most "professionals" out there, so in the end I will stick with it. But I won't lie, I think it's a little unbalanced thowards news/economic words, but it's ok ._.

Maybe I'll use my light-novel based list to unsuspend words inside the BCCWJ list and review them first, and then I'll go through the words left-out and study them.


PS: I needed a big list to feed to an utility which I made, which picks a definition for every word so that I don't need to make each note by hand. This way I have a huge list of words + reading + japanese definition in frequency order (according to BCCWJ) ready for use.

PS2: I seem to understand that Routledge A frequency dictionary of Japanese is based to the BCCWJ list.
Edited: 2017-04-23, 12:51 pm
Reply
#19
A Frequency Dictionary of Japanese uses the BCCWJ, as in the corpus itself, as one of its data sources, but not the "normal" word list generated from it.
Edited: 2017-04-23, 1:08 pm
Reply
#20
Yes, the Frequency Dictionary combines the BCCWJ with the Corpus of Spontaneous Japanese (CSJ) so it gets a blend of spoken and written sources.

Unfortunately I think the CSJ requires a fee for the raw data, though it has a free online lookup tool. (You can find it by browsing around on the same site that has the BCCWJ.)
Edited: 2017-04-24, 3:02 am
Reply