Word frequency list, what corpus is ideal?

(2017-05-10, 3:44 pm)Zarxrax Wrote: But for purposes of actually using the language, you need to learn this word separately. How else could you actually use it? If you try the reverse logic to "build" a word from parts then you may easily make up some nonsense like 研究人.
That's an issue with not knowing the language well enough in general, not an issue with not knowing some specific word.

研究者 is a perfect example of a simple compound, and registering every ~者 word independently instead of learning that ~者 is the normal suffix for an Xer actually damages the quality of registering the constituent parts of such simple compounds. Let's say that 研究 without the 者 and 研究者 as a whole are equally common. If you track them independently, then 研究 will be far lower on a frequency list than it should be. A frequency list shouldn't count both 研究 and 研究者 when it reads the string "研究者である", and counting 研究 and 者 separately prevents 研究 from showing up further down the list than it should.

It's important to know that 研究者 is the normal way of saying "a 研究er" since there are other ways of constructing it, but that's less important than making sure 研究 is accounted properly in a frequency list, and IMO, simple idiomatic/grammatical things like that are easier to pick up from reading or listening than from flashcards.

(2017-05-10, 3:44 pm)Zarxrax Wrote: Plus there are a lot of words where the combinations are not at all obvious. I was surprised to find that the kuromoji text analyzer didn't pick up 自転車. Can you really figure that out from the meanings of  自転 and 車?
自転車 is definitely a non-obvious compound, and should be treated as its own unit of meaning. This is a problem with unidic. For some reason, it has bogus weights for 自転 and 自転車. This isn't a problem with kuromoji preferring shorter segments in general, just that it likes 自転 in particular too much. In other words, it's a problem with corpus linguistics in general: you need to know roughly how common things are in the first place, and if you're wrong, you'll continue to make weird decisions about how common things are in the future.

The recommended kuromoji dictionary is ipadic, which handles 自転車 properly. It's possible that unidic's weights were generated in a way that kuromoji isn't ideal for. Kuromoji uses a viterbi graph to find the most likely string of tokens based on the weights of those tokens, and it's possible that how it works internally is different enough from mecab that it causes problems in some situations like this. I don't know. If mecab segments 自転車 wrong with unidic as well though then I would safely put the blame on unidic. I use unidic despite ipadic being recommended because ipadic doesn't have a pronunciation/kana accent version, which is necessary to disambiguate certain words.

additional edit: Apparently unidic doesn't have an entry for 自転車 alone at all. Go figure.
Edited: 2017-05-11, 4:41 am
#27 uses ipadic? Because it splits 一人 as

一    名詞,数,*,*    一    イチ    イチ
人    名詞,接尾,助数詞,*    人    ニン    ニン

while your analyzer recognizes it as

1    一人    ヒトリ    ヒトリ    2    和    名詞    普通名詞    副詞可能    *

So I suppose there is not a perfect dictionary, each one has its pros and cons :/


I've tried and Unidic is the only one that recognizes 一人 as ヒトリ :O
Edited: 2017-05-11, 6:56 am