Back

Are those words or MeCab mistakes?

#1
From the "Innocent Novel Analysis" word frequency list:

26567 智 0,00726494 75,81582950
22824 隊 0,00624139 76,87965952
21946 吉 0,00600130 77,13671578
21528 宗 0,00588699 77,27945098
21116 筈 0,00577433 77,43104812
20978 等 0,00573659 77,47711530
20148 里 0,00550962 77,75181120

Are those kanji true words from that "innocent novel corpus"? Or are those kanji contained in proper names and MeCab has made a mistake? There is a fast way to check this? Like with regular expressions? Or, based on your intuition, what do you think? I'm strongly inclined to think those are from proper names, I don't think words like "ri" can be so much frequent, but just to be sure...
Reply
#2
As someone who has never used these tools, what do the columns mean?
Reply
#3
It may be because 里 does not mean just 'RI', but also 'sato', which is a common word that means 'village'.
Reply
May 16 - 30 : Pretty Big Deal: Save 31% on all Premium Subscriptions! - Sign up here
JapanesePod101
#4
aldebrn Wrote:As someone who has never used these tools, what do the columns mean?
from JapaneseTextAnalysisTool readme:

Quote:Format:
Field 1: Number of times word was encountered
Field 2: Word
Field 3: Percentage (Field 1 / Total number of words)
Field 4: Cumulative percentage

Report is sorted from most frequent word to least frequent word.
My doubt is that a word like 里 appears within the group of words which covers about 77% of words... I've tried to search in those text novels and 里 appears often in proper names so probably it's for this reason but I'm not sure...
Reply
#5
DrJones Wrote:It may be because 里 does not mean just 'RI', but also 'sato', which is a common word that means 'village'.
Maybe, I've also thinked about it but Jisho doesn't tag it as a "common word". But obviously this means nothing, especially considering we are talking about (mostly fantasy?) novels.

EDIT: I've found a regex to check this, I'll try and see Smile
Edited: 2014-11-13, 8:52 am
Reply
#6
They're all words but some are usually found in compounds.
Reply
#7
Sorry to be difficult. I'm having a hard time finding the innocent corpus, i can has link? Tanks!!!
Reply
#8
aldebrn Wrote:Sorry to be difficult. I'm having a hard time finding the innocent corpus, i can has link? Tanks!!!
I don't know if it's legal to talk about it but I've found it on this forum so... Smile

a totally innocent thread about 日本語 books

I've found about it in this thread:

Book Club

while here is the tool to do text analysis:

cb's Japanese Text Analysis Tool
Reply
#9
Extra-sorry to be dense, but what were you planning on doing with regexp? Search the corpus for those kanji and get a list of their usage to examine?
Reply
#10
cophnia61 Wrote:
DrJones Wrote:It may be because 里 does not mean just 'RI', but also 'sato', which is a common word that means 'village'.
Maybe, I've also thinked about it but Jisho doesn't tag it as a "common word". But obviously this means nothing, especially considering we are talking about (mostly fantasy?) novels.
I wouldn't trust Jisho's "common word" tag. Rikaisama gives the frequency as 1371 for さと, which is as high as one would expect for such a word. e.g. 都 at 1682.
Reply
#11
aldebrn Wrote:Extra-sorry to be dense, but what were you planning on doing with regexp? Search the corpus for those kanji and get a list of their usage to examine?
Yes, I've tried this with notepad++ [ぁ-ゔ]里[ぁ-ゔ] to see if that kanji appears by itself and not in compounds, but it doesn't seem to work :/
Reply
#12
RawToast Wrote:
cophnia61 Wrote:
DrJones Wrote:It may be because 里 does not mean just 'RI', but also 'sato', which is a common word that means 'village'.
Maybe, I've also thinked about it but Jisho doesn't tag it as a "common word". But obviously this means nothing, especially considering we are talking about (mostly fantasy?) novels.
I wouldn't trust Jisho's "common word" tag. Rikaisama gives the frequency as 1371 for さと, which is as high as one would expect for such a word. e.g. 都 at 1682.
Thank you Raw, you are right! But for now I'm going to ignore those words, they are too specific so I'll study them in future when I encounter them in reading.

As I'm doing Core10000, considering it uses a old list from newspapers, I thinked it's better to study from a list based on novels, and I'm using that particular list as a guide. I've seen for the most part the two lists are similar (core and the novel list) but many of the words core puts first are not so common in novels, and some are not even here (things like financial and political words). So instead of studying all the words core gives me, I'm going to prioritize those words which are actually frequently used in novels. But there's not reason to bee too strict about it, especially with words like "sorcerer" and the like. If they are so much used in novels I'll study them when I'll see them the first time, but for now I don't know so many generic words that I think it is better to be selective at least for now.
Edited: 2014-11-13, 9:49 am
Reply
#13
(Note, edited since of course I found bugs in my tiny script.)

Not sure if this will benefit @cophnia61 but I took Segment 1 of the innocent corpus and searched for sentences containing those seven characters used as strictly non-compound stand-alone kanji: https://gist.github.com/fasiha/745c4c7101434480365a

I think this is what you were trying to get with Notepad++. I made heavy use of this regexp group which is what the XRegExp library uses to match CJK kanji, and excludes kana, punctuation, Latin, etc.: [⺀-⺙⺛-⻳⼀-⿕々〇〡-〩〸-〻㐀-䶵一-鿌豈-舘並-龎].

You can do the exact same thing except with "compounds": make a list of sentences where these seven kanji are used with other kanji next to them (before and/or after). That list is much, much longer than this one, and you'll want to feed that into MeCab to see how often it splits up "compounds" containing these kanji into stand-alone kanji and if it does so correctly. Let me know and I can send you a link to the compounds compilation.
Edited: 2014-11-13, 1:25 pm
Reply
#14
cophnia61 Wrote:From the "Innocent Novel Analysis" word frequency list:

26567 智 0,00726494 75,81582950
22824 隊 0,00624139 76,87965952
21946 吉 0,00600130 77,13671578
21528 宗 0,00588699 77,27945098
21116 筈 0,00577433 77,43104812
20978 等 0,00573659 77,47711530
20148 里 0,00550962 77,75181120

Are those kanji true words from that "innocent novel corpus"? Or are those kanji contained in proper names and MeCab has made a mistake? There is a fast way to check this? Like with regular expressions? Or, based on your intuition, what do you think? I'm strongly inclined to think those are from proper names, I don't think words like "ri" can be so much frequent, but just to be sure...
Out of these, only 筈, 等, and 里 are words by themselves (in the modern language). 隊 is used mostly in military contexts, so if the corpus has sci-fi or other military-related books it may show up a lot in those books. 吉, 智, and 里 are probably on the list because of their use in proper names.
Reply
#15
yudantaiteki Wrote:Out of these, only 筈, 等, and 里 are words by themselves (in the modern language). 隊 is used mostly in military contexts, so if the corpus has sci-fi or other military-related books it may show up a lot in those books. 吉, 智, and 里 are probably on the list because of their use in proper names.
Can you take a look at this list of sentences (from Segment 01 of the innocent corpus, 512 files) where MeCab splits these kanji up into their own words https://gist.github.com/fasiha/a6907a3319a062d2a9c4

I of course believe what you say about 筈等里 being the only non-compound words in today's language, but I'm having a hard time understanding what this might mean. (Likely it doesn't mean anything at all.) 隊 is found in ~11K sentences, then 等 at 5K and 里 at 2K. The rest less often (though 宗 is also found in around ~2K sentences).

But that's overall. About a fifth of the time, MeCab will parse these sentences and say those kanji ought to stand by themselves: 隊, 2750/10872 sentences; 等, 1037/5038 sentences; and 里, 479/2142 sentences. It seems to do better with 宗, 吉, 智, and 筈: for each of these, the number of sentences for which it declares kanji to be stand-alone words is <100 (roughly 1 out of 10).

I guess my concrete questions here are: are the ~3000 sentences with 隊 as its own word *wrong*? Are most of the times that it makes 里 and 等 their own words *correct*? There are other parsers and other dictionaries/training data than MeCab & IPADIC, e.g., Jdepp, but I want to get a better sense of how often MeCab gets what kind of text right/wrong first. Thanks for the interesting discussions everyone.
Reply
#16
All of the examples of 智 I checked could be replaced by 知, and they're all compounds.
隊 is mostly in compounds, but also alone at least once and as a counter a few times.
Lots of -等 as in -ら. Also some 等官, whatever that is. Didn't see a single など.
里 and 露里 are used as units of distance (all in the same book?). Never seen either of them before. There was also 里娘, a name.
Edited: 2014-11-14, 4:50 am
Reply
#17
aldebrn Wrote:
yudantaiteki Wrote:Out of these, only 筈, 等, and 里 are words by themselves (in the modern language). 隊 is used mostly in military contexts, so if the corpus has sci-fi or other military-related books it may show up a lot in those books. 吉, 智, and 里 are probably on the list because of their use in proper names.
Can you take a look at this list of sentences (from Segment 01 of the innocent corpus, 512 files) where MeCab splits these kanji up into their own words https://gist.github.com/fasiha/a6907a3319a062d2a9c4

I of course believe what you say about 筈等里 being the only non-compound words in today's language, but I'm having a hard time understanding what this might mean. (Likely it doesn't mean anything at all.) 隊 is found in ~11K sentences, then 等 at 5K and 里 at 2K. The rest less often (though 宗 is also found in around ~2K sentences).

But that's overall. About a fifth of the time, MeCab will parse these sentences and say those kanji ought to stand by themselves: 隊, 2750/10872 sentences; 等, 1037/5038 sentences; and 里, 479/2142 sentences. It seems to do better with 宗, 吉, 智, and 筈: for each of these, the number of sentences for which it declares kanji to be stand-alone words is <100 (roughly 1 out of 10).

I guess my concrete questions here are: are the ~3000 sentences with 隊 as its own word *wrong*? Are most of the times that it makes 里 and 等 their own words *correct*? There are other parsers and other dictionaries/training data than MeCab & IPADIC, e.g., Jdepp, but I want to get a better sense of how often MeCab gets what kind of text right/wrong first. Thanks for the interesting discussions everyone.
Most of what you linked to are compound words; if the program is telling you those are standalone words it's wrong. I was, however, incorrect about 隊 - it looks like it can be used by itself to just mean an army or set of troops.

Vempele: It's something like 三等官; I think the 等 there means "level" or "grade"?
Reply
#18
Vempele Wrote:All of the examples of 智 I checked could be replaced by 知, and they're all compounds.
隊 is mostly in compounds, but also alone at least once and as a counter a few times.
Lots of -等 as in -ら. ...
里 and 露里 are used as units of distance (all in the same book?). Never seen either of them before. There was also 里娘, a name.
Thanks!!!
yudantaiteki Wrote:Most of what you linked to are compound words; if the program is telling you those are standalone words it's wrong.
Thanks! Yes indeed, these are sentences where each of these seven kanji is parsed as stand-alone (there should be spaces in the text, though they're hard to see). Thanks for clarifying that. I will look into other dictionaries and parsers to see if there's something better than the MeCab/IPADIC combination I used.
Edited: 2014-11-14, 8:36 am
Reply
#19
Thank you all, particularly aldebrn, for your help!

here is the list I'm using, I've covered up to 80% and for the most part those are common words found also in Core 10000, but with different priority (words that core puts as most frequent aren't found until late in the novel frequency list, and viceversa). So in the end I just suspend words in Core 10000 but following the novel frequency list order.

Most of the words not found in Core are:

1) hiragana words, a bunch of them, like conjunctions etc.. maybe used on literary style but not on newspapers?

2) words like 警部, 斬る, 江戸;

As for 2) those are very few, I went from 0% to 79% of the novel frequency list and end up adding only 15 new kanji compounds, while the other were already on core 10000.

As for 1) from 0% to 78% I've found those which are not present in core 10000 (I didn't check them all so I'm not sure they are all true words or not):

90317 けれど 0,02469785 66,09529207
82298 こそ 0,02250500 66,82698427
81543 わし 0,02229854 66,84928281
77866 なり 0,02129304 67,24128128
76125 とも 0,02081695 67,45242376
75459 まい 0,02063482 67,55586094
72033 いける 0,01969796 68,01986540
64289 しかも 0,01758030 68,87128300
59459 がる 0,01625950 69,48067191
48619 あるいは 0,01329523 71,02959788
48461 いえる 0,01325202 71,06942066
44041 つて 0,01204334 71,69093987
42981 それも 0,01175347 71,88111474
36474 とともに 0,00997409 73,27073372
36204 において 0,00990025 73,32040644
35353 かく 0,00966754 73,53517688
34986 こうして 0,00956718 73,63136662
34727 つぶやく 0,00949636 73,68847328
34513 むしろ 0,00943784 73,72629244
34502 さすが 0,00943483 73,73572727
33885 ごと 0,00926610 73,95067163
31633 それとも 0,00865028 74,46094399
30027 それだけ 0,00821111 74,79055397
27865 なお 0,00761989 75,41494804
27200 それら 0,00743804 75,60305666
24804 すら 0,00678284 76,33250760
24606 そりゃ 0,00672869 76,41353714
24461 それにしても 0,00668904 76,44034198
23950 ほら 0,00654931 76,59252076
23133 ずつ 0,00632589 76,79804616
22671 こうした 0,00619955 76,92313269
22637 そこから 0,00619026 76,93552140
22549 ものか 0,00616619 76,96640896
22432 そうした 0,00613420 77,00328278
22030 やら 0,00602427 77,10665515
21749 それより 0,00594743 77,21439956
20486 それでは 0,00560205 77,61857291
19861 ほんの 0,00543114 77,86105908

So from this I can say that, at least up to 80% coverage, the novel frequency list share a great amount of words with core 10000. But in the end they are only 1800 words. Less if you exclude things like 僕、ぼく、ボク are treated as three different words. And things like adverbial forms of adjectives, or some conjugations, are treated as words per sé.

I wonder what happens in the next 6000 words circa, from 80% to 90%. How many of those words are also in core 10000?

But one thing is sure, if your main goal is to read novel it's better, at least for the first 2000 - 3000 words or so, to not follow the core 10000 order. Or when you'll reach 2000 words in core 10000 you'll find yourself with many words maybe commonly used in 90's newspapers but not so much common in novels.

Another thing I wonder is, from what coverage words begin to be more specific and it's better to just put apart computer made lists and just use the books himself as your list, i.e. adding new words as you encounter them in reading.

I think from 0% to 90% if you study them from a list you're sure your time is well invested. But maybe from 90% up to 95% and on, altough those are still common words you'll need to know before or after, it's better to study them as you encounter them so you'll know first the words you effectively need to know to just continue reading what you're reading. Words like "私立" as in "private investigator", given as 93% from the novel frequency list, but useless to study a priori from a list if you're going to read only fantasy novels for the next six months. Or viceversa, very useful if you're going to read "death note another note" but in that case it's meaningless to study words like "使い魔" which is very common in fantasy novels but rarely usel on thriller novels.

So what do you think? From what percentage of coverage words become less general and more specific to not justify the need to study them from list? Like up to 80% I've seen they are all very useful and generic words you'll surely encounter whatever you'll read.
Edited: 2014-11-14, 9:47 am
Reply
#20
Some of those are in Core6k, like つぶやく, むしろ, さすが, なお. Some are multiple words (or words plus particles). Most are grammar points.
Edited: 2014-11-14, 10:04 am
Reply