Back

unnamed japanese text analyzer

#51
(2017-05-11, 10:47 pm)phil321 Wrote: Here's how they translated "the Welcome Wagon Lady":  歓迎ワゴンのレヂイは.  The kanji  歓迎 are glossed with katakana furigana as follows:  ウェルカム.  So even though 歓迎 which means "welcome" is pronounced "kangei" it is in this case glossed as "werukamu".  Bizarre, but I guess that is the only way to translate "the Welcome Wagon Lady".
That doesn't leave Japanese readers much worse off than many English readers -- I had to go look the phrase up on Wikipedia. Using katakana is at least a clue that it's a proper noun...
Reply
#52
(2017-05-12, 4:10 pm)pm215 Wrote:
(2017-05-11, 10:47 pm)phil321 Wrote: Here's how they translated "the Welcome Wagon Lady":  歓迎ワゴンのレヂイは.  The kanji  歓迎 are glossed with katakana furigana as follows:  ウェルカム.  So even though 歓迎 which means "welcome" is pronounced "kangei" it is in this case glossed as "werukamu".  Bizarre, but I guess that is the only way to translate "the Welcome Wagon Lady".
That doesn't leave Japanese readers much worse off than many English readers -- I had to go look the phrase up on Wikipedia. Using katakana is at least a clue that it's a proper noun...

"Welcome wagon" is a term that native English speakers (at least in North America) would know without checking in a dictionary.  Obviously Ira Levin, the author of the book, wouldn't have used the term "The Welcome Wagon Lady" in the first sentence of his bestselling book if the typical reader didn't know what it meant off the top of their head.
Edited: 2017-05-12, 4:20 pm
Reply
#53
My point was that I'm a native English speaker (UK, in this case, but the same would apply to any non-US/Canadian readers) and it wasn't something I knew. Wikipedia says they stopped in 1998 so some younger US readers might also miss the reference. (References do go out of date with time and distance -- Austen expected her readers to know what a Rumford stove was, for instance.)
Edited: 2017-05-12, 4:32 pm
Reply
JapanesePod101
#54
For what it's worth I don't know the term "welcome wagon" either, it's just something that sounds like it might be a term.
Reply
#55
Analyzer now supports user dictionaries. Entry must be in a pretty specific format. Example dictionary uses 自転車. In the future, the names list will be provided as a user dictionary instead of being baked in.

Normalizer removes outliers instead of trying to adjust the skew of the distribution. Frequency lists follow a Pareto distribution, which there's no generic way to linearize, so removing outliers is the only thing that doesn't affect the final distribution of the data (not even a geometric mean would be correct). Removing outliers is related to the median; with only three or four input files, the result will be the same.

https://github.com/wareya/analyzer/releases/tag/alpha1
Edited: 2017-05-17, 11:45 am
Reply
#56
Some important (generic) terms from A Frequency Dictionary of Japanese that the analyzer wasn't picking up on, in user dictionary format:

Code:
自転車,5142,5142,8000,名詞,普通名詞,一般,*,*,*,ジテンシャ,自転車,自転車,ジテンシャ,自転車,ジテンシャ,漢,*,*,*,*,ジテンシャ,ジテンシャ,ジテンシャ,ジテンシャ,*,*,2,*,*
飛行機,5142,5142,8000,名詞,普通名詞,一般,*,*,*,ヒコウキ,飛行機,飛行機,ヒコーキ,飛行機,ヒコーキ,漢,*,*,*,*,ヒコウキ,ヒコウキ,ヒコウキ,ヒコウキ,*,*,2,*,*
幼稚園,5142,5142,8000,名詞,普通名詞,一般,*,*,*,ヨウチエン,幼稚園,幼稚園,ヨーチエン,幼稚園,ヨーチエン,漢,*,*,*,*,ヨウチエン,ヨウチエン,ヨウチエン,ヨウチエン,*,*,3,*,*
自動車,5142,5142,8000,名詞,普通名詞,一般,*,*,*,ジドウシャ,自動車,自動車,ジドーシャ,自動車,ジドーシャ,漢,*,*,*,*,ジドウシャ,ジドウシャ,ジドウシャ,ジドウシャ,*,*,"2,0",*,*
図書館,5142,5142,8000,名詞,普通名詞,一般,*,*,*,トショカン,図書館,図書館,トショカン,図書館,トショカン,漢,*,*,*,*,トショカン,トショカン,トショカン,トショカン,*,*,2,*,*
冷蔵庫,5142,5142,8000,名詞,普通名詞,一般,*,*,*,レイゾウコ,冷蔵庫,冷蔵庫,レーゾーコ,冷蔵庫,レーゾーコ,漢,*,*,*,*,レイゾウコ,レイゾウコ,レイゾウコ,レイゾウコ,*,*,3,*,*
違和感,5142,5142,8000,名詞,普通名詞,一般,*,*,*,イワカン,違和感,違和感,イワカン,違和感,イワカン,漢,*,*,*,*,イワカン,イワカン,イワカン,イワカン,*,*,2,*,*
不動産,5142,5142,8000,名詞,普通名詞,一般,*,*,*,フドウサン,不動産,不動産,フドーサン,不動産,フドーサン,漢,*,*,*,*,フドウサン,フドウサン,フドウサン,フドウサン,*,*,"2,0",*,*
法案,5142,5142,8000,名詞,普通名詞,一般,*,*,*,ホウアン,法案,法案,ホーアン,法案,ホーアン,漢,*,*,*,*,ホウアン,ホウアン,ホウアン,ホウアン,*,*,0,*,*
無人島,5142,5142,8000,名詞,普通名詞,一般,*,*,*,ムジントウ,無人島,無人島,ムジントー,無人島,ムジントー,漢,*,*,*,*,ムジントウ,ムジントウ,ムジントウ,ムジントウ,*,*,0,*,*
価値観,5142,5142,8000,名詞,普通名詞,一般,*,*,*,カチカン,価値観,価値観,カチカン,価値観,カチカン,漢,*,*,*,*,カチカン,カチカン,カチカン,カチカン,*,*,"3,2",*,*
市内,5142,5142,8000,名詞,普通名詞,一般,*,*,*,シナイ,市内,市内,シナイ,市内,シナイ,漢,*,*,*,*,シナイ,シナイ,シナイ,シナイ,*,*,1,*,*
地下鉄,5142,5142,8000,名詞,普通名詞,一般,*,*,*,チカテツ,地下鉄,地下鉄,チカテツ,地下鉄,チカテツ,漢,*,*,*,*,チカテツ,チカテツ,チカテツ,チカテツ,*,*,0,*,*
看護婦,5142,5142,8000,名詞,普通名詞,一般,*,*,*,カンゴフ,看護婦,看護婦,カンゴフ,看護婦,カンゴフ,漢,*,*,*,*,カンゴフ,カンゴフ,カンゴフ,カンゴフ,*,*,3,*,*
好奇心,5142,5142,8000,名詞,普通名詞,一般,*,*,*,コウキシン,好奇心,好奇心,コーキシン,好奇心,コーキシン,漢,*,*,*,*,コウキシン,コウキシン,コウキシン,コウキシン,*,*,3,*,*
いいかげん,5159,5159,8000,名詞,普通名詞,形状詞可能,*,*,*,イイカゲン,いい加減,いいかげん,イイカゲン,いいかげん,イイカゲン,混,*,*,*,*,イイカゲン,イイカゲン,イイカゲン,イイカゲン,*,*,0,*,*
いい加減,5159,5159,8000,名詞,普通名詞,形状詞可能,*,*,*,イイカゲン,いい加減,いい加減,イイカゲン,いい加減,イイカゲン,混,*,*,*,*,イイカゲン,イイカゲン,イイカゲン,イイカゲン,*,*,0,*,*
見た目,5142,5142,8000,名詞,普通名詞,一般,*,*,*,ミタメ,見た目,見た目,ミタメ,見た目,ミタメ,漢,*,*,*,*,ミタメ,ミタメ,ミタメ,ミタメ,*,*,1,*,*
みため,5142,5142,8000,名詞,普通名詞,一般,*,*,*,ミタメ,見た目,みため,ミタメ,みため,ミタメ,漢,*,*,*,*,ミタメ,ミタメ,ミタメ,ミタメ,*,*,1,*,*
みた目,5142,5142,8000,名詞,普通名詞,一般,*,*,*,ミタメ,見た目,みた目,ミタメ,みた目,ミタメ,漢,*,*,*,*,ミタメ,ミタメ,ミタメ,ミタメ,*,*,1,*,*
委員会,5142,5142,8000,名詞,普通名詞,一般,*,*,*,イインカイ,委員会,委員会,イインカイ,委員会,イインカイ,漢,*,*,*,*,イインカイ,イインカイ,イインカイ,イインカイ,*,*,"2,0",*,*
遊園地,5142,5142,8000,名詞,普通名詞,一般,*,*,*,ユウエンチ,遊園地,遊園地,ユーエンチ,遊園地,ユーエンチ,漢,*,*,*,*,ユウエンチ,ユウエンチ,ユウエンチ,ユウエンチ,*,*,3,*,*
新幹線,5142,5142,8000,名詞,普通名詞,一般,*,*,*,シンカンセン,新幹線,新幹線,シンカンセン,新幹線,シンカンセン,漢,*,*,*,*,シンカンセン,シンカンセン,シンカンセン,シンカンセン,*,*,3,*,*

Pitch accent data copied from rikai.

There's more it didn't pick up on, but most of them were spelling differences and a lot of them are questionable. The only objective ones I left out so far were a couple prefecture names, abbreviations, and phrases that are more societally-oriented.

This will be the default user dictionary in the next release, whenever that happens. The next major feature is the ability to pretend that some entries are other, different entries, to automatically account for spelling variations. This, too, will use an explicit mapping loaded from a file.

Here's how this user dictionary affects Katahane (except the effect of 自転車): https://pastebin.com/raw/w89rcF81
Edited: 2017-05-18, 6:01 pm
Reply
#57
Filters are now loaded from a file (userfilters.csv). These are basically blacklist filters, I haven't implemented rewrite filters yet. The new included filters are more reasonable for unidic than the old ones, which blocked a bunch of non-grammatical verbs, since they were meant for output from ipadic.

Also, the command line interface can now use the user dictionary (enabled by default), and the command line flags were changed because of the removal of the built-in particle/number/name filters.

https://github.com/wareya/analyzer/releases/tag/alpha2

[Image: 1aJMsgQ.png]
Edited: 2017-05-28, 6:18 am
Reply
#58
Alpha3 lets you analyze frequency by word or lexeme instead of just the given spelling.

https://github.com/wareya/analyzer/releases/tag/alpha3

Example:

Code:
20096    和    動詞    非自立可能    *    サ行変格    為る    スル    する    スル    スル    0    20092    する    スル    スル    *    3    仕る    スル    スル    0    1
12880    和    動詞    非自立可能    *    上一段-ア行    居る    イル    いる    イル    イル    0    12823    居る    イル    イル    0    57
...
9843    和    動詞    一般    *    五段-ワア行    言う    イウ    言う    イウ    イウ    0    4292    いう    ユウ    ユー    *    4029    いう    イウ    イウ    0    792    言う    ユウ    ユー    0    714    ゆう    ユウ    ユー    *    14    謂う    イウ    イウ    0    1    ゆう    ユウ    ユー    0    1

[Image: GlbLGea.png]
Edited: 2017-07-22, 8:23 am
Reply
#59
@wareya,

I wanted to show you a second thing I was able to do with your program. The "show line number" not only allowed me to sort by order of appearance, but I then took the spreadsheet and had it include the actual line of text. That was on top of using EPWING2Anki to include the basic definition for the word.  

Here's the updated Harry Potter Book 1 spreadsheet. Turns out it is very functional to both study individual words (with sentence as reference) then read the actual page or chapter. I decided to make this Harry Potter one after successfully studying 60 pages of drama transcripts the same way. The only major thing I did with the Harry Potter text was split up multisentence paragraphs into individual "sentences" using "。?、!" as markers of sorts. This made a majority of lines under 50 characters.

Anyway, thanks again. I seriously could not have done this without your program. If you're looking for more suggestions, if you could some get the program to do the above in an automated fashion that would be great.
Reply
#60
(2017-07-22, 7:51 pm)Nukemarine Wrote: @wareya,

I wanted to show you a second thing I was able to do with your program. The "show line number" not only allowed me to sort by order of appearance, but I then took the spreadsheet and had it include the actual line of text. That was on top of using EPWING2Anki to include the basic definition for the word.  

Here's the updated Harry Potter Book 1 spreadsheet. Turns out it is very functional to both study individual words (with sentence as reference) then read the actual page or chapter. I decided to make this Harry Potter one after successfully studying 60 pages of drama transcripts the same way. The only major thing I did with the Harry Potter text was split up multisentence paragraphs into individual "sentences" using "。?、!" as markers of sorts. This made a majority of lines under 50 characters.

Anyway, thanks again. I seriously could not have done this without your program. If you're looking for more suggestions, if you could some get the program to do the above in an automated fashion that would be great.
You made before an interesting excercise of comparing HP vocab with Core2 I think. At that time your list had 7000 entries and now it has only 5400. That is a huge difference. Is that all because of the improved word detection/analysis in the analyzer?
Reply
#61
(2017-07-23, 1:46 pm)Matthias Wrote:
(2017-07-22, 7:51 pm)Nukemarine Wrote: @wareya,

I wanted to show you a second thing I was able to do with your program. The "show line number" not only allowed me to sort by order of appearance, but I then took the spreadsheet and had it include the actual line of text. That was on top of using EPWING2Anki to include the basic definition for the word.  

Here's the updated Harry Potter Book 1 spreadsheet. Turns out it is very functional to both study individual words (with sentence as reference) then read the actual page or chapter. I decided to make this Harry Potter one after successfully studying 60 pages of drama transcripts the same way. The only major thing I did with the Harry Potter text was split up multisentence paragraphs into individual "sentences" using "。?、!" as markers of sorts. This made a majority of lines under 50 characters.

Anyway, thanks again. I seriously could not have done this without your program. If you're looking for more suggestions, if you could some get the program to do the above in an automated fashion that would be great.
You made before an interesting excercise of comparing HP vocab with Core2 I think. At that time your list had 7000 entries and now it has only 5400. That is a huge difference. Is that all because of the improved word detection/analysis in the analyzer?
I filter out entries that appear in scans of Core 2k and Tae Kim sentences. The unfiltered list is still 7000 or more iirc. Also, of the 5400 there's still a lot of terms that are useless for anything outside the book series.

That said, I'm tempted to officially study either this or Norwegian Wood in a way similar to when I studied the transcripts for episodes 1-4 of Zettai Kareshi. In my mind, the first 4000 lines of text (out of 15,000 total) seems reasonable. Like the transcripts, it's not really about memorizing words but understanding the sentences that the words appear.  Basically, study ~500 vocabulary/lines then read the text up to that point out loud at the end. Takes time, but the end result is amazing.
Reply
#62
(2017-07-23, 8:24 pm)Nukemarine Wrote: I filter out entries that appear in scans of Core 2k and Tae Kim sentences. The unfiltered list is still 7000 or more iirc. Also, of the 5400 there's still a lot of terms that are useless for anything outside the book series.

That said, I'm tempted to officially study either this or Norwegian Wood in a way similar to when I studied the transcripts for episodes 1-4 of Zettai Kareshi. In my mind, the first 4000 lines of text (out of 15,000 total) seems reasonable. Like the transcripts, it's not really about memorizing words but understanding the sentences that the words appear.  Basically, study ~500 vocabulary/lines then read the text up to that point out loud at the end. Takes time, but the end result is amazing.
OK, that means no significant difference to the first analysis. So probably the detection/recognition quality is not much changed.

If you filter out the terms which do not have an English definition (e.g. names) you have 4.900 terms [of which quite often the given translation seems mispicked]. Not surprisingly there are more terms in the first chapter than in the last chapter and the average frequency goes down too:

618 7,2
357 4,4
418 3,4
351 4,2
530 3,2
360 4,0
329 3,0
201 1,9
291 2,2
210 1,7
158 1,8
222 1,8
125 1,6
150 1,4
198 1,5
235 1,3
143 1,1
4.896 3,3

I think you are right that there are quite some terms which are useless outside the book series. The other way round that means too that core 10k leaves a lot uncovered.


Did you do the Zettai Kareshi transcript analysis for the complete 1-4 corpus (not one by one)? If so could you link that too? It would be interesting to see how different the vocab is compared to the HP analysis.
Reply
#63
(2017-07-25, 3:46 pm)Matthias Wrote:
(2017-07-23, 8:24 pm)Nukemarine Wrote: I filter out entries that appear in scans of Core 2k and Tae Kim sentences. The unfiltered list is still 7000 or more iirc. Also, of the 5400 there's still a lot of terms that are useless for anything outside the book series.

That said, I'm tempted to officially study either this or Norwegian Wood in a way similar to when I studied the transcripts for episodes 1-4 of Zettai Kareshi. In my mind, the first 4000 lines of text (out of 15,000 total) seems reasonable. Like the transcripts, it's not really about memorizing words but understanding the sentences that the words appear.  Basically, study ~500 vocabulary/lines then read the text up to that point out loud at the end. Takes time, but the end result is amazing.
OK, that means no significant difference to the first analysis. So probably the detection/recognition quality is not much changed.

If you filter out the terms which do not have an English definition (e.g. names) you have 4.900 terms [of which quite often the given translation seems mispicked]. Not surprisingly there are more terms in the first chapter than in the last chapter and the average frequency goes down too:

*snip*

Did you do the Zettai Kareshi transcript analysis for the complete 1-4 corpus (not one by one)? If so could you link that too? It would be interesting to see how different the vocab is compared to the HP analysis.

I did the scans one by one for simplicity's sake but it appears similar distribution. Frequent terms naturally appear first early on just because of the nature of language. Personally, I'm not going to take it too deep. Instead, this is more a byproduct of my own studies to include Japanese dramas and books. Basically, I'm going to do with books what I did with dramas (with a bit more use of text to speech on single sentences). 

I'm just glad that wareya's program is working out so well. The addition of an example sentence from the source made all the difference in turning vocabulary study into a fun endeavor. Shame no one is using my Memrise course though it is a 10 year old drama. That I will be doing Murakami's "Norwegian Wood" might change that though.
Reply
#64
https://github.com/wareya/analyzer/releases/tag/alpha4

Added functionality to make it easier to use the analyzer for mining. Written by someone else. Detailed in https://github.com/wareya/analyzer/pull/1

[Image: 1W6yYdl.png]
Reply
#65
Hi wareya,

I was the one who submitted the code for the sentence improvements. I've started experimenting with n+1 sentence mining, I'm not sure how it will fit into the ui with so many options already, but we'll see how it goes. I've started a thread specific to that here as not to detract from this thread too much.
Reply
#66
Updated the user dictionary. This fixes the numerical "word type" IDs for a bunch of sino-japanese nouns, which made kuromoji skip them sometimes before (for some reason...), and adds family member names e.g. お父・おとう because the others (父・ちち) were overriding them because unidic isn't good enough to distinguish them in context.

This is an update to alpha4 since no code changed, just a file.
Edited: 2017-08-28, 1:32 am
Reply
#67
I have seen lots of praise for this text analyzer, and after trying it myself I would like to add this:

The analyzer is amazingly easy to use!

It could indeed be a really useful tool for comparing vocab content like Nukemarine has described. Two issues:
1. The analysis is showing e.g. 自動 and 車 or 自衛 and 隊 as separate words (I remember a post about 自転車 which was probably about addressing this issue). Yomichan on the contrary is always looking for the longest possible word (and then gives below the components of the compound).
2. "Pronounciation" is in katakana. And most of the vocab data I have use hiragana - independent of being from decks (e.g. Core) or from dictionaries. That means that "Pronounciation" can't be used as second criterion for unambigious matching (e.g. for 角).
Reply
#68
Thanks for the feedback.

Quote:1. The analysis is showing e.g. 自動 and 車 or 自衛 and 隊 as separate words (I remember a post about 自転車 which was probably about addressing this issue). Yomichan on the contrary is always looking for the longest possible word (and then gives below the components of the compound).
This is caused by the lexeme dictionary the tokenization library uses. If you enable the user dictionary it'll look for the words in the user dictionary file too.

Quote:2. "Pronounciation" is in katakana. And most of the vocab data I have use hiragana - independent of being from decks (e.g. Core) or from dictionaries. That means that "Pronounciation" can't be used as second criterion for unambigious matching (e.g. for 角).
Same. And this is a super low hanging fruit, I'm not sure I'll fix it.
Reply
#69
(2017-11-27, 2:49 am)wareya Wrote:
Quote:1. The analysis is showing e.g. 自動 and 車 or 自衛 and 隊 as separate words ...
This is caused by the lexeme dictionary the tokenization library uses. If you enable the user dictionary it'll look for the words in the user dictionary file too.

Quote:2. "Pronounciation" is in katakana....
Same. And this is a super low hanging fruit, I'm not sure I'll fix it.
Thanks! Got it, the solution is all in the user dictionary. There are 45 entries and there seem to be 30 columns, but the file is unreadable (to me) and I do not know how to change or expand the data:
私,1,1,5000,代名詞,*,*,*,*,*,ワタシ,私,私,ワタシ,私,ワタシ,和,*,*,*,*,ワタシ,ワタシ,ワタシ,ワタシ,*,*,0,*,*
[i.e.: 私,1,1,5000,代名詞,*,*,*,*,*,ワタシ,私,私,ワタシ,私,ワタシ,和,*,*,*,*,ワタシ,ワタシ,ワタシ,ワタシ,*,*,0,*,*]

Is there a way to get more entries in? Individual like e.g. the above 自衛隊, a long word list or even a complete dictionary?
Reply
#70
Make sure you're using a text editor that supports utf-8, like notepad++, geany, kate, or any of the other modern notepad alternatives on windows.

The format of the user dictionary is strange. It's a side effect of how the tokenization library works internally. You definitely can't import a whole dictionary, it uses pre-conjugated stems for verbs and adjectives, but adding nouns it isn't catching is fine.

Quote:自転車,5146,5146,5000,名詞,普通名詞,一般,*,*,*,ジテンシャ,自転車,自転車,ジテンシャ,自転車,ジテンシャ,漢,*,*,*,*,ジテンシャ,ジテンシャ,ジテンシャ,ジテンシャ,*,*,2,*,*

5146 is the internal id of the word type "名詞,普通名詞,一般,*,*,* [...] 漢 [...]". You'll need to find a copy of unidic (it's open source) and check its lex.csv to get examples of other kinds of words, if you want to add a kind of word that isn't in the user dictionary. The kind of word you mark a word as having influences how likely it is to get correctly detected in certain contexts.

The 5000 is the cost of the token. The higher it is the less likely it is to be detected instead of its constituent parts. The number 2 near the end is a pitch accent code.
Edited: 2017-11-28, 1:54 am
Reply
#71
(2017-11-28, 1:53 am)wareya Wrote: ... You definitely can't import a whole dictionary, it uses pre-conjugated stems for verbs and adjectives, but adding nouns it isn't catching is fine.

Quote:自転車,5146,5146,5000,名詞,普通名詞,一般,*,*,*,ジテンシャ,自転車,自転車,ジテンシャ,自転車,ジテンシャ,漢,*,*,*,*,ジテンシャ,ジテンシャ,ジテンシャ,ジテンシャ,*,*,2,*,*

5146 is the internal id of the word type "名詞,普通名詞,一般,*,*,* [...] 漢 [...]". You'll need to find a copy of unidic (it's open source) ....

The 5000 is the cost of the token. ...
Thanks! For the nouns the numbers are rather straightforward in your user dic:
1 私 1 5000 代名詞 * * 和 (really 1?)
20 おとう 5142 5000 名詞 普通名詞 一般 和 (お/御)
21 自転車 5146 5000 名詞 普通名詞 一般 漢 (=> みた目?)
1 いい加減 5159 5000 名詞 普通名詞 形状詞可能 混 (3 variants)

That is probably as there are not many cases covered there. There is need for many more even on the very basic level like  外国人.

I guess it will not matter if one uses 5142, 5146 or 5159 as word type ID. And Cost is always 5000.

Nouns only is probably not sufficient as also verbs get combined a lot: 思い止まる,書き取る, 持って来る, 心掛ける, 仕上げる, ...

But it is of course problematic if you can only get the -る form and therfore it would probably not make sense to find out word type IDs for verbs (unidic is quite intimidating for me).

The idea of matching a known corpus against a new corpus is not that easy e.g. you have to wade through lots of single analysis results like 語, 学, 校, ...

Also if you have entries for 多い you would not get a match with 多く.

Same with different writing like 大変 or たいへん you don't get a match. Changing from katakana to hiragana would be necessary to have a second criterion in that case (hope you can pick this low hanging fruit).

All in all the matching rate is therefore much lower than hoped for.
Reply
#72
I ran a vocab list through the analyzer and got quite some entries where the analyzer picks a portion of the entry:
お辞儀; 八日; 博物館; お祖母さん; 二十歳; 掲示板; 消防車; 八つ; 入場券; 留学生; お祝い; 改札口; 航空便; 幾つ; 年賀状; 駐車場; 事務室; 一日; 万年筆; 消防署; お祖父さん; 文房具; 交通費; ようこそ; 行き; 横断歩道; 誰か; 味噌汁; 高等学校; 再来年; ごちそうする; 美容院; 一昨日; 思いがけない; 蛍光灯; 男らしい; どうも; 三日月; 慣用句; すっと; 忘年会; 比較的; どきっと; 最先端; 徹底的; 一遍に; 真四角; 地下道; 不公平; 大蔵省; 本格的; お目に掛かる; 積極的; くだらない; とっくに; 不器用; 婆さん; いつのまにか; 乗車券; 対等; 典型的; 領事館; ひらりと; にわかに; ぴたりと; やって来る; 共産主義; 時間割り; 放射能; 間に合わせる; お仕舞い; 長方形; 今に; 可愛がる; 言い出す; 市役所; 不自由; 洗面器; 海水浴; 恋する; 届け; 抽象的; 絶えず; 次第に; 正方形; 無生物; もしかすると; 年月日; 百貨店; 調味料; 誠に; 裁判所; 未成年; 筆記試験; 刑務所; 一斉に; 半導体; 論理的; 小売店; 硬さ; お世辞; 乗り遅れる; ワンピース; 仮に; 旅客機; もうけ; 非常識; 蛋白質; 覚え; 一時; この度; 消極的; きれ; 耳鼻科; おっしゃる; お中元; 切れ; 立ち入り禁止; 筆記用具; 使用人; 留守番; 貴重品; 待合室; 合理的; 公務員; 生年月日; 一周; 何と; 文化財; おだてる; 神様; 展覧会; やたらに; 水族館; とっさに; 人文科学; 有りのまま; もうける; 文学者; 出迎え; ごくんと; お詫び; 知らず知らず; 一向に; 派出所; 人差し指; 社会科学; お年玉; 小麦粉; 共通語; 滑走路; 乗用車; 容疑者; 弟子入り; 請求書; 陸上部; 一晩中; 世界中; 欠かせない; 目玉焼き; 主人公; 歌合戦; 身に付く; 料理人; 常に; 要するに; 応援団; 産婦人科; 小児科; もうじき; 借り; お歳暮; 蛍光ペン; ガラス戸; 見物人; 頻りに; 生意気; 思い切って; 寒暖計; 停留所; 定休日; ひょっとしたら; 客観的; よそう; 狭心症; 悪酔いする; 手が空く; 脳梗塞; 二日酔い; 慰安婦; 共犯者; 最終日; 宅配便; 凝固する; 創世記; 接着剤; 土下座; 熱気球; 肩甲骨; 不祥事; 懐中電灯; 雪合戦; 既婚者; 志願する; 血友病; 身に覚えがない; 舶来品; 花粉症; 計理士; 認知症; 更衣室; 一挙に; 不意に; 警視庁; 望遠鏡; 表彰状; 配偶者; 後始末; 好都合; 博覧会; 参政権; 最大限; 都市ガス; 税務署; 核兵器; 漠然と; 副作用; 模擬試験; 退廃的; 受動的; 偏差値; 海産物; 五分; 劣等感; 披露宴; 網羅する; 覚醒剤; お節料理; 冠婚葬祭; プロ野球; 二重; 予備校; 先進国; ご馳走; 芸能人; 当事者; お喋り; イスラム教; 咄嗟に; 録音テープ; お金; お休み; お気に入り; 長; 頂き; 時には; 認め; そのもの; 必需品; 致命的; 排気ガス; 座談会; 顕微鏡; 引き続き; 取締役; 如何にも; 単行本; 勿体ない; 反作用; 方程式; 封建的; 大多数; 相対的; 不透明; 禁欲的; 画期的; 爪楊枝; カトリック教徒; 過渡期; 一気に; 刻一刻と; お節介; 共和国; 診療所; 遠距離; 助教授; 医薬品; 何時も; 無条件; 修学旅行; 是非とも; 利己主義; 運動場; 運良く; 四捨五入; 飲食店; 同窓会; ご褒美; 要注意; 有意義; 潜水艦; 一等; 乾電池; 歌謡曲; 無期限; 近距離; 堂々と; 下半身; 互いに; 合理化; 小文字; 公民館; 交響曲; 似顔絵; 何ら; 十字架; 上半身; 物足りない; 不利益; 漢方薬; お守り; 小数点; 精一杯; 人間性; 一重; 立候補; 引き締め; 機関車; 自営業; 折り畳み; ざっと; 少なからず; 理学部; 自発的; 思う存分; 無性に; 加害者; 何分; 悪しからず; 食中毒; 光熱費; 一概に; 感受性; 直に; 優越感; 普段着; 一円; 爺さん; 通俗的; お化け; 危うく; 人見知り; 先天的; 立て続けに; 優等生; 母国語; 有機物; 先入観; 有権者; 貸し切り; 水蒸気; 行き過ぎ; 後天的; ぐるっと; 見出す; 初任給; 不本意; 能動的; 人並み; 早々と; 流行歌; 軍国主義; 断じて; 潔く; 無造作; 別問題; 楕円形; 乗組員; 運転士; ひしひしと; 呆然と; 超特急; 積み立て; おでこ; 代議士; 輪ゴム; 果てしない; 張本人; それとなく; 報われる; いやに; 可燃物; 発起人; ちらりと; 資本家; 程なく; 終着駅; 打算的; 適齢期; お代わり; 口々に; 絶え間なく; 不時着; 無機物; 画用紙; 何気無く; 遺失物; ふわりと; 日増しに; 浮浪者; 横文字; がらりと; 楽天的; 間一髪; 行き渡る; ありありと; 自惚れ; お手伝いさん; 一心に; 歴然と; 密入国; でかでかと; にやりと; 不燃物; 虚栄心; 根掘り葉掘り; 巨視的; 願わくは; ぽかんと; ウィークデー; 瀬戸物; お神興; がらんと; 早引き; やり切れない; からりと; 手際良く; 止めどなく; だらだらする; がぶりと; 排他主義; がたんと; 分からず屋; 引っ切り無しに; お共; 待ち兼ねる; 悪循環; 機動隊; 楽観的; 敷布団; 力一杯; 不条理; 過半数; 終止符; 暴風雨; 常用漢字; 換気扇; 高校生; 父さん; 見方; 具体的; 大使館; 教科書; 大学院; 交通事故; めったに; 週刊誌; 飛行場; 広さ; 誕生日; 天気予報; お嬢さん; 腕時計; お礼; 図書室; おじさん; 扇風機; キャッシュカード; 免許証; お釣り; お菓子; 兄さん; 姉さん; 連れて来る; 郵便屋さん; お父さん; お兄さん; 一緒に; お母さん; お姉さん; 一杯; 二十日; お願い; お手洗い; 本当に; 多く; これら; 非常に; 急に; 中学校; 私たち; 高さ; 運転手; 久しぶり; 郵便局; 救急車; ジェット機; 喫茶店; 皆さん; 小学生; 動物園; おしゃべり; 一生懸命; つまらない; ガソリンスタンド; びっくりする; 電話帳; 記念日; アイスクリーム; テープレコーダー; バス停; 歯医者; お宅; ラッシュアワー; 二階; 目覚まし時計; 歯ブラシ; たばこ屋; トイレットペーパー; ボーイフレンド; カレーライス; カセットテープ; あんなに; こんなに; レインコート; エアメール; シャープペンシル; ウェートレス; 絶対に; そんなに; 似ている; 始めに; お巡りさん; 従業員; 消費者; 植民地; 奨学金; 形容詞; 外来語; 衆議院; 受話器; 飲料水; 王様; 聞き取り; 西洋人; 口げんか; お坊さん; 水平線; 暑中見舞い; 一段と; 後片付け; 御無沙汰; 生き生きと; がくんと; きらりと; 順々に; 違いない; 全面的; 同級生; たまらない; 都立; 前もって; とんでもない; 代名詞; 地平線; 申し訳ない; 日用品; 歩行者; ぱっと; 真っ最中; 割に; 男性的; はかり; 腹一杯; 面倒臭い; ぴょんと; 外務省; いけない; 原子力; 相変わらず; 新聞社; 県立; さっと; ご存じ; 参議院; 衣食住; 三角形; 新学期; 上等; 高気圧; 座布団; 意地悪; 現住所; 五十音; ゴールデンウィーク; 自然に; 全速力; 労働者; 専門家; 次々に; 不十分; 第一; 低気圧; 我が国; 日本酒; 保育園; 百科事典; 真夜中; もったいない; ほうれん草; どんなに; ばからしい; 坊ちゃん; 叔父さん; 伯父さん; つくり; めんつ; おいで; くれぐれも; じかに; おみくじ; むやみに; さっそうと; うるうどし; もっともらしい; にわかあめ

That list was longer than expected, but it is basically just Core. And there is no even an entry for 外国人 in Core...
Edited: 2017-11-30, 9:20 pm
Reply
#73
Yeah, that's basically the blind spot for tokenization based text analysis. You're totally reliant on what the token dictionary contains. It also has problems reading お父さん correctly, because 父《とう》 and 父《ちち》 are identical in every way but pronunciation to it.

I strongly recommend focusing on adding the most important words, things like 自転車 where it's not a simple compound word but still isn't there for some reason, rather than things like 外国人. I did this with A Frequency Dictionary of Japanese for some of the user dictionary entries.
Edited: 2017-11-30, 10:34 pm
Reply