Back

unnamed japanese text analyzer

#51
(2017-05-11, 10:47 pm)phil321 Wrote: Here's how they translated "the Welcome Wagon Lady":  歓迎ワゴンのレヂイは.  The kanji  歓迎 are glossed with katakana furigana as follows:  ウェルカム.  So even though 歓迎 which means "welcome" is pronounced "kangei" it is in this case glossed as "werukamu".  Bizarre, but I guess that is the only way to translate "the Welcome Wagon Lady".
That doesn't leave Japanese readers much worse off than many English readers -- I had to go look the phrase up on Wikipedia. Using katakana is at least a clue that it's a proper noun...
Reply
#52
(2017-05-12, 4:10 pm)pm215 Wrote:
(2017-05-11, 10:47 pm)phil321 Wrote: Here's how they translated "the Welcome Wagon Lady":  歓迎ワゴンのレヂイは.  The kanji  歓迎 are glossed with katakana furigana as follows:  ウェルカム.  So even though 歓迎 which means "welcome" is pronounced "kangei" it is in this case glossed as "werukamu".  Bizarre, but I guess that is the only way to translate "the Welcome Wagon Lady".
That doesn't leave Japanese readers much worse off than many English readers -- I had to go look the phrase up on Wikipedia. Using katakana is at least a clue that it's a proper noun...

"Welcome wagon" is a term that native English speakers (at least in North America) would know without checking in a dictionary.  Obviously Ira Levin, the author of the book, wouldn't have used the term "The Welcome Wagon Lady" in the first sentence of his bestselling book if the typical reader didn't know what it meant off the top of their head.
Edited: 2017-05-12, 4:20 pm
Reply
#53
My point was that I'm a native English speaker (UK, in this case, but the same would apply to any non-US/Canadian readers) and it wasn't something I knew. Wikipedia says they stopped in 1998 so some younger US readers might also miss the reference. (References do go out of date with time and distance -- Austen expected her readers to know what a Rumford stove was, for instance.)
Edited: 2017-05-12, 4:32 pm
Reply
JapanesePod101
#54
For what it's worth I don't know the term "welcome wagon" either, it's just something that sounds like it might be a term.
Reply
#55
Analyzer now supports user dictionaries. Entry must be in a pretty specific format. Example dictionary uses 自転車. In the future, the names list will be provided as a user dictionary instead of being baked in.

Normalizer removes outliers instead of trying to adjust the skew of the distribution. Frequency lists follow a Pareto distribution, which there's no generic way to linearize, so removing outliers is the only thing that doesn't affect the final distribution of the data (not even a geometric mean would be correct). Removing outliers is related to the median; with only three or four input files, the result will be the same.

https://github.com/wareya/analyzer/releases/tag/alpha1
Edited: 2017-05-17, 11:45 am
Reply
#56
Some important (generic) terms from A Frequency Dictionary of Japanese that the analyzer wasn't picking up on, in user dictionary format:

Code:
自転車,5142,5142,8000,名詞,普通名詞,一般,*,*,*,ジテンシャ,自転車,自転車,ジテンシャ,自転車,ジテンシャ,漢,*,*,*,*,ジテンシャ,ジテンシャ,ジテンシャ,ジテンシャ,*,*,2,*,*
飛行機,5142,5142,8000,名詞,普通名詞,一般,*,*,*,ヒコウキ,飛行機,飛行機,ヒコーキ,飛行機,ヒコーキ,漢,*,*,*,*,ヒコウキ,ヒコウキ,ヒコウキ,ヒコウキ,*,*,2,*,*
幼稚園,5142,5142,8000,名詞,普通名詞,一般,*,*,*,ヨウチエン,幼稚園,幼稚園,ヨーチエン,幼稚園,ヨーチエン,漢,*,*,*,*,ヨウチエン,ヨウチエン,ヨウチエン,ヨウチエン,*,*,3,*,*
自動車,5142,5142,8000,名詞,普通名詞,一般,*,*,*,ジドウシャ,自動車,自動車,ジドーシャ,自動車,ジドーシャ,漢,*,*,*,*,ジドウシャ,ジドウシャ,ジドウシャ,ジドウシャ,*,*,"2,0",*,*
図書館,5142,5142,8000,名詞,普通名詞,一般,*,*,*,トショカン,図書館,図書館,トショカン,図書館,トショカン,漢,*,*,*,*,トショカン,トショカン,トショカン,トショカン,*,*,2,*,*
冷蔵庫,5142,5142,8000,名詞,普通名詞,一般,*,*,*,レイゾウコ,冷蔵庫,冷蔵庫,レーゾーコ,冷蔵庫,レーゾーコ,漢,*,*,*,*,レイゾウコ,レイゾウコ,レイゾウコ,レイゾウコ,*,*,3,*,*
違和感,5142,5142,8000,名詞,普通名詞,一般,*,*,*,イワカン,違和感,違和感,イワカン,違和感,イワカン,漢,*,*,*,*,イワカン,イワカン,イワカン,イワカン,*,*,2,*,*
不動産,5142,5142,8000,名詞,普通名詞,一般,*,*,*,フドウサン,不動産,不動産,フドーサン,不動産,フドーサン,漢,*,*,*,*,フドウサン,フドウサン,フドウサン,フドウサン,*,*,"2,0",*,*
法案,5142,5142,8000,名詞,普通名詞,一般,*,*,*,ホウアン,法案,法案,ホーアン,法案,ホーアン,漢,*,*,*,*,ホウアン,ホウアン,ホウアン,ホウアン,*,*,0,*,*
無人島,5142,5142,8000,名詞,普通名詞,一般,*,*,*,ムジントウ,無人島,無人島,ムジントー,無人島,ムジントー,漢,*,*,*,*,ムジントウ,ムジントウ,ムジントウ,ムジントウ,*,*,0,*,*
価値観,5142,5142,8000,名詞,普通名詞,一般,*,*,*,カチカン,価値観,価値観,カチカン,価値観,カチカン,漢,*,*,*,*,カチカン,カチカン,カチカン,カチカン,*,*,"3,2",*,*
市内,5142,5142,8000,名詞,普通名詞,一般,*,*,*,シナイ,市内,市内,シナイ,市内,シナイ,漢,*,*,*,*,シナイ,シナイ,シナイ,シナイ,*,*,1,*,*
地下鉄,5142,5142,8000,名詞,普通名詞,一般,*,*,*,チカテツ,地下鉄,地下鉄,チカテツ,地下鉄,チカテツ,漢,*,*,*,*,チカテツ,チカテツ,チカテツ,チカテツ,*,*,0,*,*
看護婦,5142,5142,8000,名詞,普通名詞,一般,*,*,*,カンゴフ,看護婦,看護婦,カンゴフ,看護婦,カンゴフ,漢,*,*,*,*,カンゴフ,カンゴフ,カンゴフ,カンゴフ,*,*,3,*,*
好奇心,5142,5142,8000,名詞,普通名詞,一般,*,*,*,コウキシン,好奇心,好奇心,コーキシン,好奇心,コーキシン,漢,*,*,*,*,コウキシン,コウキシン,コウキシン,コウキシン,*,*,3,*,*
いいかげん,5159,5159,8000,名詞,普通名詞,形状詞可能,*,*,*,イイカゲン,いい加減,いいかげん,イイカゲン,いいかげん,イイカゲン,混,*,*,*,*,イイカゲン,イイカゲン,イイカゲン,イイカゲン,*,*,0,*,*
いい加減,5159,5159,8000,名詞,普通名詞,形状詞可能,*,*,*,イイカゲン,いい加減,いい加減,イイカゲン,いい加減,イイカゲン,混,*,*,*,*,イイカゲン,イイカゲン,イイカゲン,イイカゲン,*,*,0,*,*
見た目,5142,5142,8000,名詞,普通名詞,一般,*,*,*,ミタメ,見た目,見た目,ミタメ,見た目,ミタメ,漢,*,*,*,*,ミタメ,ミタメ,ミタメ,ミタメ,*,*,1,*,*
みため,5142,5142,8000,名詞,普通名詞,一般,*,*,*,ミタメ,見た目,みため,ミタメ,みため,ミタメ,漢,*,*,*,*,ミタメ,ミタメ,ミタメ,ミタメ,*,*,1,*,*
みた目,5142,5142,8000,名詞,普通名詞,一般,*,*,*,ミタメ,見た目,みた目,ミタメ,みた目,ミタメ,漢,*,*,*,*,ミタメ,ミタメ,ミタメ,ミタメ,*,*,1,*,*
委員会,5142,5142,8000,名詞,普通名詞,一般,*,*,*,イインカイ,委員会,委員会,イインカイ,委員会,イインカイ,漢,*,*,*,*,イインカイ,イインカイ,イインカイ,イインカイ,*,*,"2,0",*,*
遊園地,5142,5142,8000,名詞,普通名詞,一般,*,*,*,ユウエンチ,遊園地,遊園地,ユーエンチ,遊園地,ユーエンチ,漢,*,*,*,*,ユウエンチ,ユウエンチ,ユウエンチ,ユウエンチ,*,*,3,*,*
新幹線,5142,5142,8000,名詞,普通名詞,一般,*,*,*,シンカンセン,新幹線,新幹線,シンカンセン,新幹線,シンカンセン,漢,*,*,*,*,シンカンセン,シンカンセン,シンカンセン,シンカンセン,*,*,3,*,*

Pitch accent data copied from rikai.

There's more it didn't pick up on, but most of them were spelling differences and a lot of them are questionable. The only objective ones I left out so far were a couple prefecture names, abbreviations, and phrases that are more societally-oriented.

This will be the default user dictionary in the next release, whenever that happens. The next major feature is the ability to pretend that some entries are other, different entries, to automatically account for spelling variations. This, too, will use an explicit mapping loaded from a file.

Here's how this user dictionary affects Katahane (except the effect of 自転車): https://pastebin.com/raw/w89rcF81
Edited: 2017-05-18, 6:01 pm
Reply
#57
Filters are now loaded from a file (userfilters.csv). These are basically blacklist filters, I haven't implemented rewrite filters yet. The new included filters are more reasonable for unidic than the old ones, which blocked a bunch of non-grammatical verbs, since they were meant for output from ipadic.

Also, the command line interface can now use the user dictionary (enabled by default), and the command line flags were changed because of the removal of the built-in particle/number/name filters.

https://github.com/wareya/analyzer/releases/tag/alpha2

[Image: 1aJMsgQ.png]
Edited: 2017-05-28, 6:18 am
Reply
#58
Alpha3 lets you analyze frequency by word or lexeme instead of just the given spelling.

https://github.com/wareya/analyzer/releases/tag/alpha3

Example:

Code:
20096    和    動詞    非自立可能    *    サ行変格    為る    スル    する    スル    スル    0    20092    する    スル    スル    *    3    仕る    スル    スル    0    1
12880    和    動詞    非自立可能    *    上一段-ア行    居る    イル    いる    イル    イル    0    12823    居る    イル    イル    0    57
...
9843    和    動詞    一般    *    五段-ワア行    言う    イウ    言う    イウ    イウ    0    4292    いう    ユウ    ユー    *    4029    いう    イウ    イウ    0    792    言う    ユウ    ユー    0    714    ゆう    ユウ    ユー    *    14    謂う    イウ    イウ    0    1    ゆう    ユウ    ユー    0    1

[Image: GlbLGea.png]
Edited: Today, 8:23 am
Reply
#59
@wareya,

I wanted to show you a second thing I was able to do with your program. The "show line number" not only allowed me to sort by order of appearance, but I then took the spreadsheet and had it include the actual line of text. That was on top of using EPWING2Anki to include the basic definition for the word.  

Here's the updated Harry Potter Book 1 spreadsheet. Turns out it is very functional to both study individual words (with sentence as reference) then read the actual page or chapter. I decided to make this Harry Potter one after successfully studying 60 pages of drama transcripts the same way. The only major thing I did with the Harry Potter text was split up multisentence paragraphs into individual "sentences" using "。?、!" as markers of sorts. This made a majority of lines under 50 characters.

Anyway, thanks again. I seriously could not have done this without your program. If you're looking for more suggestions, if you could some get the program to do the above in an automated fashion that would be great.
Reply