Back

unnamed japanese text analyzer

#51
(2017-05-11, 10:47 pm)phil321 Wrote: Here's how they translated "the Welcome Wagon Lady":  歓迎ワゴンのレヂイは.  The kanji  歓迎 are glossed with katakana furigana as follows:  ウェルカム.  So even though 歓迎 which means "welcome" is pronounced "kangei" it is in this case glossed as "werukamu".  Bizarre, but I guess that is the only way to translate "the Welcome Wagon Lady".
That doesn't leave Japanese readers much worse off than many English readers -- I had to go look the phrase up on Wikipedia. Using katakana is at least a clue that it's a proper noun...
Reply
#52
(2017-05-12, 4:10 pm)pm215 Wrote:
(2017-05-11, 10:47 pm)phil321 Wrote: Here's how they translated "the Welcome Wagon Lady":  歓迎ワゴンのレヂイは.  The kanji  歓迎 are glossed with katakana furigana as follows:  ウェルカム.  So even though 歓迎 which means "welcome" is pronounced "kangei" it is in this case glossed as "werukamu".  Bizarre, but I guess that is the only way to translate "the Welcome Wagon Lady".
That doesn't leave Japanese readers much worse off than many English readers -- I had to go look the phrase up on Wikipedia. Using katakana is at least a clue that it's a proper noun...

"Welcome wagon" is a term that native English speakers (at least in North America) would know without checking in a dictionary.  Obviously Ira Levin, the author of the book, wouldn't have used the term "The Welcome Wagon Lady" in the first sentence of his bestselling book if the typical reader didn't know what it meant off the top of their head.
Edited: 2017-05-12, 4:20 pm
Reply
#53
My point was that I'm a native English speaker (UK, in this case, but the same would apply to any non-US/Canadian readers) and it wasn't something I knew. Wikipedia says they stopped in 1998 so some younger US readers might also miss the reference. (References do go out of date with time and distance -- Austen expected her readers to know what a Rumford stove was, for instance.)
Edited: 2017-05-12, 4:32 pm
Reply
EPIC SALE: Get 30% OFF Premium & Premium PLUS! (Sept 18 - 29)
JapanesePod101
#54
For what it's worth I don't know the term "welcome wagon" either, it's just something that sounds like it might be a term.
Reply
#55
Analyzer now supports user dictionaries. Entry must be in a pretty specific format. Example dictionary uses 自転車. In the future, the names list will be provided as a user dictionary instead of being baked in.

Normalizer removes outliers instead of trying to adjust the skew of the distribution. Frequency lists follow a Pareto distribution, which there's no generic way to linearize, so removing outliers is the only thing that doesn't affect the final distribution of the data (not even a geometric mean would be correct). Removing outliers is related to the median; with only three or four input files, the result will be the same.

https://github.com/wareya/analyzer/releases/tag/alpha1
Edited: 2017-05-17, 11:45 am
Reply
#56
Some important (generic) terms from A Frequency Dictionary of Japanese that the analyzer wasn't picking up on, in user dictionary format:

Code:
自転車,5142,5142,8000,名詞,普通名詞,一般,*,*,*,ジテンシャ,自転車,自転車,ジテンシャ,自転車,ジテンシャ,漢,*,*,*,*,ジテンシャ,ジテンシャ,ジテンシャ,ジテンシャ,*,*,2,*,*
飛行機,5142,5142,8000,名詞,普通名詞,一般,*,*,*,ヒコウキ,飛行機,飛行機,ヒコーキ,飛行機,ヒコーキ,漢,*,*,*,*,ヒコウキ,ヒコウキ,ヒコウキ,ヒコウキ,*,*,2,*,*
幼稚園,5142,5142,8000,名詞,普通名詞,一般,*,*,*,ヨウチエン,幼稚園,幼稚園,ヨーチエン,幼稚園,ヨーチエン,漢,*,*,*,*,ヨウチエン,ヨウチエン,ヨウチエン,ヨウチエン,*,*,3,*,*
自動車,5142,5142,8000,名詞,普通名詞,一般,*,*,*,ジドウシャ,自動車,自動車,ジドーシャ,自動車,ジドーシャ,漢,*,*,*,*,ジドウシャ,ジドウシャ,ジドウシャ,ジドウシャ,*,*,"2,0",*,*
図書館,5142,5142,8000,名詞,普通名詞,一般,*,*,*,トショカン,図書館,図書館,トショカン,図書館,トショカン,漢,*,*,*,*,トショカン,トショカン,トショカン,トショカン,*,*,2,*,*
冷蔵庫,5142,5142,8000,名詞,普通名詞,一般,*,*,*,レイゾウコ,冷蔵庫,冷蔵庫,レーゾーコ,冷蔵庫,レーゾーコ,漢,*,*,*,*,レイゾウコ,レイゾウコ,レイゾウコ,レイゾウコ,*,*,3,*,*
違和感,5142,5142,8000,名詞,普通名詞,一般,*,*,*,イワカン,違和感,違和感,イワカン,違和感,イワカン,漢,*,*,*,*,イワカン,イワカン,イワカン,イワカン,*,*,2,*,*
不動産,5142,5142,8000,名詞,普通名詞,一般,*,*,*,フドウサン,不動産,不動産,フドーサン,不動産,フドーサン,漢,*,*,*,*,フドウサン,フドウサン,フドウサン,フドウサン,*,*,"2,0",*,*
法案,5142,5142,8000,名詞,普通名詞,一般,*,*,*,ホウアン,法案,法案,ホーアン,法案,ホーアン,漢,*,*,*,*,ホウアン,ホウアン,ホウアン,ホウアン,*,*,0,*,*
無人島,5142,5142,8000,名詞,普通名詞,一般,*,*,*,ムジントウ,無人島,無人島,ムジントー,無人島,ムジントー,漢,*,*,*,*,ムジントウ,ムジントウ,ムジントウ,ムジントウ,*,*,0,*,*
価値観,5142,5142,8000,名詞,普通名詞,一般,*,*,*,カチカン,価値観,価値観,カチカン,価値観,カチカン,漢,*,*,*,*,カチカン,カチカン,カチカン,カチカン,*,*,"3,2",*,*
市内,5142,5142,8000,名詞,普通名詞,一般,*,*,*,シナイ,市内,市内,シナイ,市内,シナイ,漢,*,*,*,*,シナイ,シナイ,シナイ,シナイ,*,*,1,*,*
地下鉄,5142,5142,8000,名詞,普通名詞,一般,*,*,*,チカテツ,地下鉄,地下鉄,チカテツ,地下鉄,チカテツ,漢,*,*,*,*,チカテツ,チカテツ,チカテツ,チカテツ,*,*,0,*,*
看護婦,5142,5142,8000,名詞,普通名詞,一般,*,*,*,カンゴフ,看護婦,看護婦,カンゴフ,看護婦,カンゴフ,漢,*,*,*,*,カンゴフ,カンゴフ,カンゴフ,カンゴフ,*,*,3,*,*
好奇心,5142,5142,8000,名詞,普通名詞,一般,*,*,*,コウキシン,好奇心,好奇心,コーキシン,好奇心,コーキシン,漢,*,*,*,*,コウキシン,コウキシン,コウキシン,コウキシン,*,*,3,*,*
いいかげん,5159,5159,8000,名詞,普通名詞,形状詞可能,*,*,*,イイカゲン,いい加減,いいかげん,イイカゲン,いいかげん,イイカゲン,混,*,*,*,*,イイカゲン,イイカゲン,イイカゲン,イイカゲン,*,*,0,*,*
いい加減,5159,5159,8000,名詞,普通名詞,形状詞可能,*,*,*,イイカゲン,いい加減,いい加減,イイカゲン,いい加減,イイカゲン,混,*,*,*,*,イイカゲン,イイカゲン,イイカゲン,イイカゲン,*,*,0,*,*
見た目,5142,5142,8000,名詞,普通名詞,一般,*,*,*,ミタメ,見た目,見た目,ミタメ,見た目,ミタメ,漢,*,*,*,*,ミタメ,ミタメ,ミタメ,ミタメ,*,*,1,*,*
みため,5142,5142,8000,名詞,普通名詞,一般,*,*,*,ミタメ,見た目,みため,ミタメ,みため,ミタメ,漢,*,*,*,*,ミタメ,ミタメ,ミタメ,ミタメ,*,*,1,*,*
みた目,5142,5142,8000,名詞,普通名詞,一般,*,*,*,ミタメ,見た目,みた目,ミタメ,みた目,ミタメ,漢,*,*,*,*,ミタメ,ミタメ,ミタメ,ミタメ,*,*,1,*,*
委員会,5142,5142,8000,名詞,普通名詞,一般,*,*,*,イインカイ,委員会,委員会,イインカイ,委員会,イインカイ,漢,*,*,*,*,イインカイ,イインカイ,イインカイ,イインカイ,*,*,"2,0",*,*
遊園地,5142,5142,8000,名詞,普通名詞,一般,*,*,*,ユウエンチ,遊園地,遊園地,ユーエンチ,遊園地,ユーエンチ,漢,*,*,*,*,ユウエンチ,ユウエンチ,ユウエンチ,ユウエンチ,*,*,3,*,*
新幹線,5142,5142,8000,名詞,普通名詞,一般,*,*,*,シンカンセン,新幹線,新幹線,シンカンセン,新幹線,シンカンセン,漢,*,*,*,*,シンカンセン,シンカンセン,シンカンセン,シンカンセン,*,*,3,*,*

Pitch accent data copied from rikai.

There's more it didn't pick up on, but most of them were spelling differences and a lot of them are questionable. The only objective ones I left out so far were a couple prefecture names, abbreviations, and phrases that are more societally-oriented.

This will be the default user dictionary in the next release, whenever that happens. The next major feature is the ability to pretend that some entries are other, different entries, to automatically account for spelling variations. This, too, will use an explicit mapping loaded from a file.

Here's how this user dictionary affects Katahane (except the effect of 自転車): https://pastebin.com/raw/w89rcF81
Edited: 2017-05-18, 6:01 pm
Reply
#57
Filters are now loaded from a file (userfilters.csv). These are basically blacklist filters, I haven't implemented rewrite filters yet. The new included filters are more reasonable for unidic than the old ones, which blocked a bunch of non-grammatical verbs, since they were meant for output from ipadic.

Also, the command line interface can now use the user dictionary (enabled by default), and the command line flags were changed because of the removal of the built-in particle/number/name filters.

https://github.com/wareya/analyzer/releases/tag/alpha2

[Image: 1aJMsgQ.png]
Edited: 2017-05-28, 6:18 am
Reply
#58
Alpha3 lets you analyze frequency by word or lexeme instead of just the given spelling.

https://github.com/wareya/analyzer/releases/tag/alpha3

Example:

Code:
20096    和    動詞    非自立可能    *    サ行変格    為る    スル    する    スル    スル    0    20092    する    スル    スル    *    3    仕る    スル    スル    0    1
12880    和    動詞    非自立可能    *    上一段-ア行    居る    イル    いる    イル    イル    0    12823    居る    イル    イル    0    57
...
9843    和    動詞    一般    *    五段-ワア行    言う    イウ    言う    イウ    イウ    0    4292    いう    ユウ    ユー    *    4029    いう    イウ    イウ    0    792    言う    ユウ    ユー    0    714    ゆう    ユウ    ユー    *    14    謂う    イウ    イウ    0    1    ゆう    ユウ    ユー    0    1

[Image: GlbLGea.png]
Edited: 2017-07-22, 8:23 am
Reply
#59
@wareya,

I wanted to show you a second thing I was able to do with your program. The "show line number" not only allowed me to sort by order of appearance, but I then took the spreadsheet and had it include the actual line of text. That was on top of using EPWING2Anki to include the basic definition for the word.  

Here's the updated Harry Potter Book 1 spreadsheet. Turns out it is very functional to both study individual words (with sentence as reference) then read the actual page or chapter. I decided to make this Harry Potter one after successfully studying 60 pages of drama transcripts the same way. The only major thing I did with the Harry Potter text was split up multisentence paragraphs into individual "sentences" using "。?、!" as markers of sorts. This made a majority of lines under 50 characters.

Anyway, thanks again. I seriously could not have done this without your program. If you're looking for more suggestions, if you could some get the program to do the above in an automated fashion that would be great.
Reply
#60
(2017-07-22, 7:51 pm)Nukemarine Wrote: @wareya,

I wanted to show you a second thing I was able to do with your program. The "show line number" not only allowed me to sort by order of appearance, but I then took the spreadsheet and had it include the actual line of text. That was on top of using EPWING2Anki to include the basic definition for the word.  

Here's the updated Harry Potter Book 1 spreadsheet. Turns out it is very functional to both study individual words (with sentence as reference) then read the actual page or chapter. I decided to make this Harry Potter one after successfully studying 60 pages of drama transcripts the same way. The only major thing I did with the Harry Potter text was split up multisentence paragraphs into individual "sentences" using "。?、!" as markers of sorts. This made a majority of lines under 50 characters.

Anyway, thanks again. I seriously could not have done this without your program. If you're looking for more suggestions, if you could some get the program to do the above in an automated fashion that would be great.
You made before an interesting excercise of comparing HP vocab with Core2 I think. At that time your list had 7000 entries and now it has only 5400. That is a huge difference. Is that all because of the improved word detection/analysis in the analyzer?
Reply
#61
(2017-07-23, 1:46 pm)Matthias Wrote:
(2017-07-22, 7:51 pm)Nukemarine Wrote: @wareya,

I wanted to show you a second thing I was able to do with your program. The "show line number" not only allowed me to sort by order of appearance, but I then took the spreadsheet and had it include the actual line of text. That was on top of using EPWING2Anki to include the basic definition for the word.  

Here's the updated Harry Potter Book 1 spreadsheet. Turns out it is very functional to both study individual words (with sentence as reference) then read the actual page or chapter. I decided to make this Harry Potter one after successfully studying 60 pages of drama transcripts the same way. The only major thing I did with the Harry Potter text was split up multisentence paragraphs into individual "sentences" using "。?、!" as markers of sorts. This made a majority of lines under 50 characters.

Anyway, thanks again. I seriously could not have done this without your program. If you're looking for more suggestions, if you could some get the program to do the above in an automated fashion that would be great.
You made before an interesting excercise of comparing HP vocab with Core2 I think. At that time your list had 7000 entries and now it has only 5400. That is a huge difference. Is that all because of the improved word detection/analysis in the analyzer?
I filter out entries that appear in scans of Core 2k and Tae Kim sentences. The unfiltered list is still 7000 or more iirc. Also, of the 5400 there's still a lot of terms that are useless for anything outside the book series.

That said, I'm tempted to officially study either this or Norwegian Wood in a way similar to when I studied the transcripts for episodes 1-4 of Zettai Kareshi. In my mind, the first 4000 lines of text (out of 15,000 total) seems reasonable. Like the transcripts, it's not really about memorizing words but understanding the sentences that the words appear.  Basically, study ~500 vocabulary/lines then read the text up to that point out loud at the end. Takes time, but the end result is amazing.
Reply
#62
(2017-07-23, 8:24 pm)Nukemarine Wrote: I filter out entries that appear in scans of Core 2k and Tae Kim sentences. The unfiltered list is still 7000 or more iirc. Also, of the 5400 there's still a lot of terms that are useless for anything outside the book series.

That said, I'm tempted to officially study either this or Norwegian Wood in a way similar to when I studied the transcripts for episodes 1-4 of Zettai Kareshi. In my mind, the first 4000 lines of text (out of 15,000 total) seems reasonable. Like the transcripts, it's not really about memorizing words but understanding the sentences that the words appear.  Basically, study ~500 vocabulary/lines then read the text up to that point out loud at the end. Takes time, but the end result is amazing.
OK, that means no significant difference to the first analysis. So probably the detection/recognition quality is not much changed.

If you filter out the terms which do not have an English definition (e.g. names) you have 4.900 terms [of which quite often the given translation seems mispicked]. Not surprisingly there are more terms in the first chapter than in the last chapter and the average frequency goes down too:

618 7,2
357 4,4
418 3,4
351 4,2
530 3,2
360 4,0
329 3,0
201 1,9
291 2,2
210 1,7
158 1,8
222 1,8
125 1,6
150 1,4
198 1,5
235 1,3
143 1,1
4.896 3,3

I think you are right that there are quite some terms which are useless outside the book series. The other way round that means too that core 10k leaves a lot uncovered.


Did you do the Zettai Kareshi transcript analysis for the complete 1-4 corpus (not one by one)? If so could you link that too? It would be interesting to see how different the vocab is compared to the HP analysis.
Reply
#63
(2017-07-25, 3:46 pm)Matthias Wrote:
(2017-07-23, 8:24 pm)Nukemarine Wrote: I filter out entries that appear in scans of Core 2k and Tae Kim sentences. The unfiltered list is still 7000 or more iirc. Also, of the 5400 there's still a lot of terms that are useless for anything outside the book series.

That said, I'm tempted to officially study either this or Norwegian Wood in a way similar to when I studied the transcripts for episodes 1-4 of Zettai Kareshi. In my mind, the first 4000 lines of text (out of 15,000 total) seems reasonable. Like the transcripts, it's not really about memorizing words but understanding the sentences that the words appear.  Basically, study ~500 vocabulary/lines then read the text up to that point out loud at the end. Takes time, but the end result is amazing.
OK, that means no significant difference to the first analysis. So probably the detection/recognition quality is not much changed.

If you filter out the terms which do not have an English definition (e.g. names) you have 4.900 terms [of which quite often the given translation seems mispicked]. Not surprisingly there are more terms in the first chapter than in the last chapter and the average frequency goes down too:

*snip*

Did you do the Zettai Kareshi transcript analysis for the complete 1-4 corpus (not one by one)? If so could you link that too? It would be interesting to see how different the vocab is compared to the HP analysis.

I did the scans one by one for simplicity's sake but it appears similar distribution. Frequent terms naturally appear first early on just because of the nature of language. Personally, I'm not going to take it too deep. Instead, this is more a byproduct of my own studies to include Japanese dramas and books. Basically, I'm going to do with books what I did with dramas (with a bit more use of text to speech on single sentences). 

I'm just glad that wareya's program is working out so well. The addition of an example sentence from the source made all the difference in turning vocabulary study into a fun endeavor. Shame no one is using my Memrise course though it is a 10 year old drama. That I will be doing Murakami's "Norwegian Wood" might change that though.
Reply
#64
https://github.com/wareya/analyzer/releases/tag/alpha4

Added functionality to make it easier to use the analyzer for mining. Written by someone else. Detailed in https://github.com/wareya/analyzer/pull/1

[Image: 1W6yYdl.png]
Reply
#65
Hi wareya,

I was the one who submitted the code for the sentence improvements. I've started experimenting with n+1 sentence mining, I'm not sure how it will fit into the ui with so many options already, but we'll see how it goes. I've started a thread specific to that here as not to detract from this thread too much.
Reply
#66
Updated the user dictionary. This fixes the numerical "word type" IDs for a bunch of sino-japanese nouns, which made kuromoji skip them sometimes before (for some reason...), and adds family member names e.g. お父・おとう because the others (父・ちち) were overriding them because unidic isn't good enough to distinguish them in context.

This is an update to alpha4 since no code changed, just a file.
Edited: 2017-08-28, 1:32 am
Reply