Back

cb's Japanese Text Analysis Tool

(2015-11-30, 6:33 pm)yogert909 Wrote:
Zarxrax Wrote:Anyone know a tool or some code I can use to take the word frequency report, and limit it to only words which contain kanji?
If you open the report in a text editor which supports regular expressions, search the following, and replace with nothing.

Search for:  ^((?![\x{4e00}-\x{9faf}]).)*\n
Replace with:

Thanks, but this doesn't seem to be working for me.

Edit: whoops, looks like the parts of speech column is throwing it off. I'll just knock that off of there and it looks like it should work then.

Edit 2: Hmm, nope that doesn't seem to be it. Something's not working right for me. I'm trying in notepad++, and have also tested on https://regex101.com/   Regex is greek to me, so I don't know exactly what to look for...
Edited: 2015-12-01, 5:32 pm
Reply
Has anyone ever gotten this to work on a mac with wine? I'd like to avoid using boot camp to download windows, but it sure sucks not being able to use tools like this one! :(
Reply
I use the tool in wine and it works great.  However the computer I'm using it on is 32bit, so I'm using version 4.2 which I believe is the last 32 bit version.  You could try 4.2 and if it doesn't work for you, then it's probably something missing in your wine package.  I know it won't work without .net installed for instance.
Reply
See this thread for Holiday Countdown Deals (until Dec 15th)
JapanesePod101
(2016-07-07, 4:07 pm)yogert909 Wrote: I use the tool in wine and it works great.  However the computer I'm using it on is 32bit, so I'm using version 4.2 which I believe is the last 32 bit version.  You could try 4.2 and if it doesn't work for you, then it's probably something missing in your wine package.  I know it won't work without .net installed for instance.

After investigating, my mac is 64bit. 


[Image: xQpuZFX.png]

I feel like maybe I'm just not checking the right circles or something??? And is the .NET mentioned in the image the same one you're referring to? How frustrating lol. All I want is some frequency lists! :(
Reply
Hmm, I installed wine using macports. I haven't used this installer so I'm not sure what it's doing although it appears to install mono(.net). Is this wine bottler or wineskin? I tried messing around with them a few years ago and had a lot of problems.

You might want to try installing win using this tutoral and then install mono which is the wine version of .net. just type brew install mono
Reply
(2016-07-07, 5:10 pm)comatosebunny Wrote: After investigating, my mac is 64bit. 

Wine is not 64bit and most likely will never be 64bit on Mac.
Reply
Thanks Tokyostyle.  That's good to know.  So it looks like we're stuck with version 4.2 of the Japanese Text Analysis Tool.

Edit:  It looks like this was my 1,000th post.
Edited: 2016-07-07, 7:23 pm
Reply
(2016-07-07, 7:21 pm)yogert909 Wrote: So it looks like we're stuck with version 4.2 of the Japanese Text Analysis Tool.

This is a bit annoying because .NET supports mixed-mode, where it compiles to 32bit/64bit on the fly, just fine. I know cb4960 likes to use a lot of external tools for the parsing but surely those also come in 32bit flavors. If the same thing happens to subs2srs then I guess I'll have to look into it more. Right now JTAT isn't a part of my study flow so there isn't much motivation to too into a solution.
Reply
(2016-07-07, 6:46 pm)yogert909 Wrote: Hmm, I installed wine using macports.  I haven't used this installer so I'm not sure what it's doing although it appears to install mono(.net).  Is this wine bottler or wineskin?  I tried messing around with them a few years ago and had a lot of problems.

You might want to try installing win using this tutoral and then install mono which is the wine version of .net.  just type brew install mono

Thank you so much for taking the time to find that tutorial! I got it to work! :D
Reply
Thank you for creating this software and creating an interest in studying from self-made corpora, cb! It's heartening to see so many people finding importance in frequency-based study. I was wondering if any JTAT users would be interested in evaluating a very similar app: my dissertation project, which is based on the same principles of word frequency analysis. The main differences, since I last recall of using JTAT, are:
  • My software consumes Wikipedia articles, not user text files
  • It is a web app, and thus is highly compatible with all systems and requires no installation
  • It provides definitions, based on part-of-speech
  • It provides example sentences from within the source
  • It allows filtering out of tokens based on JLPT level
  • It filters out symbols and numbers, etc. and just keeps what I consider to be words
  • There is a quiz module to assess one's vocabulary progress
Overall, it is more calibrated towards study than linguistic text analysis. I've made accessible online for until around August 18th. Full details at this thread. A screenshot of the software is provided below. Hope people could give some input! Smile

[Image: 0UhqnDY.png]
Edited: 2016-07-29, 9:39 am
Reply
About Word frequency report options. How am I supposed to use it. I copy past everything from the report into word excel. I copy the B column (the one that has the kanjis) and past them into notepad. I add that as an exception list but it's not working, the kanjis which were supposed to be an exception are still showing up. Do I have to format or something.
Can someone explain step by step (Im a noob when it comes to these things)
Reply
(2016-09-08, 11:13 am)Bull007 Wrote: About Word frequency report options. How am I supposed to use it. I copy past everything from the report into word excel. I copy the B column (the one that has the kanjis) and past them into notepad. I add that as an exception list but it's not working, the kanjis which were supposed to be an exception are still showing up. Do I have to format or something.
Can someone explain step by step (Im a noob when it comes to these things)

What's an "exception list"?
Reply
I just discovered this tool and it is absolutely fantastic.

For anyone who uses the tool regularly, I have a question. I am basically comfortable with all of the joyo (including the 2010 additions) and am working on learning more kanji beyond joyo. I would love it if it were possible to take the kanji frequency JTAT produces, and run some sort of comparison on it to flag or remove all the kanji that are in the joyo list, so I don't have to eliminate them manually. I have a spreadsheet that would allow me to easily generate a csv-type document that has all the joyo kanji, but I don't know if it is possible to do anything like what I am describing. Help?

From what I can see, it is possible to do this with a word list, but not with a kanji list. I did try it with two kanji lists and got an error.

Obviously I can just manually remove all the kanji I know from the frequency report, but when you know 2000 kanji, that takes kind of a long time... and starting from the "least frequent" end of the list feels like it would defeat the purpose of learning higher frequency kanji first. (That is, I want to start with the highest frequency kanji that are not currently in my known stock.)

***

ETA: never mind, I figured it out!

For anyone else who has the same problem: my mistake was not running my list of joyo kanji through JTAT before attempting to compare the two documents. The comparison tool in the Tools menu is meant to work with two similarly formatted frequency reports. Once I generated a kanji frequency report from my own joyo list and compared it to the frequency report from my novel, it worked like a charm. Such a great tool -- thank you cb!!!
Edited: 2017-01-09, 8:32 pm
Reply
Feature request:
Ability to ignore words according to the part of speech, or limit the report only to certain kinds of speech.
This will allow for instance, keeping particles out of your word list, or restricting yourself just to nouns. Would be very useful when creating lists intended for study.
Reply
I wonder if anyone might have any insight on an oddity I have found in this application. It seems that I am (rarely) getting hits for words that don't exist at the frequency that it claims.

For instance, my frequency report found 2065 occurrences for 電子音, but after a thorough search of my input text, I discovered that this word actually appeared only 10 times. That's a HUGE difference. Now its got me wondering how accurate all the other numbers are   Huh
Reply
(2017-04-21, 9:15 pm)Zarxrax Wrote: I wonder if anyone might have any insight on an oddity I have found in this application. It seems that I am (rarely) getting hits for words that don't exist at the frequency that it claims.

For instance, my frequency report found 2065 occurrences for 電子音, but after a thorough search of my input text, I discovered that this word actually appeared only 10 times. That's a HUGE difference. Now its got me wondering how accurate all the other numbers are   Huh

Is it possibly counting all instances of 電子 + all instances of 音? I don't really understand how the analyzer works.

(Also, I just had to use yomichan to verify that the compound should be pronounced denshi-on, not denshi-oto. Sometimes I really hate Japanese.)
Reply
(2017-04-21, 9:27 pm)tanaquil Wrote: Is it possibly counting all instances of 電子 + all instances of 音? I don't really understand how the analyzer works.

(Also, I just had to use yomichan to verify that the compound should be pronounced denshi-on, not denshi-oto. Sometimes I really hate Japanese.)

If that were the case, 電子 should show up as a separate word on it's own line from 音. Thats really strange. From my understanding the text analyser de-inflects and tokenizes words via mecab or Jparser and then simply counts the tokens.

@ Zarxrax, I wonder what would happen if you switch tokenizers - if you get different results.
Reply
(2017-04-21, 9:54 pm)yogert909 Wrote: @ Zarxrax, I wonder what would happen if you switch tokenizers - if you get different results.

Looks like re-running it is getting me different results now, and the word is showing up the correct number of times.
I should be running on the same input that I originally used, but most of the numbers are slightly different than before. My original run was a couple months ago, so there is no way to verify exactly what happened. I guess I will just chalk this up as user error on my part until I can verify it further.
Edited: 2017-04-22, 8:14 pm
Reply
Thank you so much for this amazing tool!


Suddenly the Word Frequency Report has stopped working. It generates an empty file, even with the same old files it worked previously.  It's ok with the others reports.


I use a small txt file (16KB) in Unicode format in Windows 10 (x64)

What can I do?
Reply
(2017-05-18, 4:44 pm)chuhamasaki Wrote: Thank you so much for this amazing tool!


Suddenly the Word Frequency Report has stopped working. It generates an empty file, even with the same old files it worked previously.  It's ok with the others reports.


I use a small txt file (16KB) in Unicode format in Windows 10 (x64)

What can I do?

Did you try downloading the software again in case it was broken somehow?
Reply
(2017-05-18, 4:44 pm)chuhamasaki Wrote: Thank you so much for this amazing tool!


Suddenly the Word Frequency Report has stopped working. It generates an empty file, even with the same old files it worked previously.  It's ok with the others reports.


I use a small txt file (16KB) in Unicode format in Windows 10 (x64)

What can I do?

Have you found a solution? I'm having the same issue, kanji frequency and readability work fine but Word Frequency Report just give me a blank text file. I tried downloading the software again, different versions as well (only 4.2 seems to work), and I tried them on a few computers as well with different OS (Win 8 and 10), all of them being x64 but i got nothing. I hope you can help me. Until then I'll keep using 4.2
Reply
(2017-07-23, 8:12 am)neko_san Wrote:
(2017-05-18, 4:44 pm)chuhamasaki Wrote: Thank you so much for this amazing tool!


Suddenly the Word Frequency Report has stopped working. It generates an empty file, even with the same old files it worked previously.  It's ok with the others reports.


I use a small txt file (16KB) in Unicode format in Windows 10 (x64)

What can I do?

Have you found a solution? I'm having the same issue, kanji frequency and readability work fine but Word Frequency Report just give me a blank text file. I tried downloading the software again, different versions as well (only 4.2 seems to work), and I tried them on a few computers as well with different OS (Win 8 and 10), all of them being x64 but i got nothing. I hope you can help me. Until then I'll keep using 4.2

Try changing the folder that the program is in to a path that does not contain spaces or Japanese characters. Maybe the input/output/wordlist files and paths too. I'm not sure of the exact cause, but I'm pretty sure it's a combination of the above.
Reply