![]() |
|
cb's Japanese Text Analysis Tool - Printable Version +- kanji koohii FORUM (http://forum.koohii.com) +-- Forum: Learning Japanese (http://forum.koohii.com/forum-4.html) +--- Forum: Learning resources (http://forum.koohii.com/forum-9.html) +--- Thread: cb's Japanese Text Analysis Tool (/thread-9459.html) |
cb's Japanese Text Analysis Tool - yogert909 - 2014-08-22 Hi, Thanks! This is a great tool that I'll be sure to use very soon. I hope you don't mind me making a feature request, I imagine it could be useful. In addition to the user based readability report, wouldn't it be great to have a list of all the unknown words/kanji so that you could study them? cb's Japanese Text Analysis Tool - Termy - 2014-08-22 yogert909 Wrote:Hi, Thanks! This is a great tool that I'll be sure to use very soon.You can make a list of unknown words and kanji. For example, if you export all of your anki decks into txt-files, analyze them (word and kanji frequency), use the "combine frequency reports" under tools, and ultimately use "compare frequency reports" to compare your merged anki txt-file with whatever txt file you wish to check the number of unknown words/kanji in. cb's Japanese Text Analysis Tool - cb4960 - 2014-08-22 Termy Wrote:Additionally, you can take a look at Word List Duplicate Remover.yogert909 Wrote:Hi, Thanks! This is a great tool that I'll be sure to use very soon.You can make a list of unknown words and kanji. For example, if you export all of your anki decks into txt-files, analyze them (word and kanji frequency), use the "combine frequency reports" under tools, and ultimately use "compare frequency reports" to compare your merged anki txt-file with whatever txt file you wish to check the number of unknown words/kanji in. cb's Japanese Text Analysis Tool - yogert909 - 2014-08-22 Oh, thanks! I thought I probably wasn't the first person to think of this
cb's Japanese Text Analysis Tool - Gayle - 2014-08-25 5.0 does indeed generate blank word frequency reports from some files. Switching to 4.4 worked for me. cb's Japanese Text Analysis Tool - Termy - 2014-08-26 I could have sworn that 4.4 didn't work when I tried it, right before installing 5.0, but when I switched back to 4.4 now it works fine. cb's Japanese Text Analysis Tool - ryuudou - 2014-09-02 This blank file thing is happening for me too. Happens selectively with 4.4 as well. Sometimes Mecab and always with JParser. cb's Japanese Text Analysis Tool - cb4960 - 2014-09-02 Thanks for the responses. I'll set the primary download on SourceForge to v4.4 until I have the time to investigate the issue further. cb's Japanese Text Analysis Tool - yogert909 - 2014-09-10 I'm on a mac and can't get v4.4 to open in parallels (windows says it is not a proper exe file or something). It opens in wine, and using jparser it exports a kanji report, but the word report is blank. Mecab crashes the program. I so want to get this working. Any suggestions? cb's Japanese Text Analysis Tool - cb4960 - 2014-09-11 yogert909 Wrote:I'm on a mac and can't get v4.4 to open in parallels (windows says it is not a proper exe file or something). It opens in wine, and using jparser it exports a kanji report, but the word report is blank. Mecab crashes the program.Other users are having issues with v4.4 even using Windows. I've tried to reproduce the issues on a number of other computer using various forms of Windows (Windows 7 Home Premium 64-bit, Windows 7 Enterprise 64-bit, Windows 7 Ultimate 32-bit, Windows 8 64-bit, Windows XP 32-bit, Windows Vista 32-bit) without success. Getting it to work properly on a Mac might not be possible and I don't know enough about the Mac environment to provide any real help. Maybe your Wine environment is missing whatever version of the Visual Studio runtime that Mecab uses. If you have a one-time analysis that you would like to perform, feel free to send me your files and I will analyze them for you. cb's Japanese Text Analysis Tool - aldebrn - 2014-09-11 cb4960 Wrote:Windows running in a Parallels virtual machine could be a good way to look into the problem, since the virtualization layer can (hopefully) be assumed to be fixed even if your host (Mac, Win) changes. I would try making a clean install of Win7 or whatever inside VirtualBox (free, robust, works great with Mac and Win hosts) and seeing what happens with version 4.4.yogert909 Wrote:I'm on a mac and can't get v4.4 to open in parallels (windows says it is not a proper exe file or something)Other users are having issues with v4.4 even using Windows. ... Getting it to work properly on a Mac might not be possible and I don't know enough about the Mac environment to provide any real help. Or compile MeCab with Emscripten and rewrite the tool itself in Javascript so it all runs in a browser, the ultimate cross-platform layer
cb's Japanese Text Analysis Tool - yogert909 - 2014-09-11 @CB Thanks for your offer and taking time respond. It's strange because last night I tried installing wine via homebrew instead of macports and I got the same mecab crashing and blank file behavior. I'm going to try installing another version of windows in parallels this weekend and see if that does the trick. I might even try a dual boot if that doesn't do it. I'm obsessed with making it work! @aldebrn I think that'll be the easiest - installing a new version of windows. Rewriting in javascript is a bit outside my capablities. I found a bash script that I might mess around with though. cb's Japanese Text Analysis Tool - Sebastian - 2014-09-11 yogert909 Wrote:I might even try a dual boot if that doesn't do it. I'm obsessed with making it work!Or you could Run Windows Vista/7 From a Flash Drive Without Installing. cb's Japanese Text Analysis Tool - yogert909 - 2014-09-16 A fresh install of windows 7 did the trick. A was able to make good use of the text analysis tool and wordlist duplicate remover. Thanks! cb's Japanese Text Analysis Tool - blindbox - 2015-02-23 Thanks for the software cb4960. You really saved a lot of people's time with this one. I was thinking of writing a simple script for kanji analysis myself, but you came up with a full-blown software plus GUI! Anyway, any chance that you can add the order of appearance of kanji and word as well? I prefer to learn by the order of appearance, rather than the frequency. EDIT: On second thought, I don't like to learn by order of appearance after all. There's 8000 words in one volume of a light novel I'm trying to read! cb's Japanese Text Analysis Tool - cb4960 - 2015-02-24 blindbox Wrote:Thanks for the software cb4960. You really saved a lot of people's time with this one. I was thinking of writing a simple script for kanji analysis myself, but you came up with a full-blown software plus GUI!Still, not a terrible idea... Maybe I can add a new column containing the order in which the words/kanji were found. Maybe you can run the tool once for each new chapter so you aren't presented with 8000 words all at one time. cb's Japanese Text Analysis Tool - Zarxrax - 2015-02-24 I think having that order would be useful. cb's Japanese Text Analysis Tool - Zarxrax - 2015-04-04 This tool fails when it encounters certain gibberish or control characters, as can be seen in this example file: https://mega.co.nz/#!xQEjARRY!S_GCZtd7Dso_EZbD_FHKAiaWPjMI6Fh14hecqoAtIaQ Can this be fixed so that it will bypass the junk and still get all of the data from the Japanese? cb's Japanese Text Analysis Tool - blindbox - 2015-05-10 Gayle Wrote:5.0 does indeed generate blank word frequency reports from some files. Switching to 4.4 worked for me. ryuudou Wrote:This blank file thing is happening for me too. cb4960 Wrote:Thanks for the responses. I'll set the primary download on SourceForge to v4.4 until I have the time to investigate the issue further.@cb4960: I managed to track down this issue. If I put the JTAT in a folder directory without a space i.e. E:\JTAT\, the blank word report issue using JParser doesn't occur. However, if I put JTAT in a folder with space i.e. E:\J T A T\, the blank word report issue does occur. To summarize, here are my tests. E:\JTAT\ - no blank E:\J T A T\ - blank word report E:\foobar\JTAT\ - no blank E:\foo bar\JTAT\ - blank word report E:\foobar\J T A T\ - blank word report This has been tested with both MeCab and JParser. Same behavior. On JTAT 5.0 What others can do meanwhile, is to place JTAT in a folder or a subfolder without spaces. The output location is not affected by this bug. That is, JTAT must be in the folders like the following example: E:\JTAT\ E:\foobar\JTAT\ E:\foo\bar\JTAT\ cb's Japanese Text Analysis Tool - cb4960 - 2015-05-11 Thanks for investigating this. I'll try to reproduce this later with spaces in the path. If this is the issue, it should be an easy fix. Update: I'm unable to reproduce this. For me it works fine with or without spaces in the path. cb's Japanese Text Analysis Tool - blindbox - 2015-05-23 Hm, I just tried it again myself, word_freq_report.txt is still blank (both meCAB and JDICT) if my folders or subfolders have spaces. Tested on 5.1. Maybe some sanitization of the source code is required? Maybe a dll version mismatch issue? I'm running the software on Windows 7. I've tried to install .net 3.5 but I can't (also tried the workaround that tells you how to activate .net 3.5 on windows 7). Gonna try update my windows. cb's Japanese Text Analysis Tool - jcdietz03 - 2015-06-23 cb4960's tools that need novel texts to work are not very useful to me because I don't have these novel texts or know where to get them. His tools that don't need novel texts to work, namely Rikaisama and the OCR one, are very useful, I found. cb's Japanese Text Analysis Tool - yogert909 - 2015-06-23 jcdietz03 Wrote:cb4960's tools that need novel texts to work are not very useful to me because I don't have these novel texts or know where to get them.what tools need novel texts to work? his japanese text analysis tool just needs a japanese text to analyze...but no surprise there. cb's Japanese Text Analysis Tool - wareya - 2015-07-01 Are you aware of this morpheme splitter? http://atilika.com/en/products/kuromoji.html It uses a statistical/context model that keeps it from doing shit like misinterpreting それ+は as the adverb ソレハ. It also handles made-up words and names elegantly. Old demo/talk (1-2 yrs): I'd be interested in seeing support for this because it avoids the ソレハ thing, which seems to be a current problem that the things we're currently on have. Everything I've gotten out of JTAT has way too many false positives of ソレハ. I would just blacklist bad terms, but it negatively impacts the ranking of それ regardless, so I don't see a point. Also it generally seems to be an incredibly good splitter for japanese. cb's Japanese Text Analysis Tool - cb4960 - 2015-07-01 wareya Wrote:Are you aware of this morpheme splitter?Can you provide an example sentence that demonstrates this behavior? |