RECENT TOPICS » View all
Is there possible somehow to see most Frequencies katakana words?
EDIT: If I search with kana or kanji, (P) dosen't work.
Last edited by Yufina (2011 February 21, 10:07 am)
Yufina wrote:
Is there possible somehow to see most Frequencies katakana words?
To see the most frequent katakana words, check "RegEx" and then copy-paste the following regular expression into the search box:
^[゠-ヿ]+$
Yufina wrote:
EDIT: If I search with kana or kanji, (P) dosen't work.
You're right! Thanks for the bug report! JLPT doesn't work either.
Edit: Fixed the regular expression (should've tested first).
Last edited by cb4960 (2011 February 22, 10:04 pm)
Maybe you should add some kind of button for that most frequent katakana words?
Also I think I found another bug: Tried to search for あまり, but it wont find word 余り (あまり).
Hello,
I have just released version 3.0 of cbJisho.
For Windows users:
Download cbJisho v3.0 via Google Code
For Linux/Mac users:
Python 2 Version:
Download cbJisho v3.0 source code python2 via Google Code
Python 3 Version:
Download cbJisho v3.0 source code python3 via Google Code
What changed?
▲ Added ability to search via basic SQL queries. See the SQL Query Help section from the online documentation.
▲ Added ability save the results of a search either to the clipboard or to a file. See the Saving Results to File/Clipboard section from the online documentation.
▲ Added search history. Press UP or DOWN to move through the history.
▲ In status bar, show # of total matches and # actually displayed
▲ Fixed (P) and JLPT options not working with kana/kanji search (Thanks Yufina!)
▲ Part-of-speech codes are now combined into a comma-separated list
▲ Added some interface shortcuts:
ESC Clear the search box
CTRL-L Send focus to the search box
CTRL-W Toggle the Whole word checkbox
CTRL-R Toggle the RegEx checkbox
CTRL-J Toggle all of the JLPT buttons
CTRL-P Toggle the (P) checkbox
CTRL-H Show the help page (this is new)
CTRL-Q Show the SQL help page (this is new)
▲ Updated EDICT definitions to latest version
▲ Updated novel stats because a small number of novels had the wrong encoding
▲ Increased Max Def default to 999 (Thanks Yufina! This is what prevented you from seeing 余り when you search for あまり).
▲ Made database column names shorter (for easier user SQL queries)
▲ Now uses SQL ORDER BY clause to sort results
cb4960
お疲れさまでした。
Last edited by nest0r (2011 April 17, 6:32 pm)
Hello,
I have just released version 3.1 of cbJisho.
For Windows users:
Download cbJisho v3.1 via Google Code
For Linux users:
Python 2 Version:
Download cbJisho v3.1 source code python2 via Google Code
Python 3 Version:
Download cbJisho v3.1 source code python3 via Google Code
Note: I haven't actually tested the Python 3 version on Linux
What changed?
▲ The novel column now has frequencies for 63,215 additional words. This was done by supplementing mecab with a user dictionary.
▲ Combined dialect codes with the part-of-speech codes.
▲ Added part-of-speech quick reference to the CTRL-H help file.
cb4960
Thank you!
Some JLPT entries are wrong. Here you can find most updated JLPT vocabulary for N5 and N4: http://www.jlptstudy.com/N5/
Yufina wrote:
Thank you!
Some JLPT entries are wrong. Here you can find most updated JLPT vocabulary for N5 and N4: http://www.jlptstudy.com/N5/
http://www.jlptstudy.com/N5/ wrote:
As of 2010, there is no official vocabulary list. This list, then, is an approximate guide that is likely to match requirements for the Level N5 exam.
I don't suppose somebody knows of a more accurate/official listing of JLPT vocabulary?
I think the JLPT organisers are now not providing official vocab lists as a matter of policy, so I doubt there'll be a better list out there.
The Text Glossing section here: http://www.csse.monash.edu.au/~jwb/wwwj … ticle.html
It looks like the tools for a standalone version are already available in various forms and thus it could be doable to add this functionality to offline programs. The various forms being: jReadability (for character sequence scanning), Rikaisan or something similar (deinflection tables), and offline dictionaries.
nest0r wrote:
The Text Glossing section here: http://www.csse.monash.edu.au/~jwb/wwwj … ticle.html
It looks like the tools for a standalone version are already available in various forms and thus it could be doable to add this functionality to offline programs. The various forms being: jReadability (for character sequence scanning), Rikaisan or something similar (deinflection tables), and offline dictionaries.
Didn't read the entire article yet, but doesn't overture(?) have a tool that does that?
The relevant section to Text Glossing is just a couple paragraphs. (It's headed in bold as Text Glossing.)
Overture's (which is great and I use constantly) is Anki only and online only (it polls the WWWJDIC site); in fact, all of the automatic glossing tools are online only. I'm a standalone/offline kind of person, so I'm always looking for that extra edge and flexibility. ;p
In fact, I'd still be whining about Rikaisan not being Stardict/Lingoes (i.e. Rikaichan getting all the good stuff) if I hadn't discovered you can use it offline (for some reason I thought you had to be online for some earlier version; perhaps I hallucinated that?), despite how mind blowingly awesome it is.
Edit: Here's the relevant text:
“... With WWWJDIC a simpler approach to segmentation has been employed in which the text is scanned to identify in turn each sequence of characters beginning with either a katakana or a kanji. The dictionary is searched using each sequence as key, and if a match is made, the sequence is skipped and the scan continues. Thus the dictionary file itself plays a major role in the segmentation of the text in parallel with the accumulation of the glosses. The technique cannot identify grammatical elements and other words written only in hiragana, however is it quite successful with gairaigo and words written using kanji.
A further element of preprocessing of text is required for inflected forms of words, as the EDICT files only carry the normal plain forms of verbs and adjectives. An inverse stemming technique previously employed in the author's JREADER program is used here, wherein each sequence comprising a kanji followed by two hiragana is treated as a potential case of an inflected word. Using a table of inflections, a list of potential dictionary form words is created and tested against the dictionary file. If a match is found, it is accepted as the appropriate gloss. The table of inflections has over 300 entries and is encoded with the type of inflection which is reported with the gloss. Although quite simple, this technique has been extensively tested with Japanese text and correctly identifies inflected forms in over 95% of cases. (In Figure iii this can be seen where »×¤¤¤Þ¤¹ has been identified as an inflection of »×¤¦.)
When preparing glosses of words in text, it is appropriate to draw on as large as a lexicon as possible. For this reason, a combination of all the major files of the EDICT project is used, unlike the single word search function where users can select which glossary to use. This can introduce other problems as the inappropriate entry may be selected. For example, for the word ¿Í¡¹ the ¤Ò¤È¤Ó¤È entry must be selected, not the much less common ¤Ë¤ó¤Ë¤ó. To facilitate this, a priority system is employed in which preference is given in turn to entries from:
a 12,000 entry file of more commonly used words;
the rest of the EDICT file;
the other subject-specific files;
the file of names.”
Last edited by nest0r (2011 June 27, 8:47 pm)
Interesting. A purely offline version is definitely possible. As you state, rikaichan already does inflection, so that's free code. Using cbJisho's frequency list, it might even be better than the rules in the article. And of course, the dictionary files are easily downloadable and free. You could parse just what you need or even entire files and create something similar to those reading-real-japanese books where the Japanese is on one side and the gloss is on the other. It could even be integrated with rikaisan as a "Gloss Mode" that highlights entire sentences and glosses them up.
Last edited by cb4960 (2011 June 27, 9:06 pm)
cb4960 wrote:
Interesting. A purely offline version is definitely possible. As you state, rikaichan already does inflection, so that's free code. Using cbJisho's frequency list, it might even be better than the rules in the article. And of course, the dictionary files are easily downloadable and free. You could parse just what you need or even entire files and create something similar to those reading-real-japanese books where the Japanese is on one side and the gloss is on the other. It could even be integrated with rikaisan as a "Gloss Mode" that highlights entire sentences and glosses them up.
Yess! ... *steeples fingers*
And then:
nest0r wrote:
... perhaps it could be compared against a user database... perhaps Anki information (seen cards, cards of a certain maturity, whatever)...
- http://forum.koohii.com/viewtopic.php?p … 33#p133033
cb4960 wrote:
... You can have a scenario where a beginner wants to know which of say 5000 novels would be a good one read with regard to that i+1 thingy...
- http://forum.koohii.com/viewtopic.php?p … 67#p133067
And boom! We've come full circle: http://forum.koohii.com/viewtopic.php?p … 16#p149016
Mwa ha ha.
Edit: And that gloss mode sounds like it could be cool. Since you've got that delimiter information, it seems (full stops and suchlike for context sentences). And/or just using Rikaisan to manually select any swath of text?
Edit 2: Although I just noticed MorphMan has a function where you can submit a UTF-8 text and it will extract morphemes and stuff? Even after reading Overture's explanation-for-nerds, not sure how that works, but I'll play with it. Perhaps it already lets you determine the i+N of a text? If not there's still ToasterMage's scripts: http://forum.koohii.com/viewtopic.php?p … 45#p148045
Edit 3: Ohh, looks like you can determine unknowns by creating a db from a text in MorphMan's manager and finding the intersection of the texts. Ahhh. Okay. Hmm, now how to reapply those numbers to the source text, at multiple levels (sentence/paragraph/etc.). ;p Edit 4: No wait, it's better to do A-B or something like that (assuming B is the known.db and A is the db you made from a text). I get it now. ;p
Last edited by nest0r (2011 June 27, 11:24 pm)
*push*
Just a question: is it possible to exchange the dictionary file with a german equivalent? Would be great to know how to do so~
Edit: HAHA! It'd be great if you could add features like 'Sort after radical/kanji' like at jisho.org into that program. That would be extremely useful, since I don't want to be online all the time I'm reading novels and on top of that, wifi access in the garden sucks, seriously~
Last edited by Tori-kun (2011 July 03, 10:34 am)
I absolutely love the word frequency analyzer, and am using it extensively to generate flashcards for any particular corpus I might want to read later.
Is it possible to use the utility to figure out which adjective or verb conjugations are most common in a given set of documents, to help focus other parts of my study?
Thank you for all your wonderful work!
Netbrian wrote:
I absolutely love the word frequency analyzer, and am using it extensively to generate flashcards for any particular corpus I might want to read later.
Is it possible to use the utility to figure out which adjective or verb conjugations are most common in a given set of documents, to help focus other parts of my study?
Thank you for all your wonderful work!
I'm that your finding it helpful. Unfortunately, it's not currently good tool to analyze conjugation frequency. Maybe I'll add support for this in the future (no promises though).
I haven't tried your tool yet, but I was thinking about things that might be useful in a tool like this. My thought was for each word, you could analyze the words surrounding it to determine if those words appear with higher relative frequency than in the rest of the document. This would indicate that certain words tend to appear together.
For example, say the word "broccoli" tends to appear once every 10,000 words based on the first pass through of a document. When you go through the document again and encounter the word "vegetable," you analyze the surrounding 200 words or so, and find that "broccoli" shows up twice in those 200 words. This occurs each time you encounter "vegetable," so we see that the relative frequency of also encountering "broccoli" is much higher if the word "vegetable" is nearby. For other words, like "football," the relative frequency will likely not be higher when the word "vegetable" is encountered.
Each word could then be displayed with the 5 most strongly correlated words, so you would start to get an idea of when words then to show up together. If all that information was collected, you could begin to categorize words together into clumps and from there, the possibilities grow even further. For example, you could clump all the words into 10 categories or so, and test a random sample from each clump to see how well you know the words, and based on the result, you would know what types of words you tend to already know and which ones need further study. If each clump is also weighted by its frequency of appearance, you could potentially use that information to study in the most efficient way to battle your weaknesses.
Couple thoughts about implementation... the sample size could potentially make a big difference in how well the results turn out. For a very common word that appears every 100 words or so, analyzing the surrounding 200 words is meaningless, as once all the results are tabulated, the relative frequency of all words would be exactly the same as in the entire document. So perhaps the radius of surrounding words to check should be related to the frequency of the word itself (e.g. if a word appears every 100 words or so, only check the surrounding 10 words).
This is a work in progress in my head though, so feel free to contribute ideas.
Can you add an option to append an entry to a certain file, instead of overwriting everything?
*cough* I know, I'm lazy. But this software already does everything else I want it to do, lol...
anritsi wrote:
Can you add an option to append an entry to a certain file, instead of overwriting everything?
*cough* I know, I'm lazy. But this software already does everything else I want it to do, lol...
Sure, that should be simple enough. I can probably get to it in 2-3 weeks.
@Christopher: Thanks for cbjisho, I was looking for a tool like that since a long time.
However, I can't make it do what I want - I am writing this to ask whether it is because I do not know the proper way, or because your tool cannot do that.
I am trying to generate lists of say, the N (max) most frequent words including a given character, for each character in say, Heisig 1 - X, not using any character outside that list (namely from Heisig X+1 to 3000 or so), and if possible excluding any duplicates. For instance, "一番" is found in the query for "一", but is also found in that for "番". It want it to appear only once in the results.
First, I can't coax cbjisho in saving only N words - for "一", for instance (Heisig 1), I can change the options text file to display only N but when I press ctrl-D (or ctrl-S), it saves the whole 1900 or so results, not just the first N (is that a behavior that you intended?). Obviously, I can select with the mouse the first N definitions from the screen output, but then it becomes too labor-intensive.
Second, is there a SQL query I can use to batch query Heisig 1 - X, excluding X + 1 to 3000 or so?
御苦労様!
Last edited by louischa (2013 June 23, 11:43 pm)
Zarxrax wrote:
This looks interesting, I'll give it a try later.
Is it possible to browse words by frequency, rather than searching? (basically to obtain a list of vocab to learn)
Makes me wonder if an alternate 'Core' list could be compiled and checked against the Core 6000/10000. I remember discussions on iknow about the last few items from Core 6000 being dated terms. Here is the comment from one of the older users of iknow:
Rainer Konowski wrote:
As far as I know, the Core items have been licensed from the CJK institute (which is probably best known for "The Kanji Learner's Dictionary"). There are 10000 items in total, which can be found in the "Japanese Sensei" iphone app. There is no further information on the pages of the CJK institute, so you can only guess where it comes from.
I think that neither selection nor order of words make sense for any kind of source. For the newspaper hypothesis: One of the surprising omissions is the word "日本" (Japan), which you could easily find in any newspaper. In contrast, the Core items contain the weird word "日ソ" (Japanese-Soviet, this was later edited out by Cerego). This points back to the 80s - maybe the Core set was compiled without the help of computers, which could explain some of its shortcomings.
Some notable omissions that I remember right now: Almost all expressions, interjections, greetings (did you know that はい means yes?). Everyday words like コンビニ (kombini, convenience store), 上履き (uwabaki, indoor shoes/slippers). Place names like Tokyo, America, Mount Fuji. Japanese-culture specific things like okonomiyaki (pancake-like food), 剣道 (kendou, hit others with a wooden stick). Very common words like 小さな/大きな (small/big). Language learners also would want to see grammar patterns like "~や~など" (and so on) or "しか~ない" (only), nothing like that is in the lists. Another problem is that many example sentences are unnecessarily abstract and do not show a typical use of the word in question. I personally would prefer an order of items first by topic and secondary by frequency. That automatically gives some context and makes learning easier.
I agree with Russ that you should ignore the prescribed order and just pick the words that make sense to you and ignore the others. "The final step of the Core 6000" (if that is meant literally) actually contains many useful words. The rare words are spread all over the place.
If I was Cerego I would simply add missing words. They have a mixed Japanese-American team, so what's the problem? All it takes is some common sense and a microphone. Competitor japanesepod does that all the time.
Interestingly, that 日ソ is ranked the 35,000th most common word by Rikaisama...
danieldesu wrote:
I haven't tried your tool yet, but I was thinking about things that might be useful in a tool like this. My thought was for each word, you could analyze the words surrounding it to determine if those words appear with higher relative frequency than in the rest of the document. This would indicate that certain words tend to appear together.
What you describe is a standard computational text analysis concept know as a "collocation". I learn Japanese based on collocations as a matter of routine.
My workhorse is ALC (http://www.alc.co.jp/). You query any word/kanji, say "挙" and then you eyeball the results. Then, boom, on page 2, you notice clusters such as "挙動", "挙動不審". These are what I memorize. I've been doing that since about 1 year.
Collocations are **the** way to go for learning verbs, as typical verb objects (and the relevant particle) appear in ALC. Japanese verbs are involved in all sorts of idiotisms that are unpredictable/surprising to English speakers.
There is a book that I highly recommend: "Common Japanese Collocations: A Learner's Guide to Frequent Word Pairings", by Kakuko Shoji, but I tend to use it less than ALC since I dislike having to type the book expressions, and the book is much too short.
Obviously, a software tool that would do that job would be extremely useful. That said, I think that some degree of human intervention is needed to separate the grain from the chaff once some frequent pairings are found.
Relevant corpora are important when looking for collocations. For some strange reason, ALC uses many Sherlock Holmes novels (Japanese translation), and these usually provide the weirdest results. However, it is spot on most of the time.
louischa wrote:
@Christopher: Thanks for cbjisho, I was looking for a tool like that since a long time.
However, I can't make it do what I want - I am writing this to ask whether it is because I do not know the proper way, or because your tool cannot do that.
I am trying to generate lists of say, the N (max) most frequent words including a given character, for each character in say, Heisig 1 - X, not using any character outside that list (namely from Heisig X+1 to 3000 or so), and if possible excluding any duplicates. For instance, "一番" is found in the query for "一", but is also found in that for "番". It want it to appear only once in the results.
First, I can't coax cbjisho in saving only N words - for "一", for instance (Heisig 1), I can change the options text file to display only N but when I press ctrl-D (or ctrl-S), it saves the whole 1900 or so results, not just the first N (is that a behavior that you intended?). Obviously, I can select with the mouse the first N definitions from the screen output, but then it becomes too labor-intensive.
Second, is there a SQL query I can use to batch query Heisig 1 - X, excluding X + 1 to 3000 or so?
御苦労様!
cbJisho isn't really the right tool for this task. However, a while back I wrote a program for just this purpose. I'll see about polishing it up a bit and releasing it in a new thread. Here is the readme that I wrote to myself at the time:
The generated list is meant for RTK learners (non-lite). The list includes every character in RTK (3007 total). The list is sorted by the Heisig number of the kanji.
Up to 5 entries are associated with each kanji. Each entry contains 4 fields: word, reading, word frequency, and definition.
If possible, the word field will consist of only kana and characters that have already been studied by the learner. For example, 径 is Heisig number 882. The words given for this character will be limited to characters in Heisig numbers 1-882.
Words that are more frequent will be used over words that are less frequent. No duplicate words are used. All fields are separated by tabs.
And here is the file that was generated:
http://www.mediafire.com/download/6x12p … 120506.zip
I try to add a few more options such as allowing a list of kanji to be specified, (instead of forcing Heisig's 1-3007), allowing more than 5 entries, and allowing the output format to be specified.
louischa wrote:
danieldesu wrote:
I haven't tried your tool yet, but I was thinking about things that might be useful in a tool like this. My thought was for each word, you could analyze the words surrounding it to determine if those words appear with higher relative frequency than in the rest of the document. This would indicate that certain words tend to appear together.
What you describe is a standard computational text analysis concept know as a "collocation". I learn Japanese based on collocations as a matter of routine.
My workhorse is ALC (http://www.alc.co.jp/). You query any word/kanji, say "挙" and then you eyeball the results. Then, boom, on page 2, you notice clusters such as "挙動", "挙動不審". These are what I memorize. I've been doing that since about 1 year.
Collocations are **the** way to go for learning verbs, as typical verb objects (and the relevant particle) appear in ALC. Japanese verbs are involved in all sorts of idiotisms that are unpredictable/surprising to English speakers.
There is a book that I highly recommend: "Common Japanese Collocations: A Learner's Guide to Frequent Word Pairings", by Kakuko Shoji, but I tend to use it less than ALC since I dislike having to type the book expressions, and the book is much too short.
Obviously, a software tool that would do that job would be extremely useful. That said, I think that some degree of human intervention is needed to separate the grain from the chaff once some frequent pairings are found.
Relevant corpora are important when looking for collocations. For some strange reason, ALC uses many Sherlock Holmes novels (Japanese translation), and these usually provide the weirdest results. However, it is spot on most of the time.
AntConc is another tool that one can use for collocations. Here is nest0r's explanation on its usage:
http://rtkwiki.koohii.com/wiki/Nest0r%2 … th_AntConc

