Back

Top Japanese kindle books sorted by difficulty

#26
Yep, I'd seen it but hadn't had a good read through it until just now. It is fascinating that they're going based on single characters instead of word boundaries - makes me feel a little less bad about rating based purely on Kanji.

The thing I'm not so sure I agree with in OBI2 is they're basically just matching text to Japanese school grades. Learning Japanese as a second language, you're not necessarily going to learn everything in the same order than school kids will. It seems to me like a better rating for us second language learners is to estimate how "typical" a Japanese text is, as opposed to predicting which grade it would be learned in. This way, studying by moving up the typicalness scale will allow you to read an increased number of texts online.

Either way, it is a good read. Thanks!
Reply
#27
zaydana Wrote:I won't be able to get it working immediately, because an easiness ranking based on vocab will need a *much* larger corpus than I've currently got, but it is definitely something I want to do. Thanks for pointing them out!
cb ran his Japanese Text Analysis Tool on a corpus of 5000+ modern Japanese novels and posted the results on MegaFireShareCrap (link and description at http://forum.koohii.com/showthread.php?p...#pid167828). I reposted the resulting text files to https://gist.github.com/fasiha/779f73f802b80520db4a which you can git-clone or download via browser.

Included:
- MeCab base lemma frequency report, and probably most relevant to this project: columns are "frequency count, MeCab lemma, percentage, cumulative percentage, raw MeCab part-of-speech analysis": word_freq_report.txt
- kanji frequency report: kanji_freq_report.txt
- readability formula applied to each file in the "corpus": formula_based_readability_report.txt

NB. Links to the full "Innocent corupus" of 5000+ novels may be found on this forum…
Reply
#28
aldebrn, this is amazing. Thanks!

I'm still thinking about how to use the words in the score, but for the moment I've rerun my algorithm using your kanji frequency report instead of building one from the tested books. I've also increased the number of Kanji I use to 7000. The results seem to make a little more sense than previously.

On each book page, I've also added color-coded kanji based on the Kyoiku Kanji, and a graph of how many Kanji are introduced over time at the bottom of each page. This should let people get a better idea of how accurate each score is.

One night I need to run my analysis on the books in the innocent corpus to see how it matches up with the two academic scores. Will post results once that is done.
Edited: 2015-03-26, 8:39 am
Reply
May 16 - 30 : Pretty Big Deal: Save 31% on all Premium Subscriptions! - Sign up here
JapanesePod101
#29
Excellent, but please note that that data and work is entirely cb4960's!

I love the new graphics and visualization elements on Read Your Level. I know it'd be a pain in the butt, but if you made the graph interactive (using d3.js or something) so that a mouse hovering over a point would indicate what kanji the point corresponds to, that would be (a) useful and (b) super-slick Smile.

What color-scheme did you use for the grades? I recently started using the color scheme at kanken.or.jp (code to save you two minutes) and admit although it seems arbitrary I'm getting used to it, and I'm wondering if your colors come from somewhere like that.

I also like using the Kanken colors/levels to break down the secondary school years too (this was a suggestion by Roketzu for another tool).

I'm also really curious what results you get if you do get to run the excerpts through MeCab and analyze the frequency of *words* against cb's word_freq_report.txt, that is, vocab-based ranking instead of kanji-based ranking (or a combination of the two). I guess to be scientific you'd have a set of books whose relative difficulty you could order, and you'd test these various ranking methods against that...?
Reply
#30
For color scheme, I used this: http://en.wikipedia.org/wiki/Ky%C5%8Diku..._Kanji.svg

Also, the graph was already using d3 and had the popovers there but commented out. I've got them working, so you should see the kanji if you hover over a dot on the graph.

Didn't think about kanken, that sounds like a good idea too. Will add it to the list.

The problem with running a word based ranking is the current algorithm isn't a simple ranking. It scores each kanji based on estimated "painfulness" instead, which is based on the value of a logistic curve at the ranking (see the graph at http://readyourlevel.jamesknelson.com/ar...calculated). This works well for 2000 kanji with the inflection point at 500, but I'd have to experiment a bit to get it to work for words. I'll report back (and to my mailing list) once it is done, but I don't think it'll happen within the next week.
Reply
#31
Some more books to rank - these from my 'looks interesting but too difficult to bother with' list:

http://www.amazon.co.jp/ebook/dp/B00N7BXSC6/
http://www.amazon.co.jp/ebook/dp/B009GPM3PA
http://www.amazon.co.jp/ebook/dp/B00GJMUNB4
http://www.amazon.co.jp/ebook/dp/B00IWHT7VG
http://www.amazon.co.jp/ebook/dp/B00BMM5OV0
http://www.amazon.co.jp/ebook/dp/B00GJMUVH0
http://www.amazon.co.jp/ebook/dp/B00M9372D4
http://www.amazon.co.jp/ebook/dp/B00BWF6YQQ
http://www.amazon.co.jp/ebook/dp/B00ARCOVLU
http://www.amazon.co.jp/UFO-1-ebook/dp/B00IUAYBCA
http://www.amazon.co.jp/ebook/dp/B00AJCM55W
Reply
#32
Thanks Aikynaro, I've put them on the list for next week.

In the meantime, I've just made a rather major update:

- The difficulty rating is now based on word frequencies, not just kanji frequencies!
- You can see the number of (possibly difficult) words in the sample, instead of just the number of kanji
- Furigana are now taken into account, and you can find books with lots of furigana on the index page
- Instead of graphing introduction of new words, you can see a graph of "frustration", which is modelled on the closeness of difficult looking words

I've also removed books which are too hard from the listing, since nobody seems to be interested in them, and added the following books from a list I found elsewhere on the site

GO
涼宮ハルヒの憂鬱
失はれる物語
しあわせは子猫のかたち

Would love to hear your opinions on what would be the next step on making the site more useful!
Reply