Back

Google Reading Level

#1
Looks like it's for English only, and based on some mysterious criteria related to the American education system: http://www.google.com/support/forum/p/We...ad86&hl=en

http://www.google.com/support/websearch/...er=1095407

http://googleforstudents.blogspot.com/20...level.html

Still, it's an interesting idea.
Reply
#2
Maybe Flesch-Kincaid, it's reasonably accurate.
Reply
#3
The Google explanation makes me think that's not what it is.
Reply
May 16 - 30 : Pretty Big Deal: Save 31% on all Premium Subscriptions! - Sign up here
JapanesePod101
#4
Quote:For Japanese, the critical factors are: sentence length, length of runs of Roman letters and symbols and of the different Japanese characters (Hiragana, Kanji and Katakana), and the ratio of tooten(comma) to kuten(period). The original formula used 10 factors, the following is only based off of six.

Readability Score = -(0.12 * LS) - (1.37 * LA) + (7.4 * LH) - (23.18 * LC) - (5.4 * LK) - (4.67 * CP) + 115.79

LS = length of the sentences
LA = average number of Roman letters and symbols per run
LH = average number of Hiragana characters per run
LC = average number of Kanji character per run
LK = average number of Katakana character per run
CP = ratio of tooten (comma) to kuten (period)
Run = a continuous string of the same type of character
http://www.ideosity.com/SEO/SEO-Readabil...px#Hayashi

Edit: the above is unrelated to the Google thing, just thought it was interesting.
Edited: 2011-02-14, 11:41 pm
Reply
#5
That is interesting; some factors to consider, but it seems surprisingly cognitive in orientation, towards processing load or somesuch, since it doesn't talk about other factors regarding the language. I wonder if the other 4 factors cover that?
Reply
#6
I notice that by far the heaviest weighted factor is LC, which is probably a reasonable proxy for difficulty of vocab: lots of two-kanji (or more) compounds and your LC score goes up; more 和語 and your LC goes down and your LH goes up...

Interesting that LS is has such a light weighting there; I'd have guessed it would be worth more.
Reply
#7
Hmm, hadn't thought of it that way. Reminds of the stick thingy in the Aozora texts, actually.

I wonder if the amount of カタカナ could indicate the amount of technical language and weigh the results towards advanced?
Reply
#8
cb4960 Wrote:
Quote:For Japanese, the critical factors are: sentence length, length of runs of Roman letters and symbols and of the different Japanese characters (Hiragana, Kanji and Katakana), and the ratio of tooten(comma) to kuten(period). The original formula used 10 factors, the following is only based off of six.

Readability Score = -(0.12 * LS) - (1.37 * LA) + (7.4 * LH) - (23.18 * LC) - (5.4 * LK) - (4.67 * CP) + 115.79

LS = length of the sentences
LA = average number of Roman letters and symbols per run
LH = average number of Hiragana characters per run
LC = average number of Kanji character per run
LK = average number of Katakana character per run
CP = ratio of tooten (comma) to kuten (period)
Run = a continuous string of the same type of character
http://www.ideosity.com/SEO/SEO-Readabil...px#Hayashi

Edit: the above is unrelated to the Google thing, just thought it was interesting.
Any programs that calculate this on your own files? I'm curious how it would rate some books.
Reply
#9
I found this for English: http://www.addedbytes.com/code/readability-score/
Reply
#10
Here are the Hayashi readability scores for those 5000 novels that are floating about:
http://dpaste.org/O8eM/

Many of the 164 novels that scored below 35 seem to be very short or oddly formatted and should probably be ignored.

The are also an additional 138 novels that couldn't be analyzed because my program couldn't properly determine the encoding or they had a very unusual format.
Reply
#11
Awesome! Very interesting.

By the way, there's newer ways that might be better, beyond Hayashi. I was looking at correlations of Hayashi with grade levels, and found this paper and it describes their methods and applicability to different types of text beyond textbooks (such as web pages), and how it doesn't use sentence/word analysis or somesuch, but ‘operative characters’. They have an online tool for people similar to the addedbytes tool that I linked above, here (interface has English option): http://kotoba.nuee.nagoya-u.ac.jp/sc/obi2/ - For more info in addition to the paper, there's also the ‘more information’ link (they provide a link to the program there, as well as a poster).

Main page in English: http://kotoba.nuee.nagoya-u.ac.jp/index_en.html

Edit: I only just looked for the grade correlations and discovered the above after you posted the Hayashi 5000 list, so didn't have time to do more than skim that paper, for the record, so no idea how sound and/or widely applicable their method is.
Edited: 2011-02-19, 3:38 am
Reply
#12
cb4960 Wrote:Here are the Hayashi readability scores for those 5000 novels that are floating about:
http://dpaste.org/O8eM/

Many of the 164 novels that scored below 35 seem to be very short or oddly formatted and should probably be ignored.

The are also an additional 138 novels that couldn't be analyzed because my program couldn't properly determine the encoding or they had a very unusual format.
Wow! Now THAT'S what I call great!

You've been my hero for already I don't know how many times!!!

edit: Do these percents have some kind of a meaning? It's the percent taken to compare a novel to the easiest and hardest novels on the list?

edit2: Now it seems that All we have to do is write 10 lines between those ~4800 novels in that list to divide the novels into 10 groups of difficulty.
And then start recommending the best stuff from each difficulty level to each other.
Edited: 2011-02-19, 3:50 am
Reply
#13
This page has an offline version of the Obi-2 tool:
http://kotoba.nuee.nagoya-u.ac.jp/sc/obi2/obi_e.html

Edit: The page says it executes on MacOS and other standard Unix environments. But since it's just ruby, it seems to run well on Windows after a couple of tweaks.

Edit2: I've set up my program to use the Obi-2 tool. It's kinda slow, so it might take a couple hours to analyze the novels.
Edited: 2011-02-19, 2:54 pm
Reply
#14
Cool, yeah that's the linked program in the more info section that I was referring to—wasn't sure if you'd be able to get it working because of the MacOS thing, but when they said Ruby and Unix I figured you could.

Reading that paper more closely, it seems like their usage is both more robust and more effective than the modified version of Taiteisi's formula that Hayashi used in 1992 (which is the 6 factor version posted above [ahha! Figured out why I kept typing ‘below’ when referencing posters' comments, it's because on the edit page the direction is reversed]). It requires only 25 characters or so to be accurate and apparently can be used on anything, even stuff with incomplete sentences.

Does what they provide give the correlation coefficient, though? It looks like it just outputs the grade level (and it's been modified in Obi2 for 14 grades), but skimming through that offline code it seems like it can be modified to output other stuff.

Edit: Just read your Edit 2. Excellent! *steeples fingers*

Edit: And of course 25 characters would be the bare minimum. I think that chart showed 100-250 or something to be the most accurate baseline.
Edited: 2011-02-19, 3:07 pm
Reply
#15
cb4960 Wrote:Here are the Hayashi readability scores for those 5000 novels that are floating about:
This ranking looks quite off to me.
From a sample of ten light novels that I have read:

・ Spice and Wolf 74
・ Hanbun no Tsuki 73
・ Imouto ga Konnani Kawaii 72
・ Suzumiya Haruhi no Bousou 68
・ Toradora 67
・ Denpateki na Kanojo 67
・ Usotsuki Miikun 65
・ Zero no Tsukaima 65
・ NHK ni Youkoso 63
・ Gosick 62

Hanbun and Imouto are two of the easiest light novels out there. Spice and Wolf is hard, but it's by no means that hard; it's certainly not harder than Suzumiya or Toradora. Gosick and NHK are not that easy either; they should be around the middle.

Here's how I would rate these novels:

・ Suzumiya Haruhi no Bousou 80
・ Toradora 80
・ Denpateki na Kanojo 75
・ Spice and Wolf 75
・ Usotsuki Miikun 75
・ NHK ni Youkoso 70
・ Gosick 70
・ Imouto ga Konnani Kawaii 65
・ Hanbun no Tsuki 60
・ Zero no Tsukaima 60

Not the most scientific method, I know; but I think others who have read at least some of these novels will agree with me that the original ranking is not reliable.
Edited: 2011-02-19, 4:41 pm
Reply
#16
iSoron Wrote:
cb4960 Wrote:Here are the Hayashi readability scores for those 5000 novels that are floating about:
This ranking looks quite off to me.
From a sample of ten light novels that I have read:

・ Spice and Wolf 74
・ Hanbun no Tsuki 73
・ Imouto ga Konnani Kawaii 72
・ Suzumiya Haruhi no Bousou 68
・ Toradora 67
・ Denpateki na Kanojo 67
・ Usotsuki Miikun 65
・ Zero no Tsukaima 65
・ NHK ni Youkoso 63
・ Gosick 62

Hanbun and Imouto are two of the easiest light novels out there. Spice and Wolf is hard, but it's by no means that hard; it's certainly not harder than Suzumiya or Toradora. Gosick and NHK are not that easy either; they should be around the middle.

Here's how I would rate these novels:

・ Suzumiya Haruhi no Bousou 80
・ Toradora 80
・ Denpateki na Kanojo 75
・ Spice and Wolf 75
・ Usotsuki Miikun 75
・ NHK ni Youkoso 70
・ Gosick 70
・ Imouto ga Konnani Kawaii 65
・ Hanbun no Tsuki 60
・ Zero no Tsukaima 60

Not the most scientific method, I know; but I think others who have read at least some of these novels will agree with me that the original ranking is not reliable.
Keep in mind that higher readability scores imply that the work is easier to read (more readable).
Reply
#17
cb4960 Wrote:Keep in mind that higher readability scores imply that the work is easier to read (more readable).
Oops. Big Grin But it doesn't change much; there are easy light novels in both upper and lower positions (Hanbun no Tsuki, Zero no Tsukaima), and hard light novels in both upper and lower positions (Spice and Wolf, Usotsuki Miikun).
Edited: 2011-02-19, 5:16 pm
Reply
#18
iSoron Wrote:Not the most scientific method, I know; but I think others who have read at least some of these novels will agree with me that the original ranking is not reliable.
I have to say that I used Obi-2 on 2 novels I've tested before:

東野圭吾's 予知夢 (~1340 kanji)
and IQ84 (~2300 kanji)

Both got the result that they are of the level of a 9th grader Sad.

a 9th grader's level and 2300 unique kanji! Doesn't sound right to me.

And in addition my japanese friend said that that 1340 kanji novel should be pretty easy compared to IQ84.

And I'm not sure whether they check it, but probably the length of sentences, and the number of commas per sentence, all kinds of layers of grammar ( don't really know how to explain it...a grammar structure in an another grammar structure) etc. are also an indication of how difficult the novel is. oh dear...
Edited: 2011-02-19, 5:12 pm
Reply
#19
That online one that Nestor quoted seems OK, basically what they did was scan in a whole bunch of school textbooks by grade, for grade 13 they used university textbooks. They used that as a basis for how difficult a piece of text is. With some statistical magic, for any given text they can tell you which grade level it corresponds to. In the paper they said if there's a large amount of katakana it gives an easier score. They don't know exactly why, but their theory is that Japanese school textbooks don't have that much katakana in them.

It actually doesn't matter what the score is really, as long as it is consistent. So if you put some text in and it gives you a score of X, then any more difficult text should be less than X, and easier more than X (with 0 being hard, 100 being easy).

Just out of curiosity I threw some stuff in:
Genki (I), 1st dialog: 1
Genki (I), last dialog: 5
HP 1, chap 8: 6
HP 7: chap 8: 6
Mainichi random articles: between 9 and 12
Linguaphone Dialog 1: 8
Linguaphone Dialog 2: 5
Linguaphone Dialog 50: 8

The Linguaphone scores are interesting, since dialog 1 is relatively simple. I think it's to do with the fact that the sentences are short, but contain a relative large number of kanji.
Reply
#20
Do they eliminate katakana consideration from Obi2? Because they noted that the readability estimate becomes stable once they exclude them from the operative characters?

At any rate, I believe that novels won't all be consistent in their readability, i.e. one passage from 1Q84 might be shown as Grade 9 while the other is ‘14’, but we're looking at overall scores... So if you're looking for fine-grained, reader-specific stuff, might want to keep it to 200-2000 character passages.

Is 1Q84 (a friend of a friend got Grade 10 overall) not considered a book appropriately easy for Japanese high school students?
Reply
#21
nest0r Wrote:Do they eliminate katakana consideration from Obi2? Because they noted that the readability estimate becomes stable once they exclude them from the operative characters?

At any rate, I believe that novels won't all be consistent in their readability, i.e. one passage from 1Q84 might be shown as Grade 9 while the other is ‘14’, but we're looking at overall scores... So if you're looking for fine-grained, reader-specific stuff, might want to keep it to 200-2000 character passages.

Is 1Q84 (a friend of a friend got Grade 10 overall) not considered a book appropriately easy for Japanese high school students?
Well to say it more in detail I first asked my friend to bring me IQ84 from japan, but then I said that I think I actually need something easier that common novels...something for high-or middleschoolers.

Then he brought that 東野圭吾book and said it's quite easy, for high schoolers.
This information is definitely not something very certain, I just added it because it seemed to fit Big Grin

But anyway if more detailed information is needed I can ask him as he has read both books.

Well MAYBE I have some kind of an evil luck when using books on rating programs, but I'm quite certain that each time I add at least 1500 characters into the prog. Too long..?
Edited: 2011-02-19, 5:53 pm
Reply
#22
Processing took more like 3.5 hours but the lists are ready!

This list is sorted by Obi-2 level
Format: Obi-2 level [tab] Hayashi score [tab] Filename
http://dpaste.org/0F45/

This list is sorted by Hayashi score
Format: Hayashi score [tab] Obi-2 level [tab] Filename
http://dpaste.org/FMwU/

Obi-2 Level
Lower scores imply that the work is more readable/easier.
The levels correspond to grade levels:
1-6: Elementary school (6 years)
7-9: Junior high school (3 years)
10-12: High school (3 years)
13: Beyond high school

Hayashi Score
Higher scores imply that the work is more readable/easier.
Edited: 2011-02-19, 6:55 pm
Reply
#23
nest0r Wrote:Do they eliminate katakana consideration from Obi2? Because they noted that the readability estimate becomes stable once they exclude them from the operative characters?

At any rate, I believe that novels won't all be consistent in their readability, i.e. one passage from 1Q84 might be shown as Grade 9 while the other is ‘14’, but we're looking at overall scores... So if you're looking for fine-grained, reader-specific stuff, might want to keep it to 200-2000 character passages.

Is 1Q84 (a friend of a friend got Grade 10 overall) not considered a book appropriately easy for Japanese high school students?
Not sure what they did with the katakana problem, nothing mentioned so probably nothing. I think if you want to analyze sentences or paragraphs the Hayashi index would work better, since it works on a sentence level.
Reply
#24
Wow, thank you for doing this.

So I wonder how accurate it is. ^_^ Judging by the titles, seems like it might line up well with grade level standards? For instance, a book of rakugo is 3, while a book on Foucault is 13.

@travis

My impression is that even with as little as 25 characters and incomplete sentences, it's more accurate than the Hayashi/Taiteisi formula? I don't know, I guess I'll keep a mental note there for stuff heavy on katakana but otherwise assume Obi2 is always superior. Not that I'll be using this in any strict way.
Edited: 2011-02-19, 7:31 pm
Reply
#25
Some charts:

[Image: obi2.png]

Mean Obi-2 Level: 8.523440604


[Image: hayashi.png]

Mean Hayashi Score: 62.27695464
Median Hayashi Score: 64.09861048
Hayashi Standard Deviation: 11.2405696
Edited: 2011-02-19, 7:39 pm
Reply