Back

How many readings do you need to know?

#76
mezbup Wrote:That article was a great read, basically confirmed a lot of what we've been talking about. A lot of the points I was making are in there. The thing I was most interested to read about was at which point a learner can tackle a novel and not have the level of difficulty be overwhelming.
Yes. It also points out that there's little point trying to look up or pre-learn every word b/c (a) many words appear only 1-2 times, and (b) you'd have to read 3-5 more novels to see half of those words again. Also, 2000 words (plus a few proper nouns) covers 95% - which they equate to adequate comprehension and ability to guess meaning from context. For them, "pleasurable" is at 97-98% (5000 words). I can see some learners having a higher tolerance for unknowns (pleasure not necessarily equating with ease and complete comprehension.)

Isoron: wow! That article recommends pre-learning only vocab which occurs frequently in a novel. Unfortunately, creating such a glossary requires computer analysis. Which you have done for several of them! Can we include your link in rasera's My study method thread?

Laroche: Interesting range of #. Textbooks seem easier to me than Japanese literature. I can't seem to shake this feeling that I'm only understanding stories at a most basic literal level and missing both the Big Ideas and the subtleties. I keep giving up ... Sad
Reply
#77
If we took like a 100 light novels, etc. and compiled from those kinds of analyses a list of kanji and vocabulary that covers a percentage similar to the paper Thora posted, that could be good, especially with the breakdown into each work. Say, 5-10 volumes of 10-20 series? That's probably only half or a third of what's available via hongfire, raseru, etc.

Like an RTKLiteNovel, and a corresponding smart.fm list of words/sentences for import into Anki (or just the deck)?
Edited: 2010-02-14, 4:17 pm
Reply
#78
Thora Wrote:Isoron: wow! That article recommends pre-learning only vocab which occurs frequently in a novel. Unfortunately, creating such a glossary requires computer analysis. Which you have done for several of them! Can we include your link in rasera's My study method thread?
Sure.

BTW, I just updated the site with a more detailed breakdown of the vocab. For the curious, I'm using Mecab plus a mess of bash, python and ruby scripts.
Edited: 2010-02-14, 6:55 pm
Reply
May 16 - 30 : Pretty Big Deal: Save 31% on all Premium Subscriptions! - Sign up here
JapanesePod101
#79
iSoron Wrote:BTW, I just updated the site with a more detailed breakdown of the vocab. For the curious, I'm using Mecab plus a mess of bash, python and ruby scripts.
Do you have to teach it about the proper nouns (character names etc) for each source, or does it guess? (I ask because the Suzumiya analysis lists 古泉 in both 'nouns_general' and 'nouns_proper'.)
Reply
#80
Finally frequency lists based off of (light) novels instead of newspapers. Not sure what to do with them, though.
Reply
#81
yukamina Wrote:Finally frequency lists based off of (light) novels instead of newspapers. Not sure what to do with them, though.
Each list is only an individual novel. It'll be truly useful once applied to a corpus of them. As is I think it's only useful if you're reading one of the applicable novels. I've been planning on reading NHKにようこそ! so I'm happy though. Too bad I only own the book and not a txt file though :/
Reply
#82
LaLoche Wrote:Reminds me of this article about the vocab needed to read a selection of first year general texts at university (in the Netherlands, but applicable, I think, to English) which came up with over 11,000 words!!
http://applij.oxfordjournals.org/cgi/gca...act%28s%29
There weren't enough grammar Nazi's derailing threads

http://forum.koohii.com/showthread.php?p...1#pid89771

so ...

An Oxford Journal Wrote:This study aimed to answer the question of how many words of the Dutch language, and which words, and adult non-native speaker needs to know receptively in order to be able to understand first-year university reading materials In the first part of this study,
Have fun Smile
Reply
#83
There are a number of other mistakes too, but the authors are not native speakers.

More than language issues, it just seems that they are horrible typists and couldn't be bothered to proof read.
Reply
#84
Jarvik7 Wrote:
yukamina Wrote:Finally frequency lists based off of (light) novels instead of newspapers. Not sure what to do with them, though.
Each list is only an individual novel. It'll be truly useful once applied to a corpus of them. As is I think it's only useful if you're reading one of the applicable novels. I've been planning on reading NHKにようこそ! so I'm happy though. Too bad I only own the book and not a txt file though :/
Working on it. My imaginary friend already has a .txt of NHKetc.
Reply
#85
Jarvik7 Wrote:Each list is only an individual novel. It'll be truly useful once applied to a corpus of them. As is I think it's only useful if you're reading one of the applicable novels.
Here's a first and very rought attempt at that:

http://isoron.org/stuff/japanese/analysis/pack/INFO
http://isoron.org/stuff/japanese/analysis/pack/

Will refine it later, and check whether it covers the required 95%. I also plan on releasing the scripts, so that others can continue/check the work.

pm215 Wrote:Do you have to teach it about the proper nouns (character names etc) for each source, or does it guess? (I ask because the Suzumiya analysis lists 古泉 in both 'nouns_general' and 'nouns_proper'.)
I should have teached it about the proper nouns. Right now, it's having to guess.
Edited: 2010-02-15, 8:07 am
Reply
#86
@iSoron: Could you list which novels make up your corpus in the INFO?
Edited: 2010-02-15, 8:12 am
Reply
#87
Jarvik7 Wrote:@iSoron: Could you list which novels make up your corpus in the INFO?
Done.
Reply
#88
Thanks.

It also might be useful to indicate how any times the word occurred across the entire corpus, instead of just counting how many novels it appeared in. It can give a more detailed picture of how common a word is (appeared 2 times in 11 novels vs appeared 500 times in 10 novels).
Edited: 2010-02-15, 8:21 am
Reply
#89
Jarvik7 Wrote:It also might be useful to indicate how any times the word occurred across the entire corpus, instead of just counting how many novels it appeared in. It can give a more detailed picture of how common a word is (appeared 2 times in 11 novels vs appeared 500 times in 10 novels).
Yep. I refrained from doing this at first because a single light novel could very easily skew the numbers. For example, 俺 appears almost 8000 times in the Suzumiya books, which is not typical at all, even for a light novel of its size.
Reply
#90
I've never done much statistical analysis using shell scripts, but you should be able to snip outliers like that. In any case, it could be in addition to the novel count.
Reply
#91
Exciting!
Edited: 2010-02-15, 3:19 pm
Reply
#92
iSoron Wrote:
Jarvik7 Wrote:It also might be useful to indicate how any times the word occurred across the entire corpus, instead of just counting how many novels it appeared in. It can give a more detailed picture of how common a word is (appeared 2 times in 11 novels vs appeared 500 times in 10 novels).
Yep. I refrained from doing this at first because a single light novel could very easily skew the numbers. For example, 俺 appears almost 8000 times in the Suzumiya books, which is not typical at all, even for a light novel of its size.
Something raseru suggested was to avoid using more than 1 or (I think) a few volumes per author, that might prevent too much of a stylistic/authorial skew. From what's available in .txt form, even just 3 volumes per work/author would I think be at least 200 files? (For now... mwa ha ha...)
Edited: 2010-02-15, 7:16 pm
Reply
#93
Just made a few updates to the site. Added some more light novels, and tuned some parameters. Here's the current coverage. Please, share your thoughts.

About the low nouns coverage, I think maybe it's better to publish one "core nouns" list (the current can be seen as one) and several high-frequency novel-specific nouns lists.

Also, the script files are now available.
Feel free to use them (and the data) in any way.

Some links:

Vocab:
- Most frequent nouns, prefixes and suffixes
- Most frequent i-adjectives, na-adjectives
- Most frequent independent, suru verbs
- Most frequent adverbs and adverbial conjunctions
- Most frequent pronouns, particles, conjunctions

Kanji:
- Most frequent kanji
- Comparison with RTK1, RTK3, Jouyou, JLTP1, and an overview

nest0r Wrote:From what's available in .txt form, even just 3 volumes per work/author would I think be at least 200 files? (For now... mwa ha ha...)
Unfortunately, most of the OCR'd files seem to have been very poorly revised. I'd rather have a smaller (but still not small) trustworthy corpus than a larger messy one.
Edited: 2010-02-16, 6:37 am
Reply
#94
Really? Which ones are bad? I've been going through and comparing scans vs. OCR haphazardly but haven't found any errors. It's such a pain, having to check them. ;p
Edited: 2010-02-16, 6:06 am
Reply
#95
I think that even if there were occasional OCR errors it wouldn't matter, since an identical error wouldn't occur frequently enough to skew the statistics. That's the wonderful thing about averaging large datasets.
Reply
#96
Hmm, still can't find any errors, but I have noticed something that can be confusing, the way furigana is either bracketed or even given its own lines. Takes some getting used to I suppose.
Reply
#97
The bracketed ones are aozora-bunko format, so I assume if you open it in an appropriate reader it will look pretty.
Reply
#98
Jarvik7 Wrote:The bracketed ones are aozora-bunko format, so I assume if you open it in an appropriate reader it will look pretty.
I never could get any of those readers and suchlike to work properly. Is there a program/addon thingy that turns those brackets back into ruby without a special reader?

And iSoron, dude, I don't know what you're talking about! If anything, the OCRs seem disturbingly good, as if some robot went through and hand-corrected everything page by page.
Edited: 2010-02-16, 6:54 am
Reply
#99
You could write a shell script to turn it into HTML (or just do a search and replace).. Safari now supports ruby tags (in the nightlies)
Reply
nest0r Wrote:I never could get any of those readers and suchlike to work properly. Is there a program/addon thingy that turns those brackets back into ruby without a special reader?
This is what I use:

http://isoron.org/stuff/japanese/aozora/aozora_to_html
http://isoron.org/stuff/japanese/aozora/aozora_ruby.py

More scripts, sorry. Tongue

Quote:And iSoron, dude, I don't know what you're talking about! If anything, the OCRs seem disturbingly good, as if some robot went through and hand-corrected everything page by page.
I've seem some pretty bad ones, with page headers/numbering still there, for example. But well, nevermind. Will recheck later.
Reply