How many readings do you need to know?

Index » The Japanese language

Thora Member
From: Canada Registered: 2007-02-23 Posts: 1691

mezbup wrote:

That article was a great read, basically confirmed a lot of what we've been talking about.  A lot of the points I was making are in there. The thing I was most interested to read about was at which point a learner can tackle a novel and not have the level of difficulty be overwhelming.

Yes. It also points out that there's little point trying to look up or pre-learn every word b/c (a) many words appear only 1-2 times, and (b) you'd have to read 3-5 more novels to see half of those words again. Also, 2000 words (plus a few proper nouns) covers 95%  - which they equate to adequate comprehension and ability to guess meaning from context. For them, "pleasurable" is at 97-98% (5000 words). I can see some learners having a higher tolerance for unknowns (pleasure not necessarily equating with ease and complete comprehension.)

Isoron: wow!  That article recommends pre-learning only vocab which occurs frequently in a novel. Unfortunately, creating such a glossary requires computer analysis. Which you have done for several of them! Can we include your link in rasera's My study method thread?

Laroche: Interesting range of #. Textbooks seem easier to me than Japanese literature. I can't seem to shake this feeling that I'm only understanding stories at a most basic literal level and missing both the Big Ideas and the subtleties. I keep giving up ...  sad

nest0r Member
Registered: 2007-10-19 Posts: 5236 Website

If we took like a 100 light novels, etc. and compiled from those kinds of analyses a list of kanji and vocabulary that covers a percentage similar to the paper Thora posted, that could be good, especially with the breakdown into each work. Say, 5-10 volumes of 10-20 series? That's probably only half or a third of what's available via hongfire, raseru, etc.

Like an RTKLiteNovel, and a corresponding smart.fm list of words/sentences for import into Anki (or just the deck)?

Last edited by nest0r (2010 February 14, 3:17 pm)

iSoron Member
From: Canada Registered: 2008-03-24 Posts: 490

Thora wrote:

Isoron: wow!  That article recommends pre-learning only vocab which occurs frequently in a novel. Unfortunately, creating such a glossary requires computer analysis. Which you have done for several of them! Can we include your link in rasera's My study method thread?

Sure.

BTW, I just updated the site with a more detailed breakdown of the vocab. For the curious, I'm using Mecab plus a mess of bash, python and ruby scripts.

Last edited by iSoron (2010 February 14, 5:55 pm)

Advertising (register and sign in to hide this)
JapanesePod101 Sponsor
 
pm215 Member
From: UK Registered: 2008-01-26 Posts: 1354

iSoron wrote:

BTW, I just updated the site with a more detailed breakdown of the vocab. For the curious, I'm using Mecab plus a mess of bash, python and ruby scripts.

Do you have to teach it about the proper nouns (character names etc) for each source, or does it guess? (I ask because the Suzumiya analysis lists 古泉 in both 'nouns_general' and 'nouns_proper'.)

yukamina Member
From: Canada Registered: 2006-01-09 Posts: 761

Finally frequency lists based off of (light) novels instead of newspapers. Not sure what to do with them, though.

Jarvik7 Member
From: 名古屋 Registered: 2007-03-05 Posts: 3946

yukamina wrote:

Finally frequency lists based off of (light) novels instead of newspapers. Not sure what to do with them, though.

Each list is only an individual novel. It'll be truly useful once applied to a corpus of them. As is I think it's only useful if you're reading one of the applicable novels. I've been planning on reading NHKにようこそ! so I'm happy though. Too bad I only own the book and not a txt file though hmm

kazelee Rater Mode
From: ohlrite Registered: 2008-06-18 Posts: 2132 Website

LaLoche wrote:

Reminds me of this article about the vocab needed to read a selection of first year general texts at university (in the Netherlands, but applicable, I think, to English) which came up with over 11,000 words!!
http://applij.oxfordjournals.org/cgi/gc … act%28s%29

There weren't enough grammar Nazi's derailing threads

http://forum.koohii.com/viewtopic.php?pid=93256#p93256

so ...

An Oxford Journal wrote:

This study aimed to answer the question of how many words of the Dutch language, and which words, and adult non-native speaker needs to know receptively in order to be able to understand first-year university reading materials In the first part of this study,

Have fun smile

Jarvik7 Member
From: 名古屋 Registered: 2007-03-05 Posts: 3946

There are a number of other mistakes too, but the authors are not native speakers.

More than language issues, it just seems that they are horrible typists and couldn't be bothered to proof read.

ruiner Member
Registered: 2009-08-20 Posts: 751

Jarvik7 wrote:

yukamina wrote:

Finally frequency lists based off of (light) novels instead of newspapers. Not sure what to do with them, though.

Each list is only an individual novel. It'll be truly useful once applied to a corpus of them. As is I think it's only useful if you're reading one of the applicable novels. I've been planning on reading NHKにようこそ! so I'm happy though. Too bad I only own the book and not a txt file though hmm

Working on it. My imaginary friend already has a .txt of NHKetc.

iSoron Member
From: Canada Registered: 2008-03-24 Posts: 490

Jarvik7 wrote:

Each list is only an individual novel. It'll be truly useful once applied to a corpus of them. As is I think it's only useful if you're reading one of the applicable novels.

Here's a first and very rought attempt at that:

    http://isoron.org/stuff/japanese/analysis/pack/INFO
    http://isoron.org/stuff/japanese/analysis/pack/

Will refine it later, and check whether it covers the required 95%. I also plan on releasing the scripts, so that others can continue/check the work.

pm215 wrote:

Do you have to teach it about the proper nouns (character names etc) for each source, or does it guess? (I ask because the Suzumiya analysis lists 古泉 in both 'nouns_general' and 'nouns_proper'.)

I should have teached it about the proper nouns. Right now, it's having to guess.

Last edited by iSoron (2010 February 15, 7:07 am)

Jarvik7 Member
From: 名古屋 Registered: 2007-03-05 Posts: 3946

@iSoron: Could you list which novels make up your corpus in the INFO?

Last edited by Jarvik7 (2010 February 15, 7:12 am)

iSoron Member
From: Canada Registered: 2008-03-24 Posts: 490

Jarvik7 wrote:

@iSoron: Could you list which novels make up your corpus in the INFO?

Done.

Jarvik7 Member
From: 名古屋 Registered: 2007-03-05 Posts: 3946

Thanks.

It also might be useful to indicate how any times the word occurred across the entire corpus, instead of just counting how many novels it appeared in. It can give a more detailed picture of how common a word is (appeared 2 times in 11 novels vs appeared 500 times in 10 novels).

Last edited by Jarvik7 (2010 February 15, 7:21 am)

iSoron Member
From: Canada Registered: 2008-03-24 Posts: 490

Jarvik7 wrote:

It also might be useful to indicate how any times the word occurred across the entire corpus, instead of just counting how many novels it appeared in. It can give a more detailed picture of how common a word is (appeared 2 times in 11 novels vs appeared 500 times in 10 novels).

Yep. I refrained from doing this at first because a single light novel could very easily skew the numbers. For example, 俺 appears almost 8000 times in the Suzumiya books, which is not typical at all, even for a light novel of its size.

Jarvik7 Member
From: 名古屋 Registered: 2007-03-05 Posts: 3946

I've never done much statistical analysis using shell scripts, but you should be able to snip outliers like that. In any case, it could be in addition to the novel count.

nest0r Member
Registered: 2007-10-19 Posts: 5236 Website

Exciting!

Last edited by nest0r (2010 February 15, 2:19 pm)

nest0r Member
Registered: 2007-10-19 Posts: 5236 Website

iSoron wrote:

Jarvik7 wrote:

It also might be useful to indicate how any times the word occurred across the entire corpus, instead of just counting how many novels it appeared in. It can give a more detailed picture of how common a word is (appeared 2 times in 11 novels vs appeared 500 times in 10 novels).

Yep. I refrained from doing this at first because a single light novel could very easily skew the numbers. For example, 俺 appears almost 8000 times in the Suzumiya books, which is not typical at all, even for a light novel of its size.

Something raseru suggested was to avoid using more than 1 or (I think) a few volumes per author, that might prevent too much of a stylistic/authorial skew. From what's available in .txt form, even just 3 volumes per work/author would I think be at least 200 files? (For now... mwa ha ha...)

Last edited by nest0r (2010 February 15, 6:16 pm)

iSoron Member
From: Canada Registered: 2008-03-24 Posts: 490

Just made a few updates to the site. Added some more light novels, and tuned some parameters. Here's the current coverage. Please, share your thoughts.

About the low nouns coverage, I think maybe it's better to publish one "core nouns" list (the current can be seen as one) and several high-frequency novel-specific nouns lists.

Also, the script files are now available.
Feel free to use them (and the data) in any way.

Some links:

    Vocab:
        - Most frequent nouns, prefixes and suffixes
        - Most frequent i-adjectives, na-adjectives
        - Most frequent independent, suru verbs
        - Most frequent adverbs and adverbial conjunctions
        - Most frequent pronouns, particles, conjunctions

    Kanji:
        - Most frequent kanji
        - Comparison with RTK1, RTK3, Jouyou, JLTP1, and an overview

nest0r wrote:

From what's available in .txt form, even just 3 volumes per work/author would I think be at least 200 files? (For now... mwa ha ha...)

Unfortunately, most of the OCR'd files seem to have been very poorly revised. I'd rather have a smaller (but still not small) trustworthy corpus than a larger messy one.

Last edited by iSoron (2010 February 16, 5:37 am)

nest0r Member
Registered: 2007-10-19 Posts: 5236 Website

Really? Which ones are bad? I've been going through and comparing scans vs. OCR haphazardly but haven't found any errors. It's such a pain, having to check them. ;p

Last edited by nest0r (2010 February 16, 5:06 am)

Jarvik7 Member
From: 名古屋 Registered: 2007-03-05 Posts: 3946

I think that even if there were occasional OCR errors it wouldn't matter, since an identical error wouldn't occur frequently enough to skew the statistics. That's the wonderful thing about averaging large datasets.

nest0r Member
Registered: 2007-10-19 Posts: 5236 Website

Hmm, still can't find any errors, but I have noticed something that can be confusing, the way furigana is either bracketed or even given its own lines. Takes some getting used to I suppose.

Jarvik7 Member
From: 名古屋 Registered: 2007-03-05 Posts: 3946

The bracketed ones are aozora-bunko format, so I assume if you open it in an appropriate reader it will look pretty.

nest0r Member
Registered: 2007-10-19 Posts: 5236 Website

Jarvik7 wrote:

The bracketed ones are aozora-bunko format, so I assume if you open it in an appropriate reader it will look pretty.

I never could get any of those readers and suchlike to work properly. Is there a program/addon thingy that turns those brackets back into ruby without a special reader?

And iSoron, dude, I don't know what you're talking about! If anything, the OCRs seem disturbingly good, as if some robot went through and hand-corrected everything page by page.

Last edited by nest0r (2010 February 16, 5:54 am)

Jarvik7 Member
From: 名古屋 Registered: 2007-03-05 Posts: 3946

You could write a shell script to turn it into HTML (or just do a search and replace).. Safari now supports ruby tags (in the nightlies)

Reply #100 - 2010 February 16, 6:03 am
iSoron Member
From: Canada Registered: 2008-03-24 Posts: 490

nest0r wrote:

I never could get any of those readers and suchlike to work properly. Is there a program/addon thingy that turns those brackets back into ruby without a special reader?

This is what I use:

    http://isoron.org/stuff/japanese/aozora/aozora_to_html
    http://isoron.org/stuff/japanese/aozora/aozora_ruby.py

More scripts, sorry. tongue

And iSoron, dude, I don't know what you're talking about! If anything, the OCRs seem disturbingly good, as if some robot went through and hand-corrected everything page by page.

I've seem some pretty bad ones, with page headers/numbering still there, for example. But well, nevermind. Will recheck later.