Just made a few updates to the site. Added some more light novels, and tuned some parameters. Here's the current coverage. Please, share your thoughts.
About the low nouns coverage, I think maybe it's better to publish one "core nouns" list (the current can be seen as one) and several high-frequency novel-specific nouns lists.
Also, the script files are now available.
Feel free to use them (and the data) in any way.
Some links:
Vocab:
- Most frequent nouns, prefixes and suffixes
- Most frequent i-adjectives, na-adjectives
- Most frequent independent, suru verbs
- Most frequent adverbs and adverbial conjunctions
- Most frequent pronouns, particles, conjunctions
Kanji:
- Most frequent kanji
- Comparison with RTK1, RTK3, Jouyou, JLTP1, and an overview
nest0r wrote:
From what's available in .txt form, even just 3 volumes per work/author would I think be at least 200 files? (For now... mwa ha ha...)
Unfortunately, most of the OCR'd files seem to have been very poorly revised. I'd rather have a smaller (but still not small) trustworthy corpus than a larger messy one.
Last edited by iSoron (2010 February 16, 5:37 am)