![]() |
|
A corpus of novel reviews which are innocent, and allegedly famous - Printable Version +- kanji koohii FORUM (http://forum.koohii.com) +-- Forum: Learning Japanese (http://forum.koohii.com/forum-4.html) +--- Forum: Learning resources (http://forum.koohii.com/forum-9.html) +--- Thread: A corpus of novel reviews which are innocent, and allegedly famous (/thread-13049.html) |
A corpus of novel reviews which are innocent, and allegedly famous - rainmaninjapan - 2015-10-02 http://forum.koohii.com/showthread.php?tid=12773&page=2 In that thread cophnia61 compared first 7k words of core 10k with first 7k of the "famous list based on the innocent novel corpus" for the LN Zero no Tsukaima, and apparently the results using cb's Japanese Text Analysis Tool were 28% total coverage with core, and 91% coverage with first 7k of the corpus of innocence. That can't possibly be right, right? I'd imagine that the 7k is superior because it draws off of innocent reviews of novels, but I can't imagine the difference is that great. Anyway, does anyone have a link to this innocent corpus? In list form? Like, all put together and whatnot? Or a set of anki cards (or a spreadsheet to make them off of)? cophnia61 also apparently put up the list he had in pastebin but its unfortunately defunct. http://forum.koohii.com/showthread.php?pid=214948#pid214948 A corpus of novel reviews which are innocent, and allegedly famous - yogert909 - 2015-10-02 I've realized that those statistics are a little bit misleading. I believe he used only the individual vocabulary words from core and compared them agains the innocent corpus. The problem with this methodology is that core's vocabulary doesn't include particles and other "words" that you don't really need a vocabulary card for, but are VERY common. Basically the most common 20-30 words in japanese are "words" like の、だ、から, ect make up roughly 30% of the language but aren't included in the statistic because core doesn't treat them as vocabulary words. の itself is 9% of the japanese language. On the hand, the innocent corpus is a collection of books that I'm not going to break protocol and tell you where it is, but if you ask google the right questions, he will tell you. So what I'm sure happened was that cophina61 ran the frequency analyser on the corpus and got a list of words that contained those frequent words that the core vocabulary list is missing. If you were to add the particles that anyone should know to the core list, you'd get a much closer race. Although I'm sure the innocent list would still win because it's tailored to that corpus. Though not the innocent corpus specifically, you may like to browse this post which compares core and another anki deck to several corpora. A corpus of novel reviews which are innocent, and allegedly famous - rainmaninjapan - 2015-10-02 Ah, I saw that. I'm the guy that replied to you there. My goal is pretty much to be reading LN (and watch Taiga dramas, not as interested in manga/anime/newspapers), and seeing how there are words like 日ソ in core 6k I kind of want to look for an alternative. So there isn't a "list" (anymore) of words or anything, and I just have to fiddle with mecab and the text analyzer for 5 hours until I get it the way I want, right? A corpus of novel reviews which are innocent, and allegedly famous - yogert909 - 2015-10-02 There's a few other things to keep in mind even if the core list included those words. The first thing is that the innocent list is designed to be the optimal list to work with the innocent corpus. So any other list of words with the same number of words will score lower than the list generated from the innocent corpus. The other thing to keep in mind is that the innocent frequency list is only good for those specific books. There's a good chance that if you were to run the same test on some similar books of the same genre, you would find the list wouldn't do quite as well on the in-sample books. I'd still expect the innocent frequency list to do better than core when it comes to japanese fiction, but there are a number of factors that stack the deck to make the difference look much wider than it actually should be. A corpus of novel reviews which are innocent, and allegedly famous - yogert909 - 2015-10-02 rainmaninjapan Wrote:Ah, I saw that. I'm the guy that replied to you there.If you have the books that you want to read in text files, the japanese text analyser will process an entire folder of text files. It shouldn't take 5 hours unless the corpus is millions of pages long. I think that's an excellent way to proceed btw... making a custom frequency list tailored to the exact material you will be reading. A corpus of novel reviews which are innocent, and allegedly famous - rainmaninjapan - 2015-10-02 どうもありがとうミスターロバート。どうも、どうも! I was just reticent to have to use more things other than anki (if it isn't completely necessary), as I already have PTSD from setting all that up. If I must I guess I will begin another day of learning how to use these confounded tools. Thanks! A corpus of novel reviews which are innocent, and allegedly famous - yogert909 - 2015-10-02 Japanese text analyzer is easy provided it works on your computer. It's as simple as pointing it to the text you want to analyze and where you want to save the report. Getting definitions of those words automatically might be a little harder though. There used to be an anki plugin, but it doesn't work since jdic changed it's hosting. I tried changing the url in the plugin, but wasn't getting any love. A corpus of novel reviews which are innocent, and allegedly famous - vix86 - 2015-10-02 Someone should build a corpus using web novels from http://ncode.syosetu.com I'd be curious to see what the overlap is with other decks/corpuses. You can think of Syosetsu as being in the same category as Light Novels as many of the really popular web novels have gone on to get publisher deals. A corpus of novel reviews which are innocent, and allegedly famous - rainmaninjapan - 2015-10-02 Sounds like a relatively easy way to build a corpus if they're already on the web. Some of these programmer guys should get on it! Also, Syosetsu is hideous romanization... mixing syo with tsu? A corpus of novel reviews which are innocent, and allegedly famous - Zarxrax - 2015-10-02 Just curious, when people refer to "innocent" novels, what definition of the word innocent is being used here? A corpus of novel reviews which are innocent, and allegedly famous - Bokusenou - 2015-10-03 It all started after this super innocent thread got made: http://forum.koohii.com/showthread.php?tid=5048 |