nest0r Wrote:a similar experiment.Along the same lines as that experiment, apparently if you try to install it with Wine in Linux, it runs perfectly fine. In theory, I mean.
2010-02-14, 3:04 pm
2010-02-14, 5:09 pm
Thanks heaps raseru- this should be really helpful as I've been trying to up my reading quantity. I bought a bunch of books before I left Japan a couple months ago but I haven't really started tackling them. I think I'll try going through some of the books you uploaded first and see how that goes.
edit: Is there a way to make the format prettier in Firefox? Some of them work fine, but I opened the first book of Harry Potter in Firefox and it's all readable but it's mashed together without any spacing/page breaks. If I open it in Word all the breaks are there and it's much more readable, but I can't use the quick look-up of Rikaichan.
edit2: Ok figured it out- it works if I opened the .txt file instead of the .html file in Firefox- then all of the formatting is preserved and Rikaichan still works.
edit: Is there a way to make the format prettier in Firefox? Some of them work fine, but I opened the first book of Harry Potter in Firefox and it's all readable but it's mashed together without any spacing/page breaks. If I open it in Word all the breaks are there and it's much more readable, but I can't use the quick look-up of Rikaichan.
edit2: Ok figured it out- it works if I opened the .txt file instead of the .html file in Firefox- then all of the formatting is preserved and Rikaichan still works.
Edited: 2010-02-14, 5:48 pm
2010-02-14, 5:59 pm
Weird, both work for me. Good thing the txt was still there then I guess
The code <pre style="white-space:pre-wrap"> is what allows you to view it in a browser without it being wacky. Before I found out how to do that code I had to just suck it up, and even used a add-on that does word-wrap and used <pre> lol
The code <pre style="white-space:pre-wrap"> is what allows you to view it in a browser without it being wacky. Before I found out how to do that code I had to just suck it up, and even used a add-on that does word-wrap and used <pre> lol
Edited: 2010-02-14, 6:03 pm
Advertising (Register to hide)
May 16 - 30 : Pretty Big Deal: Save 31% on all Premium Subscriptions!
- Sign up here
2010-02-14, 7:27 pm
Ohh look, I'll need to rerun my 'simulation', as the 'Corporate' edition of the software theoretically allows for unlimited processing, rather than 50 pages max.
2010-02-14, 8:16 pm
Just wanted to include here something from another thread. The topic was “essential” kanji and vocab and what might be required for various types of reading. iSoron kindly analyzed some of raseru’s texts and provided a detailed breakdown of the kanji and the vocab in each work.
iSoron Wrote:Maybe you'll find this useful? http://isoron.org/stuff/japanese/analysis/An article (“What Vocabulary Size is Needed to Read Unsimplified Texts for Pleasure?) recommends pre-learning words which occur frequently in a text as a good way to improve reading fluency. (As opposed to intensive reading to learn every word or pre-learning hundreds of words which occur only once or twice.) Seems like common sense, but to actually create such vocab and frequency lists is beyond the ability of most readers. Thanks to iSoron, you have it!
For the curious, I'm using Mecab plus a mess of bash, python and ruby scripts.
2010-02-14, 8:25 pm
Nice
You guys wouldn't happen to know a way to get rid of the rubi/furigana? It screws up rikai-chan.
If only there was a way to replace (like ctrl+h) the inside as well, like "《~》"
You guys wouldn't happen to know a way to get rid of the rubi/furigana? It screws up rikai-chan.
If only there was a way to replace (like ctrl+h) the inside as well, like "《~》"
2010-02-14, 8:32 pm
Haha, just thought of a way and it works, but it's pretty ghetto. (if there's a better way, feel free to say it)
open the .html file in txt
press ctrl + h
Find : 《
replace : <!--
Find : 》
replace : -->
Would be even cooler if you could just make rikai-chan skip it though because occasionally readings are supposed to be something different
Edit: Doesn't work
Rikai-chan won't skip past the invisible html
Edit: You can copy all of the text in the HTML and replace it in the txt then rikai-chan will work. Kind of a pain though
open the .html file in txt
press ctrl + h
Find : 《
replace : <!--
Find : 》
replace : -->
Would be even cooler if you could just make rikai-chan skip it though because occasionally readings are supposed to be something different
Edit: Doesn't work
Rikai-chan won't skip past the invisible htmlEdit: You can copy all of the text in the HTML and replace it in the txt then rikai-chan will work. Kind of a pain though
Edited: 2010-02-14, 8:55 pm
2010-02-14, 8:54 pm
raseru Wrote:NiceA good editor will let you use RegEx to replace. Even the most basic ones in Linux do it. I dunno what to recommend for Windows, though. http://www.jujusoft.com/software/edit/ maybe. I haven't tried it.
You guys wouldn't happen to know a way to get rid of the rubi/furigana? It screws up rikai-chan.
If only there was a way to replace (like ctrl+h) the inside as well, like "《~》"
2010-02-14, 8:58 pm
Now someone needs to take 100 .txt files of light novels, run an analysis on the vocabulary, and sort the 95-98% most common vocabulary items and their kanji. Isn't that a good idea? I bet between us all, we could get a corpus of far more than 100.
Edited: 2010-02-14, 8:58 pm
2010-02-14, 9:00 pm
nest0r Wrote:Now someone needs to take 100 .txt files of light novels, run an analysis on the vocabulary, and sort the 95-98% most common vocabulary items and their kanji. Isn't that a good idea? I bet between us all, we could get a corpus of far more than 100.Actually, I'd rather pick the 'easiest' ones and do that with them... Getting into novels is the hard part. Once you're in, the rest is cake.
I have to admit that analysis would be a lot better than the newspaper frequency list. Who really reads newspapers any more?
2010-02-14, 9:01 pm
《 doesn't seem to work in ctrl + h in that software
nest0r Wrote:Now someone needs to take 100 .txt files of light novels, run an analysis on the vocabulary, and sort the 95-98% most common vocabulary items and their kanji. Isn't that a good idea? I bet between us all, we could get a corpus of far more than 100.One thing I noticed is authors often use the same vocabulary, so if you aim for maybe different authors as well, it might be more accurate
2010-02-14, 9:02 pm
raseru Wrote:《 doesn't seem to work in ctrl + h in that softwareHmm, maybe only 2-3 volumes per author? Enough to get a good sampling of their style across works.
nest0r Wrote:Now someone needs to take 100 .txt files of light novels, run an analysis on the vocabulary, and sort the 95-98% most common vocabulary items and their kanji. Isn't that a good idea? I bet between us all, we could get a corpus of far more than 100.One thing I noticed is authors often use the same vocabulary, so if you aim for maybe different authors as well, it might be more accurate
Edited: 2010-02-14, 9:02 pm
2010-02-14, 9:03 pm
wccrawford Wrote:I'm not sure what you mean. Aren't all light novels in the same basic range? I'm a n00b.nest0r Wrote:Now someone needs to take 100 .txt files of light novels, run an analysis on the vocabulary, and sort the 95-98% most common vocabulary items and their kanji. Isn't that a good idea? I bet between us all, we could get a corpus of far more than 100.Actually, I'd rather pick the 'easiest' ones and do that with them... Getting into novels is the hard part. Once you're in, the rest is cake.
I have to admit that analysis would be a lot better than the newspaper frequency list. Who really reads newspapers any more?
2010-02-14, 9:04 pm
yeah that'd probably be good. Just saying if like 15 of the volumes are from one guy it might be a little off, haha
2010-02-14, 9:07 pm
nest0r Wrote:Newspapers have really really stupid words that shouldn't exist like 日露. I saw one kanji in a newspaper before and asked 2 Japanese highschoolers to read it and they couldn't (鬨). Even with context and me telling them how to read it, they still didn't know the wordwccrawford Wrote:I'm not sure what you mean. Aren't all light novels in the same basic range? I'm a n00b.nest0r Wrote:Now someone needs to take 100 .txt files of light novels, run an analysis on the vocabulary, and sort the 95-98% most common vocabulary items and their kanji. Isn't that a good idea? I bet between us all, we could get a corpus of far more than 100.Actually, I'd rather pick the 'easiest' ones and do that with them... Getting into novels is the hard part. Once you're in, the rest is cake.
I have to admit that analysis would be a lot better than the newspaper frequency list. Who really reads newspapers any more?
Edited: 2010-02-14, 9:10 pm
2010-02-14, 9:11 pm
nest0r Wrote:That analysis posted above shows that 1 uses 1500 kanji, 1 uses 2000, and the rest use 2500. That's fairly different, with the lowest using only 3/5 the kanji of the highest.wccrawford Wrote:I'm not sure what you mean. Aren't all light novels in the same basic range? I'm a n00b.nest0r Wrote:Now someone needs to take 100 .txt files of light novels, run an analysis on the vocabulary, and sort the 95-98% most common vocabulary items and their kanji. Isn't that a good idea? I bet between us all, we could get a corpus of far more than 100.Actually, I'd rather pick the 'easiest' ones and do that with them... Getting into novels is the hard part. Once you're in, the rest is cake.
I have to admit that analysis would be a lot better than the newspaper frequency list. Who really reads newspapers any more?
But just like comics, there's low-end (Yotsuba&), middle (Death Note), and high-end (Read or Die).
Someone posted a nice link to a site that had a lot of easy light novels, and even had previews of them... But I can't find it now and I lost my bookmark in a freak accident.
2010-02-14, 9:12 pm
raseru Wrote:Oh, I was referring to the 'easiest' part of his comment. Well, I think I might already have like 50 .txt files to contribute<---That's what I would say if I were a pirate, which I'm not. So I definitely won't be posting any Google search suggestions to avoid later.nest0r Wrote:Newspapers have really really stupid words that shouldn't exist like 日露. I saw one kanji in a newspaper before and asked 2 Japanese highschoolers to read it and they couldn't (鬨)wccrawford Wrote:Actually, I'd rather pick the 'easiest' ones and do that with them... Getting into novels is the hard part. Once you're in, the rest is cake.I'm not sure what you mean. Aren't all light novels in the same basic range? I'm a n00b.
I have to admit that analysis would be a lot better than the newspaper frequency list. Who really reads newspapers any more?
2010-02-14, 9:14 pm
http://shop.kodansha.jp/bc2_bc/search_vi...?b=1487310 - Found it. That's a pretty easy light novel. The link was in the Harry Potter thread, btw.
2010-02-14, 9:14 pm
wccrawford Wrote:That analysis posted above shows that 1 uses 1500 kanji, 1 uses 2000, and the rest use 2500. That's fairly different, with the lowest using only 3/5 the kanji of the highest.I see. Well as long as we preserve the breakdown of each work before doing a 'meta' analysis, it shouldn't be too much of a problem.
But just like comics, there's low-end (Yotsuba&), middle (Death Note), and high-end (Read or Die).
Someone posted a nice link to a site that had a lot of easy light novels, and even had previews of them... But I can't find it now and I lost my bookmark in a freak accident.
2010-02-14, 11:08 pm
I wonder if that OCR program is even all that accurate though. Like even if it's correct 95% of the time, 5% of the time you'll just be super confused. I think some of the txt files were possibly marketed in that format and not OCR'd to begin with
I shall... experiment with it myself, when it "arrives" on my computer
I shall... experiment with it myself, when it "arrives" on my computer
Edited: 2010-02-14, 11:09 pm
2010-02-14, 11:24 pm
raseru Wrote:I wonder if that OCR program is even all that accurate though. Like even if it's correct 95% of the time, 5% of the time you'll just be super confused. I think some of the txt files were possibly marketed in that format and not OCR'd to begin withSeems pretty accurate to me, but for my purposes it's more trouble than it's worth to correct the errors and tweak the formatting, etc. Actually I don't know, trying more samples, seems like it makes some really dumb errors. Like converts furigana to random words or something.
I shall... experiment with it myself, when it "arrives" on my computer
Edited: 2010-02-14, 11:37 pm
2010-02-15, 3:47 am
Whatever you do, don't do a Google search for 'raw light novels txt', you might turn up an alphanumeric MU code that links to a messy batch of virus-free files. Although, I bet some immoral person could use them for corpus analysis, if they knew how to do that sort of thing.
2010-02-15, 4:33 am
Holy crap that is a lot of stories, where ever did you find all of these? -- is what I would say if I was some kind of a pirate
2010-02-15, 6:07 am
raseru Wrote:Holy crap that is a lot of stories, where ever did you find all of these? -- is what I would say if I was some kind of a pirateWhich ones? Those, or the ones that come up if your hand slips and you type 'raw light novels txt batch 2'?
Edited: 2010-02-15, 6:08 am
2010-02-15, 6:23 am
... I.. Don't think I'll ever run out of txt books to read. Ever. o.o
Edited: 2010-02-15, 6:25 am
