My study method (Study pack included)

Index » Learning resources

wccrawford
Member
From: FL US
Registered: 2008-03-28
Posts: 1548

nest0r wrote:

a similar experiment.

Along the same lines as that experiment, apparently if you try to install it with Wine in Linux, it runs perfectly fine.  In theory, I mean.

captal
Member
From: San Jose
Registered: 2008-03-22
Posts: 673

Thanks heaps raseru- this should be really helpful as I've been trying to up my reading quantity. I bought a bunch of books before I left Japan a couple months ago but I haven't really started tackling them. I think I'll try going through some of the books you uploaded first and see how that goes.

edit: Is there a way to make the format prettier in Firefox? Some of them work fine, but I opened the first book of Harry Potter in Firefox and it's all readable but it's mashed together without any spacing/page breaks. If I open it in Word all the breaks are there and it's much more readable, but I can't use the quick look-up of Rikaichan.

edit2: Ok figured it out- it works if I opened the .txt file instead of the .html file in Firefox- then all of the formatting is preserved and Rikaichan still works.

Last edited by captal (2010 February 14, 4:48 pm)

raseru
Member
From: california
Registered: 2007-05-23
Posts: 159

Weird, both work for me. Good thing the txt was still there then I guess
The code <pre style="white-space:pre-wrap"> is what allows you to view it in a browser without it being wacky. Before I found out how to do that code I had to just suck it up, and even used a add-on that does word-wrap and used <pre> lol

Last edited by raseru (2010 February 14, 5:03 pm)

Advertising (register and sign in to hide this)
JapanesePod101
Sponsor
 
nest0r
Member
Registered: 2007-10-19
Posts: 5234
Website

Ohh look, I'll need to rerun my 'simulation', as the 'Corporate' edition of the software theoretically allows for unlimited processing, rather than 50 pages max.

Thora
Member
From: Canada
Registered: 2007-02-23
Posts: 1659

Just wanted to include here something from another thread. The topic was “essential”  kanji and vocab and what might be required for various types of reading.  iSoron kindly analyzed some of raseru’s texts and provided a detailed breakdown of the kanji and the vocab in each work.

iSoron wrote:

Maybe you'll find this useful?  http://isoron.org/stuff/japanese/analysis/

For the curious, I'm using Mecab plus a mess of bash, python and ruby scripts.

An article (“What Vocabulary Size is Needed to Read Unsimplified Texts for Pleasure?) recommends pre-learning words which occur frequently in a text as a good way to improve reading fluency. (As opposed to intensive reading to learn every word or pre-learning hundreds of words which occur only once or twice.)  Seems like common sense, but to actually create such vocab and frequency lists is beyond the ability of most readers. Thanks to iSoron, you have it!

raseru
Member
From: california
Registered: 2007-05-23
Posts: 159

Nice

You guys wouldn't happen to know a way to get rid of the rubi/furigana? It screws up rikai-chan.
If only there was a way to replace (like ctrl+h)  the inside as well, like "《~》"

raseru
Member
From: california
Registered: 2007-05-23
Posts: 159

Haha, just thought of a way and it works, but it's pretty ghetto. (if there's a better way, feel free to say it)

open the .html file in txt
press ctrl + h
Find : 《
replace : <!--
Find : 》
replace : -->

Would be even cooler if you could just make rikai-chan skip it though because occasionally readings are supposed to be something different

Edit: Doesn't work sad Rikai-chan won't skip past the invisible html
Edit: You can copy all of the text in the HTML and replace it in the txt then rikai-chan will work. Kind of a pain though

Last edited by raseru (2010 February 14, 7:55 pm)

wccrawford
Member
From: FL US
Registered: 2008-03-28
Posts: 1548

raseru wrote:

Nice

You guys wouldn't happen to know a way to get rid of the rubi/furigana? It screws up rikai-chan.
If only there was a way to replace (like ctrl+h)  the inside as well, like "《~》"

A good editor will let you use RegEx to replace.  Even the most basic ones in Linux do it.  I dunno what to recommend for Windows, though.  http://www.jujusoft.com/software/edit/ maybe.  I haven't tried it.

nest0r
Member
Registered: 2007-10-19
Posts: 5234
Website

Now someone needs to take 100 .txt files of light novels, run an analysis on the vocabulary, and sort the 95-98% most common vocabulary items and their kanji. Isn't that a good idea? I bet between us all, we could get a corpus of far more than 100.

Last edited by nest0r (2010 February 14, 7:58 pm)

wccrawford
Member
From: FL US
Registered: 2008-03-28
Posts: 1548

nest0r wrote:

Now someone needs to take 100 .txt files of light novels, run an analysis on the vocabulary, and sort the 95-98% most common vocabulary items and their kanji. Isn't that a good idea? I bet between us all, we could get a corpus of far more than 100.

Actually, I'd rather pick the 'easiest' ones and do that with them...  Getting into novels is the hard part.  Once you're in, the rest is cake.

I have to admit that analysis would be a lot better than the newspaper frequency list.  Who really reads newspapers any more?

raseru
Member
From: california
Registered: 2007-05-23
Posts: 159

《 doesn't seem to work in ctrl + h in that software

nest0r wrote:

Now someone needs to take 100 .txt files of light novels, run an analysis on the vocabulary, and sort the 95-98% most common vocabulary items and their kanji. Isn't that a good idea? I bet between us all, we could get a corpus of far more than 100.

One thing I noticed is authors often use the same vocabulary, so if you aim for maybe different authors as well, it might be more accurate

nest0r
Member
Registered: 2007-10-19
Posts: 5234
Website

raseru wrote:

《 doesn't seem to work in ctrl + h in that software

nest0r wrote:

Now someone needs to take 100 .txt files of light novels, run an analysis on the vocabulary, and sort the 95-98% most common vocabulary items and their kanji. Isn't that a good idea? I bet between us all, we could get a corpus of far more than 100.

One thing I noticed is authors often use the same vocabulary, so if you aim for maybe different authors as well, it might be more accurate

Hmm, maybe only 2-3 volumes per author? Enough to get a good sampling of their style across works.

Last edited by nest0r (2010 February 14, 8:02 pm)

nest0r
Member
Registered: 2007-10-19
Posts: 5234
Website

wccrawford wrote:

nest0r wrote:

Now someone needs to take 100 .txt files of light novels, run an analysis on the vocabulary, and sort the 95-98% most common vocabulary items and their kanji. Isn't that a good idea? I bet between us all, we could get a corpus of far more than 100.

Actually, I'd rather pick the 'easiest' ones and do that with them...  Getting into novels is the hard part.  Once you're in, the rest is cake.

I have to admit that analysis would be a lot better than the newspaper frequency list.  Who really reads newspapers any more?

I'm not sure what you mean. Aren't all light novels in the same basic range? I'm a n00b.

raseru
Member
From: california
Registered: 2007-05-23
Posts: 159

yeah that'd probably be good. Just saying if like 15 of the volumes are from one guy it might be a little off, haha

raseru
Member
From: california
Registered: 2007-05-23
Posts: 159

nest0r wrote:

wccrawford wrote:

nest0r wrote:

Now someone needs to take 100 .txt files of light novels, run an analysis on the vocabulary, and sort the 95-98% most common vocabulary items and their kanji. Isn't that a good idea? I bet between us all, we could get a corpus of far more than 100.

Actually, I'd rather pick the 'easiest' ones and do that with them...  Getting into novels is the hard part.  Once you're in, the rest is cake.

I have to admit that analysis would be a lot better than the newspaper frequency list.  Who really reads newspapers any more?

I'm not sure what you mean. Aren't all light novels in the same basic range? I'm a n00b.

Newspapers have really really stupid words that shouldn't exist like 日露. I saw one kanji in a newspaper before and asked 2 Japanese highschoolers to read it and they couldn't (鬨). Even with context and me telling them how to read it, they still didn't know the word

Last edited by raseru (2010 February 14, 8:10 pm)

wccrawford
Member
From: FL US
Registered: 2008-03-28
Posts: 1548

nest0r wrote:

wccrawford wrote:

nest0r wrote:

Now someone needs to take 100 .txt files of light novels, run an analysis on the vocabulary, and sort the 95-98% most common vocabulary items and their kanji. Isn't that a good idea? I bet between us all, we could get a corpus of far more than 100.

Actually, I'd rather pick the 'easiest' ones and do that with them...  Getting into novels is the hard part.  Once you're in, the rest is cake.

I have to admit that analysis would be a lot better than the newspaper frequency list.  Who really reads newspapers any more?

I'm not sure what you mean. Aren't all light novels in the same basic range? I'm a n00b.

That analysis posted above shows that 1 uses 1500 kanji, 1 uses 2000, and the rest use 2500.  That's fairly different, with the lowest using only 3/5 the kanji of the highest.

But just like comics, there's low-end (Yotsuba&), middle (Death Note), and high-end (Read or Die).

Someone posted a nice link to a site that had a lot of easy light novels, and even had previews of them...  But I can't find it now and I lost my bookmark in a freak accident.

nest0r
Member
Registered: 2007-10-19
Posts: 5234
Website

raseru wrote:

nest0r wrote:

wccrawford wrote:


Actually, I'd rather pick the 'easiest' ones and do that with them...  Getting into novels is the hard part.  Once you're in, the rest is cake.

I have to admit that analysis would be a lot better than the newspaper frequency list.  Who really reads newspapers any more?

I'm not sure what you mean. Aren't all light novels in the same basic range? I'm a n00b.

Newspapers have really really stupid words that shouldn't exist like 日露. I saw one kanji in a newspaper before and asked 2 Japanese highschoolers to read it and they couldn't (鬨)

Oh, I was referring to the 'easiest' part of his comment. Well, I think I might already have like 50 .txt files to contribute<---That's what I would say if I were a pirate, which I'm not. So I definitely won't be posting any Google search suggestions to avoid later.

wccrawford
Member
From: FL US
Registered: 2008-03-28
Posts: 1548

http://shop.kodansha.jp/bc2_bc/search_v … ?b=1487310 - Found it.  That's a pretty easy light novel.  The link was in the Harry Potter thread, btw.

nest0r
Member
Registered: 2007-10-19
Posts: 5234
Website

wccrawford wrote:

That analysis posted above shows that 1 uses 1500 kanji, 1 uses 2000, and the rest use 2500.  That's fairly different, with the lowest using only 3/5 the kanji of the highest.

But just like comics, there's low-end (Yotsuba&), middle (Death Note), and high-end (Read or Die).

Someone posted a nice link to a site that had a lot of easy light novels, and even had previews of them...  But I can't find it now and I lost my bookmark in a freak accident.

I see. Well as long as we preserve the breakdown of each work before doing a 'meta' analysis, it shouldn't be too much of a problem.

raseru
Member
From: california
Registered: 2007-05-23
Posts: 159

I wonder if that OCR program is even all that accurate though. Like even if it's correct 95% of the time, 5% of the time you'll just be super confused. I think some of the txt files were possibly marketed in that format and not OCR'd to begin with

I shall... experiment with it myself, when it "arrives" on my computer

Last edited by raseru (2010 February 14, 10:09 pm)

nest0r
Member
Registered: 2007-10-19
Posts: 5234
Website

raseru wrote:

I wonder if that OCR program is even all that accurate though. Like even if it's correct 95% of the time, 5% of the time you'll just be super confused. I think some of the txt files were possibly marketed in that format and not OCR'd to begin with

I shall... experiment with it myself, when it "arrives" on my computer

Seems pretty accurate to me, but for my purposes it's more trouble than it's worth to correct the errors and tweak the formatting, etc. Actually I don't know, trying more samples, seems like it makes some really dumb errors. Like converts furigana to random words or something.

Last edited by nest0r (2010 February 14, 10:37 pm)

nest0r
Member
Registered: 2007-10-19
Posts: 5234
Website

Whatever you do, don't do a Google search for 'raw light novels txt', you might turn up an alphanumeric MU code that links to a messy batch of virus-free files. Although, I bet some immoral person could use them for corpus analysis, if they knew how to do that sort of thing.

raseru
Member
From: california
Registered: 2007-05-23
Posts: 159

Holy crap that is a lot of stories, where ever did you find all of these? -- is what I would say if I was some kind of a pirate

nest0r
Member
Registered: 2007-10-19
Posts: 5234
Website

raseru wrote:

Holy crap that is a lot of stories, where ever did you find all of these? -- is what I would say if I was some kind of a pirate

Which ones? Those, or the ones that come up if your hand slips and you type 'raw light novels txt batch 2'?

Last edited by nest0r (2010 February 15, 5:08 am)

raseru
Member
From: california
Registered: 2007-05-23
Posts: 159

... I.. Don't think I'll ever run out of txt books to read. Ever. o.o

Last edited by raseru (2010 February 15, 5:25 am)