Sentences galore?

Index » Learning resources

  • 1
 
cloudstrife543 Member
From: tallahassee Registered: 2008-10-26 Posts: 82

http://www.mahou.org/Kanji/

Try out this site if you are sentence mining. Especially if you want kanji usage context. I just found it and figured I'd share and get opinions of others. Just type in a kanji and hit search and they provide various sentence examples along with definition, stroke order, blah blah, etc.

I know people are always concerned about certain sentence mining sources not being 'natural Japanese', and that is another reason why I posted it here to get some of your opinions after you've checked it out.

Thank you for any comments!

Smackle Member
Registered: 2008-01-16 Posts: 463

They are all from the Tanaka Corpus which is sometimes unnatural Japanese.

GoodSirJava Member
From: USA Registered: 2006-07-17 Posts: 38

The proper term for this, in case it interests anyone, is KWIC (Key Word in Context), and it is used by lexicographers to construct dictionaries.

Last edited by GoodSirJava (2009 February 02, 8:15 am)

Advertising (register and sign in to hide this)
JapanesePod101 Sponsor
 
Tobberoth Member
From: Sweden Registered: 2008-08-25 Posts: 3364

If it's the Tanaka corpus, I wouldn't use it. Tons of the sentences are unnatural and many of them aren't even correct.

howtwosavealif3 Member
From: USA Registered: 2008-02-09 Posts: 889 Website

galore? it has anime all -over. pretty obvious it's not very good.

cloudstrife543 Member
From: tallahassee Registered: 2008-10-26 Posts: 82

why does having anime on it make it bad? Wouldn't you sometimes get sentences from mining your favorite anime?

cloudstrife543 Member
From: tallahassee Registered: 2008-10-26 Posts: 82

Smackle wrote:

They are all from the Tanaka Corpus which is sometimes unnatural Japanese.

Where does it tell you this information?

woodwojr Member
From: Boston Registered: 2008-05-02 Posts: 530

Dictionary Information:

#12: Japanese-English sentence pairs: "This is the big file of matched Japanese-English sentence pairs from the Tanaka corpus. It is the file as used by the WWWJDIC server."

That doesn't conclusively indicate that all sentence pairs are from the Tanaka corpus, but it certainly contains it.

~J

Last edited by woodwojr (2009 February 02, 11:34 am)

cloudstrife543 Member
From: tallahassee Registered: 2008-10-26 Posts: 82

Smackle wrote:

They are all from the Tanaka Corpus which is sometimes unnatural Japanese.

I found where it says that, but it is the file loaded onto the WWWJDIC server which they say has been updated regularly to fix the sentences that were messed up but they caution that some might still be messed up. So... I guess it's not 100% reliable.

Delina Member
From: US Registered: 2008-02-12 Posts: 102

I use mahou a lot, but never for sentence-mining. It's convenient in that it has a great cross-reference for the kanji, including its frame number in Heisig and a bunch of other dictionaries and learning resources (KO2001, etc.). Sometimes I'll use the example sentences for reference when I'm trying to write something, but because they are translated from English by American college students, I do not use them as examples of 'real' Japanese for my SRS. (If you want sentences galore, Tanuki seems to be a better bet, but does not provide English translation.)

Here is an excerpt description of the origin of these sentences:

"From inspection, it appears that many of the sentence pairs have been derived from textbooks, e.g. books used by Japanese students of English."

For the full description, see the link below:

http://www.csse.monash.edu.au/~jwb/tanakacorpus.html

You are correct that it has been modified for use in WWJDIC, but note that the bar for removal of 'bad' sentences is set pretty low - only grammatically "wrong" sentences have been removed, with no regard to whether they are "natural" Japanese sentences. Generally they are high on pronouns and read like something out of an old textbook:

"As described below, the Tanaka Corpus has been edited and adapted to be used within the WWWJDIC dictionary server as a set of example sentences associated with words in the dictionary. In order to adapt the corpus for this role, it has been edited as follows:

   1. an initial regularization of the punctuation of the Japanese and English sentences was carried out, then duplicate pairs were removed, reducing the original file from 210,000 pairs to 180,000 pairs;
   2. sentences which differed only by differences in orthography (e.g. kana/kanji usage, okurigana differences), numbers, proper names, minor grammatical points such as plain/polite verb usage, etc. were reduced to single representative examples;
   3. sentences where the Japanese consisted of a short Japanese statement in kana were removed;
   4. sentences with spelling errors, kana-kanji conversion errors, etc. were corrected;
   5. sentences where the English version did not match the Japanese were edited to make the two versions agree;
   6. where the sentences contain gender-specific language or words, the English portion has been tagged with [M] or [F] respectively;
   7. sentences where the Japanese was too garbled to derive a valid English equivalent were removed."

  • 1