kanji koohii FORUM
Source of coreXk? - Printable Version

+- kanji koohii FORUM (http://forum.koohii.com)
+-- Forum: Learning Japanese (http://forum.koohii.com/forum-4.html)
+--- Forum: Learning resources (http://forum.koohii.com/forum-9.html)
+--- Thread: Source of coreXk? (/thread-11832.html)



Source of coreXk? - Ricky - 2014-05-17

Hi,

I found out about coreXk about three weeks ago. It's amazing.
But it seems to be only available as anki deck or spread through several places.

So is there a major source or a main database somewhere, that contains an maintains all the images, audio files and words together?


Greets


Source of coreXk? - mc962 - 2014-05-17

I don't know about the images and audio files, but I believe Nukemarine's thread has an excel spreadsheet to all the words at the top

Although I'm not really sure why you'd care about finding the images and audio files separately.


Source of coreXk? - NinKenDo - 2014-05-17

They come from iKnow.co.jp/Smart.fm/iKnow.jp back when they licensed the official material under a CC license.


Source of coreXk? - lauri_ranta - 2014-05-21

An older version of the Core 6000 data from the iKnow/Smart.fm API:

https://sites.google.com/site/ankinihongo/home/kore

Newer versions of the Core 6000 data based on JSON files downloaded from iknow.jp:

http://forum.koohii.com/showthread.php?tid=9762
http://jptxt.net/core-6000.txt

"Core 10k" data extracted from the Japanese Sensei Deluxe iPhone app:

http://forum.koohii.com/showthread.php?tid=7104


Source of coreXk? - Ricky - 2014-05-24

(edit: thanks for the answers! Helpful!)

As I found a a number of sentences with mistakes or unnatural Japanese, it would be nice, if a community could fix things immediately and not having a static pack of everything.

Is there any plan to make it possible to generate the core files out of any bigger/accessible database? (like using the order of core with a database like tatoeba.org)

greetings


Source of coreXk? - rich_f - 2014-05-24

You're going to find all kinds of unnatural Japanese on Tatoeba, as well as in EDICT's sample sentences, because both rely on the Tanaka Corpus, which is notorious for containing a lot of weird/unnatural stuff.

You could just take the word list from Core Xk, then run it through EPWING2Anki to get as many sentences as you want. It worked pretty well for me. The downside is that you lose the pictures and the sound, but it's quick and dirty.


Source of coreXk? - learningkanji - 2014-05-24

Does core2k/6k have any/a lot of unnatural Japanese?


Source of coreXk? - NinKenDo - 2014-05-25

learningkanji Wrote:Does core2k/6k have any/a lot of unnatural Japanese?
Not relatively speaking. But yes, there is unnatural Japanese. That is to be expected in a learning material. Plenty of time to correct to more natural Japanese when you're high-intermediate to advanced. Don't sweat it too much.


Source of coreXk? - learningkanji - 2014-05-25

What would you say high-intermediate to advanced is? I'm at a little over 3000 words in core2k/6k.


Source of coreXk? - yudantaiteki - 2014-05-25

One thing to remember is that in Japanese, as with any language, you can't just take a sentence and label it "natural" or "unnatural." Sometimes it depends on the context, and there's a lot of regional and individual variation among native speakers so that some speakers might accept a sentence as natural while others don't. Also when you're specifically asked to identify a sentence as natural or unnatural, you might think it's unnatural when you may not have even noticed if a native speaker said it.

Furthermore, if a sentence is unnatural in 2014 but would have been natural in 1982 (or 1955), that doesn't necessarily mean the sentence is bad depending on exactly how the sentence is presented and what the purpose of studying it is.


Source of coreXk? - Vempele - 2014-05-25

yudantaiteki Wrote:Furthermore, if a sentence is unnatural in 2014 but would have been natural in 1982 (or 1955), that doesn't necessarily mean the sentence is bad depending on exactly how the sentence is presented and what the purpose of studying it is.
This reminds me that 日ソ (Japanese-Soviet) is in Core. I wonder how they managed to get that and インターネット in the same corpus...


Source of coreXk? - Inny Jan - 2014-05-25

yudantaiteki Wrote:One thing to remember is that in Japanese, as with any language, you can't just take a sentence and label it "natural" or "unnatural." [...]
yudantaiteki's comment remainded me of this review of "Dictionary of Basic Japanese Grammar":
Quote:That little incident made me start asking my Japanese friends about stuff I had learnt from this book, and a lot of the entries they told, they never used in normal conversation.[...]

So please be careful to use this great but dangerous book!
There were two questions (and answers) that immediately came to my mind after reading that review:
1) Is Japanese in DoBJG (at times) unnatural?
Obviously, it has been recognised as such.

2) Does it mean that DoBJG is worthless?
That couldn't be farther from the truth.


Source of coreXk? - codex - 2014-05-25

Curious about the source of the Core sentences, and wondering what the Tanaka Corpus is, I came up with this:

The Tanaka Corpus (212,00 sentence pairs) was published in 2001. It was compiled by Japanese university students as class assignments by Prof. Tanaka. The material is riddled with errors of all kinds, ranging from transcription errors to "pairs" of completely unrelated sentences.

Jim Breen edited it for duplications, spelling errors, and mismatched sentence "pairs", and incorporated the resulting 180,000 sentence pairs into JDIC as example sentences in 2003. The editing is ongoing and is now down to about 150,000 sentence pairs.

In 2006 the Corpus was incorporated into the Tatoeba project, where it is being further edited and corrected online by volunteers of varying skill levels.

----------

The previous posts in this thread seem to imply that some part (10,000 sentence pairs) of the Corpus was incorporated into the iKnow.com websites, and the Core series was eventually extracted from that. Does anybody know if that's true, and if so, how the sentences pairs were selected and if they were they edited or corrected in any way?


Source of coreXk? - rich_f - 2014-05-26

Derp. This has been discussed three years ago:

http://forum.koohii.com/showthread.php?pid=150258#pid150258

In rising order of my personal preference:

Tanaka/Tatoeba/a hojillion apps built off of Tanaka: not terrible, but there are so many better alternatives, it's not worth it. I don't like having to question every sentence: "Is this one okay? How about this one?" It's free, but that isn't always a good thing.

EIJIRO: Professionally edited. $20 or so for a copy of the latest rev. of the database. You can roll your own EPWING dictionary (but it's a pain in the butt). Also, there are an insane number of entries (almost TOO many.) It's good for the lost cause words you can't find anywhere else, but it's hard to manage. You can just search the text file with a good text reader, but results get messy. alc.co.jp has a limited version for free, or get web access to all of it for 324 yen/month.

dic.Yahoo.co.jp: Uses Shogakukan's Progressive JP->EN dictionaries. Same dictionary is in the Kindle Paperwhite. It's a good dictionary, with good sentences. I used this one a lot.

Kenkyusha: Works great if you can find an EPWING version to run with EPWING2Anki. I dropped $120 to get a version of this on my 電子辞書. That's how much I love this thing.

Books/Newspapers: The best way to come across natural, native Japanese. Reading books is a lot more fun than doing flashcards.


Source of coreXk? - Vempele - 2014-05-26

codex Wrote:The previous posts in this thread seem to imply that some part (10,000 sentence pairs) of the Corpus was incorporated into the iKnow.com websites, and the Core series was eventually extracted from that.
I don't see anyone saying anything like that in this thread. They only said that tatoeba.org and EDICT use the Tanaka Corpus, in response to the idea of using tatoeba instead of the Core sentences (which was in response to claims that Core contained unnatural sentences).


Source of coreXk? - lauri_ranta - 2014-05-26

codex Wrote:The previous posts in this thread seem to imply that some part (10,000 sentence pairs) of the Corpus was incorporated into the iKnow.com websites, and the Core series was eventually extracted from that. Does anybody know if that's true, and if so, how the sentences pairs were selected and if they were they edited or corrected in any way?
Not true. I don't know what the source of the Core 6000 sentences is, but it is not the Tanaka/Tatoeba corpus. Also there has always been just 6000 entries (or pairs of words and sentences, even though some entries actually have multiple sentences).

"Core 10000" is what people from this forum started calling the data that overture2112 extracted from the Japanese Sensei Deluxe iPhone app. There are actually only 9619 pairs of words and sentences. When I compared the Japanese Sensei Deluxe spreadsheet with JSON files downloaded from iknow.jp, they included the exactly same versions of 3739 sentences and 5544 words. About 1500 more sentences only had small differences. There have been some changes to the Core 6000 data over time, but I don't know if the data from Japanese Sensei Deluxe has also been changed from whatever the original source was.


Source of coreXk? - Thora - 2014-05-26

I recall from past threads that the Core 6000 sentences were licensed by iknow from Jack Halpern's CJK Dictionary Institute - http://www.cjk.org/cjk/index.htm

As lauri_ranta explained, JSensei apparently used almost the same 6000 plus an additional 4000 [edit: don't actually know the source of the 4000. Does the JSensei app not mention the source of its content?]

I think someone mentioned that the Core audio is better than the JSensei audio, so they apparently arranged their own recordings.

iKnow is a company formed by 2 former foreign investment bankers in Tokyo who found themselves needing to find new careers. They teamed up with a foreign English teacher and marketed him as some kind of revolutionary pedogogical thinker. (You can read a description of the ideas and 'technology' on the site and decide for yourselves how scientifically significant or innovative you think it is. )

Core is a vocabulary building tool. One tool of several which can facilitate learning from authentic reading/listening. It wasn't meant as a complete language learning method. You can also use other techniques better suited to develop grammar, listening comprehension, reading comprehension, kanji knowledge, etc.

"Unnatural" is a bit vague. If the suggestion is that one should ONLY learn from exposure to authentic native language, I don't believe there's any justification for that.


Source of coreXk? - codex - 2014-05-27

lauri_ranta and Thora, thank you for the clarification, and the interesting stories behind the Core decks and iKnow. I do like to know the source of the materials I'm using.


Source of coreXk? - lauri_ranta - 2014-05-27

Thora Wrote:I recall from past threads that the Core 6000 sentences were licensed by iknow from Jack Halpern's CJK Dictionary Institute - http://www.cjk.org/cjk/index.htm

As lauri_ranta explained, JSensei apparently used almost the same 6000 plus an additional 4000 [edit: don't actually know the source of the 4000. Does the JSensei app not mention the source of its content?]
The App Store page of Japanese Sensei says:

"This application has not been developed using free dictionary data available on the web, but rather has been built using quality dictionary data produced by Jack Halpern's CJK Dictionary Institute."

I don't know if that means that they (or iKnow/Smart.fm) got the Core-style sentences from the CJK Dictionary Institute though.

Japanese Sensei (or at least the Google Docs spreadsheet uploaded by overture2112) only has 9619 entries and 9604 unique sentences. It might be that iKnow/Smart.fm or Japanese Sensei or both only include a part of the sentences from the original source.

Also I'm not sure if the Core 6000 data was ever available with a Creative Commons license. http://web.archive.org/web/20090227183555/http://developer.iknow.co.jp/docs says that "Content on developer.iknow.co.jp is licensed under the Creative Commons Attribution 3.0 License" but that might not apply to the Core 6000 data.


Source of coreXk? - toshiromiballza - 2014-05-27

lauri_ranta Wrote:Also I'm not sure if the Core 6000 data was ever available with a Creative Commons license.
It was:
Quote:Thank you for contacting us with your content question.

A few years ago, we did offer access to Cerego content via an API, but we ceased offering this service and we no longer offer our content with a creative commons attribution license [...]

Thank you,

iKnow! Support Team

セレゴ・ジャパン株式会社
〒150-0031 東京都渋谷区桜丘町12-10 渋谷インフォスアネックス 9F

Cerego Japan Inc.
12-10 Sakuragaoka-cho, Shibuya Infoss Annex 9F
Shibuya-ku, 150-0031, Japan
They stopped offering the content under CC, but since CC licenses are not revocable, it makes all those Anki decks, etc. legal to use.