japanese 先生 = core 10000?

Index » Learning resources

 
Reply #51 - 2011 January 25, 9:56 pm
overture2112 Member
From: New York Registered: 2010-05-16 Posts: 400

How I compare them:

I usually play with the data structures in a python interpreter.  In this case I just created a list of all the expressions, calculated the length, and then converted it to a set (thus remove duplicates) and calculated the length again.  Should be correct, albeit extremely strict (ie, a sentence with an extra space or punctuation would be considered different- it's also case sensitive).

Code:

>>> from mach4 import *
[ snipped output for brevity ]
>>> len( [ d[i]['sentExpr'] for i in d ] )
9669
>>> len(set( [ d[i]['sentExpr'] for i in d ] ))
9604
>>>

Last edited by overture2112 (2011 January 30, 12:48 am)

overture2112 Member
From: New York Registered: 2010-05-16 Posts: 400

Matthias wrote:

There are 60 sentences which are used multiple times:

Example: きりん の 首 は 長い。

Once for 麒麟 (5331 A) and twice for 首 (44 B and 44 S). That means multiple use can make sense but the 44 S could be removed.

Good catch, I didn't consider the same sentence being used for entirely different words.  What a fool I am.


How I determined the number of unique sentences:

Code:

>>> len(set( [ ( d[i]['read'], d[i]['expr'], d[i]['sentExpr'] ) for i in d ] ))
9619

Last edited by overture2112 (2011 January 30, 12:48 am)

Reply #53 - 2011 January 26, 1:52 am
overture2112 Member
From: New York Registered: 2010-05-16 Posts: 400

Since it seems a number of the words are marginally out of phonetic index order and the sentences' sub-order seems arbitrary, I think the best call at this point is to manually correct them.

I've made a publically editable spreadsheet here.  I've added a few fields, but it should be easy enough to update with any existing corrections anyone has made (the PackIndex and PackName uniquely identify an entry, so just sort your spreadsheet and this one by those and copy/paste updated columns).

The Core6k index and word/sentence audio files are based on guesses. Effectively, the word expression and reading must match exactly and in the case of multiple potential matches in Core, it first tries the first one that has exactly the same meaning, then tries the first one whose meaning contains one of the entry's meanings, and failing that just grabs the first match (ignoring meaning).

If anyone can think of useful fields to add or has a better suggestion on how to collaboratively correct the issues, please suggest them.

Edit: My matching code mentioned above identifies 5792/9669 as in core6k and 3877/9699 as new (with 4 possibly duplicates; same expression, reading, and english meaning).

Last edited by overture2112 (2011 January 26, 2:21 am)

Advertising (register and sign in to hide this)
JapanesePod101 Sponsor
 
brianobush Member
From: Portland Registered: 2008-06-28 Posts: 241 Website

If anyone has the sound files readily available, I can assist with the corrections.


EDIT: I have the files now and have started working on the non-core6k sentences.

Last edited by brianobush (2011 February 12, 8:55 pm)

Irixmark Member
From: 加奈陀 Registered: 2005-12-04 Posts: 291

I've also started doing a few. How do we match the leftover sound files to sentences... transcribe and search the spreadsheet?

And can someone tell me that the guy in js01115a and js01164a really is a native speaker?

Last edited by Irixmark (2011 February 13, 8:54 am)

radical_tyro Member
Registered: 2005-11-19 Posts: 272

Irixmark wrote:

And can someone tell me that the guy in js01115a and js01164a really is a native speaker?

my non-native ear says no, but not bad.

brianobush Member
From: Portland Registered: 2008-06-28 Posts: 241 Website

Irixmark wrote:

And can someone tell me that the guy in js01115a and js01164a really is a native speaker?

Native speaking wife say no.

Irixmark Member
From: 加奈陀 Registered: 2005-12-04 Posts: 291

やっぱり。Thanks. The 'sh' is just too Western.
The other speakers sound pretty good.

I'm working my way down the spreadsheet. For the most part it's perfectly matched, but now and again some are randomly switched.

Last edited by Irixmark (2011 February 15, 4:18 pm)

Irixmark Member
From: 加奈陀 Registered: 2005-12-04 Posts: 291

Plodding away, but with only brainobush and me working on it, we'll take ages... anyone care to help out with a few sentences/audio matches?

I've dragged the sentence audio files into iTunes and just play them in order (left half of screen) while inserting a "Y" for every match. Works fine, but it's tiring.

brianobush Member
From: Portland Registered: 2008-06-28 Posts: 241 Website

Irixmark wrote:

I've dragged the sentence audio files into iTunes and just play them in order (left half of screen) while inserting a "Y" for every match. Works fine, but it's tiring.

Well, I am glad that I am not the only one doing this smile
You have me until we get all the -1's done, then I will most likely loose interest.

Irixmark Member
From: 加奈陀 Registered: 2005-12-04 Posts: 291

I wasn't planning to do more than those either... if I understand correctly, we have the other sentences from Core2k6k anyway?

Reply #62 - 2011 March 04, 12:17 am
notnestor New member
From: USA Registered: 2011-03-03 Posts: 1

A while ago someone wrote "nice work. thanks so much for putting in all this effort. It's definitely at a usable stage right now." You certainly could argue that it's not really impotant to put in the work to tidy it up. If you think so and are ready to use your skill on the next useful and similar project, what do you think of the audio sentences in           

http://download.cnet.com/iKIC-Lite-Kanj … 55197.html               
They are more complicated but after the 10.000 mark that is good.

Reply #63 - 2011 March 07, 9:47 pm
nest0r Member
Registered: 2007-10-19 Posts: 5236 Website

Did someone already find a way to match it all? I noticed this deck shared on Anki. (Japanese Core 10000 or something.)

Reply #64 - 2011 March 07, 11:37 pm
brianobush Member
From: Portland Registered: 2008-06-28 Posts: 241 Website

nest0r wrote:

Did someone already find a way to match it all? I noticed this deck shared on Anki. (Japanese Core 10000 or something.)

I just took a peek and it is the complete set, however we haven't validated all the sentences. And there are errors every once in a while, probably 1/100. I have been going through 50 or so every other day and it is taking longer than I thought (I think only a couple of us are looking at these).