Back

japanese 先生 = core 10000?

#1
So this Japanese Sensei app, it has audio for 4000 more sentences from the same source as smart.fm? (The CJK Institute or whatnot.) This sounds great, but it's a shame it's only for iDevices, I'll feel weird having such a huge, great resource limited that way. Is there no Windows version or equivalent (i.e. other people who've used the same source to produce audio... )? I had no idea this "Core 10,000" was in existence even for iOS devices. I'm still new to that stuff. Is there no way to like put the app material somewhere that isn't an iDevice?
Edited: 2011-01-20, 3:38 am
Reply
#2
I think I tried this app before, I don't think it's SRS based, which doesn't make this app very useful, if I recall this app would ask multiple choice questions and put the same answer twice in the list. (Granted it was the free version which has less content.)

There might be a way to rip the content from it, but I'm not entirely sure, would probably involve some shady hacking expertise.
Reply
#3
I use this app on my ipod touch.
It is the same content as core 4k+6k, which is licensed from the CJK Institute.
They added their own voice readings (from 3-4 voice actors).
I think it does some for of SRS and it does keep track of the number of times seen and times correctly chosen.

If you are looking for a real SRS, "Japanese" has an SRS system with selectable JLPT vocabulary. However, since you are wishing it was on other platforms, I am assuming you use android where both of these apps are not available.
Reply
See this thread for Holiday Countdown Deals (until Dec 15th)
JapanesePod101
#4
brianobush Wrote:I use this app on my ipod touch.
It is the same content as core 4k+6k, which is licensed from the CJK Institute.
They added their own voice readings (from 3-4 voice actors).
I think it does some for of SRS and it does keep track of the number of times seen and times correctly chosen.

If you are looking for a real SRS, "Japanese" has an SRS system with selectable JLPT vocabulary. However, since you are wishing it was on other platforms, I am assuming you use android where both of these apps are not available.
I mean that the extra sentences w/ audio, which are effectively a Core 10000 (since smart.fm has audio for the CJK for 6000 and this is 4000 more) are a shame to be limited to just the iDevices, and I would feel weird using it on my own, as it's such a huge resource and I'd be stuck using it on one small device. As Daichi suggested, I think it'd be great if there was a way to export these sentences + audio, i.e. text + mp3/etc., for use on whatever machine one wishes. The other stuff doesn't matter to me (this goes for all app study systems).
Reply
#5
brianobush Wrote:I use this app on my ipod touch.
It is the same content as core 4k+6k, which is licensed from the CJK Institute.
I've never heard of "core 4k" before. Did you mean that figuratively, or is there an actual core 4k?

nest0r Wrote:I'm still new to that stuff. Is there no way to like put the app material somewhere that isn't an iDevice?
Do you have the actual app? If so, what do you think of the quality of the list? I might try extracting the data but I want to make sure it's worthwhile before spending the time+money on it.
Reply
#6
overture2112 Wrote:I've never heard of "core 4k" before. Did you mean that figuratively, or is there an actual core 4k?
Opps, I mean core 2k instead of 4k.

nest0r Wrote:Do you have the actual app? If so, what do you think of the quality of the list? I might try extracting the data but I want to make sure it's worthwhile before spending the time+money on it.
Yes, I have it and use it daily. The sentences I have seen thus far are exactly the same as core 2k - what makes it different for me is the presentation and in-app "games."

There are three levels in the game: free, beginner and advanced. I only have beginner thus far. The free app has everything included, the levels are unlocked as you pay.

In summary, the app itself is well done and content is presented nicely. The only issue is the content is mostly what I have already seen.
Reply
#7
overture2112 Wrote:
brianobush Wrote:I use this app on my ipod touch.
It is the same content as core 4k+6k, which is licensed from the CJK Institute.
I've never heard of "core 4k" before. Did you mean that figuratively, or is there an actual core 4k?

nest0r Wrote:I'm still new to that stuff. Is there no way to like put the app material somewhere that isn't an iDevice?
Do you have the actual app? If so, what do you think of the quality of the list? I might try extracting the data but I want to make sure it's worthwhile before spending the time+money on it.
No, I haven't gotten it yet, so I don't know about the audio quality or whether there are actually 4000 sentences not in smart.fm. I can't bring myself to buy it knowing I'll always be looking at it and frowning, wishing I had an Anki deck I can use as a corpus on my other machines, like we have for other resources.
Edited: 2011-01-21, 2:14 pm
Reply
#8
nest0r Wrote:No, I haven't gotten it yet, so I don't know about the audio quality or whether there are actually 4000 sentences not in smart.fm. I can't bring myself to buy it knowing I'll always be looking at it and frowning, wishing I had an Anki deck I can use as a corpus on my other machines, like we have for other resources.
So I downloaded the lite version and inspected the files. It's about 378MB in size, so I assume the lite version has all the data as suggested above. I'm still figuring out how it's storing everything (just started looking at it), so maybe I've missed some stuff or made some mistake, but thus far:

----
The A1.dat to A10.dat and B1.dat to B10.dat files seem to have the words, from which I can extract the expression, reading, and an internal index. I've extracted 9619 words with about 200 duplicates (maybe for multiple meanings with the same reading??). Doing a simple comparison against the expression field, 3568 (of 9619) are not in Kore and 20 (of 6000) items from Kore are not in this.
----
sentences.idx and headwords.idx are some sort of mapping of sentences/words to the audio filename (stored in mpeg4 .m4a files). The names are js00002a.m4a through js10089a.m4a for sentences and jw00002a.m4a through jw10089a.m4a for words. There's additional data in these files but I'm not sure how to decode it yet.
----
sentences.dict and headwords.dict are some sort of archive storing all the audio files in base64.
----
JIB_ej{A,B,S}.{idx,dict} contains all the English meanings (and some wacky indexing method) and it would appear other JIB_* files have data for various other languages (eg, Chinese, Korean).


Either way, it certainly looks like extracting the data and correlating it all isn't unreasonably difficult (no encryption) [EDIT: actually, it appears some files are encrypted, or more likely, compressed in some way], it's just a question of whether figuring out all the specifics and gluing it together is worth the effort.


brianobush, could you explain what exactly you meant about the unlocking as you play? And what exactly is the difference between the "lite" (aka free) version and the deluxe one for $16?


EDIT: Ah. So the free/lite version comes with some lessons unlocked, the beginner brings it to 1750 words for $6, advanced adds 8000 words for $10, and deluxe unlocks all 9750 for $16.
Edited: 2011-01-21, 5:27 pm
Reply
#9
overture2112 Wrote:----
The A1.dat to A10.dat and B1.dat to B10.dat files seem to have the words, from which I can extract the expression, reading, and an internal index. I've extracted 9619 words with about 200 duplicates (maybe for multiple meanings with the same reading??). Doing a simple comparison against the expression field, 3568 (of 9619) are not in Kore and 20 (of 6000) items from Kore are not in this.
----
sentences.idx and headwords.idx are some sort of mapping of sentences/words to the audio filename (stored in mpeg4 .m4a files). The names are js00002a.m4a through js10089a.m4a for sentences and jw00002a.m4a through jw10089a.m4a for words. There's additional data in these files but I'm not sure how to decode it yet.
----
sentences.dict and headwords.dict are some sort of archive storing all the audio files in base64.
----
JIB_ej{A,B,S}.{idx,dict} contains all the English meanings (and some wacky indexing method) and it would appear other JIB_* files have data for various other languages (eg, Chinese, Korean).


Either way, it certainly looks like extracting the data and correlating it all isn't unreasonably difficult (no encryption), it's just a question of whether figuring out all the specifics and gluing it together is worth the effort.
I'm confused, I spent quite a while trying to extract meaningful data from those files in the ipa, I swear there was nothing in plaintext in there. Anyway, great news that the data is accessible. I managed to get some MP3s but not all using filejuicer on the idx files. Glad to have someone competent trying this!

If you can match up the mp3 names with the appropriate words and sentences, it would make a perfect addition to the Core decks for Anki. I've been tagging items in the core deck with the J sensei lesson numbers to make a way to study J Sensei in Anki. If you could generate a new deck from what you extracted that would be awesome, especially if it had lesson numbers tagged and there was a tag for cards which aren't in core.

edit - just looked at it again, all the dat files are plaintext (doh!) whilst lots of the other files are compressed. I wonder if the numbers in the word list correspond to mp3 numbers...

edit2 - and some of the idx files are plaintext too, UTF no-BOM. I feel like such a dumbass for skipping over this before Sad
Edited: 2011-01-21, 5:51 pm
Reply
#10
Blahah Wrote:edit - just looked at it again, all the dat files are plaintext (doh!) whilst lots of the other files are compressed. I wonder if the numbers in the word list correspond to mp3 numbers...
JIB_ejS.res.dict is particularly interesting in that it apparently isn't compressed/encrypted at all, just a plaintext xml file with expression, reading (including romaji), meaning, audio file and all those for the example sentence as well. Unfortunately, the S files represents only like 50 words, but perhaps the A and B packs have a similar format sans compression/encryption.
Reply
#11
overture2112 Wrote:
Blahah Wrote:edit - just looked at it again, all the dat files are plaintext (doh!) whilst lots of the other files are compressed. I wonder if the numbers in the word list correspond to mp3 numbers...
JIB_ejS.res.dict is particularly interesting in that it apparently isn't compressed/encrypted at all, just a plaintext xml file with expression, reading (including romaji), meaning, audio file and all those for the example sentence as well. Unfortunately, the S files represents only like 50 words, but perhaps the A and B packs have a similar format sans compression/encryption.
Hey that is cool, it's so well structured! But that's the only plaintext xml file (apart from plists and Localizable.strings) - I searched all the files for all the tags in JIB_ejS.res.dict, zilch.

I am using the 'Deluxe' version of the paid app so it seems the files are the same no matter what you pay. JIB_jeA.res.dict is large and probably contains the bulk of the material we are looking for, but is compressed with no headers for the compression. The accompanying idx is plaintext. Any ideas how to work out the compression?

edit - TRiD gives a 0% match on the dict files, and gives 100% match on the idx files as CAT files (MS Security files). Sounds unlikely.
Edited: 2011-01-21, 6:40 pm
Reply
#12
Blahah Wrote:The accompanying idx is plaintext. Any ideas how to work out the compression?
The .res.idx files are 18 bytes per record, with the first 9 being an index number I'm guessing. Not sure what the last 9 are, but I'm hoping they relate the former to something in another file, perhaps byte offsets? The other .idx files are variable length records that contain the meaning in some language and then 9 bytes that I assume act similarly to the .res.idx ones. So not completely plaintext, but hopefully still usable.

As for the apparent compression on the .res.dict, perhaps try forcing a bunch of algorithms on it and see if one works? If it's actually encrypted then we're probably SOL.
Reply
#13
*eyes glaze over*
Reply
#14
overture2112 Wrote:As for the apparent compression on the .res.dict, perhaps try forcing a bunch of algorithms on it and see if one works? If it's actually encrypted then we're probably SOL.
What are the first 8 bytes? Most compression methods have a header signature that will clue you in on the tool/technique.
Reply
#15
brianobush Wrote:
overture2112 Wrote:As for the apparent compression on the .res.dict, perhaps try forcing a bunch of algorithms on it and see if one works? If it's actually encrypted then we're probably SOL.
What are the first 8 bytes? Most compression methods have a header signature that will clue you in on the tool/technique.
No such luck, the first 8 bytes are different on JIB_ejB.res.dict vs JIB_ejA.res.dict and the `file` command is unable to identify them by magic number if I try anyway.
Reply
#16
You guys are having too much fun with this hacktivity!

Of course, for those people who are content with the raw Japanese sentences and key words (without audio) the problem is effectively solved. Ok, first and last post. Gotta get back to studying Japanese!
Reply
#17
overture2112 Wrote:
brianobush Wrote:
overture2112 Wrote:As for the apparent compression on the .res.dict, perhaps try forcing a bunch of algorithms on it and see if one works? If it's actually encrypted then we're probably SOL.
What are the first 8 bytes? Most compression methods have a header signature that will clue you in on the tool/technique.
No such luck, the first 8 bytes are different on JIB_ejB.res.dict vs JIB_ejA.res.dict and the `file` command is unable to identify them by magic number if I try anyway.
The DICT standard uses .dict and .idx files. Dict files are basically made up of small gzipped pieces added together with an index (the idx file) which gives the addresses of the parts for rapid access. The idx file should have been compressed using gzip -9 but gunzip -9 doesn't recognise it (including if you change the extension).

Dict files can supposedly be decompressed using dictzip which comes in the dictd tar bundle from dict.org. I'll have to install xcode before I can compile it, so if someone else gets there first I would be v happy...
Reply
#18
Blahah Wrote:Dict files can supposedly be decompressed using dictzip which comes in the dictd tar bundle from dict.org. I'll have to install xcode before I can compile it, so if someone else gets there first I would be v happy...
Oh? I thought it might be some standard dictionary format but didn't have any luck searching around (probably doesn't help that I know nothing of dictionary formats). I'll try checking that, but it'll probably have to wait until after i sleep.

----

In the meantime, I exacted all the audio for the words and sentences and have the files named according to the system they use (jw######.m4a and js######.m4a for words and sentences, respectively).

It's 155MB for the sentences and 85MB for the words (240MB total).

----

I wish they would just sell their data in a collection of CSV + audio files, but it works well enough to pay $16 through itunes and then use this. I guess legally you "buy" the program and all it's files for $0 with the lite edition- Apple even bills that way- (and the fact that you don't need to unlock it in their program because you're not using their program in and of itself is inconsequential), but I think it's more within the spirit to purchase the full thing and well worth the money once we finalize an anki deck.
Reply
#19
overture2112 Wrote:In the meantime, I exacted all the audio for the words and sentences and have the files named according to the system they use (jw######.m4a and js######.m4a for words and sentences, respectively).

It's 155MB for the sentences and 85MB for the words (240MB total).

----

I wish they would just sell their data in a collection of CSV + audio files, but it works well enough to pay $16 through itunes and then use this. I guess legally you "buy" the program and all it's files for $0 with the lite edition- Apple even bills that way- (and the fact that you don't need to unlock it in their program because you're not using their program in and of itself is inconsequential), but I think it's more within the spirit to purchase the full thing and well worth the money once we finalize an anki deck.
Nice work extracting the m4a's, how did you do it? I can see the mpeg headers in the files, but how did you extract them and get the filenames?

I did pay for the deluxe version, then I emailed them to ask for the data to be exportable, they said they're considering it in the future. The app is one of few which is definitely worth the money, I use it quite frequently.
Reply
#20
Blahah Wrote:Nice work extracting the m4a's, how did you do it? I can see the mpeg headers in the files, but how did you extract them and get the filenames?
I thought it was some sort of archive of mpeg4 files as that seemed likely when viewing it in a text file and so I tried disecting it and throwing chunks at mplayer. Eventually I realized I was an idiot and somehow ran `file` on everything but those 2 files- it turns out they're in a standard container QuickTime Container, albeit with an insane number of tracks in one file (which prevents any app I've found from playing it in a useful fashion).

With the notes from that site, it was then a trivial matter to split the file into chunks whenever you see the magic header, and then do a little more effort to extract the filename.
Reply
#21
Blahah Wrote:Dict files can supposedly be decompressed using dictzip which comes in the dictd tar bundle from dict.org. I'll have to install xcode before I can compile it, so if someone else gets there first I would be v happy...
Turns out they're actually in StarDict format Star Dict File Format.

I have the expression, reading, and english meanings of the words linked now.
Edited: 2011-01-22, 8:20 am
Reply
#22
overture2112 Wrote:Turns out they're actually in StarDict format Star Dict File Format.

I have the expression, reading, and english meanings of the words linked now.
I tried Stardict before but it needs an ifo file for each dict - how did you bypass that?

edit: OK so I see how it's in the Stardict format (sort of), what's the encoding? I can't get the type identifiers to look like normal chars. And you can interpret the files without the ifo but did you write something to parse the files?
Edited: 2011-01-22, 9:20 am
Reply
#23
Blahah Wrote:
overture2112 Wrote:Turns out they're actually in StarDict format Star Dict File Format.

I have the expression, reading, and english meanings of the words linked now.
I tried Stardict before but it needs an ifo file for each dict - how did you bypass that?

edit: OK so I see how it's in the Stardict format, but did you write something to parse the files?
Yea, I wrote my own code to manipulate the stardict files since it sort of goes off spec with a few things anyways. I'll post the code after I sleep.

EDIT: Actually, I put it here on pastebin in case you're curious and don't want to wait. It's not too refined, but it works fairly well. Should run if it's in the same directory as all the app's files and you make 'wordAudio' and 'sentenceAudio' subdirs for it to fill.
Edited: 2011-01-22, 9:39 am
Reply
#24
So you're close? Nice.
Reply
#25
overture2112 Wrote:EDIT: Actually, I put it here on pastebin in case you're curious and don't want to wait. It's not too refined, but it works fairly well. Should run if it's in the same directory as all the app's files and you make 'wordAudio' and 'sentenceAudio' subdirs for it to fill.
Awesome, I played with it and outputed a tab-delimited file of the words and expressions (that's my major python achievement of the year), but I can't see how we can refer them to the audio.

edit: The audio is sorted syllabetically, so the earliest numbered sounds are ああ, あい, etc. So one way to match them would be to sort the reading field (but I dunno if python can do syllabetical sorting in Japanese?) and then assume the numbers match up.

Also we (i.e. you - I am only good for giving enthusiastic praise) need to get the English equivalents from JIB_ej*.index.example.idx and match them by ID. I could try, but I think by the time you had completed the task I would still be frantically googling python methods. It took me a good two hours to modify your code to print the dict :S
Edited: 2011-01-22, 4:07 pm
Reply