Back

Kanji compounds/words per Lesson of Book 1 ??

#1
Hi everyone,

I'm wondering if anyone knows of any lists which contain kanji compounds or words listed per Lesson of RTK1? I learned Japanese through a totally different system and my reading ability is advanced. I know all the readings of the kanji and how to read thousands of compounds already. But my writing skills are weak. I am going through RTK1 now, eventhough I know it's not recommended to do it backwards.

I'd like to practice by writing words I already know and can read (but can't write from memory). If I do this lesson by lesson I'm hoping to fill in the gaps. For example, from Lessons 1 and 2, words like 明日, 早く, 古本, stuff like that.

Or, does anyone know of a program or site where you can input a list of selected kanji (ie Lessons 1-5 or RTK1) and have it produce a list of words or compounds containing only those kanji??

Thanks in advance and I hope to be a regular contributor on the forum.
Reply
#2
If you go to the Reviewing the Kanji main site, the vocab shuffle in the Labs tab just does that. One condition though, it has to be a learned kanji (three or more succesful reviews).
Reply
#3
I think that's it's better to do it the 'pure' method, and then just add the readings later... You're only really needing the writing, because you already can recognise and read a lot of kanji, right? So don't clutter your study with things you already know.^^
Reply
May 16 - 30 : Pretty Big Deal: Save 31% on all Premium Subscriptions! - Sign up here
JapanesePod101
#4
Well it's not really things I already 'know', or at least know fully. I could easily read a word like 懲罰 for example but there's no way I could write it from memory. If I just through the pure method, I think I'll still be at a loss as to which kanji to use for the words I already know. That's why I wanted to start practicing writing as I go.

But I'm not actually sure what the best way to move forward is.
Reply
#5
ktcgx Wrote:I think that's it's better to do it the 'pure' method, and then just add the readings later... You're only really needing the writing, because you already can recognise and read a lot of kanji, right? So don't clutter your study with things you already know.^^
I see. Is there a downloadable list? Or do you have to do it through the site? I use Anki.
Reply
#6
If you use OS X, you can paste this to Terminal:

for x in edict_sub kanjidic;do curl ftp://ftp.monash.edu.au/pub/nihongo/$x|iconv -f euc-jp -t utf-8>$x.txt;done;sed -En 's/^([^ ]+).* L([0-9]+) .*/\2 \1/p' kanjidic.txt|sort -n|sed -n 1,94p|cut -d' ' -f2|tr -d '\n'|ruby -KUe 'kanji=STDIN.read;puts IO.read("edict_sub.txt").scan(/^([#{kanji}]{2}) \[(.*?)\] \/(?:\(.*?\) )*(.*?)\//).map{|l|l.join("\t")}'

It also worked on Ubuntu when I ran sudo apt-get install curl ruby1.9.1 and changed sed -E to sed -r.

sed -n 1,94p selects RTK frame numbers 1 to 94. KANJIDIC still uses fifth edition frame numbers.

[#{kanji}]{2} only matches two kanji compounds. Change it to [#{kanji}\343\201\201-\343\202\237]{2,} to include words with hiragana.

Example output (where I replaced tabs with two spaces):

一日 ついたち first day of the month
目下 めした subordinate
明朝 みょうちょう tomorrow morning
一元 いちげん unitary
Edited: 2013-05-31, 5:02 am
Reply
#7
As far as I understand your problem, I may by wrong of course, you don't seem to know
1. kanji stroke order rules - they are very simple
2. bushu (radicals) and their Japanese names.

You can find all the necessary info here:
http://users.bestweb.net/%7Esiom/martian_mountain/K.7z
The file includes two fonts as well: stroke orders and calligraphic font.
There are html files there with all new jouyou kanji list with example words. One of the lists is in Heisig order.

You may find this offline dictionary useful too.
http://zkanji.sourceforge.net/
Reply
#8
buonaparte Wrote:As far as I understand your problem, I may by wrong of course, you don't seem to know
1. kanji stroke order rules - they are very simple
2. bushu (radicals) and their Japanese names.

You can find all the necessary info here:
http://users.bestweb.net/%7Esiom/martian_mountain/K.7z
The file includes two fonts as well: stroke orders and calligraphic font.
There are html files there with all new jouyou kanji list with example words. One of the lists is in Heisig order.

You may find this offline dictionary useful too.
http://zkanji.sourceforge.net/
I do know stroke order rules but not the names of the radicals. But I'm more interested in being able to write all the words I already know how to read, if that makes sense. So I'm going through RTK1 and using the stories to write each individual kanji. But I'd like to supplement that with practicing writing words I know that use kanji from the Lessons of RTK1 I've do so far.
Reply
#9
lauri_ranta Wrote:If you have access to a Unix shell:
I appreciate your time in answering my question, but I'm afraid my technical skills aren't up to that task. I don't have access to a Unix shell (to be honest I don't really know what it is). But I will continue to look for ways to generate words from a given list of kanji.
Reply
#10
ktcgx Wrote:I think that's it's better to do it the 'pure' method, and then just add the readings later... You're only really needing the writing, because you already can recognise and read a lot of kanji, right? So don't clutter your study with things you already know.^^
Doing that now and it looks good. Thanks. What I'm really looking for is a list is that data source that generates vocab shuffle tool uses to generate it's words.
Edited: 2013-05-30, 8:25 am
Reply
#11
If you're familiar with most radicals and understand general stroke order rules I think you can just safely just reverse cards in your vocab deck, starting with the simplest (so they go hiragana -> kanji). Indeed, if you can, you should, since practicing to write compounds will be more useful.

At least this is what I did when I forgot most of RTK a year or two ago. Just redownloaded core 6000, and reversed about 1500 cards. Friend did a similar thing.
Edited: 2013-05-30, 8:49 am
Reply
#12
choubatsu Wrote:
ktcgx Wrote:I think that's it's better to do it the 'pure' method, and then just add the readings later... You're only really needing the writing, because you already can recognise and read a lot of kanji, right? So don't clutter your study with things you already know.^^
I see. Is there a downloadable list? Or do you have to do it through the site? I use Anki.
There are downloadable, public RTK1 decks on anki, but personally, I prefer this site, as I find it much easier to use, plus, there're all the stories published here to help you remember how to write the characters...
Reply
#13
@lauri_ranta
Cleverly done Smile

@choubatsu
If your machine is Mac you do have access to Unix command line - this is where lauri_ranta's command can be executed. Below is my attempt to make his script a bit more accessible (# [nb] is a comment and as such has no significance on execution of the command):

Code:
# [1]
for x in edict_sub kanjidic; do
    curl ftp://ftp.monash.edu.au/pub/nihongo/$x |
        iconv -f euc-jp -t utf-8 > $x.txt
done

sed -rn 's/^([^ ]+).* L([0-9]+) .*/\2 \1/p' kanjidic.txt | # [2]
    sort -n | # [3]
    sed -n 1,94p | # [4]
    cut -d ' ' -f2 | # [5]
    tr -d '\n' | # [6]
    ruby -KUe '
        kanji=STDIN.read # [7]
        puts IO.read("edict_sub.txt"). # [8]
            scan(/^([#{kanji}]{2}) \[(.*?)\] \/(?:\(.*?\) )*(.*?)\//). # [9]
            map{|l|l.join("\t")}' # [10]
[1]
Download publicly available kanji dictionaries:
Edict and Kanjidic - save them to edict_sub.txt and kanjidic.txt files

[2]
From all lines in kanjidic.txt select a kanji and its frame number - print them as (number, kanji)

[3]
Sort the output of [2] - this sorts the kanji according to the frame number

[4]
From the output of [3] select lines from 1 to 94, ie. the first 94 frames

[5]
From the output of [4] strip out the first column (the frame number)

[6]
From the output of [5] remove linebreaks - all the selected kanji are concatenated on one line now

[7]
Store the output of [6] to a variable 'kanji'

[8]
Read in edict_sub.txt

[9]
In the string created at [8] select complete entries that contain two kanji compounds, the characters for the compounds are only those that are stored in the 'kanji' variable

[10]
The output of [9] is (compound, reading, meaning) each on a separate line - join those with a tab character. This is our final output.
Reply
#14
Sorry I'm a PC user.
Reply
#15
EratiK Wrote:If you go to the Reviewing the Kanji main site, the vocab shuffle in the Labs tab just does that. One condition though, it has to be a learned kanji (three or more succesful reviews).
In addition to what EratiK mentioned, you can also do so without adding flashcards on the Labs page by entering a Heisig index. You will get a random selection of kanji compounds made of characters learned up to the given Heisig frame number. You don't get to choose a specific range of characters but you can speed through the ones you've already seen with the space bar.
Reply
#16
Sorry to re-open this discussion after so long but I'm still looking for answers.

How does the Language lab generate words just from an input of a selection of characters? In other words,
if I tell it I've studied up to Frame 750, how does it then generate a list of words which contain only Kanji from those frames?

* Edit * Also, what database of vocabulary is it searching for matching Kanji?
Edited: 2014-10-29, 4:50 pm
Reply
#17
Inny Jan Wrote:@lauri_ranta
Cleverly done Smile

@choubatsu
If your machine is Mac you do have access to Unix command line - this is where lauri_ranta's command can be executed. Below is my attempt to make his script a bit more accessible (# [nb] is a comment and as such has no significance on execution of the command):

Code:
# [1]
for x in edict_sub kanjidic; do
    curl ftp://ftp.monash.edu.au/pub/nihongo/$x |
        iconv -f euc-jp -t utf-8 > $x.txt
done

sed -rn 's/^([^ ]+).* L([0-9]+) .*/\2 \1/p' kanjidic.txt | # [2]
    sort -n | # [3]
    sed -n 1,94p | # [4]
    cut -d ' ' -f2 | # [5]
    tr -d '\n' | # [6]
    ruby -KUe '
        kanji=STDIN.read # [7]
        puts IO.read("edict_sub.txt"). # [8]
            scan(/^([#{kanji}]{2}) \[(.*?)\] \/(?:\(.*?\) )*(.*?)\//). # [9]
            map{|l|l.join("\t")}' # [10]
[1]
Download publicly available kanji dictionaries:
Edict and Kanjidic - save them to edict_sub.txt and kanjidic.txt files

[2]
From all lines in kanjidic.txt select a kanji and its frame number - print them as (number, kanji)

[3]
Sort the output of [2] - this sorts the kanji according to the frame number

[4]
From the output of [3] select lines from 1 to 94, ie. the first 94 frames

[5]
From the output of [4] strip out the first column (the frame number)

[6]
From the output of [5] remove linebreaks - all the selected kanji are concatenated on one line now

[7]
Store the output of [6] to a variable 'kanji'

[8]
Read in edict_sub.txt

[9]
In the string created at [8] select complete entries that contain two kanji compounds, the characters for the compounds are only those that are stored in the 'kanji' variable

[10]
The output of [9] is (compound, reading, meaning) each on a separate line - join those with a tab character. This is our final output.
Do you know if this is possible on a PC somehow?
Reply
#18
I believe this is what you are looking for, but I'm a mac user so I haven't used it.

https://www.cygwin.com/
Reply
#19
Why not try this. Get a couple of books like in the links below. Look up the kanji from Heisig for which you want compounds and write down all the compounds where the compounds include only the ones up to the point in Heisig you desire. (We're not all computer programmers on this forum).

http://www.amazon.com/Essential-Kanji-Ch...characters

http://www.amazon.com/Guide-Remembering-...characters

http://www.amazon.com/Guide-Reading-Writ...nce+sakade
Reply
#20
I just ran it and pasted the output below. Apparently it only goes up to frame 94. Email me if you like this and I'll send you a longer list so as not to fill up this thread with thousands of lines of text.

一員 いちいん person
一丸 いちがん lump
一月 いちがつ January
一見 いっけん look
一元 いちげん unitary
一口 ひとくち mouthful
一首 いっしゅ tanka
一寸 ちょっと just a minute
一世 いっせい generation
一切 いっさい all
一千 いっせん 1,000
一旦 いったん once
一丁 いっちょう one sheet
一二 いちに the first and second
一日 いちにち first day of the month
一日 ついたち first day of the month
一品 いっぴん item
一目 ひとめ glance
凹凸 おうとつ unevenness
下見 したみ preview
下旬 げじゅん month (last third of)
下町 したまち low-lying part of a city (usu. containing shops, factories, etc.)
下品 げひん vulgarity
九九 くく multiplication table
九月 くがつ September
九十 きゅうじゅう ninety
九日 ここのか the ninth day of the month
月見 つきみ viewing the moon
月日 がっぴ date
月日 つきひ time
元首 げんしゅ ruler
元旦 がんたん New Year's Day
元日 がんじつ New Year's Day
五月 ごがつ May
五十 ごじゅう fifty
五日 いつか the fifth day of the month
口上 こうじょう vocal message
工員 こういん factory worker
項目 こうもく item
左右 さゆう left and right
三月 さんがつ March
三十 さんじゅう thirty
三千 さんぜん 3000
三日 みっか the third day of the month
三百 さんびゃく 300
四月 しがつ April
四十 よんじゅう forty
四千 よんせん four thousand
四日 よっか fourth day of month
四百 よんひゃく four hundred
自首 じしゅ surrender
自白 じはく confession
自負 じふ conceit
自明 じめい obvious
七月 しちがつ July
七十 しちじゅう seventy
七日 なのか the seventh day of the month
首唱 しゅしょう advocacy
十一 じゅういち 11
十九 じゅうきゅう 19
十月 じゅうがつ October
十五 じゅうご 15
十三 じゅうさん 13
十四 じゅうし 14
十七 じゅうしち 17
十二 じゅうに 12
十日 とおか the tenth day of the month
十八 じゅうはち 18
十万 じゅうまん 100,000
十六 じゅうろく 16
上下 うえした top and bottom
上下 じょうげ top and bottom
上旬 じょうじゅん first 10 days of month
上昇 じょうしょう rising
上田 じょうでん high rice field
上品 じょうひん elegant
真上 まうえ just above
占有 せんゆう exclusive possession
早口 はやくち fast-talking
早朝 そうちょう early morning
卓上 たくじょう desktop
中元 ちゅうげん 15th day of the 7th lunar month
中古 ちゅうこ used
中旬 ちゅうじゅん middle of a month
中世 ちゅうせい Middle Ages (in Japan esp. the Kamakura and Muromachi periods )
中中 なかなか very
中日 ちゅうにち China and Japan
丁目 ちょうめ district of a town
朝日 あさひ morning sun
町中 まちなか downtown
頂上 ちょうじょう top
直下 ちょっか directly under
的中 てきちゅう striking home
凸凹 でこぼこ unevenness
二月 にがつ February
二見 ふたみ forked (road, river)
二十 にじゅう twenty
二世 にせい nisei
二日 ふつか second day of the month
二百 にひゃく two hundred
日中 にっちゅう daytime
日日 ひにち the number of days
八月 はちがつ August
八十 はちじゅう eighty
八丁 はっちょう skillfulness
八日 ようか the eighth day of the month
百万 ひゃくまん 1,000,000
品目 ひんもく item
万一 まんいち emergency
明朝 みょうちょう tomorrow morning
明日 あした tomorrow
明日 あす tomorrow
明日 みょうにち tomorrow
明白 めいはく obvious
目下 めした subordinate
目下 もっか at present
目上 めうえ superior
目的 もくてき purpose
目白 めじろ white-eye family of birds (Zosteropidae)
六月 ろくがつ June
六十 ろくじゅう sixty
六日 むいか sixth day of the month
Reply
#21
yogert909 Wrote:I just ran it and pasted the output below. Apparently it only goes up to frame 94.
Take out this command: Replace "94" with something larger in this command "sed -n 1,94p | " i.e., #4 in @InnyJan's deconstruction.
Edited: 2014-10-29, 8:32 pm
Reply
#22
Hey Aldebrn, thanks. I figured that out but I just didn't want to fill the thread with 1000s of lines of text. I know just enough about programming to make small edits to existing scripts, but not enough to write anything novel unfortunately. I'd love to be able to come up with the kinds of things you write, but I don't have the time to learn unfortunately.

Btw, is fuzzy-anki still working? I've tried it recently and it didn't seem to be working like it did the first few times.
Reply
#23
@choubatsu, take a look at https://gist.github.com/fasiha/42026df0e...ubatsu-txt (and documentation above it). It's a list of all 3000-odd kanji from RTK volumes 1 and 3, and for each kanji, there's a list of "compounds" in Edict (a popular but now-deprecated? open-source online J-E dictionary) that are built solely out of preceding kanji.

Is this something like what you're looking for? I know it's not grouped into lessons, and it doesn't have the nice accompanying definitions/readings, but these can be easily fixed.

One caveat is that when I say "compound", I mean basically any string of consecutive kanji. So there are numerous "compounds" which are actually concatenations of individual compounds, e.g., 朝鮮民主主義人民共和国 and 北海道開発庁長官 (which MeCab-wakati parses as 北海道開発庁 長官) and 国際協力事業団 (with spaces: 国際 協力 事業 団) & ... sorry. The code to build this is in JavaScript right now, but unlike the shell/ruby script by lauri_ranta, it's much slower (since it's doing a lot more work; this list took ~20 minutes to generate). Based on feedback, I can make it faster & extend it to be more flexible so anyone can easily build lists like this given (1) some kanji and (2) a corpus of text. (Especially now that MeCab runs in Javascript, we can have all kinds of linguistic parsing fun in the browser.)

Edit: of all 14'000 "compounds" in Edict (not just those using the 2200 RTK1 kanji), here's a breakdown according to their length:
1 kanji: 1851 "compounds"
2 kanji: 10725 "compounds"
3 kanji: 1596 "compounds"
4 kanji: 361 "compounds"
5 kanji: 53 "compounds"
6 kanji: 11 "compounds"
7 kanji: 8 "compounds"
8 kanji: 1 "compounds"
11 kanji: 1 "compound"

yogert909 Wrote:Hey Aldebrn, thanks. I figured that out but I just didn't want to fill the thread with 1000s of lines of text. I know just enough about programming to make small edits to existing scripts, but not enough to write anything novel unfortunately. I'd love to be able to come up with the kinds of things you write, but I don't have the time to learn unfortunately.

Btw, is fuzzy-anki still working? I've tried it recently and it didn't seem to be working like it did the first few times.
(1) I think just replacing 94 with 9400 does something somewhat bad: it reorganizes the order of the printout so that 一 is no longer the first line printed: the 一 block of words at the top of your list there is some 200-lines in. I think this might be due to Ruby's regexp implementation?

(2) fuzzy-anki should still be working, I used its "review history" mode (versus "deck browse mode") a couple of days ago. It's terribly confusing and user-unfriendly, so let me know if something specific isn't working and I can try to simplify that (github issues: https://github.com/fasiha/fuzzy-anki/issues or email). I plan on improving that tool, but time spent coding is time not spent on Japanese Undecided

(3) you come up with cool ideas that the coders can then go implement, so good on you!
Edited: 2014-10-30, 7:25 am
Reply
#24
aldebrn Wrote:so anyone can easily build lists like this given (1) some kanji and (2) a corpus of text. (Especially now that MeCab runs in Javascript, we can have all kinds of linguistic parsing fun in the browser.)
That's basically what I'm looking for. A way to enter in a certain number of Kanji and extract from a corpus only the compounds which contain those Kanji.

I'd love to learn how to do it myself but as I mentioned, I have zero experience with things like mecab. Do you have any advice on how to get started using programs like mecab? Or to get started learning how to do basic code for linguistic problems?

BTW, I really appreciate everyone's help and the great suggestions I've seen on this thread. 誠に有難う!
Reply
#25
choubatsu Wrote:
aldebrn Wrote:so anyone can easily build lists like this given (1) some kanji and (2) a corpus of text. (Especially now that MeCab runs in Javascript, we can have all kinds of linguistic parsing fun in the browser.)
That's basically what I'm looking for. A way to enter in a certain number of Kanji and extract from a corpus only the compounds which contain those Kanji.
What did you think of this this list I generated using a list of 2200 kanji from RTK1 and Edict as a corpus: https://gist.github.com/fasiha/42026df0e...ubatsu-txt ?

(It'll be a snap to make the code accept user input, so you could paste in your own kanji list & corpora, but I was looking for feedback on the output format.)

choubatsu Wrote:I'd love to learn how to do it myself but as I mentioned, I have zero experience with things like mecab. Do you have any advice on how to get started using programs like mecab? Or to get started learning how to do basic code for linguistic problems?
I'm the wrong person to ask for how to get started since I'm learning all this as I go along Undecided Jeff Berhow (a fellow Koohito) has a YouTube tutorial on installing MeCab on Windows and if you don't want to go through all that pain, I compiled MeCab to Javascript so you could play with it in your browser (click "Examples" there to see what options you can use). MeCab is a total mystery to me: all the documentation is in Japanese so I have no idea what it is actually capable of---I just see other people using it to make amazing things (like Anki's Japanese Support plugin which adds furigana to kanji-only text, or the wakati mode that adds spaces to Japanese) and reverse-engineer how they did it.

But specific problems like this, I can show you how I did it and answer questions: I generated your list of compounds using this software: https://github.com/fasiha/compounds-per-...er/code.js It's about as stupid an implementation as you can imagine, which is why it's terrifically slow. (I am pretty sure it can be sped up using the technique @lauri_ranta demonstrated...)
Edited: 2014-10-31, 8:31 am
Reply