kanji koohii FORUM
A program for handling vocab databanks? - Printable Version

+- kanji koohii FORUM (http://forum.koohii.com)
+-- Forum: Learning Japanese (http://forum.koohii.com/forum-4.html)
+--- Forum: Learning resources (http://forum.koohii.com/forum-9.html)
+--- Thread: A program for handling vocab databanks? (/thread-3462.html)

Pages: 1 2 3


A program for handling vocab databanks? - koyota - 2009-07-08

Hey everyone, so here's my situation: I'm finally done memorizing(at least in terms of reconigition) all the jlpt Vocab.

While I've been doing this, I've been keeping Wordpad files of all the words That I read in books that I don't know the meaning/ reading/looked up. I literally have thousands of words in these word pad files/ Here's a sample of what it looks like:

揶揄 やゆ banter; raillery; tease; ridicule; banter with
範疇 はんちゅう category
無くて七癖 なくてななくせ Every man has his own peculiar habits
鬼の首を取ったよう おにのくびをとったよう (in a) triumphant manner
闡明 せんめい make clear; explicate; explain
浄瑠璃 じょうるり 1: ballad drama;
2: drama narrator for the bunraku theatre (theater)

I now want to do something with these wordpad files, use some type of program that will go dump all the wordpad files together, go through them and find words that are repeated multiple times. Then I'll take the time to memorize these multiple words/put them in Anki.

Is there any type of program that can help me out in doing this?

Along with that in the future: is there better way of handling vocab like this then just dumping it in a wordpad file/ aka having a program tell me if I've already seen the vocab before? etc.

Thanks for your help


A program for handling vocab databanks? - AmberUK - 2009-07-08

If you do file - import in anki it says that it imports text separated by tabs or semicolon's. Have you tried that?
http://ichi2.net/anki/wiki/FileImport

Then you can order by different fields and see?


A program for handling vocab databanks? - koyota - 2009-07-08

Amberuk - Thanks for trying to help, but sadly that doesn't work.

Due to words with multiple lines like this: the file's wouldn't work, not to mention anki doesn't seem to be able to import Japanese from wordpad files.

虚構 きょこう 1: fiction; fabrication; concoction; (No-adjective)
2: fictitious; fictional; imaginary

Even something that I can just copy and paste all my wordpad files into some big document and then have a program search for duplicates of kanji/romaji combinations would be nice.


A program for handling vocab databanks? - AmberUK - 2009-07-08

It would have to be pretty clever to know that there were two words the same in there. Esp if you have more than one definition. If fiction is the first definition for one word but the second for another is that a duplication?

Someone out there has probs written a script. I presume there is no anki add-on script?


A program for handling vocab databanks? - Nukemarine - 2009-07-08

Well, if it has a consistent separator, you can do a find/replace in excel or word. Just include the "carriage return" in the find portion which remove the second and make it the first.

Barring that, plain old-fashion line by line checking can do it.

After that, a little manipulation to make it Anki friendly. Then just import with the rule to tag duplicates (make the fields unique). Tedious, but simple.


A program for handling vocab databanks? - liosama - 2009-07-08

Which leads me to a question;

What would the best file format be for me if I was to do a similar thing as Amber, I was currently doing excel but if there is something less tedious then any ideas before I find myself dealing with a huge file like Amber


A program for handling vocab databanks? - shang - 2009-07-08

liosama Wrote:What would the best file format be for me if I was to do a similar thing as Amber, I was currently doing excel but if there is something less tedious then any ideas before I find myself dealing with a huge file like Amber
I don't know about tediousness, but in terms of what would be easy to export to some external program like ANKI, I recommend a plain TXT file (with UTF-8 encoding) using a single line for every entry and separating all fields with the same separator character (like TAB or semi-colon).


A program for handling vocab databanks? - koyota - 2009-07-08

Thanks for trying to help guys!

but, I don't think what I'm asking about is clear: Basically what I'm looking for is incredibly simple, what http://language.tiu.ac.jp/ (reading tutor) does except be able to work with a file that has tens of thousands of words(reading tutor seems to not accept beyond a couple hundred words at a time)

I just want something that will ignore Roman Characters and count the times a word occurs more then once, so I know that it's a word I've seen often and should memorize. I have no intention or desire to enter the entire list into Anki or an SRS program.

all my words are contained separately with spaces between distinct words.

Aka: just throwing one of my wordpad files into reading tutor got me 淋しい 2, Everything else only had one appearance. , Thus I know the kanji for 淋しい is something I should probably memorize for the future.


A program for handling vocab databanks? - avparker - 2009-07-08

You could try the software mentioned in http://forum.koohii.com/showthread.php?pid=57870#pid57870 (I haven't used it).

Or perhaps try asking some of the people on this thread - http://forum.koohii.com/showthread.php?tid=3216&page=1


A program for handling vocab databanks? - Pauline - 2009-07-08

If you have linux, you can use sed and filters on text files:

1. Remove all but the first "word" and save to a new file.
sed 's/\s\w*//g' in_file.txt > out_file.txt
Removes all words that follow a space (\s or simply ' '). Replace \s if you have another seperator (use \t for tabs). Saves the result in another file (out_file.txt).

2. Remove lines that starts with letters and digits as well as empty lines
sed -i -e '/^[a-zA-Z0-9]/d' -e '/^$/d' out_file.txt
The -i means the changes is done in the file. The part within brackets are the intervals that needs to match to remove the whole line (^ is for start of line).

3. List the most common words.
cat out_file.txt | sort | uniq -c | sort -nr | head -10
List contents of file | sort the words | count and remove duplicates | numerical sort | list the 10 last words (most frequent)

1-3. The combined script Big Grin
sed -e 's/\s\w*//g' -e '/^[a-zA-Z0-9]/d' -e '/^$/d' in_file.txt | sort | uniq -c | sort -nr | head -10

No extra files are created and the result is simply printed.

Test data
in_file.txt:
から よる thou
1 2 4
か 345
Kanji
から

いか

zed
いか
い 予土
予土
いか
with the result:
3 いか
2 から
2 か
1 予土
1 い

OBS: This solution was done very fast and is not well tested. Use it on copies of your word files. With some more work (and testing) it can be rewritten as a script that can be invoked with ./script in_file.txt 10.


A program for handling vocab databanks? - Tobberoth - 2009-07-08

liosama Wrote:Which leads me to a question;

What would the best file format be for me if I was to do a similar thing as Amber, I was currently doing excel but if there is something less tedious then any ideas before I find myself dealing with a huge file like Amber
When it comes to data, it's ALWAYS best to use raw text files in UTF-8. Since raw text files are something the computer can read directly, programming for it is extremely simple compared to pretty much any other format. Therefor, almost all software out there supports it.

If I was going to program something for textfiles, I would prefer ; separated without whitespace since that's easier to parse. However, most software supports tab separated stuff so... yeah.


A program for handling vocab databanks? - Transparent_Aluminium - 2009-07-08

The real question is: how did you learn all of the JLPT vocab? Did you learn it the traditional way using wordlists? If so, that's pretty impressive.


A program for handling vocab databanks? - ahibba - 2009-07-08

koyota, compress all your wordpad files and send them to my e-mail address. I will:

1. Remove all non-Japanese.
2. Re-sort the Japanese words according to their frequency (the most common words at the top.)
3. Send you back the wordfiles in one combined file.

If you send me the files today, I may sent them back tomorrow. If I read your message tomorrow, I'll send them back within 3 days.


A program for handling vocab databanks? - ahibba - 2009-07-08

Transparent_Aluminium Wrote:The real question is: how did you learn all of the JLPT vocab? Did you learn it the traditional way using wordlists? If so, that's pretty impressive.
Memorizing JLPT vocabulary is not that difficult. With Iversen's method, you can memorize the whole JLPT1 vocab in 4-12 weeks, even without SRS.

You only need 2-3 hours/day, 2 sessions, one for memorizing new words, the other for reviewing the old words.


A program for handling vocab databanks? - Transparent_Aluminium - 2009-07-08

Is that the Iverson method? http://learnanylanguage.wikia.com/wiki/Word_lists

Do you know of anybody who would have tried it with Japanese?


A program for handling vocab databanks? - ahibba - 2009-07-08

Yes, many peope tried it with Japanese. You can use it with any language.

But I recommend you to start learning 100 vocab/day using romaji for fast result as a try. Choose 100 words and convert them into romaji using kakasi or j-talk converter.

Nothing wrong with romaji. It's faster to memorize romaji words than kana & kanji words. When you memorize the romaji words, you will recognize them in kana & kanji easily.


A program for handling vocab databanks? - Tobberoth - 2009-07-08

ahibba Wrote:When you memorize the romaji words, you will recognize them in kana & kanji easily.
No, you won't.

Romaji doesn't save you any time anyway, only complete beginners have problems reading kana. If you learn faster with romaji, it just means you need more exposure to kana. Therefor, it's a bad idea to use romaji what so ever.


A program for handling vocab databanks? - ahibba - 2009-07-08

One of the famous users of Iversen's method with Japanese is Keith, a friend of Steve Kaufmann the great linguist:

http://natural-language-acquisition.blogspot.com/2008/06/iversen-method.php

See one of his papers:

http://kanji4.us/language-learning/uploaded_images/wordlist1_2008JUN05.jpg

He use kanji because he is advanced. Don't try to imitate him.

For more information, read this:

http://forum.koohii.com/showthread.php?pid=55678#pid55678


A program for handling vocab databanks? - ahibba - 2009-07-08

Tobberoth Wrote:No, you won't.

Romaji doesn't save you any time anyway, only complete beginners have problems reading kana. If you learn faster with romaji, it just means you need more exposure to kana. Therefor, it's a bad idea to use romaji what so ever.
Excuse me, Tobberoth. This time I totally disagree with you.

Using romaji at the beginning (or even in the intermediate stages) is not a problem at all.

If you memorized a large amount of vocabulay using romaji, you will easily read the kanji once you learned the readings. (my proof is that Japanese learn the readings of kanji easily because they have enough vocabulary, they learned it without reading kana. So you can do the same, learn as much vocabulary as you can using something you know very well, whether it's romaji, or audio if you are auditory.)

Kana is similar to romaji, so there is no big difference if you use either of them. But reading romaji is faster.


A program for handling vocab databanks? - Transparent_Aluminium - 2009-07-08

Quote:He use kanji because he is advanced. Don't try to imitate him.
What language is kanji? I'm confused. Tongue

It looks like he just tried this method a few times though. And if you look at his first interview with his coworker Charith, you can see that he's not that great in Japanese. The other guy is pretty good though.


A program for handling vocab databanks? - avparker - 2009-07-08

ahibba Wrote:my proof is that Japanese learn the readings of kanji easily because they have enough vocabulary, they learned it without reading kana
Have you ever looked at books for Japanese children? Looks an awful lot like kana to me. In fact, they are entirely written in kana.

ahibba Wrote:Kana is similar to romaji, so there is no big difference if you use either of them. But reading romaji is faster.
I find reading kana faster than reading romaji (romaji feels very unatural). The sooner you learn to read kana the sooner you will get fast.
Heck, coming from the same person that claims memorizing the entire JLPT1 vocab (10000 words) in 4-12 weeks is easy, how hard is it to learn 50-odd hiragana?

Anyway, interesting idea to learn vocab without Kanji.
My immediate reaction was this is a bad idea. That's how I learnt because I didn't study Kanji at all for a long time. The way I learnt sucked about as bad as is possible, and if I could go back and do it again I would do it very very differently.

If you've already done RTK, is using the Kanji so bad?

ahibba Wrote:Memorizing JLPT vocabulary is not that difficult. With Iversen's method, you can memorize the whole JLPT1 vocab in 4-12 weeks, even without SRS.

(...)

One of the famous users of Iversen's method with Japanese is Keith, a friend of Steve Kaufmann the great linguist:

http://natural-language-acquisition.blo … method.php
From his web page

Keith Wrote:So to fill up a page with 44 words took me 2 hours plus rest time in between sets.

(...)

Each night was a lot of work. I'm not sure I could keep this up. I was exhausted when I finished. I would have liked to have done more. But it was tiring so I could not. I suppose I will try with the 3-column version instead of four columns.
He's exhausted after doing 44 words.
10000 (JLPT1 vocab) words in 4 weeks is 357 words a day.
Even 12 weeks is 119 words a day.
Good luck with that.


A program for handling vocab databanks? - yukamina - 2009-07-08

ahibba Wrote:He use kanji because he is advanced. Don't try to imitate him.
Why not use kanji or kana? At my level it would be ridiculous not to.
And most people on this site learn the kanji ahead of time so that they don't have to avoid them in cases like this.


A program for handling vocab databanks? - dat5h - 2009-07-09

avparker Wrote:Anyway, interesting idea to learn vocab without Kanji.
My immediate reaction was this is a bad idea. That's how I learnt because I didn't study Kanji at all for a long time. The way I learnt sucked about as bad as is possible, and if I could go back and do it again I would do it very very differently.
In my experience understanding individual kanji conceptually drastically improves your retention of meaning and pronounciation of vocab. Without kanji I spent several years tirelessly trying to learn vocabularly (which I am naturally terrible at). The problem I see for adult brains retaining vocabulary as distinct forms makes them unintuitive. Kanji seems like a good tool for bringing together idea and sound into a mutual relationship to work together. I have seen, as I am quickly improving my own vocab, that my knowledge of kanji is significantly improving my ability to guess sound and meaning on a new vocabulary word (starting from either kanji or kana). Before understanding the kanji, I would just see sounds and then try to cram into my brain with little success.

Also, in order to throw another voice into the mix on romaji, don't do it. Using romaji is only easier because you are resistant to kana. Embrace kana and it will work to improve you. In fact, I see people who work mostly with romaji as struggling with pronunciation while those who push to use kana quickly get good pronunciation. Ahibba, I want you to know that if I see romaji, it looks like meaningless scribble, but kana makes sense. When you come to Japan, do you want it to be the other way around? Do you want to think to yourself, "gee I wish all these signs were written in romaji instead of kana"? Think about it


A program for handling vocab databanks? - vosmiura - 2009-07-09

shang Wrote:
liosama Wrote:What would the best file format be for me if I was to do a similar thing as Amber, I was currently doing excel but if there is something less tedious then any ideas before I find myself dealing with a huge file like Amber
I don't know about tediousness, but in terms of what would be easy to export to some external program like ANKI, I recommend a plain TXT file (with UTF-8 encoding) using a single line for every entry and separating all fields with the same separator character (like TAB or semi-colon).
I usually add things to a spreadsheet (Open Office). It's easy to convert spreadsheet entries to a tab separated text file.


A program for handling vocab databanks? - ahibba - 2009-07-09

Transparent_Aluminium Wrote:It looks like he just tried this method a few times though. And if you look at his first interview with his coworker Charith, you can see that he's not that great in Japanese. The other guy is pretty good though.
Yes, Keith is a bad example. I mentioned him because he is one of the worst learners of Japanese as he says himself!


avparker Wrote:how hard is it to learn 50-odd hiragana?
Hiragana can be learned in one day (like I did), and katakana in two days, but this does not mean that you will read it faster than romaji. Reading latin letters is faster because you're very familiar with it.


avparker Wrote:He's exhausted after doing 44 words.
10000 (JLPT1 vocab) words in 4 weeks is 357 words a day.
Even 12 weeks is 119 words a day.
Good luck with that
It seems that you don't know who is Keith. Read more about him, and you will understand.


yukamina Wrote:Why not use kanji or kana? At my level it would be ridiculous not to.
And most people on this site learn the kanji ahead of time so that they don't have to avoid them in cases like this.
Hey people, it seems that you forgot that I was talking to Transparent_Aluminium not to you! Look at this and you will understand:

http://kanji.koohii.com/showprofile.php?user=Transparent_Aluminium


dat5h Wrote:Ahibba, I want you to know that if I see romaji, it looks like meaningless scribble, but kana makes sense. When you come to Japan, do you want it to be the other way around? Do you want to think to yourself, "gee I wish all these signs were written in romaji instead of kana"? Think about it
I don't know where are you from. But for me, if I learn a word in any system I know (romaji, kana, kanji, or even spoken not written) I can recognize it easily when I see it in another system that I know.