Back

Inquiry on feasibility/usefulness of my project

#1
November 1st, 2014 update: the project is now uploaded on GitHub: https://github.com/mifunetoshiro/kanjium

Long post ahead...

Over the course of the past 4 years I've been working on and off on a custom "kanji database," because I wasn't happy with the various clones that simply employ Jim Breen's databases. All in all it took me hundreds of hours of my time and my girlfriend's patience, but today I can finally say I've done it. Everything that I envisioned and wanted to add is now done, including extra information that serves no purpose to me personally, but might be useful/interesting to somebody else.

It started back when Jisho.org still had no stroke order images, so I used a Greasemonkey script to use the KanjiStrokeOrders font to see the stroke order. I was also disappointed because the kanji had so many readings, but no indication of which readings were actually useful. I downloaded KANJIDIC, stripped it out of all the useless things and added the official readings in a separate field. So my database began, 1 MB in size.
Later I would notice other oddities and annoyances, such as a kanji would say it has 12,13 strokes. So which one is it? Some would say it's irrelevant, but unfortunately in my Japanese writing class in college, exams would include tasks where you had to write the correct stroke order, so I was determined to confront this issue.
I checked various Japanese sources to find which version is "correct," and manually fixed all such cases in the database. Then I noticed the stroke order font says otherwise! I would email Tim Eyre with these mistakes so he would correct them in a future version, and he did, but eventually I came across so many mistakes, either wrong stroke order, wrong element shape or wrong elements used, that I decided to just fix them myself without waiting. So I generated 6355 images out of the font and fixed the mistakes with Photoshop. I came across new ones all the time and eventually lost track of all the corrections, so I didn't forward them to Tim Eyre any more. I went "full in" and if I found a kanji that has a mistake in a certain stroke or element, I would go through all kanji that have that element to make sure my corrections are thorough. I would compare them to kakijun.jp to make sure they are in fact correct. So now, these stroke order images are as accurate as kakijun.jp I would say.
Still later, I would notice discrepancies between the radical data in KANJIDIC and my textbooks, so I went and compared them to a digital version of Kanjigen, as that would surely be a more accurate source. In it I noticed the radicals also include the radical variant, which I thought was useful, so I added that to my custom database. Many were however missing, or did not make distinctions between radical variants (e.g. ⺌, ⺍), so I looked for all the possible radical variants in unicode, and went through all the 6355 kanji and gave each kanji the correct radical variant.
Then I learned about phonetics, and was disappointed KANJIDIC does not provide this information. So I discovered RTK. Thanks to RTK, I could include the phonetic characters in my database, but since so many were outside the scope of KANJIDIC or JIS X 0213, I had to hunt through the tens of thousands of CJK characters in Unicode to finally find them. Still later, I would find a few phonetics that are not in RTK at all, so now, in my complete database, there are 438 phonetic characters in total.
Well, I thought, I might as well add the RTK indices to my database now, as someone would probably find that useful.
Then I thought I might as well add other cool features, to make this database really stand out from the other clones. I added antonyms, synonyms, homonyms, look-alike kanji, etc. See the list below for all the changes/additions.

Little by little, my database grew from 1 MB to 34 MB. Since a kanji-only database isn't very helpful in terms of vocabulary, I included EDICT into it. However, since this was first and foremost a kanji dictionary, and the main focus was on the correct kanji information and stroke order, and was never meant to be a Jisho.org (or any other alternative) replacement in terms of vocabulary, I opted to only provide official readings and common/frequent jukugo for each kanji. So for each kanji, you are presented with words that are actually useful/common and/or "official." This means that instead of over 200,000 entries in EDICT, this database only has some 12,000 of them. Again, the database was never meant to be a jisho replacement, this is just something extra I added over time to make it more useful. However, with so few entries compared to EDICT, I knew I had to add something to it to make it stand out from the other clones. So I added pitch accent information for the words, 64 different conjugations and pitch accent information for conjugatable words, and, what I consider the most useful, particle information for verbs, so that you know which verbs take which particle. I also included percentage information, so that if a verb can take more particles, you'll know which one is the most commonly used. Clicking on a particle will show you 5 sentences where that verb is used with that particle. Note, however, that mistakes are possible, as some of the sentences come from Tatoeba.org. See the list below for all the changes/additions.

Anyway, the database is complete, but I'm not a designer or web developer to finally make this public and freely and easily accessible to all. I considered just uploading the database and make it available for everyone, but after spending hundreds of hours on this, I thought I should at least host the website for it first.
Since the database has so many features now, I thought it would be best to make it completely customizable/user-configurable. Don't want to see homonyms? Turn them off. Don't want to see any images other than the stroke order image? Turn them off. Don't want to see etymology links pointing to KanjiNetworks.com or ChineseEtymology.org? Turn them off. I also want to implement customizable personal kanji/vocabulary lists which would be exportable to Anki. For example, you could save the kanji you come across and all their information that you want (radical, radical variant, phonetic, strokes, etc.) into your personal list (and include notes) and later export and review them in Anki. You could do the same with words, jukugo, etc., and include the pitch accent and particle information.

So this is what this thread is about. If you think this seems useful, I'd like to start an Indiegogo campaign to raise the funds to hire a designer/web developer and hosting to make this a reality. I'm not in a situation to be able to spend ~1000 € of my own money for this, unfortunately. If I get positive replies, I'll ask the users on reddit for their opinions too, because getting those kinds of funds from the Koohii community alone is not possible. If the campaign does go online, I'd also appreciate sharing it among fellow Japanese learners and other communities/forums. It would be a "Fixed funding campaign," by the way, which means that unless the required amount is met, nobody will be charged a penny, but unfortunately the project/website won't go live either (yet, at least).

Here is an ugly representation of some of the features, which I hope the website won't look anything like, lol!
And here's how I envisioned the multi-element look-up UI to look like.

KANJIDIC differences:
● Includes 445 more kanji (for a total of 6800), most of which are jinmeiyō kanji, various kokuji kanji and kanji used as phonetics
○ includes all 861 jinmeiyō kanji
○ includes 503 kokuji kanji
● Radicals corrected to match 漢字源 (Kanjigen)
● Includes appropriate radical variants (e.g. for 羊: ? or ⺶)
● Includes phonetic kanji (only for jōyō and jinmeiyō kanji)
○ 438 different phonetics (where 2 or more kanji share the same phonetic and reading)
○ Shows how many kanji use that particular phonetic. E.g. 者: 24, 古: 12, etc.
○ Clicking on a phonetic would show all kanji that use that phonetic
○ Grouped into: Mixed-reading phonetics (e.g. 工: コウ, ク), Single-reading phonetics (e.g. 古: コ), Absolute phonetics (e.g. 夆: ホウ). Absolute phonetics are phonetics where a kanji will have ONLY that single reading, no other onyomi or kunyomi (referring to official readings, that is). Quick breakdown: Mixed-reading phonetics: 66, Single-reading phonetics: 250, Absolute phonetics: 122
○ 3 phonetics aren't even encoded in Unicode yet (and won't be), but I've assigned them a Private Use Area unicode index and made the appropriate stroke order image to represent them
● Includes kanji shape. This does NOT use Jack Halpern's SKIP data with 4 possible shapes. See image for 22 variations
● Includes kanji type (Pictograph, Ideograph, Compound ideograph, Phono-semantic compound)
● Includes official jōyō readings in a separate field
● Onyomi readings include information when that reading was introduced to Japan (go-on 呉, kan-on 漢, tō-on 唐, kan'yō-on 慣)
● Shows how many jōyō kanji have the same (official) onyomi reading (e.g. 別: ベツ (2), clicking on "(2)" would show 蔑 as the only other kanji with that reading)
● Includes thousands of extra kunyomi readings that are otherwise not in KANJIDIC, especially those obscure ones found on the KanKen tests
● Stroke information is corrected to match the JIS X 0213 standard
● "Grade" information is updated, and besides the standard grade 1-6 kanji, the changes include:
○ High school kanji are further grouped into: grade 7 (1st grade of junior high school), grade 8 (2nd grade of junior high school), grade 9 (high school). Grade 7 and 8 are approximate estimates based on one Japanese school's education manual/pamphlet, KanKen level and frequency, and are therefore not "official."
○ Jinmeiyō kanji are further grouped into: Jinmeiyō (former Jōyō kanji), Jinmeiyō (traditional variant of Jōyō), Jinmeiyō (traditional variant of Jinmeiyō)
○ Hyōgaiji kanji are further grouped into: Hyōgaiji, Kyūjitai-Hyōgaiji, Hyōgaiji (tolerated Jōyō variant), Hyōgaiji (former Jōyō candidate), Hyōgaiji (former Jinmeiyō candidate)
● Includes antonym (521) and synonym kanji (628) (limited to jōyō kanji)
● Includes look-alike kanji (limited to jōyō kanji), total of 487 kanji
● Includes homonym information for all kunyomi readings (and some onyomi exceptions), and whether the homonyms have the same or different pitch accent (limited to jōyō kanji), total of 690
● Includes variant kanji:
○ Japanese variants, 118 common ryakuji, kyūjitai, shinjitai, Traditional Chinese variants, Simplified Chinese variants, ultra-simplified (unofficial) Chinese variants (basically Chinese ryakuji)
○ Common (3,500 daily-use hanzi) Simplified Chinese variants are marked
○ Common (4,808 daily-use hanzi) Traditional Chinese variants are marked
○ Ryakuji and ultra-simplified Chinese variants are rendered with custom woff/svg fonts (few kB in size)
● Includes new JLPT information (since no official list exists any longer, there is some guesswork involved by taking the KanKen level and frequency into account)
● Includes KanKen information
● Frequency is based on several averages (Wikipedia, novels, newspapers, ...)
● Besides the standard KANJIDIC definitions, includes "Compact meanings" in a separate field (only for jōyō kanji), which are only the most common definitions
● Initially I stripped pinyin out, but somebody with a Chinese background said it's really helpful, so I included pinyin with tone marks (not numbered), including 145 cases where the pronunciation differs in Taiwan (readings that are in brackets)
○ Meh, decided I'd throw hangul in there as well...
● Includes indices for RTK 1 & 3 (old and new editions), 2001 & 2301 Kanji Odyssey, White Rabbit Press' Japanese Kanji Flashcards and the custom indices for my printable flashcards
● Codepoints include: decimal, hexadecimal, UTF-8, UTF-16, JIS level, minkuten. All other indices and codepoints have been stripped out
● Also includes braille (rokutenkanji, kantenji) information in unicode, because why the hell not. E.g. 亜: ⠠⠁⠃, ⠃⠊

KRADFILE (search by kanji parts) differences:
● Completely revamped. Thousands of additions and corrections
● 163 more elements to choose from (for a total of 415 instead of 252)
● Includes shape information filter, with 22 possible shapes to make searching more accurate, faster and easier:
[Image: idc.png]
● Groups kanji into 3 levels (jōyō, jinmeiyō and hyōgaiji), so you can easily filter which results you want to see
○ They are all also coloured with a darker/lighter font in the results window to distinguish them easier
● The UI layout on the website would take radical types into account, so that kanmuri radicals are at the top, rare kanmuri below them, on the left are hen radicals, on the right are tsukuri radicals, etc.
● 41 elements on the UI layout are rendered with a custom ~10 kB woff/svg font, so there is no need to replace them with images to make them show on computers/devices that don't have huge CJK fonts installed (though, they won't get rendered correctly on devices running Symbian or Windows Phone 7, because they don't support woff/svg, sorry)
● Takes element position into account. E.g. selecting 木 from the kanmuri elements would show 査, 李, etc., selecting 木 from the hen radicals would result in 柿, 横, etc.
● Differentiates between elements that the default KRADFILE treats as the same. E.g. selecting 辶 would only show kanji that have the "road" radical with 1 dot, selecting 辶 would only show kanji that have 2 dots, the same with ⺌, ⺍ and all other such cases
● Limit results to kanji with more than/less than X strokes, +-
● Option to display compact kanji meanings in results window
● Includes a "part of" field that consists of kanji that contain the particular kanji, e.g. for 阿: 婀,痾
● You can also look-up kanji by searching for any element/kanji it consists of, regardless if it is part of the possible 415 elements or not. E.g. 右 and 若 are not part of the 415 elements, but inputting either of them into a special kanji look-up search box will result in 匿, 能 will result in 熊, etc.

JMdict/EDICT differences:
● Words for each kanji are grouped into: regular words (e.g. 一: イチ, 上: あげる), compound verbs (e.g. 受: 受け入れる), jukugo (e.g. 一: 一部), yojijukugo (e.g. 一: 一人一人)
○ The first 5 compound verbs, jukugo and yojijukugo for each kanji are the 5 most common ones. The rest are in random order
○ 801 distinct idiomatic expressions (yojijukugo)
○ 214 distinct compound verbs
○ 7771 distinct jukugo
○ 3449 distinct regular words
● Does NOT include obscure and "unofficial" words (e.g. there are no entries for 食む, 藩祖, etc.)
○ only jōyō kanji (with a few exceptions) have words associated with their kanji. E.g. 烹 has therefore zero words associated with it. Remember, this was not supposed to be a word dictionary or a Jisho.org replacement, this is first and foremost a kanji dictionary, everything else is extra
● Compound verbs, jukugo and yojijukugo readings are segmented so that you can find compound words that include any possible (official) kanji reading. E.g. 火 か: 火事, 火 ひ: 火鉢, 火 び: 下火. All possible (official) readings (including rendaku) are listed for every kanji, and there is at least one word associated with it
● Includes pitch accent information with several possible (customizable) ways of displaying it: annotated (by inserting two unicode arrows in the text), CSS (html scripting), binary (e.g. L-H (Low-High)), accented mora position
● For regular words, includes information whether that particular word is taught in primary school, junior high school or high school
● Includes JLPT information for regular words, compound verbs and jukugo
○ Source is tanos.co.uk with corrections by me. Note that since the JLPT doesn't release official lists any more, some of the corrections might not be corrections at all, since it's all more or less guesswork
● Includes frequency information for regular words, compound verbs and jukugo (very common, common, uncommon, rare)
● Includes particle information for verbs, compound verbs and jukugo+する verbs, so that you know which particle usually goes in front of that verb. E.g. 飼う: を, 一致: と(56%),が(44%)
○ Clicking on a particular particle will show you 5 sentences where that verb is being used with that particle
○ The sentences come from smart.fm/inknow.co.jp (the Core 6000 deck) and the Tatoeba.org project. The sentences are sorted so that the Core 6000 sentences take priority over the Tatoeba ones, meaning they will show first as much as possible. Both are licensed under CC. Contains a total of 12,975 sentences
○ The sentences include the Japanese sentence, the Japanese sentence with furigana, the English translation, and a list of all the unique kanji in the sentence
● All regular verbs and い adjectives have conjugations available (plain polite, negative plain polite, past polite, passive, negative passive, past passive, negative past passive...). If it's a valid conjugation, it's there. A bit of an overkill, but who cares
○ Clicking on a conjugatable word will show you a list of 64 conjugated forms
○ Conjugated forms ALSO have pitch accent marked with CSS (not all, however)
○ All conjugations of irregular godan verbs and irregular する verbs have been taken care off and properly marked

Word/kanji search:
● You can search in any conjugation form, whether Hepburn, Kunrei or kana. E.g. isogashikunai, isogasikunai or いそがしくない will result in 忙しい and 忙. Also taken into account are Kunrei exceptions such as tu/tsu/du/zu/hu/fu, etc. Note that various irregular Hepburn-Kunrei hybrid searches will show no results; this is to "enforce" proper form
● You can also limit your search to just the official readings, so that e.g. しるし will result in 印, but not 験, 璽, 徽, etc.
● You can search by inputting a Simplified Chinese hanzi and it will redirect to the appropriate Japanese kanji variant/s (if it exists)
● Search by indices/codepoints possible as well

Images:
● The Kanji Stroke Order images (KSO) were generated using the KanjiStrokeOrders font by Tim Eyre. Copyright is held by Ulrich Apel and the Wadoku project. The images were NOT generated using the KanjiVG project
○ Some ~400 were made by me with Photoshop and are not part of the font
○ Includes hundreds of corrections and modifications to match the JIS X 0213 standard. kakijun.jp was very helpful with that, and I noticed a couple of mistakes even there!
● Gyōsho, tensho and sōsho images were generated with freeware fonts, I forgot their names...
● The "origin" images (oracle, bronze, large seal, seal) are composite images generated from Richard Sears' website (chineseetymology.org). Please ask for permission to use first.

Double-sided printable flashcards (example: http://www.mediafire.com/download/ywz6kg...hcards.pdf):
● The front side of the card contains the stroke order image, 5 jukugo and the card index
○ The jukugo are (as much as possible) sorted so that for every new card, you should already know the other kanji in the jukugo. E.g. once you come to card 66 (生), you should have already seen the other kanji its jukugo consists of (一, 年, 先, 大, 学)
○ In cases where no such jukugo are possible, the jukugo are sorted so that the jukugo consist of new kanji closest to the current card index. E.g. card 1 (一) has its jukugo comprised of the following: 人, 日, 本, 大, 手
○ The jukugo are common compounds only
● The back side of the card contains: radical, phonetic, stroke information, RTK (6th ed.) and White Rabbit Press index, official onyomi and how many other cards have that onyomi in superscript, official kunyomi and their accented mora in subscript, compact kanji meanings, jukugo readings, their accented mora and meanings, homonyms and their card index, lookalike kanji and their card index, word meanings for each kunyomi
● Includes all 2136 jōyō kanji + 57 extra cards
● The cards are sorted first by grade and then by KanKen level and frequency
● The flashcard data is a little outdated in certain aspects, because during the 4 years I've been working on the database, I stopped updating the flashcard data. E.g. some cards are missing the phonetic data, homonyms, lookalikes, use less common jukugo, etc. I don't intend on ever updating them because it's too much work
○ The reason why I stopped updating it is because of the prevalence of handheld devices (tablets, smartphones) that can do the same with Anki, and do not require you to carry physical paper cards in your pockets. The concept of printable flashcards just seemed outdated, but maybe there are still some people who prefer them...

Edit:
First reddit thread for feedback
Fundraising reddit thread
Indiegogo page
Edited: 2014-11-01, 12:32 pm
Reply

Messages In This Thread