Back

Inquiry on feasibility/usefulness of my project

#26
toshiromiballza Wrote:
headphone_child Wrote:I'm a web developer and I'd be able to work on this if I'm available for work at the time you're ready to hire someone, but I'm on east coast USA so I'm guessing we wouldn't be able to meet in person. Let me know if you're interested anyway though -- I think it'd be a plus to have this developed by someone with some knowledge of the Japanese language.
That would indeed be a plus, as well as save me the time of having to instruct the developer on what this or that is, where this or that should be, etc. I can already imagine the endless phone conversations and additional meetings I'd have to have with the otherwise clueless developer... How much would you say you'd charge for such a project (the flat rate)? I imagine the prices are way higher in the US than where I am (Slovenia).
toshiromiballza Wrote:
ikore Wrote:but this is easily a few months of full-time work (which would make it more into the range of 3000-6000 euro).
Wow, seriously? I didn't think it costs that much, or that it takes that long... I was imagining this being done in two weeks tops... Maybe I should learn to code and make a nice living off of that, lol.

Will a website in Python cost more than one done in PHP, by the way? I read some good articles about how it's becoming more and more used for web development, whereas PHP is "getting outdated."
I haven't scoped everything out, but I agree with ikore's ballpark. 1000 might be enough to get a website without any user-customization, but it also depends on how much of the data needs to be reorganized.

As for learning coding, go for it! Just note - there's more to software development than coding. Smile

As for PHP, it gets a lot of bad rep, but it's been picking up steam again with the excellent Laravel framework, which some people say is the first "good" PHP framework. Of course there's been a movement from pure backend development to more frontend lately, so you could serve the backend as a REST API with Laravel or anything, and consume it with a Javascript MV* framework such as AngularJS. That's probably how I'd do it. That way, when you're ready to open the data to the public, you have a public, consumable REST API with little additional work needed.

toshiromiballza Wrote:
headphone_child Wrote:And it depends on how is the data currently stored. Which DBMS? How many tables? Is it 3NF? If not, the tables could need redesigning too. And additional tables are needed for storing user settings as described above.
Initially I went with MySQL (because it was the only SQL database I was somewhat familiar with). I soon realized it's not adequate and doesn't even support many of the CJK characters my database consists of... So I googled around and came across SQLite; it was perfect and lightweight, and I think it really is the best option for the job (not sure about all the user-related stuff, though). 23 tables, but I'm sure I could merge some together (e.g. there are two tables for 'shinjitai' and 'kyuujitai', perhaps storing both that information in the 'variants' table would be a better idea, not sure if it really matters speed-wise, though?). No idea what 1NF, 2NF, 3NF is, even after I've read something about it, lol. Also, some of the entries in the tables would require additional Python/PHP magic before being output to the user. For example, here is an example of the 'particles' field: "が(59%),に(30%),を(11%)" Before this is output, Python/PHP should "explode" the entry to separate the actual particles from the brackets, percentages and commas, so that clicking on a particle would load the appropriate sentences by querying the 'sentences' table for that 'particle+word' combination.
Yeah, no need to worry about the rigorous definition of 3NF. What's important is database normalization -- I think that article is a bit easier to understand (again, that's just if you're interested. this is something a good developer should handle for you). The main motivations for this are data integrity (eliminating the possibility of data anomalies) and reducing data redundancy. 3NF actually tends to result in more tables rather than less tables. 23 sounds like a good number, but it'll probably be higher after normalization.

The particles field contains too much data for one column, which limits the utility of that data (this particular issue could be fixed by using multiple tables). This is especially a problem if you eventually want to make this data available to the public. If the database is designed to work only for your website, then it will be difficult for others to make use of the data, plus you'll have problems with flexibility if you ever want to make changes to your website. So the database needs to be designed independently of the website, because it's a separate application tier. Again, you don't have to worry about any of this though; there's nothing wrong with the data you've collected. I'm only bringing this up to show that there's more work to be done than what one might expect. Sorry about all the jargon.

toshiromiballza Wrote:
headphone_child Wrote:What could be nice is an app completely separate from the website where you download all the dictionary data onto your device initially when installing the app, so that you can use the app without an internet connection (a la JED for Android). I'd probably use an app like that.
I suppose, but this would have so many features, how do you display everything on the small screens of mobile devices, or even tablets? This would require a lot of scrolling or opening different tabs just to get to the part you wanted to see... I'm not entirely sure the app idea is feasible. And remember, this in no way replaces EDICT as a vocabulary dictionary, it's got 12,000+ entries compared to EDICT's 200,000+. So sure, it's a great kanji resource with cool extra bits for all the "official" words and common jukugo, etc., but a lot of people would be disappointed after searching, for example, "こんにちは" or some obscure jukugo, and there would be no results. I mean, I could append the rest of EDICT into the database, but those entries would have no extra features, they would just be as-is. I don't think it's worth it...
Have you seen what JED for Android looks like? It does have scrolling, but it's fine. Of course yours would have much more scrolling, though. Anyway, your concerns are understandable, and I agree this should be low priority anyway.

toshiromiballza Wrote:
headphone_child Wrote:But mainly, the really risky one is "Be a voice in the process" -- you have to be very careful of causing feature creep with this, and it could easily increase the cost of development.
Hm, well, I think I covered all the possible features myself already, and there really isn't anything to add! Well, somebody at Reddit did mention it would be nice to have pinyin (and Korean) included, so I guess I'll throw that in too... But I was referring more to the design itself. I mean, it would be better for people to see and comment on the design first, so appropriate changes can be made based on user input, instead of finishing the website and then people complaining this should be changed, this is ugly... Although, if somebody has some great ideas and I can include it, why not. In any case, I think access to the beta (or just preview screenshots) seems like a valid "award."
OK, if they just suggest minor UI changes, things like different colors, fonts, arrangements, things like that are generally OK (but again, depends on who you hire. some people will charge for any change, no matter how minor). It's actual functionality that would add to the initial costs.
Edited: 2014-04-28, 5:40 pm
Reply
#27
headphone_child Wrote:The particles field contains too much data for one column, which limits the utility of that data (this particular issue could be fixed by using multiple tables).
You know, when I wrote the script to parse the particle data, I was actually thinking how to import this into the database. Do I use a field for each particle, or do I put everything into one field and let PHP "sort it out"? I did some googling, and people were actually saying PHP "exploding" the value was quicker than querying each particle from the table, so I stayed with this model. For sure, this isn't the appropriate or recommended way to organize the data, but I think that it really does not matter in my case that much, since the fields aren't really that long or complicated. I used the same logic when importing definitions from EDICT, where some "PHP magic" is needed to display the PoS information and so on.

headphone_child Wrote:Sorry about all the jargon.
Hah, that's okay.

I'll go through the database tomorrow and see what I can do to "minimize redundancy." Hopefully I won't maximize it, lol.
Edited: 2014-04-28, 6:47 pm
Reply
#28
toshiromiballza Wrote:
headphone_child Wrote:The particles field contains too much data for one column, which limits the utility of that data (this particular issue could be fixed by using multiple tables).
You know, when I wrote the script to parse the particle data, I was actually thinking how to import this into the database. Do I use a field for each particle, or do I put everything into one field and let PHP "sort it out"? I did some googling, and people were actually saying PHP "exploding" the value was quicker than querying each particle from the table, so I stayed with this model. For sure, this isn't the appropriate or recommended way to organize the data, but I think that it really does not matter in my case that much, since the fields aren't really that long or complicated. I used the same logic when importing definitions from EDICT, where some "PHP magic" is needed to display the PoS information and
To clarify, normalization chooses to sacrifice some speed for data integrity and ease of maintenance. See http://www.25hoursaday.com/weblog/2007/0...abase.aspx or any article discussing normalization vs denormalization. Point is, if you don't have Facebook levels of traffic, the benefits tend to outweigh the performance hit. An example for your case is, what if one of the values is in the wrong format, it has a ; instead of , or the % data is missing. Of course that's not likely since you generated these from a script, but then what if you update the data and make a typo. That's not good from a maintenance point of view -- the database doesn't guarantee that the format is maintained. and what if you later decide to track more information, like the frequency alongside relative frequency. you'd have to change the format, which would break the application. if you have separate columns you'd just add a column. even if you think you'll never change the data or add data, it's good to be flexible.

For a normalized format, just search many to many mapping database to see some examples. I think at the very least, the particle character and the frequency could be separated into their own columns. But it depends on a lot of factors and it's hard to say without seeing the design of all tables. Sorry I don't have time to explain it myself, I just boarded a 15 hour flight that's about to depart. Good luck, and make sure to backup your data frequently.
Edited: 2014-04-28, 8:58 pm
Reply
May 16 - 30 : Pretty Big Deal: Save 31% on all Premium Subscriptions! - Sign up here
JapanesePod101
#29
Just a quick update. After all the positive feedback both on Reddit and here; I've finally launched the fundraising campaign here: http://igg.me/at/kanjiproject
Reply
#30
Great! 41 days, let's do it!
Reply
#31
Letting you guys know this is now uploaded on GitHub: https://github.com/mifunetoshiro/kanjium
Reply
#32
toshiromiballza Wrote:I came across new ones all the time and eventually lost track of all the corrections, so I didn't forward them to Tim Eyre any more. I went "full in" and if I found a kanji that has a mistake in a certain stroke or element, I would go through all kanji that have that element to make sure my corrections are thorough. I would compare them to kakijun.jp to make sure they are in fact correct. So now, these stroke order images are as accurate as kakijun.jp I would say.
I saw your posts on the KanjiVG mailing list in May containing these updates---thank you for giving back to the community!

I'd like is a list of kanji where your Github repo and KanjiVG currently are divergent. I see in your repo that you have raster images of the "canonical" stroke order per Kakijun (plus couple of corrections thereupon!). Is there a *textual* record of all the changes you had to make? Or any other way you can think of to get a list of kanji for which the plain KanjiVG is "wrong"?

What triggered this question is frustration with , which Kakijun has written with 已 but which KanjiVG has with 巳. KanjiVG has a 已-version but calls it a "Hyougai" variant...
Reply
#33
aldebrn Wrote:Is there a *textual* record of all the changes you had to make?
Not really, but I did eventually email all the modified kanji (I hope I didn't miss some, but it's possible) to both Tim Eyre and KanjiVG. There were/are discrepancies between Tim's KSO font and KanjiVG as well, so often when I sent some issue kanji to e.g. KanjiVG, I didn't need to send them to Tim Eyre because they were already fixed in the font and vice versa. After that post in May I also fixed some kanji in my database thanks to ospalh's commits to KanjiVG. I'll also email Tim and ask if he's got any other pending fixes that I'm not aware of so I can fix them. I suppose I could make a textual record of all the modified kanji by traversing the emails and such and compiling them together, but I don't think that's really that important now that the changes have already been incorporated (KanjiVG) or are pending (KSO font, Tim is busy writing a book if I recall correctly).

I'll push an update for 42 kanji tomorrow to my database; 1 of which is already in KanjiVG, 1 is outside KanjiVG's scope, 1 is pending, for 33 I already opened an issue, 2 are wrong and 4 of which are just aesthetics, but I'll also open an issue for them tomorrow. So that'll be 40 kanji where my data and KanjiVG will be divergent (until KanjiVG updates again). Other than those, there's really only kanji with 舛 as a radical where my data differs, because not even various sources can agree on their stroke count (paper Kanjigen vs digital Kanjigen, Unicode, Adobe Japan, Kakijun), so I went ahead and made them all consistent by following the same logic they should have been following in the first place.

aldebrn Wrote:What triggered this question is frustration with , which Kakijun has written with 已 but which KanjiVG has with 巳. KanjiVG has a 已-version but calls it a "Hyougai" variant...
I don't understand, both KanjiVG and Kakijun have 巳 as the default (Jinmeiyou version), and 已 as the Hyougaiji version.
Edited: 2014-12-17, 8:45 am
Reply
#34
toshiromiballza Wrote:I'll push an update for 40 kanji tomorrow to my database; 1 of which is already in KanjiVG, 1 is pending, for 34 I already opened an issue, and 4 of which are just aesthetics, but I'll also open an issue for them tomorrow. So that'll be 39 kanji where my data and KanjiVG are divergent. Other than those, there's really only kanji with 舛 as a radical where my data differs, because not even various sources can agree on their stroke count (paper Kanjigen vs digital Kanjigen, Unicode, Adobe Japan, Kakijun), so I went ahead and made them all consistent by following the same logic they should have been following in the first place.
Are these updates going to be to the PNG files in Github?

toshiromiballza Wrote:
aldebrn Wrote:What triggered this question is frustration with , which Kakijun has written with 已 but which KanjiVG has with 巳. KanjiVG has a 已-version but calls it a "Hyougai" variant...
I don't understand, both KanjiVG and Kakijun have 巳 as the default (Jinmeiyou version), and 已 as the Hyougaiji version.
A thousand apologies, I completely confused myself because Tangorin.com's entry for 鞄 uses the hyougai/已, and it's exactly as you say: KanjiVG itself and Kakijun both agree that 巳 is the plain form. This is a Tangorin issue, not KanjiVG. Thanks for your patience with me!
Reply
#35
aldebrn Wrote:Are these updates going to be to the PNG files in Github?
Yes. Edit: update done.

aldebrn Wrote:A thousand apologies, I completely confused myself because Tangorin.com's entry for 鞄 uses the hyougai/已, and it's exactly as you say: KanjiVG itself and Kakijun both agree that 巳 is the plain form. This is a Tangorin issue, not KanjiVG. Thanks for your patience with me!
Ah, both Jisho.org and Tangorin.com still haven't updated their KanjiVG data. 已 used to be the default before.
Edited: 2014-12-17, 8:28 am
Reply
#36
This is an incredible amount of work! Thank you!
Reply
#37
toshiromiballza Wrote:
aldebrn Wrote:Are these updates going to be to the PNG files in Github?
Yes. Edit: update done.
Please consider using text-based files like SVG for these: you can edit these in Inkscape very easily. This will greatly increase the number of uses for this subset of your database, since text is much easier for tools to work with. It'll also (potentially) make it much easier to track differences with KanjiVG, removing all guesswork from where they stand.

I realize that KanjiVG's SVGs embed a lot of custom data in their SVGs, and that editing them in Inkscape might destroy a lot of that information. I believe there's some discussion on KanjiVG mailing list for a web-based tool to edit their SVGs that will make it easy to edit that meta-data and to re-draw or modify strokes: if you need that before you can consider switching from PNG to SVG, let me know and I can look into that more.

Your stroke order database is really valuable because it's probably the most accurate and accessible one on the web, and moving to a more easily-consumable format will really help it take off.
Reply
#38
I'm really not familiar with Inkscape and editing/drawing therein, so it'd be too much work for me. If I'd know how to do it, I'd push the changes to KanjiVG myself instead of simply telling KanjiVG's maintainers about them. Tim Eyre also asked me if I wanted to start maintaining his font instead of him, but likewise, I don't know how to use FontForge, so I can't help. Besides, my fixes are now in KanjiVG, so except for me having more kanji that aren't in KanjiVG, the data should be the same (except for this latest commit today and 4 other kanji mentioned above). Edit: I just figured out that isn't necessarily true, because I used the KSO font as my source, and I've come across instances where KanjiVG was wrong, but the KSO font was correct, so I didn't bother with it and thus the mistakes might still be in KanjiVG.

I know SVG is superior to binary pngs, but when I started this I wasn't thinking that far ahead. KanjiVG has come a long way, and it's far more accurate today than it was last year or even half a year ago, so I'd definitely recommend it.
Edited: 2014-12-18, 4:33 am
Reply