kanji koohii FORUM
Zero no Kiseki - Programming Help - Printable Version

+- kanji koohii FORUM (http://forum.koohii.com)
+-- Forum: Learning Japanese (http://forum.koohii.com/forum-4.html)
+--- Forum: Off topic (http://forum.koohii.com/forum-13.html)
+--- Thread: Zero no Kiseki - Programming Help (/thread-12782.html)

Pages: 1 2


Zero no Kiseki - Programming Help - jcdietz03 - 2015-06-11

OK- Need help with two things. First, I should tell you where me and my team are at with this:

Progress Video



We can basically do whatever, except for certain stuff like 3D models, movies, animated text... All of the basic text is figured out.

Need help with:
1. Machine translating. How do I machine translate this? I have 121,000 lines (approx.) to do, so I'd like some way to automate it.

2. The game supports furigana, but it isn't often used by the original developer. I want to prepare a version with full furigana. I prepared two examples for you: One from the original game and a custom one.
http://imgur.com/a/AnWt4

In the game script (original example) it looks like:
『魔導杖#6Rオーバルスタッフ#』
#6R means it should be displayed atop the previous 3 characters. There are codes for anywhere between 1 and 9 characters (more than that is not possible).
The closing # means it should go back to normal text display after this.

I need a way to run mecab. Mecab breaks apart a sentence into fragments and computes furigana for each one. There are lots of lines, so I need a way to automate it. If I had some output from mecab, like that you find in Anki, I could easily write a program to convert this into the script code you see above.

One thing I can think of is make an Anki deck with all the lines from the game, import into Anki, bulk add readings, then export & convert. Will that work?


Zero no Kiseki - Programming Help - xfact2007 - 2015-06-11

What OS do you use for game translation: Windows, Linux, other?
1. What is the name of your preferred translation software? Is it an online service, or an offline program?
2. What is the character encoding of the target text? UTF-8, SHIFT-JIS?
If you can write a program to convert the (mecab/Anki) output to ZnK's format, parsing the mecab's output is not much harder. I don't know in which programming language your converter will be written. But you don't need to write a parser (=parse the mecab's output to generate furigana), if you use an Anki plugin (eg. Japanese Support's Bulk Add Reading function) to generate furigana. Mecab can parse 120k lines in seconds, SQL operations in Anki may slow down the process a little bit. The furigana reading is not always accurate, you should edit the results later. Then you can export the furigana fields from Anki, and your program can convert the exported text to ZnK's format.


Zero no Kiseki - Programming Help - jcdietz03 - 2015-06-11

I should say I'm a terrible programmer. I wrote the tool that dumps the script and inserts it back because nobody else would do it, and it took me many hours to do. I did it in Python because I heard it's easy to learn.

I don't have a preferred translation software. I'm actually not sure what choices are available. I use Windows. This is a PSP game; I use PPSSPP as an emulator, which runs on Windows. I want to go with whatever's the easiest to work with. I'm willing to spend up to $100 on a software package if it'll help.

The game will read SHIFT-JIS only. Python has the codecs module in its standard library which I had to learn how to use because my team uses Google Sheets (UTF-8 is its format) for translation and the game uses SHIFT-JIS. I feel comfortable using the codecs module. I don't think converting encodings is likely to be a problem.

Okay, it seems the Anki thing is likely to work then. I'll give it a try. It seems that will be simplest. Thanks for weighing in on it.


Zero no Kiseki - Programming Help - aldebrn - 2015-06-11

It's a lot easier just to install and use MeCab stand-alone (or extract it from the Japanese Support plugin, which installs a copy in your Anki folder) than go through all of Anki's tedium. For each line of text input into it, it'll print one line for each morpheme it finds.

Edit: OH I forgot I wrote something for just this purpose! Download the script here and put it in the same directory as Japanese Support Anki plug-in (alongside "reading.py"), and then run it like this: "python readingStandAlone.py input.txt output.txt". You need UTF8 input.txt, but it sounds like you have that. The output will be UTF8 also. This produces the exact same readings as Anki plugin, i.e., it puts spaces between morphemes and uses "[" and "]" to denote furigana. You can add an extra argument: "python readingStandAlone.py input.txt output.txt verboseFormatter" and it'll produce furigana in a format more amenable to software processing, see the script for details. Let me know if you have questions, mecab can be a pain to figure out.

There's online furigana services, e.g., http://tatoeba.org/eng/tools/romaji_furigana This one is powered by MeCab.

MeCab will give you the readings for every morpheme in the text. Are you going to have some way to choose which words get furigana?

Also, if you want to try, KyTea is a morphological parser & part-of-speech tagger like MeCab but newer and, according to its author, better Smile it should be pretty easy to compile under Cygwin, and you use it the same way as MeCab.


Zero no Kiseki - Programming Help - yudantaiteki - 2015-06-12

Any reason you're picking Zero?


Zero no Kiseki - Programming Help - xfact2007 - 2015-06-12

You can find a download link for ATLAS here (Translation_Tools)
kipchu.wordpress.com/2012/08/02/visual-novel-help-translation-tools-for-playing-untranslated-visual-novels/
The translation quality is questionable.

I tried to solve the following. Because I don't have any test data, I think it won't work yet. Let me know what you think.
Here is the python code pastebin.com/x1PxGpwH
* you don't need Anki or Kakasi anymore, only Mecab
* it keeps the current opcodes
* it adds reading to words without furigana
* it will remember custom furigana, and ignores the Mecab's reading (useful for names)
* rare bug: if the last line of the input file is an english sentence, the program throws exception

『魔導杖』といいます will be converted to 『魔導杖#6Rオーバルスタッフ#R』といいます
or
彼奴#4Rきゃつ#Rらには届かん…… to 彼奴#4Rきゃつ#Rらには届#2Rとど#Rかん……


Zero no Kiseki - Programming Help - aldebrn - 2015-06-12

xfact2007 Wrote:I tried to solve the following. Because I don't have any test data, I think it won't work yet. Let me know what you think.
Here is the python code pastebin.com/vBHD9hRp
* you don't need Anki or Kakasi anymore, only Mecab
* it keeps the current opcodes
* it adds reading to words without furigana
* it will remember custom furigana, and ignores the Mecab's reading (useful for names)

『魔導杖』といいます will be converted to 『魔導杖#6Rオーバルスタッフ#R』といいます
or
彼奴#4Rきゃつ#Rらには届かん…… to 彼奴#4Rきゃつ#Rらには届#2Rとど#Rかん……
Hey good work! Even though the feature where it remembers custom furigana is custom to ZnK, that's a great feature to have if your input has partial furigana. Also kudos on getting rid of Kakasi, all it did here was katakana->hiragana right? I admit even in Javascript I use a library (wanakana) to do this.

I saw "Expires: in 6 days" and freaked out—please for the love of god, put it on https://gist.github.com so people (me) can reuse this when they find need of it in a year. (I made a secret gist which should last forever.)


Zero no Kiseki - Programming Help - yogert909 - 2015-06-12

I'm a programming noob so this may or may not be useful, but I just came across this. It seems to be a python library with some of the same functions as mecab. Unfortunately it doesn't seem to generate furigana, but oddly it does romaji.

Japanese NLP Library


Zero no Kiseki - Programming Help - aldebrn - 2015-06-12

yogert909 Wrote:I'm a programming noob so this may or may not be useful, but I just came across this. It seems to be a python library with some of the same functions as mecab. Unfortunately it doesn't seem to generate furigana, but oddly it does romaji.

Japanese NLP Library
Good find! This uses CaboCha instead of MeCab. Both are morphological parsers, written by the same guy, Dr Kudou, in C++. Both need one of MeCab's dictionaries (IPADIC, Juman, etc.) for training. CaboCha uses a support vector machines as its machine learning horsepower while MeCab uses conditional random fields.

jNlp's tokenizer returns katakana readings of tokens, so it should be straightforward to build furigana out of that, same as out of MeCab's output. It should be a lot easier to build what jcdietz03 wanted using this instead of mucking with the filthy Python script that came with Anki Japanese Support plugin.

Ve's goal is to make a unified front-end for all these parsers and part-of-speech taggers. But so far Kimtaro's only got to MeCab/IPADIC and FreeLing Tongue!

Edit: Ooh, I didn't have the whole picture. CaboCha can also do dependency analysis (screenshot from jGlossator). Not sure how useful that is for this project, but it's certainly going to be useful to someone sometime. MeCab can't do this (yet?), but Dr Yoshinaga &co have written J.DepP, which post-processes MeCab output for both bunsetsu chunking and dependency calculation. I took a screenshot of its output.


Zero no Kiseki - Programming Help - jcdietz03 - 2015-06-13

I used the tool xfact2007 wrote.
Anyway, the output format wasn't correct, but it wasn't anything a small change to line 110 couldn't fix.
This is the result of the 1st test: http://imgur.com/N29ike1 There are some Rs which shouldn't be there.
This is the result of the 2nd test:


It should be noted that xfact2007's tool is either written in Python 2 or compatible across Python versions (not sure which).


Zero no Kiseki - Programming Help - aldebrn - 2015-06-13

Definitely Python2, that's what Anki ships with. Lol—will Anki ever be Python3? ?

Edit: fixed link to Youtube:
cool!


Zero no Kiseki - Programming Help - Roketzu - 2015-06-13

aldebrn Wrote:Definitely Python2, that's what Anki ships with. Lol—will Anki ever be Python3? ?

Edit: fixed link to Youtube:
cool!
Forgive my ignorance, but do you know if issues like the one I brought up here: https://anki.tenderapp.com/discussions/ankidesktop/12552-fonts-losing-clarityfullness-in-anki

.. have anything to do with the fact Anki hasn't moved to Python3? I'm not sure what toolkit even means in the context brought up here, or how much of a hassle it would be to update it.


Zero no Kiseki - Programming Help - xfact2007 - 2015-06-13

jcdietz03 Wrote:I used the tool xfact2007 wrote.
Anyway, the output format wasn't correct, but it wasn't anything a small change to line 110 couldn't fix.
This is the result of the 1st test: http://imgur.com/N29ike1 There are some Rs which shouldn't be there.
This is the result of the 2nd test:


It should be noted that xfact2007's tool is either written in Python 2 or compatible across Python versions (not sure which).
It's written in Python 2.7. Pardon me, but it seems, you used an old, mirrored, outdated version. I know that, because the 110th line is a blank line in the new version. Please use this version pastebin.com/x1PxGpwH (then remove the second R character), because the old version generates wrong furigana for frequently used kanji characters.

aldebrn Wrote:Hey good work! Even though the feature where it remembers custom furigana is custom to ZnK, that's a great feature to have if your input has partial furigana. Also kudos on getting rid of Kakasi, all it did here was katakana->hiragana right?
There is a DISABLE_AUTO_FURIGANA constant, set to true, and it will use only Mecab for unknown words. If you keep the default false value, the program will have a translation furigana memory.
I used the Unicode table to convert between katakana and hiragana, haven't tested much though.

aldebrn Wrote:CaboCha uses a support vector machines as its machine learning horsepower while MeCab uses conditional random fields.
I tested only Mecab and Chasen. For these translation projects, a custom dictionary must be built, we used the default dictionary. I don't know the theme of the game, I hope the accuracy of generated furigana is not than lower 93%.


Zero no Kiseki - Programming Help - jcdietz03 - 2015-06-13

I was wondering if you can help me by developing this tool more.
I would like you to debug the orbal staff test case.

In the test cases at the bottom, add:
mecab.translation[u"魔導杖"] = u"オーバルスタッフ"
expr = u"魔導杖"
print mecab.opcode_restore(expr)

The lines that care about the translation dictionary are 251 and 252.
Other lines in the program set up the translation dictionary. I tested and it looks like these are working properly.
mecab is breaking down 魔導杖 into three morphemes of 1 kanji each. I get 魔#2Rま#導#2Rしるべ#杖#2Rつえ# for a result. When it does that, the individual morphemes don't match the dictionary entry. Any ideas on how to fix?


Zero no Kiseki - Programming Help - xfact2007 - 2015-06-13

Thank you for the bug report.
Mecab parses the word as three different characters. 魔+導+杖
The translation furigana dictionary tries to find a predefined furigana for 魔,導 and 杖. It doesn't find any Big Grin
It's a serious flaw, I have to rewrite the MecabController.reading() function. I will leave the download link here, later.

aldebrn Wrote:The long-winded and perhaps more long-term solution would be to make a custom MeCab dictionary with this one word 魔導杖, and give it a low probability of being split.
Not that easy. I told before,
xfact2007 Wrote:For these translation projects, a custom dictionary must be built, we used the default dictionary.
In a perfect world, we could have lots of example sentences with 魔導杖, tagged correctly, preferably from RPG games. We could have the data to train the learning algorithm, and that will generate a dictionary.

@aldebrn, if you ever wrote a useful model what can solve this problem, share that with us.
All I heard so far, that we should use
* a) Mecab
* b) custom Mecab dictionaries
* c) CaboCha

I don't have the dialogues from the game. The result won't be perfect, neither the result from Mecab, Cabocha, with or without a custom dictionary.


Zero no Kiseki - Programming Help - aldebrn - 2015-06-13

jcdietz03 Wrote:mecab is breaking down 魔導杖 into three morphemes of 1 kanji each. I get 魔#2Rま#導#2Rしるべ#杖#2Rつえ# for a result. When it does that, the individual morphemes don't match the dictionary entry. Any ideas on how to fix?
The fabulous xfact2007 will probably find some simple and practical solution for this. The long-winded and perhaps more long-term solution would be to make a custom MeCab dictionary with this one word 魔導杖, and give it a low probability of being split.

Here's a gist from Kimtaro (of jisho.org fame) showing how to do this. In his example, the new word is "fasihsignal". And the *cost* in his example is 100 (the lower this cost, the less likely MeCab is to split the word in question). Here's a really great StackOverflow answer explaining the same process in more words. (If you have a StackOverflow account, consider upvoting this answer ?, it's not mine).

You can look at the raw data the various MeCab dictionaries use to see examples of what the cost is for actual words. For example, this is the raw CSV data for IPADIC's adjectival nouns (very large file, Github will render it as a table). There we see this line:

史的,1287,1287,6608,名詞,形容動詞語幹,*,*,*,*,史的,シテキ,シテキ

Per Kimtaro's notes, these fields correspond to:

surface_form: 史的
left_context_id: 1287
right_context_id: 1287
cost: 6608
part_of_speech: 名詞
pos_division_1: 形容動詞語幹
pos_division_2: *
pos_division_3: *
inflection_type: *
inflection_style: *
lemma: 史的
reading: シテキ
pronunciation: シテキ

For 魔導杖, you won't need to provide left/right context IDs (the two explanatory links above omit them), just a cost, part of speech, part of speech division 1 (i.e., main sub-part-of-speech), lemma (can be 魔導杖 again, "lemma" means "root form of word"), and "オーバルスタッフ" for both "reading" and "pronunciation".

Let me know if the above is unclear and I can try clarifying it.


Zero no Kiseki - Programming Help - jcdietz03 - 2015-06-13

yudantaiteki Wrote:Any reason you're picking Zero?
I should reply to you. I do not mean to ignore you, but I don't know what to write as a response.

I think the main reason is I developed a tool that can insert the text.
I am working on an English translation project here: http://retro-type.com/heroesoflegend/forums/index.php
But, I was thinking about the full scope of possibilities now that the script inserting tool is developed, and I came up with this.
Does that address your question?

Because this furigana project uses computer automation, it can be ready much faster than the English translation, where all lines must be translated manually. It's almost ready now, actually. I need to explore mecab custom dictionaries as suggested by aldebrn below.

I feel that a version of the game with full furigana could be of interest to those studying Japanese. Also, I feel like it is feasible to actually complete.

"I like this game"
-I'm not sure I can say this. I have played maybe the first 1.5 hours of the game (I think it's ~25 hours total), and it's not too bad so far.

"The game is regarded as a good one"
-Well, I think it is, but I can't point you to any evidence for this opinion.


Zero no Kiseki - Programming Help - aldebrn - 2015-06-13

Roketzu Wrote:Forgive my ignorance, but do you know if issues like the one I brought up here: https://anki.tenderapp.com/discussions/ankidesktop/12552-fonts-losing-clarityfullness-in-anki .. have anything to do with the fact Anki hasn't moved to Python3? I'm not sure what toolkit even means in the context brought up here, or how much of a hassle it would be to update it.
No, this issue is probably related to Qt/PyQt, which is the GUI toolkit that Anki uses. Anki ships with a bunch of supporting libraries, including Python itself and PyQt—the idea being that it doesn't rely on your system to have anything installed. Some of these packages are often quite old. I once ran into a bug with the bundled PyInstaller last year which was fixed upstream in mid-2012, but the Anki guys didn't plan on upgrading to the latest version of PyInstaller for a few obvious reasons. The case here is probably that the bundled Qt is from a vintage that hadn't figured out antialiased fonts ?…

Python2 vs Python3 is another case where if you were starting a new project, you should clearly use Python3, but the cost-benefit analysis for Anki, and other legacy projects, advised conservatism.


Zero no Kiseki - Programming Help - yudantaiteki - 2015-06-13

jcdietz03 Wrote:
yudantaiteki Wrote:Any reason you're picking Zero?
I should reply to you. I do not mean to ignore you, but I don't know what to write as a response.
I just wondered since the plot of Zero depends on a lot of stuff introduced in Sora SC and Sora 3rd. It's probably understandable on its own though (unlike Ao).

Quote:"I like this game"
-I'm not sure I can say this. I have played maybe the first 1.5 hours of the game (I think it's ~25 hours total), and it's not too bad so far.
The whole series is great. For best results it really should be played in release order (Zero being the 4th game); the story writers have increasingly shown that they're not interested in catering to people who haven't played the entire series so far.


Zero no Kiseki - Programming Help - xfact2007 - 2015-06-14

Changelog
* restored the power of orbal staff
* tested both 魔導杖, and 魔導杖#6Rオーバルスタッフ#
* the old and the new version generate the same mecab output
* updated "usage" notes, if you run the program without parameters
* accepts two or three parameters
* new optional --dictionary parameter
* it shouldn't duplicate every newline
* it removes opcodes, before giving the line to mecab as input text.

I remove the link, see the modified post later.

jcdietz03 Wrote:I need to explore mecab custom dictionaries
use --dictionary as the third parameter, to get the list of words between #_R and # opcodes


Zero no Kiseki - Programming Help - Roketzu - 2015-06-14

@aldebrn

Thanks a lot for the explanation! That clears some things up for me.


Zero no Kiseki - Programming Help - Kuzunoha13 - 2015-06-14

yudantaiteki Wrote:The whole series is great. For best results it really should be played in release order (Zero being the 4th game); the story writers have increasingly shown that they're not interested in catering to people who haven't played the entire series so far.
In the US, XSEED handles the Eiyuu Densetsu localizations. They published Sora no Kiseki FC a few years ago, and SC should be coming out in a few months. The next game they have planned is SEN NO KISEKI (already fully translated! albeit sans voice.) Zero and Ao are not scheduled to come over. The Falcom Prez says that "We wanted [Sen] and Zero/[Ao] to be playable in any order because we noticed that, with [Sen], the demographic suddenly became younger. Many new fans of the series were going back to the old games."

Here's the link: http://www.siliconera.com/2015/06/10/trails-of-cold-steel-is-already-100-translated-xseed-open-to-prior-trails-games/

Although this series is like, the top reason I started learning Japanese, I just never got around to it. Fortunately, they just released the "Evolution" version of SnK FC, so holding off in this case actually paid off.


Zero no Kiseki - Programming Help - yudantaiteki - 2015-06-14

I haven't played Sen II yet, but even Sen 1 really seemed to assume you had played Ao and Zero. I understand why they jumped directly to Sen (although Zero and Ao both have Vita re-releases), but it's unfortunate that they had to do that.


Zero no Kiseki - Programming Help - xfact2007 - 2015-06-14

Changelog
* give the Tanaka corpus as input, the output text doesn't contain duplicated words anymore
* tested with 15k custom furigana

pastebin.com/XXTi8yyn


Zero no Kiseki - Programming Help - jcdietz03 - 2015-06-14

ZnK Furigana Patch 1: https://www.mediafire.com/?0el2g2v5cp255fy
I was able to play through the first hour of the game. There could still be bugs. I definitely haven't done everything there is to do in the game. The entire game has had the furigana inserted, so you will know if there's a problem.

Choiceboxes and dialogue names don't support furigana, so furigana only appear in main dialog boxes.

This patch is based on binary difference (and made using Xdelta), so it should be respectful of the original creator's rights. Moderator, please let me know if there's a problem with posting a link to the patch here (and I will remove it).

Everything in this post is using the 14Jun15 0300 version of xfact2007's tool.
--------------
Because it can be a pain to play through the game to look for problems, I will post the sources for this patch here:
https://docs.google.com/spreadsheets/d/1SJc8kOagBsk_3VjiIlViCIlniy6dDsAcjlfup8q5Kuo/edit#gid=515299165
You can see area indexes here:
https://docs.google.com/spreadsheets/d/1XD1lxQ43tYo65l65ZFFflDkvBD8cRQrFLpRIG9sTOXk/edit#gid=1895448308

I think there is still work to do on the automation side, so I will not begin editing these yet.
-------------------
I ran into a problem. But it's not too bad. Solved a few problems too.
MeCab wants x (ばつ) -> かける
Which could be appropriate some of the time, but not here.
I was making up the custom dictionary. http://pastebin.com/L0s7JGU6
I don't like the auto-furigana feature. So I made a list with the dictionary feature and picked the ones I wanted to keep. And then, in my custom application: (sorry, not good at coding...)
with codecs.open('ZnK Dictionary.txt','r','utf-8') as f:
for line in f:
line = line.rstrip('\n').split(',')
znk_plus_mecab.mecab.remember(line[0],line[1])

But your program does not like it if the furigana is "" so I made it one fullwidth space instead and that seemed to work.
-------------------
I was able to run the mecab dictionary indexer, but I could not get it to work.
I cannot get mecab.exe direct input mode to work at all- I never see furigana (but I can type Japanese into the input just fine).
I can get mecab.exe to work in the file processing mode:
mecab X.txt -o Y.txt
Where X.txt is the input file and Y.txt is the output file.

I was able to compile the custom dictionary. My input for the dictionary was:
魔導杖,,,50000,名詞,一般,*,*,*,*,魔導杖,オーバルスタッフ,オーバルスタッフ
I ran the dictionary indexer: (after setting the path to C:\Program Files (x86)\MeCab\bin)
mecab-dict-index -d "C:\Program Files (x86)\MeCab\dic\ipadic" -u mydic.dic -f utf-8 -t utf-8 ./mydic
And then I edited dicrc with userdic = mydic.dic (output of dictionary indexer).
No matter what I did, I could not get it to recognize 魔導杖 as a token.
------------------
About 10 boxes in I spotted the first mistake: (should be あげる)
http://imgur.com/J56zgl0
If your Japanese is better than mine you might see a mistake sooner.
I really think this is good enough, but improvements can always be made.
For this one, a custom Mecab dictionary is needed, or otherwise the entry 挙げる,あXXXX could be added to the ZnK dictionary (X = fullwidth space character).