#1
This is just a draft, I'm going to work on it to add a graphic interface and to add features (various options to search words/sentences, edit words splitted wrongly, custom dictionaries integration like epwing, furigana, export words, sentences and definitions, cloze-deletion, lingq-like feature but with dictionary form support, and so on...).

Sorry if this sucks for now, I've done it in like 5 minutes.
As I'm not a programmer please be kind to me and btw every suggestion and help is welcome Tongue


SOURCECODE

How it works:

extract the .rar archive, double-click the .exe, then write the path of the file that you want to process


What it does:

It splits words and enclose them in <div></div> like:

<div>怪しく</div>


then it adds a property inside the opening <div>, which contains the dictionary form:

<div dictf="怪しい">怪しく</div>


example:

部屋に、真っ暗な夜の帳がおりる。

窓の外には月が二個、怪しく光っていた



becames:

<div dictf="部屋">部屋</div><div dictf=""></div><div dictf=""></div><div dictf="真っ暗">真っ暗</div><div dictf=""></div><div dictf=""></div><div dictf=""></div><div dictf=""></div><div dictf=""></div><div dictf="おりる">おりる</div><div dictf=""></div><div dictf="*"><br /></div><div dictf=""></div><div dictf=""></div><div dictf=""></div><div dictf=""></div><div dictf=""></div><div dictf=""></div><div dictf=""></div><div dictf=""></div><div dictf=""></div><div dictf=""></div><div dictf="怪しい">怪しく</div><div dictf="光る">光っ</div><div dictf=""></div><div dictf="いる"></div><div dictf=""><br /></div>


To search words
using the dictionary form, use a regular expression inside Notepad++ (or another text editor), like this (example for the word 怪しい):

<div dictf="怪しい">(.*?)</div>


will return:

<div dictf="怪しい">怪しく</div>


while:

<div dictf="。">。(.*?)<div dictf="怪しい">(.*?)</div>(.*?)<br /></div>


will return:

<div dictf="。">。</div><div dictf="*"><br /></div><div dictf="窓">窓</div><div dictf="の">の</div><div dictf="外">外</div><div dictf="に">に</div><div dictf="は">は</div><div dictf="月">月</div><div dictf="が">が</div><div dictf="二">二</div><div dictf="個">個</div><div dictf="、">、</div><div dictf="怪しい">怪しく</div><div dictf="光る">光っ</div><div dictf="て">て</div><div dictf="いる">い</div><div dictf="た">た<br /></div>


Sorry, I know this is a mess.
I'm going to add soon a search and export feature, where you will be able to specify the sentence lenght and/or boundaries, and it will extract a clean version of all the sentences which satisfy your search.

So if you want all sentences which contains "食べる" in all its forms, you will specify the boundaries of the sentence (like 。,!,?,「」, newline) and the minimum length. So that if your sentence is "「食べたい!」" it's obviously too short, so it will seach the next upper and/or lower bounds to include a larger sentence, and it will do it recursively until the length specified by you is satisfied.

Hope this helps, I know for now it's just a stupid script, I hope to turn it into a good project as soon as I can!
Edited: 2017-02-12, 12:46 pm
Reply
#2
I've added a second tag with the word reading, so you will end up with something like:


<div dictf="怪しい", reading="アヤシイ">怪しく</div>
Reply
#3
Looks like it very quickly encounters an out of memory exception if run on a large (450mb) file.
Might help if it were compiled as a 64bit application, if that is possible. Or perhaps the code can be refactored so that its not performing so many operations in memory.
Edited: 2017-02-12, 12:08 pm
Reply
MONSTER Sale Get 28% OFF Basic, Premium & Premium PLUS! (Oct 16 - 27)
JapanesePod101
#4
(2017-02-12, 12:07 pm)Zarxrax Wrote: Looks like it very quickly encounters an out of memory exception if run on a large (450mb) file.
Might help if it were compiled as a 64bit application, if that is possible. Or perhaps the code can be refactored so that its not performing so many operations in memory.

Yeah, it's because it saves the full file into a string asd

Unfortunately I've just realized how this kind of projects take so much time which I want to use with the Japanese language instead, so I think I will put this project apart for a while, and just working on it from time to time :/

Anyway I'm going to upload the source code, so that anyone with programming skills can take it and translate it into something useful xD

If someone decides to take this and work on it, consider the possibility to change the tagging format to a XML one, so it will be easier to navigate the file structure (.NET has methods to work on xml files).

PS: there is an issue with MeCab, when you compile the program it puts the dictionary in "release/dict/" but when you execute the program it gives an error because it tries to open the dictionary at "release/dict/ipadict/", so you need to move it inside that directory (or to figure how to instruct MeCab to look inside another directory).

Here is the source code!
Edited: 2017-02-12, 12:44 pm
Reply
#5
Thanks, I managed to bend the code to fit my needs.
Reply
#6
(2017-02-12, 9:42 pm)Zarxrax Wrote: Thanks, I managed to bend the code to fit my needs.

Happy to hear this! (Sorry I couldn't be of more help ._. )
Reply