kanji koohii FORUM
Spreadsheet/regex questions - Printable Version

+- kanji koohii FORUM (http://forum.koohii.com)
+-- Forum: Learning Japanese (http://forum.koohii.com/forum-4.html)
+--- Forum: Learning resources (http://forum.koohii.com/forum-9.html)
+--- Thread: Spreadsheet/regex questions (/thread-13039.html)



Spreadsheet/regex questions - Thora - 2015-09-24

Hi folks. I'm hoping to separate a block of individual kanji into one kanji per row for use in a spreadsheet

Is there a simple way to use regex in Notepad or Open Office to insert a line break after each kanji? After much searching, I came across some indications I might try something like:
Find: (.)
Replace: \1\n
Unfortunately, that doesn't work.

I didn't find anything in Open Office's spreadsheet program either. Any solution would have to be simple since I have no idea what I'm doing.


Spreadsheet/regex questions - yogert909 - 2015-09-24

what's not working? That should add a line break after every single character... If you want line breaks after just kanji, try changing (.) to ([\u4e00-\u9faf]).

It might help better if you post a few lines from your input file and an example of what your desired output looks like.


Spreadsheet/regex questions - anotherjohn - 2015-09-24

You could try importing into the spreadsheet first and separating them there Smile

Put the numbers 1 - 9999 or whatever in column A

In C1 put the kanji all on one line

In B1 put: =mid(c$1,a1,1) and copy down


Spreadsheet/regex questions - Thora - 2015-09-24

Thank you both for your quick responses. Smile

yogert909, I just discovered that using Replace: $1\n works in OpenOffice Writer (rather than \1\n).

Neither of them worked in Notepad++. It would result in "XE9 [line break]X80 XA2" instead. ??

Thanks for tip about using [\u4e00-\u9faf] for kanji. I did come across mention of using [1-9] and [a-z] for alphanumeric and wondered what would work for kanji.

I believe you correctly understood what I was trying to do:
From (for eg): 逢芦飴溢茨鰯淫迂厩噂餌襖迦牙廻恢晦蟹 etc to:





鰯 etc


Spreadsheet/regex questions - Thora - 2015-09-24

anotherjohn Wrote:In B1 put: =mid(c$1,a1,1) and copy down
It worked (I just needed to use semi-colons). What a useful formula to know. Thanks!

Playing around with it a bit , it seems to mean (source cell; starting character position; number of characters).

What about converting the formula cells back to regular numbers. Is Special Paste the only/best way to do that? I don't want to turn off autocalculate because I will be using formulas elsewhere to compare columns.


Spreadsheet/regex questions - anotherjohn - 2015-09-24

Thora Wrote:It worked (I just needed to use semi-colons). What a useful formula to know. Thanks!
You're welcome Cool

Yep, just replace the formulas with values

Unfortunately doing so is a bit cumbersome in OpenOffice iirc, which is a serious drawback given how common an operation it is, though there may be a shortcut for it


Spreadsheet/regex questions - aldebrn - 2015-09-24

If it's not a super-huge file you want to do this in, you can just use your browser's Javascript Console (Tools -> Web Developer -> Web Console in Firefox, or in Chrome: Settings -> More Tools -> Developer Tools -> Console tab; all computer browsers have this).

I just ran:
Code:
copy('逢芦飴溢茨鰯淫迂厩噂餌襖迦牙廻恢晦蟹'.split('').join('\n'))
and now I'll paste the contents of my clipboard:




















If you wind up needing to do this to a huge file which you can't paste into the web console, you can try venturing into the haunted pleasurelands of Node.js.


Spreadsheet/regex questions - aldebrn - 2015-09-24

Thora Wrote:Neither of them worked in Notepad++. It would result in "XE9 [line break]X80 XA2" instead. ??
Horrific. Nodepad++ seems to not know Unicode. U+80A2 in Unicode is .


Spreadsheet/regex questions - Vempele - 2015-09-25

http://stackoverflow.com/questions/18411903/anyone-know-how-to-use-regex-in-notepad-to-find-arabic-characters
Quote:This is happening because Notepadd++ regex engine is PCRE which doesn't support the syntax you have provided.

To match a unicode codepoint you have to use \x{NNNN} so your regular expression becomes:
[\x{0600}-\x{06FF}]



Spreadsheet/regex questions - Thora - 2015-09-25

aldebrn, I've saved your instructions for future reference. Thank you.

aldebrn Wrote:If you wind up needing to do this to a huge file which you can't paste into the web console, you can try venturing into the haunted pleasurelands of Node.js.
While this sounds somewhat intriguing, I suspect I don't have what it takes to get past the velvet rope. So I shall resign myself to my ordinary little files and try to take pleasure in my new javascript friend. ;-)