cb4960 Wrote:@nest0r,
Thanks for the quick tutorial!
And it looks like I'll have to add batch support for Gaiji Replacer or gaiji support for AozoraRemover (or maybe both).
That would be cool. ;p Maybe a two-way tool?
Also:
An even quicker AntConc tutorial. ^_^ Well, maybe not quicker.
Go here:
http://www.antlab.sci.waseda.ac.jp/software.html and download the relevant file. If it's the Windows .exe, open it (no installation that I recall), and:
1) Global Settings→Language Encodings→UTF-8 or Shift_JIS
2) File→Export Settings→Default name (antconc_settings_320.ant) in same folder as .exe so that when the program starts it uses this and you don't have to keep selecting the language. You can always import settings files also.
3) File→Open File(s)/Open Dir.
4) With those files showing up in a column to your left, go to the Clusters tab.
5) Check ‘N-Grams’ box, and the Clusters tab will now say N-grams instead of Clusters.
6) Set N-Gram Size to Min. Size 1 and Max. Size 20 (For some reason you have to scroll up/down rather than type numbers.)
7) Min. N-Gram Frequency = 4
(Those above size and especially frequency settings can vary, but I find the size works best that way due to how Japanese is processed in AntConc. Probably want much lower for English or somewhat lower for larger numbers [like triple digits] of Japanese texts.)
8) Sort by Word or Frequency. You can always re-sort by clicking Sort. Sorting by Word is useful because of how n-grams are processed:
http://www.antlab.sci.waseda.ac.jp/softw...grams).htm (One set of n-grams for ‘This is a pen’ would be ‘This/This is/This is a/This is a pen’, so word-sorting lets you skim more easily to that final line.)
9) Click Start!
You can copy the results manually by clicking slightly to the side on the same line, or click directly for Concordance, or File→Save Output to Text File. You can also duplicate and undock the results window by clicking Save Window.
Also, because we're using default n-gram settings, it's only processing words, not punctuation, numbers, symbols, etc. You can enable those in the Global Settings but it skews the results quite a bit so I just leave it and when I see results that have gaps in them indicating where those tokens (question mark, number, etc.) used to be, I click it to see the context in the Concordance tab. In those cases you'll get no hits because it uses that punctuation-removed line as a search query, so you'll need to select just a section of it and re-search, e.g. if you click だけど それは in the search results it'll take you to the Concordance tab and say No Hits, but just re-search for だけど or それは and sort by 1L or 2R or whathaveyou.
Lastly, we're obviously only skimming the surface of what's possible with this tool.
Edit: I was earlier reading about ‘concgrams’ via that thesis paper I mentioned before, and I see that AntConc's author is thinking of incorporating those into a later version, so that's cool. Concgrams account for all the constituency and positional variation, e.g. ‘government expenditure’ can be ‘government's own expenditure’ or ‘expenditure of the government’, but normally this wouldn't get picked up by linear-only searches. There's some tools that do this automatically for texts, but they seem to be hard to get. You can manually achieve a similar effect in AntConc by searching for ‘government’ and adding the context word (in Advanced settings) ‘expenditure’ with a horizon (another instance of selecting the span, this time within how many words to the L or R of the search term). But I digress.
Guess I should've posted this elsewhere, sorry. Back to Rikaichan: RevTK Community Edition!
Edited: 2011-06-15, 9:12 pm