I've concatenated the text files, which might make it easier to weed out all the junk data and focus on the text. (The concatenated files are here.) I did this using the following bash script:
(The sorting isn't perfect -- it thinks files 100-199 come between files 1 and 2, but at least it'll ensure the English and Japanese files are processed in the same order.)
There are some large chunks of data that have small, well-hidden bits of English (or Japanese), though, so you have to be REALLY careful before just deleting huge blobs of data.
For the English text, a script can probably be written that filters out junk data. For the Japanese text, this probably isn't so easy. But doing it just on the English text could still help us identify which bits of text are particularly interesting.
(EDIT: I've edited the script so that there are blank lines around the filenames in the output files now. I've updated the upload to reflect this as well.)
Code:
#!/bin/bash
outfile=output.txt
# Creates blank file for output
cat /dev/null >"$outfile"
for f in `ls _out_*.txt | sort`
do
echo "$f"
echo "****************************************" >>"$outfile"
echo "$f" >>"$outfile"
echo "****************************************" >>"$outfile"
echo >>"$outfile"
cat "$f" >>"$outfile"
echo >>"$outfile"
echo >>"$outfile"
doneThere are some large chunks of data that have small, well-hidden bits of English (or Japanese), though, so you have to be REALLY careful before just deleting huge blobs of data.
For the English text, a script can probably be written that filters out junk data. For the Japanese text, this probably isn't so easy. But doing it just on the English text could still help us identify which bits of text are particularly interesting.
(EDIT: I've edited the script so that there are blank lines around the filenames in the output files now. I've updated the upload to reflect this as well.)
Edited: 2011-02-19, 9:22 pm
