Since it seems a number of the words are marginally out of phonetic index order and the sentences' sub-order seems arbitrary, I think the best call at this point is to manually correct them.
I've made a publically editable spreadsheet here. I've added a few fields, but it should be easy enough to update with any existing corrections anyone has made (the PackIndex and PackName uniquely identify an entry, so just sort your spreadsheet and this one by those and copy/paste updated columns).
The Core6k index and word/sentence audio files are based on guesses. Effectively, the word expression and reading must match exactly and in the case of multiple potential matches in Core, it first tries the first one that has exactly the same meaning, then tries the first one whose meaning contains one of the entry's meanings, and failing that just grabs the first match (ignoring meaning).
If anyone can think of useful fields to add or has a better suggestion on how to collaboratively correct the issues, please suggest them.
Edit: My matching code mentioned above identifies 5792/9669 as in core6k and 3877/9699 as new (with 4 possibly duplicates; same expression, reading, and english meaning).
Last edited by overture2112 (2011 January 26, 2:21 am)