cbJisho - E-J Dictionary Based on Word Frequency

Index » Learning resources

 
cb4960 Member
From: Los Angeles Registered: 2007-06-22 Posts: 917

cbJisho is an English-to-Japanese dictionary that sorts results based on relative frequencies in blogs, newspapers and novels. You can filter words by JLPT level (1-5), whether or not it’s a JDIC "common" word and by character length of the definition. You can also search using a regular expression or by whole word only. Search text can either be in English (in which case the definitions are searched) or Japanese (in which case the kanji and kana fields are searched) or you can also use a simple SQL query. Results can be saved to the clipboard or to a file.

For Windows users:
Download cbJisho v3.1 via Google Code

For Linux users:
Python 2 Version:
Download cbJisho v3.1 source code python2 via Google Code

Python 3 Version:
Download cbJisho v3.1 source code python3 via Google Code
Note: I haven't actually tested the Python 3 version on Linux

Documentation:
View the documentation for cbJisho (includes screenshots)

Last edited by cb4960 (2011 April 24, 12:18 am)

Zarxrax Member
From: North Carolina Registered: 2008-03-24 Posts: 949

This looks interesting, I'll give it a try later.
Is it possible to browse words by frequency, rather than searching? (basically to obtain a list of vocab to learn)

cb4960 Member
From: Los Angeles Registered: 2007-06-22 Posts: 917

Zarxrax wrote:

This looks interesting, I'll give it a try later.
Is it possible to browse words by frequency, rather than searching? (basically to obtain a list of vocab to learn)

You can press the search button while leaving the search text blank.

Although, for that purpose, it would probably be easier to look at Dict/edict_processed.txt. This is the "database" that cbJisho uses. The entries in that file are sorted by most frequent to least frequent based on Overall frequency (the average of the blog, newspaper and novel frequencies). You might want to open it in Excel and remove some of the frequency columns though.

Last edited by cb4960 (2011 January 22, 7:25 pm)

Advertising (register and sign in to hide this)
JapanesePod101 Sponsor
 
nest0r Member
Registered: 2007-10-19 Posts: 5236 Website

Cool! Could use it as a kind of 'word validator' (re: Globefish Language Tool's 'phrase validation'), perhaps. But what's wrong with Edict for J-E?

Have you thought about trying to use the list FooSoft made based on the 'totally innocent' books thread? I never really trusted that pomax thing myself, as noted a cpl years ago or whenever. Due to the lack of knowing what sources were used, specifically. Just "lots of books I found somewhere and you'll never know what they were".

There's also the Wikipedia list.

Somehow I feel like this brings us closer to having a framework to integrate with subs2srs with regards to frequency and user-customized study lists.

Last edited by nest0r (2011 January 22, 8:02 pm)

nest0r Member
Registered: 2007-10-19 Posts: 5236 Website

http://forum.koohii.com/viewtopic.php?pid=94075#p94075

Or more specifically: http://forum.koohii.com/viewtopic.php?pid=94092#p94092

Edit: Damn! Just noticed that's for kanji only. Nevermind. ;p

Last edited by nest0r (2011 January 22, 8:28 pm)

cb4960 Member
From: Los Angeles Registered: 2007-06-22 Posts: 917

nest0r wrote:

Cool! Could use it as a kind of 'word validator' (re: Globefish Language Tool's 'phrase validation'), perhaps.

That's kinda what I was thinking. It can help for those times when you hesitate to write a word because you don't know if it's the "right" word.

nest0r wrote:

But what's wrong with Edict for J-E?

The existence of Kenkyusha makes it difficult to recommend EDICT for J-E. But on second thought, such a statement isn't necessary for the description.

nest0r wrote:

Have you thought about trying to use the list FooSoft made based on the 'totally innocent' books thread? I never really trusted that pomax thing myself, as noted a cpl years ago or whenever. Due to the lack of knowing what sources were used, specifically. Just "lots of books I found somewhere and you'll never know what they were".

There's also the Wikipedia list.

As you stated in your second post, those lists are for kanji only. If you know of any other frequency lists though, I'll be happy to incorporate them. I agree with you about the pomax list. It would have been nice if he gave us a few more details.

nest0r wrote:

Somehow I feel like this brings us closer to having a framework to integrate with subs2srs with regards to frequency and user-customized study lists.

As always, if you come up with any specific ideas, I'd love to read them.

Last edited by cb4960 (2011 January 22, 8:46 pm)

nest0r Member
Registered: 2007-10-19 Posts: 5236 Website

hehe

Working on it. Still fuzzy. Maybe start with smaller steps. Exporting/importing lists, and multiple/automated look-up functions in cbJisho?

I'm tentatively speculating on it being an extra constraint when generating subs2srs cards, narrowing the results by frequency ranges, perhaps? But I suppose this would involve a possibly overly complicated process of importing subtitles to derive a variable/custom frequency range, then exporting the results to filter subs2srs.

Also, something to do with a personal database based on processed subtitles. Fuzzy there also.

I might be trying to think too much of using cbJisho as a kind of tool for utter n00bs like me to dynamically produce secondhand frequency lists from material they input.

Edit: Or perhaps taking the subs2srs results and processing with cbJisho to derive definitions arranged by frequency to aid prioritizion of study? Something involving fields and tags or somesuch?

Last edited by nest0r (2011 January 22, 9:21 pm)

FooSoft Member
From: Seattle, WA Registered: 2009-02-15 Posts: 513 Website

Cool, after some tweaking I got it to run on my Linux comp. Pretty crazy for doing this in python3 lol, nobody runs that on Mac or Linux yet tongue I only had to make a couple of tiny changes to run with python2 though, you might consider just supporting python2 for us non-windows users.

I basically changed your open calls to codecs.open, and added a unicode cast to gatherInterfaceData on this line (which sadly won't work on python3):
self.searchText = unicode(self.ui.lineEditSearchText.text())

Also, when specifying the interpreter you can't have a space after the #, so you might want to change it to:
#!/usr/bin/env python3

Or #/user/bin/env python2 if you change it to be python2 wink Note that you don't version numbers because on Linux you usually just have either python2 or python3 installed (or both), and python2/3 are just symlinks to the real executables. Beforehand you could just do "python" but now some distros are actually defaulting that to python3 even though there still aren't a lot of libraries compatible with python3.

Looking good big_smile

Last edited by FooSoft (2011 January 23, 12:13 pm)

cb4960 Member
From: Los Angeles Registered: 2007-06-22 Posts: 917

FooSoft wrote:

Cool, after some tweaking I got it to run on my Linux comp. Pretty crazy for doing this in python3 lol, nobody runs that on Mac or Linux yet tongue I only had to make a couple of tiny changes to run with python2 though, you might consider just supporting python2 for us non-windows users.

I basically changed your open calls to codecs.open, and added a unicode cast to gatherInterfaceData on this line (which sadly won't work on python3):
self.searchText = unicode(self.ui.lineEditSearchText.text())

It might be apparent from the code, but I'm a python n00b - So tips like this are much appreciated. Since I'm not using very much python3, I'll release any future versions for python2 as per your suggestion. From my n00b perspective, the changes from 2 to 3 don't seem significant enough to warrant loss of backwards compatibility - but that's a discussion for a different forum.

FooSoft wrote:

Also, when specifying the interpreter you can't have a space after the #, so you might want to change it to:
#!/usr/bin/env python3

Or #/user/bin/env python2 if you change it to be python2 wink Note that you don't version numbers because on Linux you usually just have either python2 or python3 installed (or both), and python2/3 are just symlinks to the real executables. Beforehand you could just do "python" but now some distros are actually defaulting that to python3 even though there still aren't a lot of libraries compatible with python3.

Looking good big_smile

Again, thanks for the info. I just copy-pasted the shebang from the python3.1 tutorial:
http://docs.python.org/py3k/tutorial/in … on-scripts

Reply #10 - 2011 January 23, 4:16 pm
cb4960 Member
From: Los Angeles Registered: 2007-06-22 Posts: 917

I just released version 1.0 of cbJisho with python 2 source code for Linux/Mac users:

Download cbJisho v1.0 source code python2 via MediaFire

I've tested it under Linux Mint and it seems to work okay.

The compiled Windows version will continue to use python 3 however. For some reason, on my copy of Windows XP, the python 2 version runs about 2x slower than the python 3 version. I'm not sure why yet.

Reply #11 - 2011 January 30, 5:14 pm
cb4960 Member
From: Los Angeles Registered: 2007-06-22 Posts: 917

Hello,

I have just released version 2.0 of cbJisho.

For Windows users:
Download cbJisho v2.0 via MediaFire

For Linux/Mac users:
Python 2 Version:
Download cbJisho v2.0 source code python2 via MediaFire
Python 3 Version:
Download cbJisho v2.0 source code python3 via MediaFire

What changed?
cbJisho now utilizes an sqlite database for searches. This means that searches are now much faster than they were before. (On my slow netbook search times went from seconds to near instant).

cb4960

cb4960 Member
From: Los Angeles Registered: 2007-06-22 Posts: 917

Hello,

I have just released version 2.1 of cbJisho.

For Windows users:
Download cbJisho v2.1 via MediaFire

For Linux/Mac users:
Python 2 Version:
Download cbJisho v2.1 source code python2 via MediaFire
Python 3 Version:
Download cbJisho v2.1 source code python3 via MediaFire

What changed?
I updated the novel column of the database.

Previous versions used the frequency information provided here. However, based Pomax's explanation found in the link, the validity of those frequencies is a bit questionable.

So I decided to generate my own novel frequency list. The new novel frequency list is based on 5109 novels. The list of novels used: http://pastebin.com/VLJpTREd. The first 50 lines and last 20 lines were removed from each file so that things like table of contents, copyright and publisher information were not parsed. The readings (between 《 and 》) were also removed.

The utility that I wrote to create the frequency list will be given in the following post.

cb4960

cb4960 Member
From: Los Angeles Registered: 2007-06-22 Posts: 917

This tool is deprecated. Use cb's Japanese Text Analysis Tool instead. Thread for cb's Japanese Text Analysis Tool.

Here is the utility that I mentioned in the previous post. I call it cb's Japanese Word Frequency List Generator.

Download cb's Japanese Word Frequency List Generator v1.1 via MediaFire

Download cb's Japanese Word Frequency List Generator v1.1 Source Code via MediaFire

Here is the custom Mecab user dictionary that I created. Used together with the default Mecab dictionary, it adds several thousand words:
Download custom Mecab user dictionary via MediaFire

http://subs2srs.sourceforge.net/JapaneseWordFrequencyListGenerator/main_v1.1.png

From the readme file:

What is cb's Japanese Word Frequency List Generator?
As the name implies, it is a utility for generating word frequency lists from
files that contain Japanese text.

The input to the program is directory that contains the .txt files with Japanese
text (files from subdirectories will also be used).

The output of the program is a text file with each line consisting of a Japanese
word that was encountered in the input files and a number indicating the number
of times that the word was encountered. Example:

...
8297    謝る
8290    王女
8284    欲望
...

In this example, 謝る was encountered 8297 times in the input files, 王女 was
encountered 8290 times, etc.

How to Install and Launch:
1) Unzip cb's Japanese Word Frequency List Generator.
2) In the unzipped directory, simply double-click
   JapaneseWordFrequencyListGenerator.exe.
   
   Note: Requires the .Net Framework.

How to Use:
It's fairly straightforward, just fill out each of the fields and press the Start
button. The fields are explained below:

Root Directory:
  The directory that contains the .txt files to analyze. Files from subdirectories
  will also be used.

Encoding:
  The encoding of the input files.

Output File:
  The output file that will contain the frequency information.

Mecab Location:
  The location of Mecab. Mecab is the sophisticated morphological analyzer that this
  program relies on to extract words from the input files. You can download it from: 

  https://sourceforge.net/projects/mecab/ … cab-win32/

  During the installation of Mecab, you will be prompted for the encoding to
  use. Select UTF-8.

Mecab User Dictionary Location (Optional):
  If you have a user dictionary that you would like to use in addition to the default Mecab
  dictionary, you may enter it here. Leave blank if you don't have a user dictionary.

Chop Front:
  The # of lines to ignore from the start of each file. Can be useful for removing
  TOC and other header information.

Chop Back:
  The # of lines to ignore from the end of each file.
  Can be useful for removing copyright and publisher information.

To customize the field defaults, edit the "settings.txt" file that is in the
same directory as the .exe file.

How Much Time Does the Analysis Take?
On my 3.5 year old computer, analysis of 5109 Japanese novels took 34 minutes.

cb4960

Last edited by cb4960 (2012 May 27, 1:12 pm)

nest0r Member
Registered: 2007-10-19 Posts: 5236 Website

I just had one of those manga nosebleeds, but from looking at these updates.

Last edited by nest0r (2011 February 05, 3:19 pm)

Nukemarine Member
From: 神奈川 Registered: 2007-07-15 Posts: 2347

First, awesome work as always CB. Secord, care to post the generated frequency list (top 30,000 words only)?

I've also been wondering about something thanks to Yomichan's interface. However, I'll post that on the subs2srs thread.

cb4960 Member
From: Los Angeles Registered: 2007-06-22 Posts: 917

Nukemarine wrote:

First, awesome work as always CB. Secord, care to post the generated frequency list (top 30,000 words only)?

No problem:

List used for cbJisho 3.1:
http://www.mediafire.com/?b6lj0y483g7q1d4

List used for cbJisho 3.0:
http://www.mediafire.com/?nlgz8lp8zey278u

List used for cbJisho 2.0:
http://www.mediafire.com/?r4uc88nwh1adrdr

Last edited by cb4960 (2011 April 25, 10:00 pm)

deign Member
From: Paris Registered: 2010-02-17 Posts: 31

Awesome!!

I hope this will be integrated in Anki also one day!

belton New member
From: UK Registered: 2008-02-05 Posts: 5 Website

Very interesting. Thanks for posting.
Unfortunately I haven't had much success running these on my Mac. either in python or windows emulation.

I had a look at the output you posted.
I would suggest that the frequency list generator filter for punctuation and roman characters and also possibly single kana. (is a particle really a word?)

Where do you get the JLPT levels from? I had thought that official listings were not going to be published for the new levels.

>> 100 – ((Relative frequency / total # of unique relative frequencies) * 100)

How are you defining "relative frequency" in your equation? At the moment it doesn't make sense to me. Why isn't the relative frequency expressed as a percentage enough information. What is achieved by "total # of unique relative frequencies"

Apologies if I'm just being statistics challenged.

cb4960 Member
From: Los Angeles Registered: 2007-06-22 Posts: 917

belton wrote:

Very interesting. Thanks for posting.
Unfortunately I haven't had much success running these on my Mac. either in python or windows emulation.

Post the error message and I'll take a look.

belton wrote:

I had a look at the output you posted.
I would suggest that the frequency list generator filter for punctuation and roman characters and also possibly single kana. (is a particle really a word?)

I suppose I can add an option to remove such items. Note that for cbJisho, I only include words that exist in EDICT (though EDICT does include particles).

belton wrote:

Where do you get the JLPT levels from? I had thought that official listings were not going to be published for the new levels.

I extracted the JLPT levels from the corePlus deck made by rachels.

belton wrote:

>> 100 – ((Relative frequency / total # of unique relative frequencies) * 100)

How are you defining "relative frequency" in your equation? At the moment it doesn't make sense to me. Why isn't the relative frequency expressed as a percentage enough information. What is achieved by "total # of unique relative frequencies"

Apologies if I'm just being statistics challenged.

If you look at the output that I've posted, you'll notice that some words have the same number of hits. By "total # of unique relative frequencies", I mean the count of unique hits. For example, in the novel frequency list, there are 7676 unique hits.

"Relative frequency" is the position of a word within in the range [1 - count_of_unique_hits]. Words closer to 1 are more frequent than words closer to count_of_unique_hits.

The calculation merely transforms the [1 - count_of_unique_hits] scale to a [0 - 100] scale.

belton New member
From: UK Registered: 2008-02-05 Posts: 5 Website

cb4960 wrote:

Post the error message and I'll take a look.

On the mac I think it might be my version of python (2.5) which could be because I'm running 10.5.8 and not 10.6.x
for what it's worth it says
Syntax error in script : invalid syntax(cbJisho.py, line 183)

cb4960 wrote:

If you look at the output that I've posted, you'll notice that some words have the same number of hits. By "total # of unique relative frequencies", I mean the count of unique hits. For example, in the novel frequency list, there are 7676 unique hits.

"Relative frequency" is the position of a word within in the range [1 - count_of_unique_hits]. Words closer to 1 are more frequent than words closer to count_of_unique_hits.

The calculation merely transforms the [1 - count_of_unique_hits] scale to a [0 - 100] scale.

I don't quite follow. But I'm willing to take your logic as given.

I'd have thought ("number of instances of a word" / "total number of words in the sample" )*100 would have given a relative frequency expressed as a percentage.

nest0r Member
Registered: 2007-10-19 Posts: 5236 Website

Hey cb4960, O tireless programmer... per this thread: http://forum.koohii.com/viewtopic.php?pid=133031

+ What I mentioned before about being able to give cbJisho multiple words for batch processing (to generate a list of definitions), how about I do some speculation on possibilities.

Let's see, instead of relying on WWWJDIC for sentence glossing, how about combining the tools in this thread and expanding them so one could input a text or texts, generate a list of the words, then input them into cbJisho to derive a list of definitions, perhaps arranged by not just the novel frequency in the generator tool, but multiple ones (with or without statistical information?) as cbJisho uses (blogs, JLPT, novels, etc.).

And/or perhaps it could be compared against a user database so it's not just generating frequencies from 5000 novels or a handful, but you could calculate the frequency of a text or texts and compare them against 1-5000 other texts or whathaveyou, perhaps Anki information (seen cards, cards of a certain maturity, whatever).

Anyway sorry just rambling. All these tools you've posted have gotten me excited about possibilities again.

Oh, and: http://language.tiu.ac.jp/tools_e.html - We might end up with something like this but offline and in English, with more customizable options?

I really like the idea of colour-coding in readers too, this is very tangential but I'm thinking of gradients to aid in focusing on best/least known items (per my 60/30/10 media ecology or the 80/20 media principle thing Cranks mentioned). Of course they could be distracting so customizing colours and being able to toggle them would be interesting. Oh! And tying it together with yomichan, JDIC audio, Kage Shibari, a card reminder plugin, JNF, etc. etc. Okay way off the rails now, I better study some actual Japanese.

Last edited by nest0r (2011 February 11, 3:28 pm)

cb4960 Member
From: Los Angeles Registered: 2007-06-22 Posts: 917

belton wrote:

cb4960 wrote:

Post the error message and I'll take a look.

On the mac I think it might be my version of python (2.5) which could be because I'm running 10.5.8 and not 10.6.x
for what it's worth it says
Syntax error in script : invalid syntax(cbJisho.py, line 183)

cb4960 wrote:

If you look at the output that I've posted, you'll notice that some words have the same number of hits. By "total # of unique relative frequencies", I mean the count of unique hits. For example, in the novel frequency list, there are 7676 unique hits.

"Relative frequency" is the position of a word within in the range [1 - count_of_unique_hits]. Words closer to 1 are more frequent than words closer to count_of_unique_hits.

The calculation merely transforms the [1 - count_of_unique_hits] scale to a [0 - 100] scale.

I don't quite follow. But I'm willing to take your logic as given.

I'd have thought ("number of instances of a word" / "total number of words in the sample" )*100 would have given a relative frequency expressed as a percentage.

When I was figuring out how to display the frequency numbers, I had a few goals in mind:
1) Make them accurate
2) Make them pretty
3) Make it easy to compare columns

Expressing the frequencies as percentages was actually my first approach. However, since the total number of words encountered is so large, nearly all of the percentages were very close to 0. This approach failed goals 2 and 3.

In response to this, I decided to display relative frequencies instead. For example, for novels I displayed a number 1-7676. The problem I found with this is that I had to remember what the maximum was for each column (blog, newspaper, novel). And I didn't like this idea of displaying the totals on the GUI. And it was kind of difficult to compare the columns without doing some calculations in my head. This approach definately failed goal 3.

My solution was to convert the 1-7676 range (for example) to 0.00-100.00 with 100.00 being the most frequent. This way I could easily compare relative frequencies of a word for each column.

Perhaps I should make it look less like a percentage to avoid confusion. Maybe a 0-100 range without the decimals would be sufficient (or 0-1000 scale if I wanted a bit more precision).

Anyway, I'll think about it for the next version. Perhaps I'm missing something. Also, if you feel strongly about a particular solution, please feel free to convince me.

Maybe in the end I'll implement all approaches and have it be user selectable.

cb4960 Member
From: Los Angeles Registered: 2007-06-22 Posts: 917

nest0r wrote:

Hey cb4960, O tireless programmer... per this thread: http://forum.koohii.com/viewtopic.php?pid=133031

+ What I mentioned before about being able to give cbJisho multiple words for batch processing (to generate a list of definitions), how about I do some speculation on possibilities.

Let's see, instead of relying on WWWJDIC for sentence glossing, how about combining the tools in this thread and expanding them so one could input a text or texts, generate a list of the words, then input them into cbJisho to derive a list of definitions, perhaps arranged by not just the novel frequency in the generator tool, but multiple ones (with or without statistical information?) as cbJisho uses (blogs, JLPT, novels, etc.).

And/or perhaps it could be compared against a user database so it's not just generating frequencies from 5000 novels or a handful, but you could calculate the frequency of a text or texts and compare them against 1-5000 other texts or whathaveyou, perhaps Anki information (seen cards, cards of a certain maturity, whatever).

Anyway sorry just rambling. All these tools you've posted have gotten me excited about possibilities again.

Oh, and: http://language.tiu.ac.jp/tools_e.html - We might end up with something like this but offline and in English, with more customizable options?

I really like the idea of colour-coding in readers too, this is very tangential but I'm thinking of gradients to aid in focusing on best/least known items (per my 60/30/10 media ecology or the 80/20 media principle thing Cranks mentioned). Of course they could be distracting so customizing colours and being able to toggle them would be interesting. Oh! And tying it together with yomichan, JDIC audio, Kage Shibari, a card reminder plugin, JNF, etc. etc. Okay way off the rails now, I better study some actual Japanese.

Introducing cb's nest0r Stream of Consciousness Parser v1.0 ... wink

Paragraphs 1-3 are definitely doable.

Paragraph 4 is somewhat confusing to me, but basing output on Anki decks (somehow) is a novel idea.

That's an interesting pair of tools in Paragraph 6. The Dictionary Tool is kind of like a rikaichan pre-parser. The Level Checker is also interesting. You can have a scenario where a beginner wants to know which of say 5000 novels would be a good one read with regard to that i+1 thingy. Said beginner could do something like run those 5000 novels through the tool and get a sorted difficulty ranking.

The way I see Paragraph 7 is that the user could focus on words that matter for them at whatever level they are currently at. Or sometimes I get into the situation where I'm not sure if it's even worth putting effort into trying to remember a word I've encountered because I'm not sure I'll ever see it again. This tool could help with that among other things.

nest0r Member
Registered: 2007-10-19 Posts: 5236 Website

I need that nest0r parser.

Yeah for the Anki basis, perhaps it could somehow have to do with tagging cards based on their intervals (i.e. determining young/mature... based on what I've said about a card maturity reminder plugin and the 60/30/10 thing, I think a 3-part basis is good, like 21 and under, 22-120, and 120+, or even something more refined?). Anyway, since you can export with tags, perhaps exporting the deck with those tags and then using all that information could work? Though then there'd be keeping things up-to-date to worry about...

At any rate, as for the 4th paragraph, I was speaking in the context of a level-checking type mechanism, rather than only a master list to compare against (this also goes for the stats in cbJisho definitions) that consists of the 5000 novels or JLPT or blogs or Anki, etc., users could generate a list from a smaller number, of say, books by a certain author, or a genre, or whatever, and then use that generated list as the master list which single texts might be compared against (whatever that comparison might entail, be it stats/percentages/ratios, eliminating duplicates or eliminating non-duplicates, etc.).

Maybe I'm thinking about that all wrong, though. Anyway, that was something in addition to/other than the general glossing/level-checking thing.

So to put it another way, when thinking of all these possibilities, the context would be, in that set of master lists of JLPT, 5000 novels, blogs, JDIC common words, Anki... the extra category of ‘custom’.

Last edited by nest0r (2011 February 11, 8:47 pm)

belton New member
From: UK Registered: 2008-02-05 Posts: 5 Website

cb4960 wrote:

My solution was to convert the 1-7676 range (for example) to 0.00-100.00 with 100.00 being the most frequent. This way I could easily compare relative frequencies of a word for each column.

Ok I understand now. I'd think of that as a relative ranking. The only thing I'd say about the approach is while it says that word A is more frequent than word B in a given sample, it doesn't necessarily allow you to compare the frequency across samples. Word A could be the highest ranked in two samples but have different frequencies in each sample. ie it tells you word A is more frequent than Word B but not how frequent. (Mind you, that's how kanji frequencies are presented in works such as the KKLD, even going as far as to break 2000 kanji into 4 groups)

I haven't had the chance to run your novel output through a spreadsheet, but I did notice that after about 5000 words (after stripping punctuation etc) there is less than one usage per novel on average. Although I think this is an unlikely distribution and the words may well cluster to a genre of novels or a particular novel. This from an original list of 126500~ entries.
Maybe the larger the sample the flatter the distribution curve of a large amount of the data.
In a general list after about 500 words there's not much between them and one word is about as likely as any other nearby. At the top 100 we're seeing grammatical constructs, ある、する、です、この、その、etc. rather than more specific nouns verbs and adjectives.
Maybe as a tool for learners we are better off applying it to works we are interested in to get a personal frequency / ranking of words to concentrate on.

This flatness of distribution may be why the studies I've seen look at kanji frequency rather than word frequency. It's all of interest though and it's great that you are making tools to look at these things.