Combining the dataset on kanji review failure rates kindly provided by Fabrice with the kanji data in Edict, I constructed a dataset on the RTK1 kanji, consisting of:
Here are a few simple numbers.
First, some summary statistics:
And the correlation matrix:
10 best-remembered kanji, best to worst:
And 20 worst-remembered, worst to best:
There are quite a few kanji with a high Heisig number here, especially in the top 10. It could be due of Dr. Heisig leaving the harder kanji for later in the text, or perhaps people just haven't learned them well enough yet, and there can be many other explanations:
One also wonders, first, how the kanji difficulty is affected by the predictable factors, and second, what are the most difficult kanji after correcting for those factors -- i.e. what kanji seem more difficult than they should be.
It is possible that those would be the kanji that could benefit from better stories the most.
First, let's fit a simple linear model:
As we can see all the coefficients (as expected) are highly significant and have expected sign -- that is, the kanji that appear later in the book, are studied in higher Japanese school grade, are more complex, and appear less frequently in newspapers, are harder to recall.
This model misses a few kanji, due to some them either not being assigned either a school grade, or not appearing in the newspaper frequency dataset. I introduce two dummies (GRADEMISS and FREQMISS, equal to 1 for the missing observations and 0 elsewhere) to account for those. In addition, I add two more variables -- dummies for the kanji in Part 1 and Pt. 2 of the Heisig book (the first part has detailed stories by Dr.Heisig, and the 2nd has less detailed ones. iirc, first part is also available online):
Interestingly, the kanji NOT in the newspaper frequency dataset appear easier to recall than others. Perhaps this is because for a rare kanji to be included into 常用 set it had to have particularly simple structure or be otherwise easy. Also kanji from Pt.1 of RTK appear easier than others, but there is no effect for those in Pt.2.
Dropping Pt.2 from the model, here are the estimation results:
And the list of 20 kanji that appear most difficult after correcting for the above factors:
As we can see these are somewhat more evenly distributed over the book length:
Now the interesting question is, of course, what can we do to make those kanji easier to recall.
Probably, the best thing would be if we could think of what stories we ourselves used there, whether they helped, and if they did, whether they could be shared with the community.
For comparison, here is also a list of 10 easiest kanji after controlling for the factors in the last model:
Pepe might also be interested in checking if he can find any systematic difference between the stories that are available for these 'best' and 'worst' kanji.
Code:
obs: 2,042
vars: 9
size: 98,016 (99.1% of memory free)
-------------------------------------------------------------------------------
storage display value
variable name type format label variable label
-------------------------------------------------------------------------------
framenum int %8.0g Heisig No.
sfc int %8.0g Failures
ssc int %8.0g Successes
str int %8.0g Reviews
tt float %9.0g Failure rate
frequency int %8.0g Frequency
grade byte %8.0g Grade
strokecount byte %8.0g Strokes
englishmeaning str28 %28s
-------------------------------------------------------------------------------First, some summary statistics:
Code:
. su sfc ssc str tt frequency grade strokecount
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
sfc | 2042 141.454 113.8418 3 848
ssc | 2042 723.1572 597.9186 132 3293
str | 2042 864.5916 672.2959 169 3338
tt | 2042 .1833305 .0793473 .0037594 .3950617
frequency | 2007 1028.071 616.4795 1 2495
-------------+--------------------------------------------------------
grade | 2004 5.915669 2.408873 1 9
strokecount | 2042 10.30852 3.782139 1 23Code:
. cor sfc ssc str tt frequency grade strokecount
(obs=1979)
| sfc ssc str tt freque~y grade stroke~t
-------------+---------------------------------------------------------------
sfc | 1.0000
ssc | 0.5955 1.0000
str | 0.6991 0.9907 1.0000
tt | 0.2847 -0.4171 -0.3231 1.0000
frequency | 0.0813 -0.1552 -0.1244 0.3182 1.0000
grade | 0.1483 -0.2756 -0.2202 0.5346 0.6821 1.0000
strokecount | 0.0726 -0.3328 -0.2839 0.5087 0.2044 0.3323 1.0000Code:
+-------------------------------------------------------------------------+
| framenum sfc ssc str freque~y grade stroke~t englishm~g |
|-------------------------------------------------------------------------|
1. | 768 3 795 798 131 1 3 mountain |
2. | 286 9 1516 1525 333 1 7 car |
3. | 595 7 920 927 157 2 4 heart |
4. | 195 14 1751 1765 317 1 4 tree |
5. | 951 6 618 624 5 1 2 person |
|-------------------------------------------------------------------------|
6. | 1616 3 301 304 452 2 8 gates |
7. | 107 25 2048 2073 7 1 3 large |
8. | 2 40 3265 3305 9 1 2 two |
9. | 3 41 3235 3276 14 1 3 three |
10. | 14 39 3015 3054 90 1 5 rice field |
+-------------------------------------------------------------------------+Code:
+--------------------------------------------------------------------------+
| framenum sfc ssc str freque~y grade stroke~t englishme~g |
|--------------------------------------------------------------------------|
1. | 1908 96 147 243 1097 8 11 explanation |
2. | 2000 89 137 226 1082 8 16 recommend |
3. | 1563 143 221 364 1723 8 16 sew |
4. | 1789 116 182 298 1536 8 18 appear |
5. | 1733 115 182 297 1549 8 11 solemn |
|--------------------------------------------------------------------------|
6. | 1914 89 144 233 . 8 10 decrease |
7. | 1577 135 220 355 830 6 12 diligence |
8. | 766 380 622 1002 1278 8 10 tempt |
9. | 1954 83 138 221 1903 8 15 entrust |
10. | 631 437 731 1168 1682 8 16 remorse |
|--------------------------------------------------------------------------|
11. | 1939 78 135 213 1064 8 13 reputation |
12. | 1394 160 282 442 889 8 20 suspend |
13. | 1562 139 247 386 1836 8 10 summit |
14. | 1803 98 176 274 1737 8 10 respect |
15. | 1372 157 283 440 1291 8 15 affinity |
|--------------------------------------------------------------------------|
16. | 387 587 1060 1647 897 8 12 surpass |
17. | 1570 126 228 354 1281 8 10 peaceful |
18. | 998 251 455 706 2073 8 15 praise |
19. | 1969 76 138 214 624 6 12 concerning |
20. | 1841 97 178 275 905 8 8 residence |
+--------------------------------------------------------------------------+There are quite a few kanji with a high Heisig number here, especially in the top 10. It could be due of Dr. Heisig leaving the harder kanji for later in the text, or perhaps people just haven't learned them well enough yet, and there can be many other explanations:
Code:
.395062 +
| *
| * *
|
F | *
a | *
i |
l | *
u | *
r | *
e | *
| *
r |
a |
t | *
e |
| *
| *
| *
| * * * * *
.352727 + *
+----------------------------------------------------------------+
387 Heisig No. 2000It is possible that those would be the kanji that could benefit from better stories the most.
First, let's fit a simple linear model:
Code:
RECALL = b0+b1*FRAMENUM+b2*GRADE+b3*STROKECOUNT+b4*FREQUENCY+eps
. regress tt framenum grade stroke frequency
Source | SS df MS Number of obs = 1979
-------------+------------------------------ F( 4, 1974) = 386.69
Model | 5.45293192 4 1.36323298 Prob > F = 0.0000
Residual | 6.95910812 1974 .003525384 R-squared = 0.4393
-------------+------------------------------ Adj R-squared = 0.4382
Total | 12.41204 1978 .006275046 Root MSE = .05937
------------------------------------------------------------------------------
tt | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
framenum | .0000235 2.38e-06 9.86 0.000 .0000188 .0000281
grade | .0142809 .000791 18.05 0.000 .0127297 .0158321
strokecount | .0069658 .0003857 18.06 0.000 .0062094 .0077221
frequency | -8.71e-06 2.98e-06 -2.92 0.004 -.0000146 -2.86e-06
_cons | .0123982 .0045974 2.70 0.007 .003382 .0214144
------------------------------------------------------------------------------This model misses a few kanji, due to some them either not being assigned either a school grade, or not appearing in the newspaper frequency dataset. I introduce two dummies (GRADEMISS and FREQMISS, equal to 1 for the missing observations and 0 elsewhere) to account for those. In addition, I add two more variables -- dummies for the kanji in Part 1 and Pt. 2 of the Heisig book (the first part has detailed stories by Dr.Heisig, and the 2nd has less detailed ones. iirc, first part is also available online):
Code:
RECALL = b0+b1*FRAMENUM+b2*GRADE+b3*STROKECOUNT+b4*FREQUENCY+
. regress tt framenum grade0 grademiss stroke freq0 freqmiss pt1 pt2
Source | SS df MS Number of obs = 2042
-------------+------------------------------ F( 8, 2033) = 198.27
Model | 5.63176092 8 .703970115 Prob > F = 0.0000
Residual | 7.21836221 2033 .003550596 R-squared = 0.4383
-------------+------------------------------ Adj R-squared = 0.4361
Total | 12.8501231 2041 .006295994 Root MSE = .05959
------------------------------------------------------------------------------
tt | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
framenum | .0000201 3.48e-06 5.78 0.000 .0000133 .0000269
grade0 | .0141365 .0007917 17.86 0.000 .0125838 .0156893
grademiss | .0782822 .0118053 6.63 0.000 .0551303 .101434
strokecount | .0070189 .0003824 18.36 0.000 .006269 .0077688
freq0 | -8.70e-06 2.97e-06 -2.93 0.003 -.0000145 -2.88e-06
freqmiss | -.0236515 .0114678 -2.06 0.039 -.0461413 -.0011617
pt1 | -.0098122 .0055459 -1.77 0.077 -.0206884 .0010641
pt2 | -.0009667 .005195 -0.19 0.852 -.0111547 .0092213
_cons | .0175479 .0060571 2.90 0.004 .0056692 .0294267
------------------------------------------------------------------------------Dropping Pt.2 from the model, here are the estimation results:
Code:
. regress tt framenum grade0 grademiss stroke freq0 freqmiss pt1
Source | SS df MS Number of obs = 2042
-------------+------------------------------ F( 7, 2034) = 226.69
Model | 5.63163798 7 .804519711 Prob > F = 0.0000
Residual | 7.21848515 2034 .003548911 R-squared = 0.4383
-------------+------------------------------ Adj R-squared = 0.4363
Total | 12.8501231 2041 .006295994 Root MSE = .05957
------------------------------------------------------------------------------
tt | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
framenum | .0000205 2.82e-06 7.27 0.000 .000015 .000026
grade0 | .0141403 .0007913 17.87 0.000 .0125884 .0156921
grademiss | .0783013 .0118021 6.63 0.000 .0551559 .1014468
strokecount | .0070126 .0003808 18.42 0.000 .0062658 .0077594
freq0 | -8.71e-06 2.97e-06 -2.94 0.003 -.0000145 -2.89e-06
freqmiss | -.0237389 .0114554 -2.07 0.038 -.0462045 -.0012732
pt1 | -.0093114 .004848 -1.92 0.055 -.018819 .0001962
_cons | .0170344 .0053907 3.16 0.002 .0064626 .0276063
------------------------------------------------------------------------------Code:
+-------------------------------------------------------------------+
| framenum sfc ssc stroke~t grade freque~y englishme~g |
|-------------------------------------------------------------------|
1. | 766 380 622 10 8 1278 tempt |
2. | 1577 135 220 12 6 830 diligence |
3. | 1914 89 144 10 8 . decrease |
4. | 1510 137 279 5 6 337 income |
5. | 1445 131 250 12 4 769 rejoice |
|-------------------------------------------------------------------|
6. | 1489 119 267 7 4 896 hope |
7. | 863 248 633 5 4 857 achievement |
8. | 1908 96 147 11 8 1097 explanation |
9. | 1733 115 182 11 8 1549 solemn |
10. | 936 220 494 11 4 799 salvation |
|-------------------------------------------------------------------|
11. | 1562 139 247 10 8 1836 summit |
12. | 387 587 1060 12 8 897 surpass |
13. | 553 382 844 13 . . envious |
14. | 1241 157 342 5 8 1537 adroit |
15. | 1169 183 399 12 4 515 full |
|-------------------------------------------------------------------|
16. | 1841 97 178 8 8 905 residence |
17. | 1803 98 176 10 8 1737 respect |
18. | 177 608 1538 12 4 469 quantity |
19. | 1570 126 228 10 8 1281 peaceful |
20. | 1969 76 138 12 6 624 concerning |
+-------------------------------------------------------------------+As we can see these are somewhat more evenly distributed over the book length:
Code:
.395062 +
| *
|
| *
F | * * *
a |
i |
l | *
u | * * * *
r | *
e | *
|
r | *
a |
t |
e | * * *
| * *
|
|
|
.281498 + * *
+----------------------------------------------------------------+
177 Heisig No. 1969Probably, the best thing would be if we could think of what stories we ourselves used there, whether they helped, and if they did, whether they could be shared with the community.
For comparison, here is also a list of 10 easiest kanji after controlling for the factors in the last model:
Code:
+---------------------------------------------------------------------+
| framenum sfc ssc stroke~t grade freque~y englishmean~g |
|---------------------------------------------------------------------|
1. | 593 28 853 11 8 1142 hemp |
2. | 315 148 1334 19 8 1486 whale |
3. | 551 105 925 17 8 355 fresh |
4. | 1766 21 206 13 8 1609 Go |
5. | 1869 16 179 11 8 1753 liner |
|---------------------------------------------------------------------|
6. | 1623 18 293 11 6 951 closed |
7. | 1806 12 243 4 8 339 well |
8. | 518 55 943 11 8 2031 lightning-bug |
9. | 2003 25 168 14 9 1105 bear |
10. | 1944 10 209 10 . 2042 crow |
+---------------------------------------------------------------------+
Edited: 2006-11-02, 3:34 pm

.