It dawned on me, though, that the mini sims are effectively a sampling-from-infinity process and my small sample size causes problems when evaluating advantage/reward. Effectively the machine remembers too much about optimal or advantaged spend outliers especially on one side of a tail.
This go round, which probably is not the last -- I have some ideas for the future -- I am attempting to continue with the policy reinforcement aspect while also now "tracing" the advantage discovered during the meta-simulation. That allows me to lean on the central limit theorem to look at policy recommendations, i.e., use the sampling mean of the sampling distribution (if I have the stats lingo right) at a wealth/age level to see the evolution of the spend policy. Every time an advantage is detected I trace/store the wealth/age/spend/utility and then use the mean of the spend for wealth/age as an indicator of what the policy should have been. The next step in the future would be to make that linkage more direct. TBD
To remind us, this is the schematic of the RL process model:
Also, as a reminder, this is where we left off after ~57000 iterations and 1.14M sim years and 1.6 days of run time before the changes made for this post...
Figure 1 |
Now we change to the sample-detect-trace-average approach. I mean, just going from discrete spend-chunks of .005 to an averaging process will smooth out the line a bit but I still think it works better so far because it is less influenced by the sample outliers. Here is the chart of the new approach for different generations of results.
Figure 2 |
This was at around 2500 (blue) and 28000 (green) iterations. I see two things that I have not bothered to measure too closely. 1) I can see the machine training itself as iterations go up; that's the curve forming as it goes from blue to green...at later ages, and 2) it is to the eye smoother than figure 1 and at a much earlier stage of learning. Well, I can see it anyway. Note, btw, that the machine starts with a presumptive 4% spend give or take some randomness which means that even the early learning in blue is already aware of some age-sensitivity in the policy.
Now this below in Figure 3 is:
a) in red, the policy spend from the last generation with the prior method (figure 1), and
b) in blue, the current inferred spend policy known by tracing and averaging the sample
of spend rates from the mini-sims where an advantage is detected. Same as green in figure 2.
I didn't do myself favors by adding a year to the interval between the two methods or nudging the scale of the Y vs figure 2, but I'm rolling with this anyway.
Recall from the first post that the benchmarks, not individually labeled here in grey, are:
RMD-style
Merton with tuning
Kolmogorov PDE solved for constant risk
PMT method
A rule of thumb based on age
which are described in the referenced link.
I'll leave discussion on this until later but in general, blue looks like it serves me better than red. That is encouraging me to try some other moves later such as:
- make the link to policy table from the sampling changes more direct and therefore less discrete
- maybe play some statistical games like throwing out obvious outliers within a wealth/age class
No comments:
Post a Comment