Jan 14, 2020

Trying to Increase the learning speed of my naive RL machine

In the last go round starting with this post, along with some enhancements related to goosing the reinforcement aspect, my machine was slow and the output was choppy and unstable.  Partly this is due to the rough, amateur nature of the experiment. I have an agent using fuzzy action in quite discrete chunks with small internal dynamic mini-simulation.

It dawned on me, though, that the mini sims are effectively a sampling-from-infinity process and my small sample size causes problems when evaluating advantage/reward.  Effectively the machine remembers too much about optimal or advantaged spend outliers especially on one side of a tail.


This go round, which probably is not the last -- I have some ideas for the future -- I am attempting to continue with the policy reinforcement aspect while also now "tracing" the advantage discovered during the meta-simulation. That allows me to lean on the central limit theorem to look at policy recommendations, i.e., use the sampling mean of the sampling distribution (if I have the stats lingo right) at a wealth/age level to see the evolution of the spend policy.  Every time an advantage is detected I trace/store the wealth/age/spend/utility and then use the mean of the spend for wealth/age as an indicator of what the policy should have been. The next step in the future would be to make that linkage more direct.  TBD

To remind us, this is the schematic of the RL process model:


Also, as a reminder, this is where we left off after ~57000 iterations and 1.14M sim years and 1.6 days of run time before the changes made for this post...

Figure 1

...where red was the policy recommendation for spend at age for a given level ($1M) of wealth at some age at the end. Choppy and unstable though trained but curving up as theory would predict. In the past posts I hacked this problem by averaging generations recorded at various checkpoints. An inelegant solution but it worked for the posts at the time.

Now we change to the sample-detect-trace-average approach.  I mean, just going from discrete spend-chunks of .005 to an averaging process will smooth out the line a bit but I still think it works better so far because it is less influenced by the sample outliers.  Here is the chart of the new approach for different generations of results.

Figure 2


This was at around 2500 (blue) and 28000 (green) iterations. I see two things that I have not bothered to measure too closely. 1) I can see the machine training itself as iterations go up; that's the curve forming as it goes from blue to green...at later ages, and 2) it is to the eye smoother than figure 1 and at a much earlier stage of learning.  Well, I can see it anyway. Note, btw, that the machine starts with a presumptive 4% spend give or take some randomness which means that even the early learning in blue is already aware of some age-sensitivity in the policy.

Now this below in Figure 3 is:

   a) in red, the policy spend from the last generation with the prior method (figure 1), and

   b) in blue, the current inferred spend policy known by tracing and averaging the sample
       of spend rates from the mini-sims where an advantage is detected. Same as green in figure 2.

I didn't do myself favors by adding a year to the interval between the two methods or nudging the scale of the Y vs figure 2, but I'm rolling with this anyway.





Recall from the first post that the benchmarks, not individually labeled here in grey, are:

RMD-style
Merton with tuning
Kolmogorov PDE solved for constant risk
PMT method
A rule of thumb based on age

which are described in the referenced link.

I'll leave discussion on this until later but in general, blue looks like it serves me better than red. That  is encouraging me to try some other moves later such as:

 - make the link to policy table from the sampling changes more direct and therefore less discrete
 - maybe play some statistical games like throwing out obvious outliers within a wealth/age class





No comments:

Post a Comment