Jan 11, 2020

Update on my Reinforcement Learning Experiment

I've now run my reinforcement learning experiment through close to 40 hours of 2 rounds of training and maybe around 1.1 million sim-years. That's evidently thin for training these kinds of things but maybe enough for me to evaluate where I am.

I've kept my data in generations for restart-recovery purposes but that also allows me a window into evolution of what it is finding.  And, pretty much, what it is finding is a choppy result that isn't changing to much any more but is still imprecise or inconsistent in it's policy recommendations by age.  That inconsistency I wanted to think about today.

The inconsistency (for the given parameters that is, e.g., 4.5% at 60, 5% at 61, 4.5% at 62, 5% again at 63, and so forth) comes, no doubt, from the methods and short cuts I've chosen as well as the nature of what I am trying to do.  Here are the reasons I think I am seeing the inconsistencies:

1. Under-training.  1.1M sim years is not enough. Over billions my guess is that I would get a more consistent result. But maybe not.

2. I am effectively rounding stuff. In order to simplify and speed things I am working in 500k chunks of wealth and .005 chunks of spending. That means the policy recommendations will "pop" to the closest solution and in conditions of "in the middle" or other fuzz, it is going to buzz up and down depending on the iteration and the randomness dealt to the machine.

3. The machine probably is starting to see into the idea of false precision.  In a standard Ret-fin "model" where everything is tidy and the horizon is fixed and spending is constant, the difference between a spend of 4.21 and 4.22 might make a big difference. In real life, spending variance is way way beyond that ".01 diff" so it'll dominate the mini-differences and the false precision of a standard academic model.  Also the machine, in it's many iterations, is probably reflecting that in a dynamic system that adapts to the environment and "local choice," the exact choice between 4.21 and 4.22, or even between 4 and 4.5 doesn't matter much. In any alternative universe both could be right or wrong depending on the cards dealt but probably both represent a zone of acceptability where 2 or 7% on the other hand would be wrong in spades.

4. Inside the machine, in order to evaluate randomized "actions," a mini-forward-utility sim is run to evaluate the choice.  That mini sim is really really tiny. About 200 iterations in the latest effort took me 50 minutes (haven't dropped this on AWS yet, not sure how). That was double the effort at 100 iterations.  29000 iterations took me 24 hours. That small set of mini-runs, while efficient, is going to leave behind a wake of outliers that the policy will remember incorrectly as important but that won't get shaken out unless a future iteration is even more extreme...which we wouldn't expect. That problem alone will explain the instability and inconsistency. Until I figure out a better way on the cloud to do this project, that inconsistency will remain.

All of this is a way of saying that I don't think additional iterations will help me much and that I, or we, can derive some implications from what we already know.  Let's take a look at some of the output so far and see what, if anything, we can conclude. In the following, recall that I keep "generations" (i.e., save it at 5k, 10k, 15k, 20k... etc of iterations) of the policy conclusions, all of which are hopping around like mad when we look at any particular generation.

Here is what the spend policy, for the $1M level of wealth, looks like by age for 14 different generations in 2 tranches - one seed stage without hard reinforcement, then one "with."  The last few generations will dominate the average, by the way, since they didn't change much.

Even the last run after 1.1M sim lives (red) is still a little bit ambiguous.  The trend line, however (blue, a 2nd order poly-trend of the average of all) is perhaps worthy -- and has some information value in my amateur opinion, given points 1-4 above.

Now let's go back and overlay that blue line on top of the benchmarks we set in the first post. Recall that the benchmarks were: Merton, RMD, PMT, RH40, and LPR.  See prior posts for definitions but I consider them worthy.  That Blue line is now in this context:



As before, the machine is playing the same game I think.  It also looks relatively optimistic over the interval 60-80, an artifact no doubt of my assumptions.  So the runs of the machine suck individually but as a group they don't look totally incoherent.









No comments:

Post a Comment