Learning #4: Getting statistical significance can be really challenging
When you create a program that’s supposed to play a game, the only way to know if it’s any good is to have it play lots of games. As you iterate on the solution, it then becomes important to be able to check whether your changes are improving the solution or making it worse. At first, this was easy. I started with a completely random game player and added in simple rules that had obvious impacts on the results. I had the code play 1,000 random games in each of a variety of different configurations. I picked 1,000 because I’d written a small program that tested the variance of the random game player at different batch sizes and found that, at around 1,000, the variance between batches had essentially converged.
As I got further along, though, I found that sometimes my changes made little improvement in the results and at times, beyond all sense, a new, important rule made the results worse. For a while I chose to remain perplexed about these anomalies and just backed out the seemingly-bad ideas. After it happened a number of times, though, I got curious. It occurred to me to run the variance check code against my rule-based automaton. Can you guess what I found? When the rule-based game player was playing random batches of 1,000 games, the variance in the results between batches was massive! In fact, the variance of the rule-based player only converged once I increased the batch size to 32,000. When I had added a rule and seen the results get worse, all I was actually seeing was statistical noise! Worse even than the rules I’d left out, was that many of the rules where I’d seen slight improvements had probably also just been statistical noise. This didn’t mean they were duds, but I had no trustworthy data for showing they weren’t. I thought 1,000 random games was a lot, but it turns out getting a statistically significant sample across something as complex as a board game-playing automaton requires a huge mount of data points.