RMNIST with annealing and ensembling

By Michael Nielsen

In the last post I described Reduced MNIST, or RMNIST, a very stripped-down version of the MNIST training set. As a side project, I’ve been exploring RMNIST as an entree to the problem of using machines to generalize from extremely small data sets, as humans often do. Using just 10 examples of each training digit, in that post I described how to achieve a classification accuracy of 92.07%.

That 92.07% accuracy was achieved using a simple convolutional neural network, with dropout and data augmentation to reduce overfitting.

In this post I report the results obtained by using three additional ideas:

The use of simulated annealing to do hyper-parameter optimization;
Voting by an ensemble of neural nets, rather than just a single neural net; and
l2 regularization.

The code is available in anneal.py.

The experiments in the last post were done on my laptop, using the CPU – a nice thing about tiny training sets is that you can experiment using relatively few computational resources. But for these experiments, it was helpful to use a NVIDIA Tesla P100, run in the Google Compute cloud. This sped my experiments up by a factor of about 10.

These changes resulted in an accuracy of 93.81%, a considerable improvement over the 92.07% obtained previously. I suspect that further improvements using these ideas, along the lines described below, will bump that accuracy over 95%, and possibly higher. Ideally, I’d like to achieve better than 99% accuracy. My guess is that this would be close to how humans would perform, starting with a training set of this size.

Detailed working notes and ideas for improvement

Through the remainder of this post, I assume you’re familiar with the way annealing works.

The annealing strategy is to make local “moves” in hyper-parameter space. For instance, a typical move was to increase by 2 the number of kernels in the first convolutional layer. Another move was to decrease by 2 the number of kernels. Two more moves were to increase or decrease the learning rate by a constant factor of 10^¼.

Overall, the anneal involved modifying four hyper-parameters using such local moves: the learning rate, the weight decay (for l2 regularization), the number of kernels in the first convolutional layer, and the number of kernels in the second convolutional layer.

The “energy” associated to hyper-parameter configurations was just the validation accuracy of an ensemble of nets with those hyper-parameters. More precisely, I used the negative of the validation accuracy – the negative since the goal of annealing is to minimize the energy, and thus to maximize the accuracy.

These were first experiments, and it’d likely be easy to considerably improve the results. To do that, it’d be useful to have monitoring tools which help us debug and improve the anneal. Such tools could help us:

Identify which hyper-parameters make a significant difference to performance, and which do not. Bergstra and Bengio find that typically only a few hyper-parameters make much difference. How can we identify those hyper-parameters and ensure that we concentrate on those?
Identify when we should change the structure of a move. For instance, instead of changing the number of kernels by 2, perhaps it would be better to change the number by 5. What step sizes are best? Should we have a distribution? How sensitive is validation accuracy to the size of the steps?
Identify changes to the way we should sample from the moves. At the moment I simply choose a move at random. But if statistics are kept of previous moves, it would be possible to estimate the probability of a given move improving the validation accuracy, and sample accordingly. What is the probability distribution with which particular moves improve the accuracy? What’s a good model for the size of the expected improvements? These are questions closely related to the work of Snoek, Larochelle, and Adams on Bayesian hyper-parameter optimization.
Identify pairs of moves which work well together. For instance, it may be that increasing the number of kernels works well provided the l2 regularization is also increased. But each move on its own might be unfavourable. Which pairs of moves often produce good outcomes, even when the individual moves do not? Is it possible for the annealer to automatically learn such pairs and incorporate them into the annealing?
Identify when we should change the energy scale of the anneal, i.e., the effective temperature. A characteristic question here is how often we accept moves which make the accuracy lower, despite the fact that a different move would have made the accuracy higher. If this happens too often it likely means the energy scale should be made smaller (i.e., the temperature of the anneal should be decreased).
By sampling from the hyper-parameter space can we build a good model which lets us predict accuracy from the hyper-parameters? And then use something like gradient ascent to optimize that function?

Each of these ideas suggests good small follow-up projects. Those projects would be of interest in their own right; I also wouldn’t be surprised if they resulted in considerable improvement in performance.

Insofar as such tools would change the way we do the anneal, we’d be doing hyper-parameter optimization optimization.

A few miscellaneous observations:

Good performance even with small number of kernels in the first layer: I was surprised how well the network performed with just 2 (!) kernels in the first convolutional layer – it was relatively easy to get validation accuracies above 93%. What can we learn from this? What would happen with just 1 kernel? How much is it possible to reduce the number of kernels in the second convolutional layer? In a situation where the key problem is overfitting and generalization, it seems like an important observation that we can get 93% performance with just 2 kernels.

Batch size mattered a lot for speed: As a legacy of my CPU code I started with a mini-batch size of 10. I changed that to 64, since increasing mini-batch size often helps with speed, particularly on a GPU, where these computations are easily parallelized. I was, however, surprised by the speedup – I didn’t do a detailed benchmark, but it was easily a factor of 2 or 3. Further experimentation with mini-batch size would be useful. (Note: I’d never used the P100 GPU before. I’ve seen speedups with other GPUs when changing mini-batch size, but I’m pretty sure this is the largest I’ve seen.)

Adding other hyper-parameters: I suspect adding other hyper-parameters would result in significantly better results. In rough order of priority, it’d be good to add: initialization parameters for the weights, different types of data augmentation, size of the fully-connected layer, the kernel sizes, learning rate decay rate, and stride length.

Understand performance across ensembles of nets: Something I understand poorly is the behaviour of ensembles of neural nets. What is the distribution of performance across the ensemble? How much can aggregating the outputs help? What are the best strategies for aggregating outputs? How much does it help to increase the size of the ensemble?

How stable are the results for large ensembles? The questions in the last item are all intrinsically interesting. They’re also interesting for a practical reason: sometimes I found hyper-parameter choices which did not provide stable performance across repeated training using those same hyper-parameters. But perhaps with large enough ensemble sizes that instability could be eliminated. A related point: I achieved validation accuracies up to 94.39%, but didn’t report them above, because they were not easy to reproduce while using the same hyper-parameters.

Adding interactivity: Something that’s often frustrating while annealing is that a question will occur to me, based on observing the program output, but I have no way to modify the anneal in real time. It’d be exceptionally helpful to be able to break in, access the REPL, modify the structure of the anneal, and restart.

The addictive psychology of training neural nets: Watching the outputs flow by – all the ups and downs of performance – produces a feeling which mirrors the appeal many people (including myself) feel while watching sport. There’s lots of random intermittent reward, and the perhaps illusory sense that you’re watching something important, something which your mind really wants to find patterns in. Indeed, on occasion you do find patterns, and it can be helpful. Nonetheless, I wonder if there aren’t healthier ways of engaging with neural nets.