Hyperparameter Tuning: Finding the Sweet Spot

August 2023
This is just a quick summary of a chapter I wrote in Deep Learning for Molecules and Materials, an awesome book by my awesome PhD advisor, Andrew White. Head there for the full treatment, including code examples and a hands-on walkthrough of tuning a peptide hemolysis classifier.

Training deep learning models is a bit like tuning a car, you've got a bunch of knobs to adjust, and finding the sweet spot takes patience and know-how.

The Learning Rate Saga

The learning rate is probably the most critical hyperparameter you'll deal with. Set it too high and your model overshoots the optimal solution; too low and you'll be waiting forever for it to converge [1]. The animation below shows this in action, watch how a tiny learning rate barely moves along the loss surface, a reasonable one glides toward the minimum, and a large one just bounces right past it:

Effect of learning rate on loss convergence
Fig 1. Effect of learning rate on loss. Too small crawls, too large overshoots, and a reasonable value converges efficiently.

But here's the thing, even a "reasonable" constant learning rate can only get you so far. What if you could start bold and get more careful as you go? That's exactly what learning rate decay schedules do. By starting with a larger learning rate, you get the rarely discussed benefit of escaping local minima that overfit, and then as the rate decreases, you settle into a broader, better minimum:

Learning rate decay schedule vs constant learning rate
Fig 2. A constant learning rate can get stuck in a local minimum, while a decay schedule allows the optimizer to escape and find a better solution.

Momentum, Batch Size, and the Balancing Act

Momentum helps smooth things out by remembering which direction you've been heading, kind of like a ball rolling downhill that doesn't stop at every little dip [1]. Batch size is another interesting trade-off, smaller batches tend to generalize better because they add a bit of noise that helps escape local minima, though they're slower to train [2, 3]. And then there's the architecture itself: how many layers, how many nodes, how much regularization. Too simple and you underfit; too complex and you overfit. It's a balancing act.

The Search for Optimal Hyperparameters

When it comes to actually searching for the best hyperparameters, you've got options ranging from brute-force grid search to smarter approaches like Bayesian optimization, which learns from previous trials to focus on promising regions [4]. But since training is expensive, early stopping strategies like Successive Halving and HyperBand have become popular, they essentially give every configuration a quick audition and cut the underperformers early, saving your compute budget for the winners [5, 6].

The key takeaway? Start with a baseline, add complexity gradually, and let your validation metrics guide you. And maybe use one of the many toolkits out there, Ray Tune, Optuna, Keras-Tuner, so you're not reinventing the wheel [7, 8, 9].

References

  1. Goodfellow, Ian, Yoshua Bengio, and Aaron Courville. Deep Learning. Cambridge, MA: MIT Press, 2017.
  2. Keskar, Nitish Shirish, et al. "On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima." arXiv preprint arXiv:1609.04836, 2016.
  3. Hoffer, Elad, Itay Hubara, and Daniel Soudry. "Train Longer, Generalize Better: Closing the Generalization Gap in Large Batch Training of Neural Networks." Advances in Neural Information Processing Systems, 2017.
  4. Golovin, Daniel, et al. "Google Vizier: A Service for Black-Box Optimization." Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2017, 1487–1495.
  5. Jamieson, Kevin, and Ameet Talwalkar. "Non-stochastic Best Arm Identification and Hyperparameter Optimization." Artificial Intelligence and Statistics, PMLR, 2016, 240–248.
  6. Li, Lisha, et al. "Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization." Journal of Machine Learning Research 18, no. 185 (2018): 1–52.
  7. Liaw, Richard, et al. "Tune: A Research Platform for Distributed Model Selection and Training." arXiv preprint arXiv:1807.05118, 2018.
  8. Akiba, Takuya, et al. "Optuna: A Next-Generation Hyperparameter Optimization Framework." Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2019.
  9. O'Malley, Tom, et al. "Keras Tuner." 2019.
Back to Latent Space