[ML Crash Course] No Overfit, No Underfit, Just Right Fit

A Ydobon
6 min readSep 2, 2020

--

Machine learning’s goal is to predict well on new data drawn from a (hidden) true probability distribution. Unfortunately, the model can’t see the whole truth; the model can only sample from a training data set. If a model fits the current examples well, how can you trust the model will also make good predictions on never-before-seen examples?

William of Ockham, a 14th-century friar, and philosopher, loved the simplicity. He believed that scientists should prefer simpler formulas or theories over more complex ones. To put Ockham’s razor in machine learning terms:

The less complex an ML model, the more likely that a good empirical result is not just due to the peculiarities of the sample.

In modern times, we’ve formalized Ockham’s razor into the fields of statistical learning theory and computational learning theory. These fields have developed generalization bounds — a statistical description of a model’s ability to generalize to new data based on factors such as:

  • the complexity of the model
  • the model’s performance on training data

While the theoretical analysis provides formal guarantees under idealized assumptions, they can be difficult to apply in practice. Machine Learning Crash Course focuses instead on empirical evaluation to judge a model’s ability to generalize to new data.

A machine learning model aims to make good predictions on new, previously unseen data. But if you are building a model from your data set, how would you get the previously unseen data? Well, one way is to divide your data set into two subsets:

  • training set — a subset to train a model.
  • test set — a subset to test the model.

Good performance on the test set is a useful indicator of good performance on the new data in general, assuming that:

  • The test set is large enough.
  • You don’t cheat by using the same test set over and over.

The ML fine print

The following three basic assumptions guide generalization:

  • We draw examples independently and identically (i.i.d) at random from the distribution. In other words, examples don’t influence each other. (An alternate explanation: i.i.d. is a way of referring to the randomness of variables.)
  • The distribution is stationary; that is the distribution doesn’t change within the data set.
  • We draw examples from partitions from the same distribution.

In practice, we sometimes violate these assumptions. For example:

  • Consider a model that chooses ads to display. The i.i.d. assumption would be violated if the model bases its choice of ads, in part, on what ads the user has previously seen.
  • Consider a data set that contains retail sales information for a year. Users’ purchases change seasonally, which would violate stationarity.

When we know that any of the preceding three basic assumptions are violated, we must pay careful attention to metrics.

Training and Test Sets: Splitting Data

We introduced the idea of dividing the dataset into two subsets:

  • training set — a subset to train a model.
  • test set — a subset to test the trained model.

You could imagine slicing the single data set as follows:

Make sure that your test set meets the following two conditions:

  • Is large enough to yield statistically meaningful results.
  • Is representative of the data set as a whole. In other words, don’t pick a test set with different characteristics than the training set.

Assuming that your test set meets the preceding two conditions, your goal is to create a model that generalizes well to new data. Our test set serves as a proxy for new data. For example, consider the following figure. Notice that the model learned for the training data is very simple. This model doesn’t do a perfect job — a few predictions are wrong. However, this model does about as well on the test data as it does on the training data. In other words, this simple model does not overfit the training data.

Never train on test data. If you are seeing surprisingly good results on your evaluation metrics, it might be a sign that you are accidentally training on the test set. For example, high accuracy might indicate that test data has leaked into the training set.

For example, consider a model that predicts whether an email is spam, using the subject line, email body, and sender’s email address as features. We apportion the data into training and test sets, with an 80–20 split. After training, the model achieves 99% precision on both the training set and the test set. We’d expect a lower precision on the test set, so we take another look at the data and discover that many of the examples in the test set are duplicates of examples in the training set (we neglected to scrub duplicate entries for the same spam email from our input database before splitting the data). We’ve inadvertently trained on some of our test data, and as a result, we’re no longer accurately measuring how well our model generalizes to new data.

Validation Set: Another Partition

We introduced partitioning a data set into a training set and a test set. This partitioning enabled you to train on one set of examples and then to test the model against a different set of examples. With two partitions, the workflow could look as follows:

In the figure, “Tweak model” means adjusting anything about the model you can dream up — from changing the learning rate to adding or removing features, to designing a completely new model from scratch. At the end of this workflow, you pick the model that does best on the test set.

Dividing the data set into two sets is a good idea, but not a panacea. You can greatly reduce your chances of overfitting by partitioning the data set into the three subsets shown in the following figure:

Use the validation set to evaluate results from the training set. Then, use the test set to double-check your evaluation after the model has “passed” the validation set. The following figure shows this new workflow:

In this improved workflow:

  1. Pick the model that does best on the validation set.
  2. Double-check that model against the test set.

This is a better workflow because it creates fewer exposures to the test set.

Test sets and validation sets “wear out” with repeated use. That is, the more you use the same data to make decisions about hyperparameter settings or other model improvements, the less confidence you’ll have that these results actually generalize to new, unseen data.

If possible, it’s a good idea to collect more data to “refresh” the test set and validation set. Starting anew is a great reset.

--

--

A Ydobon
A Ydobon

No responses yet