# Revisiting machine learning 1: overfitting

After having written 4 posts to practice my blog writing skill, I decided to move on to more complicated subjects. This research was done at the same period (last Christmas) as my Pokémon posts, right after Andrew Ng’s talk in NIPS 2016. By the way, I find it particularly worthwhile to attend these kinds of conferences, though I had to pay for the entrance myself.

This subject will include a series of posts, with each focusing on a slightly different topic. Each post will be based on the understanding of previous ones. They are written for experienced researchers/practitioners as well as for beginners. For experienced researchers/practitioners, these posts can correct some biased understanding that are formed inadvertently. For beginners, these posts alert them some misunderstanding traps that they should stay away from during the learning process.

The one you are currently reading is the first post, which discusses the most popular and fundamental concept of machine learning – overfitting and underfitting.
Although anyone who has ever played with machine learning has heard about these concepts, they may not understand the **cause** correctly.
This post hence intends to clarify these misunderstood concepts.

**Tl;dr**:
The **cause** of overfitting is neither using too many parameters nor using a larger model than the ground truth.
Instead, it occurs when the decrease of approximation error is *less* significant than the increase of estimation error.
Oppositely, when the decrease of approximation error is *more* significant than the increase of estimation error, it is the realm of underfitting.
This clarification helps us to understand (1) why deep learning can be successful and (2) why good data scientists are sometimes “bad”.

## Introduction

Many consider that the overfitting is due to excessive use of parameters or a model much larger than the ground truth.
This understanding is not accurate.
For example, it cannot explain why *Neural Network*, the most popular model nowadays, which uses extremely many parameters, can be so successful.

In order to understand overfitting, we must resort to the concept of approximation error and estimation error. Any other shortcuts will introduce an understanding bias. This argument will be re-emphasized later after we have developed necessary tools to understand these concepts.

I will use OLS to illustrate the idea because of its popularity and simplicity. I believe that other families of models can also be used for this purpose though they may have some extra subtleties. In addition, I suppose that the data are i.i.d., which is a common hypothesis in machine learning.

## Approximation error and estimation error

Suppose that the ground truth is [ y = f(x) + \varepsilon ,] where $ x \in \mathbb{R}^d $ is a vector and $ \varepsilon $ is a random variable with zero conditional expectation \[ \mathbb{E} [\varepsilon | x] = 0 .\]

Assuming $ S \subset [d]:= \{1,2,\ldots, d\} $, we denote $ x_S $ as the abridged vector of $x$ indexed by $S$. For example, $ x_{\{3, 5, 8\}} = (x_3, x_5, x_8)^T $.

The researcher in question supposes that
\[ f \in \mathcal{M}_ S := \{ f_\theta = x_S^T \theta : \theta \in \mathbb{R}^{|S|} \}, \]
*i.e.*, a fixed subgroup of features $S$ are relevant.
He possesses an i.i.d. dataset $ \mathcal{D} = \{ (x^{(i)}, y^{(i)}) \in \mathbb{R}^d \times \mathbb{R}, i \in [n] \} $, which is generated from the ground truth.
The design can be either fixed or random.

To simplify the notation of the upcoming analysis, I denote the matrix representation of the dataset as \[ X=[ x^{(1)}, x^{(2)}, \ldots, x^{(n)} ]^T \] and \[ Y=[ y^{(1)}, y^{(2)}, \ldots, y^{(n)} ]^T .\] $ f(X) $ and $ X_S $ are defined vectorwisely: \[ f(X) = [ f(x^{(1)}), f(x^{(2)}), \ldots, f(x^{(n)}) ]^T \] and \[ X_S=[ x^{(1)}_S, x^{(2)}_S, \ldots, x^{(n)}_S ]^T .\]

Naturally, he uses OLS to solve the problem: \[ \hat{\theta}_ S := \arg\min_{\theta \in \mathbb{R}^{|S|}} \lVert Y - X_S \theta \rVert_2^2 = (X_S^T X_S)^{-1} X_S^T Y. \]

The Mean Squared Error (MSE) of this estimator at a point $x$ is \[ \begin{align} \text{MSE}(x) & = \mathbb{E}_ \mathcal{D} [ f_ {\hat{\theta}_ S} (x) - f(x) ]^2 \newline & = \mathbb{E}_ \mathcal{D} [ x_S^T (X_S^T X_S)^{-1} X_S^T Y - f(x) ]^2 \newline & = \mathbb{E}_ \mathcal{D} [ x_S^T (X_S^T X_S)^{-1} X_S^T f(X) - f(x) + x_S^T (X_S^T X_S)^{-1} X_S^T \varepsilon ]^2 \newline & = \underbrace{\mathbb{E}_ \mathcal{D} [ x_S^T (X_S^T X_S)^{-1} X_S^T f(X) - f(x) ]^2}_ {\text{approximation error}} + \underbrace{\mathbb{E}_ \mathcal{D} [ x_S^T (X_S^T X_S)^{-1} X_S^T \varepsilon ]^2}_ {\text{estimation error}}, \end{align} \] where the last equality uses $\varepsilon$’s zero conditional expectation property.

The first term above, denoted as **approximation error**, represents the difference between the ground truth and its projection on the linear feature space.
The approximation error captures the error of the model used: The larger $S$ is, the smaller the approximation error will be.
The second term, denoted as **estimation error**, represents the projection of the noise on the linear feature space.
The estimation error captures the error towards the best estimator in the current model: The larger $S$ is, the bigger the estimation error will be.
Indeed, by classic parametric statistics, the estimation error is roughly $O(\frac{ |S| \sigma}{n})$, where $\sigma$ is the variance of the noise: This error increases linearly with regard to the size of $S$.

Now let us gradually increase the complexity of the model via enlarging $S$.
When the decrease of the approximation error is *more* significant than the increase of the estimation error, we are at the underfitting region, which implies that we may further enlarge $S$.
When the decrease of the approximation error is *less* significant than the increase of the estimation error, we are at the overfitting region, which implies that we should not further enlarge $S$.
The point separating these two regions are the optimal model size for this family of models.

An important insight here is that the estimation error is proportional to $\frac{\sigma}{n}$. The more data we have or the less noise there is in the data, the more complicated model we will be able to use. The optimal model size is actually not fixed: It depends on the size of the dataset and its quality.

Therefore, we cannot say that overfitting is caused by using too many parameters. How many are too many? It is not fixed; it depends on the data. Similarly, overfitting is not caused by using a larger model than the ground truth either. In fact, if there is no noise in the data, you can use an as large model as you like; the estimator will still recover the relatively simpler ground truth.

This insight has practical implications: When the data is small and noisy, data scientists are not likely to learn good models.
Garbage in, garbage out; they cannot outperform the data.
For data scientists, this means that they should prevent them from overthinking and making “false discoveries”.
For employers, this means that there is no need to employ highly skillful data scientists if their data are of small size and low quality. (**Update**: you should still hire skillful data scientists because they can make you realize that the problem is the data, not the model, instead of misleading you to a wrong direction.)
Moreover, it is unfair to use this kind of data to evaluate the ability of the data scientists: Giants are as high as dwarfs in this case.

It also explains the root reason why deep learning takes off only today: Bigdata enables deep learning. Both the volume and the noise of the data grow. When the advantage of the volume of the data overwhelms the disadvantage of the noise of the data, people can use larger, more complex models. Of course, the advance of hardware and computation power also contribute, but only through the root reason.

## Related works

Similar observations are also made in Frank Dieterle’s PhD thesis, which is available online.

## Limitations

This post reveals the true cause of overfitting, but it does not explain how to detect an overfitting in practice, for you will not have the ground truth $f$. This issue will be the topic of the next post.