Regularization in deep learning

5 min readDec 1, 2021

Reference: An introduction to statistical learning
All pictures shared in this post is from the reference.

One of the main difference between a machine/deep learning (M/DL)model and an inference model is the number of variables — the merit of an M/DL model is that the user can use as many variables as one wants regardless of problems like (multi)collinearity. However, the massive amount of variables could lead to “overfitting”

Regularization in deep learning takes three forms:
- Ridge regression
- Lasso
- Dropout

The problem: overfitting

Figure 1. The left side shows a model with 20 variables, and the right side shows a model with 2 variables and its overfitting problem.

In a simple linear regression where we have one variable, the model trained with 20 variables would get an approximate line, whereas one with 2 variables would get a “perfect” line that crosses exactly the 2 data we have. This is in fact not perfect, but overfit.

Ridge regression and Lasso

Ridge regression and Lasso actually comes from the math of linear regression. (For more linear model selection methods, see the previous post: Linear model selection methods — subset selection)

Typically, the loss (residual sum of squares, RSS) of a linear model is computed by calculating the difference between the observed ad predicted values. However, the RSS could be falsely small because of overfitting.

Figure 2. Residual sum of squares of a linear model

We want to adjust (increase) the loss by adjusting the model, which is to alter the beta coefficients of the model.

Some formula

Figure 3. Formula of ridge regression and Lasso

In ridge regression, we make this adjustment by taking the sum of squares of beta coefficients and adding the tuning parameter (lambda), and together, this is called the shrinkage penalty.

The lambda controls the relative importance of the original model loss and the penalty — the bigger the lambda, the more contribution of the penalty term to the new loss, and the less impact the original model loss will have.

Note the the shrinkage penalty is a squared term, which will always be equal to or more than zero, and it makes the new loss always be equal to or more than the origrinal loss. Mathematically, another way to make a term always ≥0 is to take the absolute value, and this way it is called Lasso.

When computing ridge regression and Lasso, we adjust the lambda to get the best model which has the minimum new loss. In other words, for each lambda we try, we get a new set of beta coefficients.

The penalty is called “shrinkage penalty” since the bigger the lambda, the smaller the coefficients would be, and they are driven towards zero (Figure 4).

Figure 4. The least square solution is marked with beta-hat, and the red contours is the RSS produced by each set of solution — all solution sets residing on the same contour give the same RSS. The blue area represents the solution space of ridge regression and Lasso coefficients. The intersection (the yellow icon) of the blue contour and the blue area is the ridge regression/Lasso coefficients that give the corresponding RSS.

How do we find the lambda? The minimum new loss can be found when the shrinkage penalty is within a constraint (s) adjusted by the tuning parameter lambda. (See the formula in Figure 4. The smaller the constraint, the smaller the new loss the model will have.)

Geometrically, “the solution space of the models’ new loss (least square loss)” and “the solution space of Lasso and ridge regression” can be graphically presented in Figure 4. The intersection of the two solution spaces is the solution set of Lasso/ridge regression that gives the minimum loss.

One difference between Lasso and ridge regression is the shape of the solution space. Because the shape of Lasso solution space has angles, it is possible for the intersection to be located on axises; in the example of Figure 4, on beta 2 axis, and hence beta 1 would be zero. This way, we remove the coefficient beta 1 from the model, and this is the process of feature selection.

How does ridge regression and Lasso work in a deep neural network

Let’s take a single layer neural network for example.

A hidden unit (Ak) is a linear combination of inputs transformed with a nonlinear activation function. In the example of Figure 5 where we have 5 hidden units, we will have 5 different linear combination transformed by the activation function.

Now we do the math.There are 4 input variables with 5 hidden units in the next layer, therefore (4+1)*5 = 25, we will get 25 weights for 4 input variables! (here, it’s 4+1, for 1 being the bias term in the linear combination). This is the scenario where we have a larger number of variables than the number of observations.

So, when we write the argument of L1/2 when coding a neural network layer, we are actually adjusting the weights of the linear combination in each hidden unit.

In Python packages, L1 stands for Lasso, and L2 stands for ridge regression

For linear model: https://scikit-learn.org/stable/modules/linear_model.html
For neural networks: https://www.tensorflow.org/api_docs/python/tf/keras/regularizers/L1L2

Dropout learning

Dropout learning is inspired by the concept of random forest — each time a split is made, a random sample of predictors is chosen as split candidates. Similarly, in the dropout learning, in each layer, the model randomly removes a fraction (⏀) of hidden units. Subsequently, the surviving units’ weights will be scaled up by a factor of 1/(1-⏀).

“Dropout learning prevents nodes from being over-specialized.” This is because the dropout process is random rather than based on loss or significance level

Expand a little further from regularization…

Network tuning is a process of adjusting hyperparameters to optimize a model. Regularization and dropout are the two methods among others such as

the model architecture: number of hidden layers/units
stochastic gradient descent: the batch size, the learning rate, etc.

This is a summary of a part of Chapter 6 and Chapter 9 in the book of “An Introduction to Statistical Learning”. I hope you enjoy the article and the book.