Linear model selection methods — subset selection

6 min readNov 7, 2021

Reference: An introduction to statistical learning
All pictures shared in this post is from the reference.

A (multiple) linear model typically takes the form of

Source: An Introduction to Statistical Learning with Applications in R. Second Edition

It is good if we can incorporate as many variables as possible into our model, but only if they are “meaningful/significant”. Adding insignificant variables in the model simply means adding noise to your model, which of course, is not preferred. Reducing the number of variables of a model also helps users to comprehend and interpret the model.

To choose the most appropriate model, there are generally three methods: subset selection, shrinkage, and dimension reduction. In today’s post, we will discuss the subset selection method, particularly, best subset selection and stepwise selection.

Best subset selection and stepwise selection

Best subset selection, in its own words, means to “greedily” test on every possible combination of variables. Given p number of variables/predictors, we will fit p models that contain only one predictor, p(p-1)/2 models that contain two predictors, etc.

For each cohort of n-variable models, we select the best one based on residual sum of squares (RSS) or R-squared. Then we will have p models with 1 to p variables in each respectively, and we compare these models using indicators such as cross-validation prediction errors, BIC, adjusted R-squared, etc.

Source: An Introduction to Statistical Learning with Applications in R. Second Edition

Best subset selection is very computationally expensive since it goes through every possible model. On the other hand, stepwise selection is more efficient and more commonly used. Let’s assume we have 10 variables and want to find the best model, we can do

Forward selection: We start with a null model with no variable involved, then add one variable each time to the model until u (u = 1, …,10) variables are added. At each step, the variable that gives the best additional improvement to the model fit is added.
For example, u equals to 3. In this iteration, three models will be created with u= 1,2,3 respectively. We choose the best model among these three using RSS or R-squared. Remember that we will test all the us and get 10 best models in the end. We then select the best model among these best models using cross-validation prediction errors, BIC, adjusted R-squared, etc.

2. Backward selection: We start with a full model with all 10 variables. Then we eliminate one variable at a time until u variables are in the model, for u= 0, .. ,10. At each step, the least useful variable is eliminated.
Again, if u= 3, we select the best model where these models have 10 to 3 variables respectively. Therefore, we will have 10 best models in the end. We then select the best model among these best models using cross-validation prediction errors, BIC, adjusted R-squared, etc.

3. Hybrid approach: This approach combined the forward selection and backward selection. We start with a null model, add one variable at a time and in every step, we also look for insignificant variables and eliminate them.

Choosing the best model

As seen above, there are two steps involving model comparison. Step 2(b) and Step 3. Why they use different approach to compare the models?

Statistical models can be classified into nested or non-nested models. “Two models are nested if one model contains all the terms of the other, and at least one additional term. The larger model is the complete (or full) model, and the smaller is the reduced (or restricted) model.“ For nested models, we can use RSS or deviance (in the case of nonlinear models) to compare, and for non-nested models, we can use BIC, adjusted R-squared, and other approaches.

This is why in step 2(b), since these models are a result of adding/removing variables from the previous models, they are nested; whereas when comparing the best models from step 2(b) in step 3, we need to use other methods.

RSS and R-squared

These two methods are only related to the training errors of the model as they use the regression line (model) as the reference.

Residual/regression sum of squares measures the difference between each observed y and the corresponding predicted y. We take the square to make all differences a positive value, regardless if the difference from the predicted values is bigger or smaller than the observed.

where TSS is the total sum of squares that measures the difference between each observed y and the mean predicted y. [TSS = RSS + ESS (error sum of squares).

“TSS − RSS measures the amount of variability in the response that is explained (or removed) by performing the regression, and R2 measures the proportion of variability in Y that can be explained using X.” R-squared locate between 0 and 1, and a value close to 1 means the model can explain the majority of Y with the regression.

To get the testing errors, we can use the adjusted R-squared, AIC, and BIC to “indirectly” estimate the test errors by making adjustments.

Adjusted R-squared

For linear models, it is intuitive that more variables in a model would give bigger R-squared, and hence a better fit. This is correct in a situation like stepwise selection; however, to compare models with different set of variables of different number, simply comparing models by R-squared is misleading. Therefore, we use adjusted R-squared that adds the number of observations and variables in the formula.

Source: An Introduction to Statistical Learning with Applications in R. Second Edition. d: number of variables. n: number of observations

By using adjusted R-squared, “once all of the correct variables have been included in the model, adding additional noise variables will lead to only a very small decrease in RSS”. This is because adding noise variables which will not increase RSS will decrease the nominator of the formula 6.4, and hence decrease adjusted R-squared as a whole.

Akaike’s Information Criterion (AIC), Bayesian Information Criterion (BIC)

For non-linear models, AIC and BIC are typically used for comparing non-nested models. The understanding of the formula is out of the scope of this post. And in modern machine learning, few people use these two indicators, and the reason is in the next section.

Why we hear about cross-validation a lot? A method to directly (actually) get the test error

Details about cross-validation can be referred to my previous post. In simple words, cross-validation is an approach that “directly” yields test errors as we iterate each validation fold.

“In the past, performing cross-validation was computationally prohibitive for many problems with large p and/or large n, … ,nowadays with fast computers, the computations required to perform cross-validation are hardly ever an issue.”

This is a summary of a part of Chapter 6 in the book of “An Introduction to Statistical Learning”. I hope you enjoy the article and the book.