A gentle guide to statistical resampling methods
Reference: An introduction to statistical learning
All pictures shared in this post is from the reference.
Let’s start with the simple concept, validation. “Validate” is defined as “to support or corroborate on a sound or authoritative basis” by Marriam-Webster dictionary. In plain words, you are interested in using a dataset to test a model that is previously built and trained on another dataset. In data science, to achieve generalizability, we ‘randomly” divide the data into a training set and a validation set or hold-out set (in programming language, often the validation set is called testing set. e.g. train_test_split in sklearn package). (Figure 1)
Three ways of doing validation
In machine learning, there are typically three ways to do the validation: validation set approach, leave-one-out cross-validation (LOOCV), and k-fold cross-validation.
- Validation set approach (one time train/test split)
This method is exaclty like what Figure 1 shows: randomly divide the dataset into a training and a validation dataset. One interesting observation is that if we do the splitting several times, we will get different error estimates (e.g. mean square error which is commonly used in linear problems). Figure 2 demonstrated the variability of error estimates in several splittings.
The high variability is due to the heterogeneity of training datasets extracted from different times. To overcome the issue of variability, leave-one-out cross-validation was generated.
2. Leave-one-out cross-validation (LOOCV)
As simple as it seems, LOOCV only picks one observation as the validation set. This way, the model will be trained by almost the same training set every time, and therefore, the variablity of error estimates is minimal. See Figure 3 for the demonstration.
Imagine a dataset with n observations. Using LOOCV will have to fit the model n times! In the era of big data, this is almost impossible (mentally, physically, and financially :P) Here comes the well-known k-fold cross-validation!
3. k-fold validation
k-fold means that you split the data into k sets (k<n). Typically, we use k=5 or 10. As shown in Figure 4, when k = 5, the model will be fit for 5 times with five different training and validation sets. This approach will still yield higher variability in error estimates than LOOCV, due to the higher variable traning data; however, the variability is definitely smaller than the one from validation set approach.
In addition, what makes k-fold more preferable than LOOCV is not entirely due to its computational advantage.
Bias-variance trade-off
It is without question that the bias among the three methods from high to low is as follows: validation set approach, k-fold cross-validation, and LOOCV, depending on the completeness of the training dataset compared to the original dataset.
However, in case of the variance, variance from LOOCV approach is mathematically higher than that from k-fold cross-validation! Variance shows how far the error estimates spread out. The higher the variance, the less centralized the estimates are. For more details about bias-variance trade-off, I will recommend you to read this post.
The Bootstrap
Bootstrap means, “Rather than repeatedly obtaining independent data sets from the population, we instead obtain distinct data sets by repeatedly sampling observations from the original data set.”
The sampling is performed with replacement. In Fgiure 5, suppose we generate B data sets (Z1-ZB). In Z1 dataset, Obs 3 appears 2 times, and in ZB dataset, Obs2 appers 2 times. This is what replacement means: every time you pick an observation from the dataset; then you put it back to the dataset, and pick one again, the one which may be the same one as you just picked up.
Bootstrap is incredibly powerful for in practice, especially when it comes to small dataset or concerns of uncertainty of a statistical method. We will not always obtain the “entire” dataset (technically, it is impossible in any field); however, with bootstrap strategy, we can use the computer to simulate the process of “generating new datasets”, “so that we can estimate the variability of [estimate] without generating additional samples”.
This is a summary of Chapter 5 in the book of “An introduction to statistical learning”. I hope you enjoy the article and the book.