The Bias-Variance dilemma is relevant for supervised machine learning. It’s a way to diagnose an algorithm performance by breaking down its prediction error. There are three types of prediction errors: bias, variance, and irreducible error.
- Bias error: The error due to bias as the difference between the expected (or average) prediction of the model and the true value which is trying to predict. Of course, there is only one model so talking about expected or average prediction values might seem a little wired. However, if it repeats the model building process more than once: each time gathering new data and run a new analysis creating a new model. Due to randomness in the underlying data sets, the resulting models will have a range of predictions. Bias measures how far off, in general, these models’ predictions are from the correct value. Imagine fitting a linear regression to a dataset that has a non-linear pattern:
No matter how many more observations are collected, a linear regression won’t be able to model the curves in that data! This is known as underfitting.
- Variance error: The error due to variance is the variability of a model prediction for a given data point. Again, imagine it could be possible to repeat the entire model building process multiple times. The variance is how much the predictions for a given point vary between different realizations of the model. For example, there is an algorithm that fits a completely unconstrained, flexible model to the dataset.
As seen in the figure above, this unconstrained model has basically memorized the training set, including all the noise. This is known as overfitting.
- The irreducible error is the noise term in the true relationship that cannot fundamentally be reduced by any model. It typically comes from inherent randomness or an incomplete feature set.
At its root, dealing with bias and variance is really about dealing with underfitting and overfitting. Bias is reduced and variance is increased in relation to model complexity. For example, as more polynomial terms are added to a linear regression, the greater the resulting model’s complexity will be. In other words, bias has a negative first-order derivative in response to model complexity while variance has a positive slope.
Why there is a trade-off between bias and variance?
Low variance (high bias) algorithms turn to be less complex, with simple or rigid underlying structure. These models include linear or parametric algorithms such as regression and naive Bayes.
On the other hand, low bias (high variance) algorithms turn to be more complex, with a flexible underlying structure. These models include non-linear or non-parametric algorithms such as decision trees and nearest neighbors.
This tradeoff in complexity is there’s a tradeoff in bias and variance an algorithm cannot simultaneously be more complex and less complex.
What is the total error?
The total error may then be decomposed into bias, variance, and irreducible error components:
How should be possible to detect overfitting and underfitting and what solutions exist to solve it?
Overfitting results in low training error and high test error, while underfitting results in high errors in both the training and test set.
However, measuring training and test errors is hard when there are relatively few data points and the algorithms need many data points. In this case, a good choice to use is a technique called cross-validation.
This is where we take our entire dataset and split it into k groups. For each of the k groups, we train on the remaining k−1 groups and validate on the kth group. This way we can make the most use of our data, essentially taking a dataset and training k times on it.
As for what to do after you detect a problem? Well, having the high bias is symptomatic of a model that is simple enough. In that case, the best bet would be to just pick a more complex model(getting more features or try adding polynomial features)
The problem of high variance is a bit more interesting. One naive approach to reduce high variance is to use more data. Theoretically, with a complex model, as the number of samples tends toward infinity the variance tends toward zero. However, this approach is naive because the rate at which the variance decreases is typically fairly slow, and the large data problem is almost always very hard to come across.
A better solution to reduce variance is to use regularization. It models the training data well, penalizes it for growing too complex. Essentially, regularization injects bias into the model by telling it not to become too complex. Common regularization techniques include lasso or ridge regression, dropout for neural networks, and soft margin SVMs.
If you want to make free money and have a blog like this one using our platform then sign up with this referral link of digital ocean platform if you don’t like money forget it, my friend.