Model Selection Criteria
2019-05-01
Chapter 1 Prerequisites
Consider a multiple linear regression model as follows: \[\begin{eqnarray*} {\bf y}={\bf X}{\boldsymbol \beta}+ {\boldsymbol \epsilon}, \end{eqnarray*}\] where \({\bf y}=(y_1,y_2,\dots,y_n)^{{\bf T}}\) is the \(n\)-dimensional response vector, \({\bf X}=({\bf x}_1,{\bf x}_2, \dots, {\bf x}_n)^{{\bf T}}\) is the \(n\times p\) design matrix, and \({\boldsymbol \epsilon}\sim \mathcal{N}_n({\boldsymbol 0},\sigma^2{\bf I}_n)\). We assume that \(p<n\) and \({\bf X}\) is full rank.
By the method of MLE, we have \[\begin{eqnarray*} &&\hat{\boldsymbol \beta}=({\bf X}^{{\bf T}}{\bf X})^{-1}{\bf X}^{{\bf T}}{\bf y}\\ &&{\hat\sigma}^2 = \frac{SSE}{n}=\frac{||{\bf y}-{\bf X}\hat{\boldsymbol \beta}||^2}{n} = \frac{{\bf y}^{{\bf T}}({\bf I}-{\bf H}){\bf y}}{n} = \frac{{\bf y}^{{\bf T}}{\bf P}{\bf y}}{n}, \end{eqnarray*}\] where \({\bf P}= {\bf I}-{\bf H}; {\bf H}= {\bf X}({\bf X}^{{\bf T}}{\bf X})^{-1}{\bf X}^{{\bf T}}\).
1.1 Bias-variance tradeoff
According to wiki:
In statistics and machine learning, the bias–variance tradeoff is the property of a set of predictive models whereby models with a lower bias in parameter estimation have a higher variance of the parameter estimates across samples, and vice versa.
Models with low bias are usually more complex (e.g. higher-order regression polynomials), enabling them to represent the training set more accurately. In the process, however, they may also represent a large noise component in the training set, making their predictions less accurate - despite their added complexity. In contrast, models with higher bias tend to be relatively simple (low-order or even linear regression polynomials) but may produce lower variance predictions when applied beyond the training set.
1.2 Bias–variance decomposition of mean squared error (MSE):
We assume \({\bf y}= f(x) + \varepsilon\), where \(\mathbb{E}(\varepsilon)=0\) and \(\text{Var}(\varepsilon)=\sigma^2\). Our goal is to find a function \(\hat f(x)\) that makes MSE of \(\hat f\), \(\mathbb{E}\{({\bf y}-\hat f)^{{\bf T}}({\bf y}-\hat f)\}\), minimum.
The Bias-Variance decomposition of MSE proceeds as follows: \[\begin{eqnarray*} &&\mathbb{E}\{({\bf y}-\hat f)^{{\bf T}}({\bf y}-\hat f)\} = \{\mathbb{E}({\bf y}-\hat f)\}^{{\bf T}}\mathbb{E}({\bf y}-\hat f) + \text{Var}({\bf y}-\hat f)\\ &=&||\text{Bias}(\hat f)||^2 + \text{Var}({\bf y})+\text{Var}(\hat f) - 2\text{cov}({\bf y},\hat f)\\ &=& ||\text{Bias}(\hat f)||^2 +\text{Var}(\hat f) + \sigma^2, \end{eqnarray*}\] where \[\begin{eqnarray*}\text{cov}({\bf y},\hat f) &=& \mathbb{E}({\bf y}\hat f) - \mathbb{E}({\bf y})\mathbb{E}(\hat f)\\ &=& \mathbb{E}[(f+\varepsilon)\hat f] - \mathbb{E}(f+\varepsilon)\mathbb{E}(\hat f)\\ &=& f\mathbb{E}(\hat f) + \mathbb{E}(\varepsilon\hat f) - f\mathbb{E}(\hat f)\\ &=&\mathbb{E}(\varepsilon\hat f)\\ &=&0, \end{eqnarray*}\] since \(\varepsilon \bot \hat f\) or they are independent. (Question : why independent implies \(\varepsilon \bot \hat f\), which implies \(\mathbb{E}(\varepsilon\hat f)=0\)).
Bias-variance tradeoff
- models including many covariates leads to have low bias but high variance.
- models including few covariates leads to high bias but low variance.
Hence, we need criteria that both take in account model complexity (number of predictors) and quality of fit.
1.3 Structure of my note
I plan to introduce model selection criteria from well-known methods, such as BIC, and methods from papers. I will update the items in the following list when I am done one item:
- \(R^2\)
- Mallows \(C_p\)
- AIC
- BIC
- DIC
- EBIC