Chapter 2 The coefficient of determination (\(R^2\))

Summary: it is not a good criterion because \(R^2\) increases with the size of model; in other words, it always choose biggest model.

Interpretation by wiki: It is a statistic used in the context of statistical models whose main purpose is either the prediction of future outcomes or the testing of hypotheses, on the basis of other related information. It provides a measure of how well observed outcomes are replicated by the model, based on the proportion of total variation of outcomes explained by the model.

2.1 \(R^2\)

Denifition: \[\begin{eqnarray} \text{R}^2 = 1-\frac{RSS}{TSS} = 1- \frac{\sum_i(y_i-\hat{f_i})^2}{\sum_i(y_i-\bar y)^2}, \end{eqnarray}\] where TSS is total sum of squares, RSS is residual sum of squares. And define \(\text{ESS} = \sum_i(\hat f - \bar y)^2\) is explained sum of squares, also called the regression sum of square. \(R^2\) is based on the assumption that \(TSS = RSS + ESS\). Under the linear regression model setting satisfies this assumption usually.

Proof:

\[\begin{eqnarray*} &&\sum_i(y_i-\bar y)^2 = \sum(y_i-\hat{f_i}+ \hat{f_i} - \bar y)^2 \\ &=&\sum_i(y_i-\hat{f_i})^2 + \sum_i(\hat{f_i} - \bar y)^2 + 2\sum_i(y_i-\hat{f_i})(\hat{f_i} - \bar y)\\ &=& RSS + ESS + 2\sum_i\hat{e}_i(\hat{f_i} - \bar y) \,(\hat{f_i}=\hat{y_i}={\bf X}\hat{{\boldsymbol \beta}}\enspace\text{in linear model}) \\ &=& RSS + ESS + 2\sum_i\hat{e}_i(\hat{y_i} - \bar y)\\ &=& RSS + ESS + 2\sum_i\hat{e}_i\hat{y_i}-2\bar y\sum_i\hat{e}_i \end{eqnarray*}\] Then, the reamining part is to prove \(\sum_i\hat{e}_i(\hat{y_i} - \bar y)=0\).

Firstly, \(\sum_i\hat{e}_i\hat{y_i} = {\bf e}^{{\bf T}}{\bf H}{\bf y}= {\bf y}^{{\bf T}}({\bf I}-{\bf H}){\bf H}{\bf y}= 0\) due to \({\bf H}\) idempotent. Then if we can show \(\sum_i \hat{e}_i=0\), our proof is done. However, this can not be shown for a model without an intercept.

2.1.1 \(R^2\) in the model with an intercept

To see this, the partial derivative of our normal equation w.r.t \(\beta_0\) is: \[ \frac{\partial ESS}{\partial\hat\beta_0} = \frac{\sum_i(y_i-\hat\beta_0-x_i\hat\beta_1)^2}{\partial\hat\beta_0} = -2\sum_i(y_i-\hat\beta_0-x_i\hat\beta_1)=0, \] which can be rearranged to \(\sum_iy_i = \sum_i\hat\beta_0+\hat\beta_1\sum_ix_i=\sum_i\hat y_i\). Thus, \(\sum\hat e_i = \sum_iy_i - \sum_i\hat y_i = 0\).

Hence in a model with intecept, we have that \(TSS = RSS + ESS\) of that \(1 = \frac{RSS}{TSS} + \frac{ESS}{TSS}\).

From this \(R^2\) is defined as \(R^2\overset{def}{=}1-\frac{RSS}{TSS}\).

By the above, \(R^2\geq0\)

2.1.2 \(R^2\) in the model without an intercept

\(R^2\overset{def}{=}1-\frac{RSS}{TSS} = \frac{ESS+2\sum_i(y_i-\hat{y_i})(\hat{y_i} - \bar y)}{\sum_i(y_i-\bar y)^2}\). If the second term of numerator is large positive value, then \(R^2\) can be larger than 1 or it is a small negative value, then \(R^2\) can be negative.

2.2 Ajusted \(R^2\)