Chapter 1 Introduction

1.1 Linear model

Consider a multiple linear regression model as follows: \[{\bf y}={\bf X}{\boldsymbol \beta}+ {\boldsymbol \epsilon},\] where \({\bf y}=(y_1,y_2,\dots,y_n)^{{\bf T}}\) is the \(n\)-dimensional response vector, \({\bf X}=({\bf x}_1,{\bf x}_2, \dots, {\bf x}_n)^{{\bf T}}\) is the \(n\times p\) design matrix, and \({\boldsymbol \epsilon}\sim \mathcal{N}_n({\boldsymbol 0},\sigma^2{\bf I}_n)\). We assume that \(p<n\) and \({\bf X}\) is full rank.

1.2 Maximum likelihood estimation (MLE) approach

Since \({\bf y}\sim \mathcal{N}_n({\bf X}{\boldsymbol \beta}, \sigma^2{\bf I}_n)\), the likelihood function is given as \[\begin{eqnarray*} L({\boldsymbol \beta},\sigma^2)&=&f({\bf y}|{\boldsymbol \beta},\sigma^2)\\ &= &(2\pi)^{-\frac{n}{2}}\lvert {\boldsymbol \Sigma}\lvert^{-\frac{1}{2}} \exp\left\{-\frac{1}{2}({\bf y}-{\bf X}{\boldsymbol \beta})^{{\bf T}}{\boldsymbol \Sigma}^{-1}({\bf y}-{\bf X}{\boldsymbol \beta})\right\} , \end{eqnarray*}\]

where \({\boldsymbol \Sigma}=\sigma^2 {\bf I}_n\). Then the log likelihood can be written as \[\begin{eqnarray*} l({\boldsymbol \beta},\sigma^2)&=&\log L({\boldsymbol \beta},\sigma^2)\\ &=&-\frac{n}{2}\log(2\pi) - \frac{n}{2}\log(\sigma^2) - \frac{1}{2\sigma^2}({\bf y}-{\bf X}{\boldsymbol \beta})^{{\bf T}}({\bf y}-{\bf X}{\boldsymbol \beta}). \end{eqnarray*}\] Note that \(l({\boldsymbol \beta},\sigma^2)\) is a concave function in \(({\boldsymbol \beta},\sigma^2)\). Hence, the maximum likelihood estimator (MLE) can be obtained by solving the following equations: \[\begin{eqnarray*} \frac{\partial l({\boldsymbol \beta},\sigma^2)}{\partial {\boldsymbol \beta}} &=&- \frac{1}{2\sigma^2}(-2{\bf X}^{{\bf T}} {\bf y}+ 2{\bf X}^{{\bf T}} {\bf X}{\boldsymbol \beta})=0; \\ \frac{\partial l({\boldsymbol \beta},\sigma^2)}{\partial \sigma^2} &=&-\frac{n}{2}\frac{1}{\sigma^2} + \frac{1}{2}\frac{1}{(\sigma^2)^2} \mathbf{({\bf y}-{\bf X}{\boldsymbol \beta})^T({\bf y}-{\bf X}{\boldsymbol \beta})}=0. \end{eqnarray*}\] Therefore, the MLEs of \({\boldsymbol \beta}\) and \(\sigma^2\) are given as \[\begin{eqnarray*} \hat{\beta} &=& ({\bf X}^{{\bf T}} {\bf X})^{-1}{\bf X}^{{\bf T}} {\bf y};\\ \hat{\sigma}^2 &=& \frac{({\bf y}-{\bf X}\hat{{\boldsymbol \beta}})^{{\bf T}}({\bf y}-{\bf X}\hat{{\boldsymbol \beta}})}{n}=\frac{ \|{\bf y}-\hat{{\bf y}}\|^2}{n}, \end{eqnarray*}\] where \(\hat{{\bf y}}={\bf X}\hat{{\boldsymbol \beta}}\).

1.2.1 Distribution of \(\hat{\boldsymbol \beta}\)

Note that if \({\bf z}\sim \mathcal{N}_k(\mu,\Sigma)\), then \(A{\bf z}\sim \mathcal{N}_m(A\mu,A\Sigma A^{{\bf T}})\), where \(A\) is an \(m\times k\) matrix. Since \({\bf y}\sim \mathcal{N}_n({\bf X}{\boldsymbol \beta}, \sigma^2{\bf I}_n)\) and \(\hat{{\boldsymbol \beta}} =({\bf X}^{{\bf T}} {\bf X})^{-1}{\bf X}^{{\bf T}}{\bf y}\), we have \[\begin{eqnarray*} \hat{{\boldsymbol \beta}} &\sim& \mathcal{N}_p\left( ({\bf X}^{{\bf T}} {\bf X})^{-1}{\bf X}^{{\bf T}}{\bf X}{\boldsymbol \beta}, \sigma^2 ({\bf X}^{{\bf T}} {\bf X})^{-1}{\bf X}^{{\bf T}}{\bf X}({\bf X}^{{\bf T}} {\bf X})^{-1}\right)\\ &=&\mathcal{N}_p\left({\boldsymbol \beta}, \sigma^2 ({\bf X}^{{\bf T}} {\bf X})^{-1}\right). \end{eqnarray*}\]

1.2.2 Distribution of \(\hat\sigma^2\)

Note that \({\bf y}-\hat{{\bf y}}=({\bf I}_n-{\bf X}({\bf X}^{{\bf T}}{\bf X})^{-1}{\bf X}^{{\bf T}}){\bf y}\), where \({\bf I}_n-{\bf X}({\bf X}^{{\bf T}}{\bf X})^{-1}{\bf X}^{{\bf T}}\) is an idempotent matrix with rank \((n-p)\).

\(\color{red}{\text{Can you prove that ${\bf I}_n-{\bf X}({\bf X}^{{\bf T}}{\bf X})^{-1}{\bf X}^{{\bf T}}$ s an idempotent matrix of rank $(n-p)$ ?}}\)

Proof. Let \({\bf H}= {\bf X}({\bf X}^{{\bf T}}{\bf X})^{-1}{\bf X}^{{\bf T}}\).

\({\bf H}{\bf H}= {\bf X}({\bf X}^{{\bf T}}{\bf X})^{-1}{\bf X}^{{\bf T}}{\bf X}({\bf X}^{{\bf T}}{\bf X})^{-1}{\bf X}^{{\bf T}}={\bf X}({\bf X}^{{\bf T}}{\bf X})^{-1}{\bf X}^{{\bf T}}={\bf H}\) thus \({\bf H}\) is idempotent matrix.

Similarly, as \(({\bf I}_n-{\bf H})({\bf I}_n-{\bf H}) = {\bf I}_n-{\bf H}\), \(({\bf I}_n-{\bf H})\) is also idempotent.

Hence we have \(r({\bf I}_n-{\bf H})=tr({\bf I}_n-{\bf H})=n-tr({\bf H})=n-tr(({\bf X}^{{\bf T}}{\bf X})^{-1}{\bf X}^{{\bf T}}{\bf X})=n-p\)

\(\color{red}{\text{How to prove $r({\bf I}_n-{\bf H})=tr({\bf I}_n-{\bf H})$? }}\)

The eigenvalue of idempotent matrix is either 1 or 0, hence the rank of it is the sum of eigenvalues, which equals the trace of matrix;

\(\color{red}{\text{How to prove trace of matrix is the sum of eigenvalues?}}\)

It requires characteristic polynomial.

Another way to prove this is in this link

From Lemma 1.1, we have that \[\begin{eqnarray} \label{eq:1} n\frac{\hat{\sigma}^2}{\sigma^2}\sim \chi^2(n-p), \end{eqnarray}\] where \(\chi^2(n-p)\) denotes the chi-squared distribution with degrees of freedom \(n-p\).

Proof. By Lemma 1.1, we have
\(n\frac{\hat{\sigma}^2}{\sigma^2}=\frac{\hat e^{{\bf T}}\hat e}{\sigma^2}={\bf y}^{{\bf T}}(\frac{{\bf I}_n-{\bf H}}{\sigma^2}) {\bf y}={\bf y}^{{\bf T}}{\bf A}{\bf y}\sim\chi^2(tr({\bf A}{\boldsymbol \Sigma}),\mu^{{\bf T}}{\bf A}\mu/2)\)

where \({\bf A}=\frac{{\bf I}_n-{\bf H}}{\sigma^2}\), \(\mu^{{\bf T}}{\bf A}\mu/2=({\bf X}{\boldsymbol \beta})^{{\bf T}}(\frac{{\bf I}_n-{\bf H}}{\sigma^2})({\bf X}{\boldsymbol \beta})/2=0\). \({\bf A}{\boldsymbol \Sigma}= (\frac{{\bf I}_n-{\bf H}}{\sigma^2})\sigma^2{\bf I}_n={\bf I}_n-{\bf H},\) hence \(r({\bf A}{\boldsymbol \Sigma}) = n-p\). Therefore \(n\frac{\hat{\sigma}^2}{\sigma^2}\sim \chi^2(n-p)\).
Lemma 1.1 Let \({\bf z}\sim \mathcal{N}_k(\mu,\Sigma)\) with \(r(\Sigma)=k\), where \(r(\Sigma)\) denotes the rank of \(\Sigma\). If \({\bf A}{\boldsymbol \Sigma}\) (or \({\boldsymbol \Sigma}{\bf A}\)) is an idempotent matrix of rank \(m\), then \({\bf z}^{{\bf T}}{\bf A}{\bf z}\) follows the non-central chi-squared distribution with degrees of freedom \(m\) and non-central parameter \(\lambda=\mu^{{\bf T}}{\bf A}\mu/2\).

1.3 Bayesian approach

1.3.1 \(\sigma^2\) is known

Suppose \(\sigma^2\) is known. We define the prior distribution of \({\boldsymbol \beta}\) by \({\boldsymbol \beta}\sim \mathcal{N}_p({\boldsymbol 0},\sigma^2\nu{\bf I}_p)\). Then the posterior density of \({\boldsymbol \beta}\) can be obtained by \[\begin{eqnarray*} \pi ({\boldsymbol \beta}|{\bf y}) &\propto& f({\bf y}|{\boldsymbol \beta})\pi ({\boldsymbol \beta}) \\ &\propto& \exp\left(-\frac{1}{2\sigma^2}\| {\bf y}-{\bf X}{\boldsymbol \beta}\|^2\right)\times \exp\left(-\frac{1}{2\sigma^2\nu}\|{\boldsymbol \beta}|^2\right)\\ &=&\exp\left[-\frac{1}{2\sigma^2}\left\{ ({\bf y}-{\bf X}{\boldsymbol \beta})^{{\bf T}}({\bf y}-{\bf X}{\boldsymbol \beta}) + \frac{1}{\nu}{\boldsymbol \beta}^{{\bf T}} {\boldsymbol \beta}\right\}\right]\\ &\propto&\exp\left\{-\frac{1}{2\sigma^2}\left( - 2 {\boldsymbol \beta}^{{\bf T}} {\bf X}^{{\bf T}} {\bf y}+ {\boldsymbol \beta}^{{\bf T}} {\bf X}^{{\bf T}} {\bf X}{\boldsymbol \beta}+ \frac{1}{\nu}{\boldsymbol \beta}^{\bf T}{\boldsymbol \beta} \right)\right\}\\ &=&\exp\left\{-\frac{1}{2\sigma^2}\left( - 2 {\boldsymbol \beta}^{{\bf T}} \left({\bf X}^{{\bf T}} {\bf X}+\frac{1}{\nu}I_p\right) \left({\bf X}^{{\bf T}} {\bf X}+\frac{1}{\nu}I_p\right)^{-1} {\bf X}^{{\bf T}} {\bf y}+ {\boldsymbol \beta}^{{\bf T}} \left({\bf X}^{{\bf T}} {\bf X}+\frac{1}{\nu}I_p\right){\boldsymbol \beta}\right)\right\}\\ &\propto&\exp\left\{-\frac{1}{2\sigma^2}({\boldsymbol \beta}-\tilde{{\boldsymbol \beta}})^{{\bf T}} \left({\bf X}^{{\bf T}} {\bf X}+\frac{1}{\nu}I_p\right)({\boldsymbol \beta}-\tilde{{\boldsymbol \beta}}) \right\} \end{eqnarray*}\] where \(\tilde{{\boldsymbol \beta}}=\left({\bf X}^{{\bf T}} {\bf X}+\frac{1}{\nu}\right)^{-1} {\bf X}^{{\bf T}} {\bf y}\).

This implies that \[\begin{eqnarray} \label{eq:2} {\boldsymbol \beta}|{\bf y}\sim \mathcal{N}_p\left(\left({\bf X}^{{\bf T}} {\bf X}+\frac{1}{\nu}I_p\right)^{-1} {\bf X}^{{\bf T}} {\bf y},\sigma^{2} \left({\bf X}^{{\bf T}} {\bf X} +\frac{1}{\nu}I_p\right)^{-1}\right). \end{eqnarray}\] The Bayesian estimate \(\hat{\boldsymbol \beta}_{Bayes} = \left({\bf X}^{{\bf T}} {\bf X}+\frac{1}{\nu}I_p\right)^{-1} {\bf X}^{{\bf T}} {\bf y}\). It is worth noting that if \(\nu\to \infty\), the posterior distribution converges to the distribution of \(\hat{{\boldsymbol \beta}}_{MLE}\sim \mathcal{N}_p\left({\boldsymbol \beta}, \sigma^2 ({\bf X}^{{\bf T}} {\bf X})^{-1}\right)\).

1.3.2 \(\sigma^2\) is unknown

In general, \(\sigma^2\) is unknown. Then, we need to assign a reasonable prior distribution for \(\sigma^2\). We consider the inverse gamma distribution, which is called a , as follows: \(\sigma^2\sim \mathcal{IG}(a_0,b_0)\) with the density function \[\begin{eqnarray} \pi(\sigma^2)=\frac{b_0^{a_0}}{\Gamma(a_0)}(\sigma^2)^{-a_0-1}\exp\left(-\frac{b_0}{\sigma^2}\right), \end{eqnarray}\] where \(a_0>0\) and \(b_0>0\). In addition, we need to introduce prior for \({\boldsymbol \beta}|\sigma^2\):

\[{\boldsymbol \beta}|\sigma^2 \sim \mathcal{N}_p({\boldsymbol 0},\sigma^2\nu{\bf I}_p).\]

\(\color{red}{\text{Today, we derive the joint posterior distribution} \pi({\boldsymbol \beta},\sigma^2|{\bf y})\propto f({\bf y}|{\boldsymbol \beta},\sigma)\pi({\boldsymbol \beta}|\sigma^2)\pi(\sigma^2).}\)

Show

    1. \(\pi(\sigma^2|{\boldsymbol \beta},{\bf y}) = \mathcal{IG}(a^*,b^*)\)
    1. \(\pi({\boldsymbol \beta}|{\bf y}) \sim\) t-distribution with \(t^*\)
Then the posterior density function of \(\sigma^2\) given \({\boldsymbol \beta},{\bf y}\) can be obtained by

Proof (1). \[\begin{eqnarray*} &&\pi(\sigma^2|{\boldsymbol \beta},{\bf y}) = \frac{f({\bf y}|{\boldsymbol \beta},\sigma^2)\pi({\boldsymbol \beta},\sigma^2)}{\int f({\bf y}|{\boldsymbol \beta},\sigma^2)\pi({\boldsymbol \beta},\sigma^2)d\sigma^2} = \frac{f({\bf y}|{\boldsymbol \beta},\sigma^2)\pi({\boldsymbol \beta}|\sigma^2)\pi(\sigma^2)}{\int f({\bf y},{\boldsymbol \beta},\sigma^2)d\sigma^2}\\ &\propto& f({\bf y}|{\boldsymbol \beta},\sigma^2)\pi({\boldsymbol \beta}|\sigma^2)\pi(\sigma^2)\\ &\propto& (\sigma^2)^{-\frac{n}{2}}\exp\left(-\frac{1}{2\sigma^2}\|{\bf y}-{\bf X}{\boldsymbol \beta}\|^2\right) \times (\sigma^2)^{-\frac{p}{2}}\exp\left( -\frac{1}{2\sigma^2\nu}\|{\boldsymbol \beta}\|^2\right)\times (\sigma^2)^{-a_0-1}\exp \left(-\frac{b_0}{\sigma^2}\right)\\ &=& (\sigma^2)^{-(\frac{n+p}{2}+a_0)-1}\exp \left[-\frac{1}{\sigma^2}\left\{\frac{1}{2}\|{\bf y}-{\bf X}{\boldsymbol \beta}\|^2+\frac{1}{2\nu}\|{\boldsymbol \beta}\|^2 + b_0 \right\}\right]\\ &=& (\sigma^2)^{-a^*-1} \exp(-\frac{b^*}{\sigma^2}) \end{eqnarray*}\] where \(a^* = \frac{n+p}{2}+a_0\), \(b^* = \frac{1}{2}\|{\bf y}-{\bf X}{\boldsymbol \beta}\|^2+\frac{1}{2\nu}\|{\boldsymbol \beta}\|^2 + b_0\).

This implies that \[\begin{eqnarray} \sigma^2|{\boldsymbol \beta},{\bf y}\sim \mathcal{IG}\left(\frac{n+p}{2}+a_0, \frac{1}{2}\|{\bf y}-{\bf X}{\boldsymbol \beta}\|^2+\frac{1}{2\nu}\|{\boldsymbol \beta}\|^2 + b_0\right). \end{eqnarray}\]


Proof (2). \[\begin{eqnarray*} &&\pi({\boldsymbol \beta}|{\bf y}) \\ &=& \int\pi({\boldsymbol \beta},\sigma^2|{\bf y})d\sigma^2\\ &=& \int \frac{f({\bf y}|{\boldsymbol \beta},\sigma^2)\pi({\boldsymbol \beta},\sigma^2)}{\iint f({\bf y}|{\boldsymbol \beta},\sigma^2)\pi({\boldsymbol \beta},\sigma^2)d{\boldsymbol \beta}d\sigma^2}d\sigma^2\\ &\propto& \int f({\bf y}|{\boldsymbol \beta},\sigma^2)\pi({\boldsymbol \beta}|\sigma^2)\pi(\sigma^2) d\sigma^2\\ &\propto&\Gamma(a^*)b^{-a^*}\\ &\propto&{b^*}^{-a^*}\\ &=&[\frac{1}{2}\|{\bf y}-{\bf X}{\boldsymbol \beta}\|^2+\frac{1}{2\nu}\|{\boldsymbol \beta}\|^2 + b_0]^{-a^*}\\ &\propto&\left[({\bf y}-{\bf X}{\boldsymbol \beta})^T({\bf y}-{\bf X}{\boldsymbol \beta})+\frac{1}{\nu}{\boldsymbol \beta}^T{\boldsymbol \beta}- 2b_0 \right]^{-a^*}\\ &=&\left[{\bf y}^T{\bf y}- 2{\boldsymbol \beta}^T{\bf X}^T{\bf y}+ {\boldsymbol \beta}^T{\bf X}^T{\bf X}{\boldsymbol \beta}+ \frac{1}{\nu}{\boldsymbol \beta}^T{\boldsymbol \beta}+2b_0 \right]^{-a^*}\\ &\propto& \left[ {\boldsymbol \beta}^{{\bf T}} \left({\bf X}^{{\bf T}} {\bf X}+\frac{1}{\nu}{\bf I}_p\right){\boldsymbol \beta}- 2 {\boldsymbol \beta}^{{\bf T}} \left({\bf X}^{{\bf T}} {\bf X}+\frac{1}{\nu}{\bf I}_p\right) \left({\bf X}^{{\bf T}} {\bf X}+\frac{1}{\nu}{\bf I}_p\right)^{-1} {\bf X}^{{\bf T}} {\bf y}+ {\bf y}^T{\bf y}+2b_0 \right]^{-a^*}\\ &=& \left[ {\boldsymbol \beta}^{{\bf T}}{\bf A}{\boldsymbol \beta}- 2{\boldsymbol \beta}^{{\bf T}}{\bf A}\tilde{{\boldsymbol \beta}} + \tilde{{\boldsymbol \beta}}^{{\bf T}} {\bf A}\tilde{{\boldsymbol \beta}} - \tilde{{\boldsymbol \beta}}^{{\bf T}}{\bf A}\tilde{{\boldsymbol \beta}} + {\bf y}^{{\bf T}}{\bf y}+2b_0\right]^{-a^*}\\ &=& \left[ ({\boldsymbol \beta}- \tilde{{\boldsymbol \beta}})^T{\bf A}({\boldsymbol \beta}-\tilde{{\boldsymbol \beta}}) + {\bf y}^{{\bf T}}{\bf y}- {\bf y}^{{\bf T}}{\bf X}{\bf A}^{-1}{\bf X}^{{\bf T}}{\bf y}+ 2b_0\right]^{-a^*}\\ &=& \left[ ({\boldsymbol \beta}- \tilde{{\boldsymbol \beta}})^T{\bf A}({\boldsymbol \beta}-\tilde{{\boldsymbol \beta}}) + {\bf y}^{{\bf T}}({\bf I}_n-{\bf X}{\bf A}^{-1}{\bf X}^{{\bf T}}){\bf y}+ 2b_0\right]^{-a^*}\\ &\propto& \left[({\boldsymbol \beta}- \tilde{{\boldsymbol \beta}})^T{\bf A}({\boldsymbol \beta}-\tilde{{\boldsymbol \beta}}) + c^* \right]^{-a^*}\\ &\propto&\left[1+ \frac{1}{c^*}({\boldsymbol \beta}- \tilde{{\boldsymbol \beta}})^T{\bf A}({\boldsymbol \beta}-\tilde{{\boldsymbol \beta}})\right]^{-\frac{n+p+2a_0}{2}}\\ &=&\left[1+ \frac{\nu^*}{\nu^*c^*}({\boldsymbol \beta}- \tilde{{\boldsymbol \beta}})^T{\bf A}({\boldsymbol \beta}-\tilde{{\boldsymbol \beta}})\right]^{-\frac{p+\nu^*}{2}}\\ &=&\left[1+ \frac{1}{\nu^*}({\boldsymbol \beta}- \tilde{{\boldsymbol \beta}})^T(\frac{c^*}{\nu^*}{\bf A}^{-1})^{-1}({\boldsymbol \beta}-\tilde{{\boldsymbol \beta}})\right]^{-\frac{p+\nu^*}{2}}\\ \end{eqnarray*}\] This implies that \[\begin{eqnarray} {\boldsymbol \beta}|{\bf y}\sim {\bf \mathcal{MT}}(\tilde{{\boldsymbol \beta}},\frac{c^*}{\nu^*}{\bf A}^{-1},\nu^*), \tag{1.1} \end{eqnarray}\]

where

\[\begin{eqnarray*} {\bf A}&=&{\bf X}^{{\bf T}} {\bf X}+\frac{1}{\nu}{\bf I}_p,\\ \tilde{{\boldsymbol \beta}} &=& {\bf A}^{-1} {\bf X}^{{\bf T}} {\bf y},\\ c^* &=& {\bf y}^{{\bf T}}({\bf I}_n-{\bf X}{\bf A}^{-1}{\bf X}^{{\bf T}}){\bf y}+ 2b_0,\\ \nu^* &=& n+2a_0. \end{eqnarray*}\]


Note: the density of a multiple t-distribution with \({\boldsymbol \Sigma},{\boldsymbol \mu}\) and \(\nu\) is (1.1) \[\begin{eqnarray*} \frac{\Gamma[(\nu+p)/2]}{\Gamma(\nu/2)\nu^{p/2}\pi^{p/2}|{\boldsymbol \Sigma}|^{1/2}}\left[1+\frac{1}{\nu}({\bf X}- {\boldsymbol \mu})^{{\bf T}} \Sigma^{-1}({\bf X}-{\boldsymbol \mu})\right]^{-(\nu+p)/2}. \end{eqnarray*}\]