$$
\begin{gathered}
D=\left{(x_{1},y_{1}),(x_{2},y_{2}),\cdots ,(x_{N},y_{N})\right}\
x_{i}\in \mathbb{R}^{p},y_{i}\in \mathbb{R},i=1,2,\cdots ,N\
X=\begin{pmatrix}
x_{1} & x_{2} & \cdots & x_{N}
\end{pmatrix}^{T}=\begin{pmatrix}
x_{1}^{T} \ x_{2}^{T} \ \vdots \ x_{N}^{T}
\end{pmatrix}=\begin{pmatrix}
x_{11} & x_{12} & \cdots & x_{1p} \ x_{21} & x_{22} & \cdots & x_{2p} \ \vdots & \vdots & & \vdots \ x_{N1} & x_{N2} & \cdots & x_{Np}
\end{pmatrix}_{N \times p}\
Y=\begin{pmatrix}
y_{1} \ y_{2} \ \vdots \ y_{N}
\end{pmatrix}_{N \times 1}
\end{gathered}
$$
因此,对于最小二乘估计,有
$$
\begin{aligned}
L(\omega)&=\sum\limits_{i=1}^{N}||\omega^{T}x_{i}-y_{i}||^{2}\
&=\sum\limits_{i=1}^{N}(\omega^{T}x_{i}-y_{i})^{2}\
&=\begin{pmatrix}
\omega^{T}x_{1}-y_{1} & \omega^{T}x_{2}-y_{2} & \cdots & \omega^{T}x_{N}-y_{N}
\end{pmatrix}\begin{pmatrix}
\omega^{T}x_{1}-y_{1} \ \omega^{T}x_{2}-y_{2} \ \vdots \ \omega^{T}x_{N}-y_{N}
\end{pmatrix}\
&=[\begin{pmatrix}
\omega^{T}x_{1} & \omega^{T}x_{2} & \cdots & \omega^{T}x_{N}
\end{pmatrix}-\begin{pmatrix}
y_{1} & y_{2} & \cdots & y_{N}
\end{pmatrix}]\begin{pmatrix}
\omega^{T}x_{1}-y_{1} \ \omega^{T}x_{2}-y_{2} \ \vdots \ \omega^{T}x_{N}-y_{N}
\end{pmatrix}\
&=[\omega^{T}\begin{pmatrix}
x_{1} & x_{2} & \cdots & x_{N}
\end{pmatrix}-\begin{pmatrix}
y_{1} & y_{2} & \cdots & y_{N}
\end{pmatrix}]\begin{pmatrix}
\omega^{T}x_{1}-y_{1} \ \omega^{T}x_{2}-y_{2} \ \vdots \ \omega^{T}x_{N}-y_{N}
\end{pmatrix}\
&=(\omega^{T}X^{T}-Y^{T})\begin{pmatrix}
\omega^{T}x_{1}-y_{1} \ \omega^{T}x_{2}-y_{2} \ \vdots \ \omega^{T}x_{N}-y_{N}
\end{pmatrix}\
&=(\omega^{T}X^{T}-Y^{T})(X \omega-Y)\
&=\omega^{T}X^{T}X \omega-2 \omega^{T}X^{T}Y+Y^{T}Y
\end{aligned}
$$
对于$\hat{\omega}$,有
$$
\begin{aligned}
\hat{\omega}&=\text{argmin }L(\omega)\
\frac{\partial L(\omega)}{\partial \omega}&=2X^{T}X \omega-2X^{T}Y\
2X^{T}X \omega-2X^{T}Y&=0\
\omega&=(X^{T}X)^{-1}X^{T}Y
\end{aligned}
$$
补充:矩阵求导法则
$$\begin{aligned} x&=\begin{pmatrix}x_{1} & x_{2} & \cdots & x_{n}\end{pmatrix}\f(x)&=Ax,则\frac{\partial f (x)}{\partial x^T} = \frac{\partial (Ax)}{\partial x^T} =A\f(x)&=x^TAx,则\frac{\partial f (x)}{\partial x} = \frac{\partial (x^TAx)}{\partial x} =Ax+A^Tx\f(x)&=a^{T}x,则\frac{\partial a^Tx}{\partial x} = \frac{\partial x^Ta}{\partial x} =a\f(x)&=x^{T}Ay,则\frac{\partial x^TAy}{\partial x} = Ay,\frac{\partial x^TAy}{\partial A} = xy^T\end{aligned}$$
作者:zealscott
链接:矩阵求导法则与性质
在几何上,最小二乘法相当于模型(这里就是直线)和试验值的距离的平方求和,假设我们的试验样本张成一个 $p$ 维空间(满秩的情况):$X=Span(x_1,\cdots,x_N)$,而模型可以写成 $f(w)=x_{i}^{T}\beta$,也就是 $x_1,\cdots,x_N$ 的某种组合,而最小二乘法就是说希望 $Y$ 和这个模型距离越小越好,于是它们的差应该与这个张成的空间垂直:
$$X\bot(Y-X\beta)\longrightarrow X^T\cdot(Y-X\beta)=0_{p\times1}\longrightarrow\beta=(X^TX)^{-1}X^TY$$
作者:tsyw
这里个人理解,有几点
由于$X=\begin{pmatrix}x_{1}^{T} \ x_{2}^{T} \ \vdots \ x_{N}^{T}\end{pmatrix}$,因此$x_{i}^{T}\beta$就是$X \beta$
一般$Y$是不在$p$维空间中的
$$\begin{aligned} X \beta&=\begin{pmatrix}x_{11} & x_{12} & \cdots & x_{1p} \ x_{21} & x_{22} & \cdots & x_{2p} \ \vdots & \vdots & & \vdots \ x_{N1} & x_{N2} & \cdots & x_{Np}\end{pmatrix}\begin{pmatrix}\beta_{1} \ \beta_{2} \ \vdots \ \beta_{p}\end{pmatrix}\&=\beta_{1}\begin{pmatrix}x_{11} \ x_{21} \ \vdots \ x_{N1}\end{pmatrix}+\beta_{2}\begin{pmatrix}x_{12} \ x_{22} \ \vdots \ x_{N2}\end{pmatrix}+\cdots +\beta_{p}\begin{pmatrix}x_{1p} \ x_{2p} \ \vdots \ x_{Np}\end{pmatrix}\end{aligned}$$
这里可以看做是$\beta$在矩阵$X$的作用下,从原来$\begin{pmatrix}1 \ 0 \ \vdots \ 0\end{pmatrix},\begin{pmatrix}0 \ 1 \ \vdots \ 0\end{pmatrix},\cdots ,\begin{pmatrix}0 \ 0 \ \vdots \ 1\end{pmatrix}$基底映射到新的基底$\begin{pmatrix}x_{11} \ x_{21} \ \vdots \ x_{N1}\end{pmatrix},\begin{pmatrix}x_{12} \ x_{22} \ \vdots \ x_{N2}\end{pmatrix},\cdots ,\begin{pmatrix}x_{1p} \ x_{2p} \ \vdots \ x_{Np}\end{pmatrix}$,因此新的向量$X \beta$一定是在$p$维空间内的,又因为$Y$一般不在$p$维空间内,因此求向量$Y$与$X \beta$的最短距离,应当调整$\beta$,使得$Y-X \beta$所代表的的向量恰好与$p$维空间垂直,此时即为最小。因此有$X^{T}\bot(Y -X \beta)=\boldsymbol{0}$
对于一维的情况,记$y=\omega^{T}x+\epsilon ,\epsilon \sim N(0,\sigma^{2})$,那么
$$
y|x;\omega \sim N(\omega^{T}x, \sigma^{2})
$$
注意这里$x$为已知数据集,$\omega$为参数,因此$y$与$\epsilon$同分布
有
$$
P(y|x;\omega)=\frac{1}{\sqrt{2\pi}\sigma}\text{exp}\left[ \frac{(y-\omega^{T}x)^{2}}{2\sigma^{2}}\right]
$$
最大似然估计即为
$$
\begin{aligned}
L(\omega)&=\log P(Y|X;\omega)\
&=\log \prod\limits_{i=1}^{N}P(y_{i}|x_{i};\omega)\
&=\sum\limits_{i=1}^{N}\log P(y_{i}|x_{i};\omega)\
&=\sum\limits_{i=1}^{N}\left{\log \frac{1}{\sqrt{2\pi}\sigma}+\log \text{exp}\left[- \frac{(y_{i}-\omega^{T}x)^{2}}{2\sigma^{2}}\right]\right}\
\hat{\omega}&=\mathop{argmax }\limits_{\omega}L(\omega)\
&=\mathop{argmax }\limits_{\omega}\left[- \frac{1}{2\sigma^{2}}(y_{i}-\omega^{T}x_{i})^{2}\right]\
&=\mathop{argmin }\limits_{\omega}(y_{i}-\omega^{T}x_{i})^{2}
\end{aligned}
$$
到目前为止对于确定$\omega$的问题来说,最大化似然函数等价于最小化由公式
$$E(\omega)=\frac{1}{2}\sum\limits_{n=1}^{N}[y(x_{n},\omega)-t_{n}]^{2}$$
定义的平方和误差函数。因此,在高斯噪声的假设下,平方和误差函数是最大化似然函数的一个自然结果
来源:《PRML Translation》-P27
作者:马春鹏
原著:《Pattern Recognition and Machine Learning》
作者:Christopher M. Bishop
在PRML中还有对精度矩阵$\beta$,也就是这里的$\sigma^{2}$的最大似然估计。这里$y$就是PRML中的$t$
(不做特殊说明都用PRML中的符号)
$$
\begin{aligned}
\ln p(T|X,\omega,\beta)&=- \frac{\beta}{2}\sum\limits_{n=1}^{N}[y(x_{n},\omega)-t_{n}]^{2}+ \frac{N}{2}\ln \beta- \frac{N}{2}\ln (2 \pi)\
\hat{\beta}&=\mathop{argmax\space}\limits_{\beta}\left{- \beta\sum\limits_{n=1}^{N}[y(x_{n},\omega)-t_{n}]^{2}+ N\ln \beta\right}=L(\beta)\
\frac{\partial L(\beta)}{\partial \beta}&=\sum\limits_{n=1}^{N}[y(x_{n},\omega_\text{MLE})-t_{n}]^{2}- \frac{N}{\beta_\text{MLE}}=0\
\frac{1}{\beta_\text{MLE}}&=\frac{1}{N}\sum\limits_{n=1}^{N}[y(x_{n},\omega_\text{MLE})-t_{n}]^{2}
\end{aligned}
$$