# 梯度下降与海森矩阵

## 鞍点

$f'(x)=0$时函数不一定抵达局部最优解，还可能是鞍点（见上图），此时还必须根据二阶导数确定。

$f'(x)$ $f''(x)$ $f(x)$
$f'(x)=0$ $f''(x)>0$ 局部最小值
$f'(x)=0$ $f''(x)<0$ 局部最大值
$f'(x)=0$ $f''(x)=0$ 局部极值或鞍点

## 对角化

$$A = W\Lambda W^{-1}$$

\begin{align*} W &= {\left(\begin{matrix}a_1 & a_2 & \dots & a_n\end{matrix}\right)} \\ \Lambda &= \begin{bmatrix} \lambda_1 & 0 & \cdots & 0 \\ 0 & \lambda_2 & \cdots & 0 \\ \tag{1} \vdots & \cdots & \ddots & \vdots \\ 0 & \cdots & 0 & \lambda_n \end{bmatrix}\end{align*}

$$W\Lambda = {\left(\begin{matrix}\lambda_1 a_1 & \lambda_2 a_2 & \dots & \lambda_n a_n\end{matrix}\right)}$$

$W$中的$n$个特征向量为标准正交基，即满足$W^TW=I$，等价于$W^T=W^{-1}$，也等价于$W$是Unitary Matrix。

## 奇异值分解

$$A_{m\times n}=U_{m\times m}\Sigma_{m\times n}V^T_{n\times n}$$

## Hessian 矩阵

$$Hf = \left[ \begin{array}{ccc} \dfrac{\partial^2 f}{\partial \color{blue}{x}^2} & \dfrac{\partial^2 f}{\partial \color{blue}{x} \partial \color{red}{y}} & \dfrac{\partial^2 f}{\partial \color{blue}{x} \partial \color{green}{z}} & \cdots \\\\ \dfrac{\partial^2 f}{\partial \color{red}{y} \partial \color{blue}{x}} & \dfrac{\partial^2 f}{\partial \color{red}{y}^2} & \dfrac{\partial^2 f}{\partial \color{red}{y} \partial \color{green}{z}} & \cdots \\\\ \dfrac{\partial^2 f}{\partial \color{green}{z} \partial \color{blue}{x}} & \dfrac{\partial^2 f}{\partial \color{green}{z} \partial \color{red}{y}} & \dfrac{\partial^2 f}{\partial \color{green}{z}^2} & \cdots \\\\ \vdots & \vdots & \vdots & \ddots \end{array} \right]$$

$Hf$并不是一个普通的矩阵，而是一个由函数构成的矩阵。也就是说，$\textbf{H}f$的具体取值只有在给定变量值$(x_0, y_0, \dots)$时才能得到。

$$Hf(x_0, y_0, \dots) = \left[ \begin{array}{ccc} \dfrac{\partial^2 f}{\partial \color{blue}{x}^2}(x_0, y_0, \dots) & \dfrac{\partial^2 f}{\partial \color{blue}{x} \partial \color{red}{y}}(x_0, y_0, \dots) & \cdots \\\\ \dfrac{\partial^2 f}{\partial \color{red}{y} \partial \color{blue}{x}}(x_0, y_0, \dots) & \dfrac{\partial^2 f}{\partial \color{red}{y}^2}(x_0, y_0, \dots) & \cdots \\\\ \vdots & \vdots & \ddots \end{array} \right]$$

$$\begin{split}& \frac{\delta^2}{\delta x_i \delta x_j} f(x) = \frac{\delta^2}{\delta x_j \delta x_i} f(x) \implies H_{ij} = H_{ji} \\\end{split}$$

$$\forall v \in \mathbb{R^n}, ||v||=1 \implies v^T Hv\leq \lambda_{\max}$$

$${\bf v}^TH{\bf v} = {\bf y}^T\Lambda {\bf y}\tag{2}$$

\begin{align}{\bf y}^T\Lambda {\bf y}&=\lambda_{\max}y_1^2 + \lambda_{2}y_2^2 + \cdots + \lambda_{\min}y_N^2& \\&\le \lambda_{\max}y_1^2 + \lambda_{\max}y_2^2 + \cdots + \lambda_{\max}y_N^2 \\ & =\lambda_{\max}(y_1^2 +y_2^2 + \cdots y_N^2) \\ & =\lambda_{\max} {\bf y}^T{\bf y} \\ & \implies {\bf y}^T\Lambda {\bf y} \le \lambda_{\max} {\bf y}^T{\bf y} \tag{3}\end{align}

$$\begin{eqnarray}{\bf y}^T{\bf y} &= &{\bf v}^TQQ^T{\bf v} = {\bf v}^T{\bf v} \tag{4}\end{eqnarray}$$

$$\begin{eqnarray}{\bf y}^T\Lambda {\bf y} & \le & \lambda_{\max} {\bf y}^T{\bf y} \tag{5}\end{eqnarray}$$

$${\bf v}^TH{\bf v} \le \lambda_{\max}{\bf v}^T{\bf v}$$

$$\begin{eqnarray}{\bf v}^TH{\bf v} & \le & \lambda_{\max} \tag{6}\end{eqnarray}$$

## Condition Number of Hessian

$$\underset{i, j}{\max} \left| \frac{\lambda_i}{\lambda_j} \right|$$

## line search

$3$个参数：

$N$个参数：

\begin{align*}f\left( x \right) & = \sum\limits_{n = 0}^\infty {\frac{{{f^{\left( n \right)}}\left( a \right)}}{{n!}}{{\left( {x - a} \right)}^n}} \\ & = f\left( a \right) + f'\left( a \right)\left( {x - a} \right) + \frac{{f''\left( a \right)}}{{2!}}{\left( {x - a} \right)^2} + \frac{{f'''\left( a \right)}}{{3!}}{\left( {x - a} \right)^3} + \cdots \end{align*}

\begin{align*}f(x) & = f(x^{(0)}) + (x-x^{(0)})^T g + \frac{1}{2} (x-x^{(0)})^T H (x-x^{(0)}) + \ldots \quad \text{其中 } g \text{ 为偏导数向量} \\f(x) & \approx f(x^{(0)}) + (x-x^{(0)})^T g + \frac{1}{2} (x-x^{(0)})^T H (x-x^{(0)}) \\f(x^{(0)} - \epsilon g) & \approx f(x^{(0)}) - \epsilon g^T g + \frac{1}{2} \epsilon^2 g^T H g \tag{7} \quad\ \text{令} x=x^{(0)} - \epsilon g,\epsilon \in(0,1)\\\end{align*}

$$\epsilon^{*} = \frac{g^T g}{g^T H g} \geqslant \frac{g^T g}{g^T g \lambda_{max}} = \frac{1}{\lambda_{max}} \quad \text{利用式（7） } g^T H g \leqslant g^T g \lambda_{\max}.\tag{8}$$
$\lambda_{\max}$是$H$的最大特征值。于是海森矩阵决定了最佳学习率的下界。在使用最佳学习率的情况下，将式（8）代入式（7）有
$$\begin{split}f(x^{(0)} - \epsilon^{*} g) & \approx f(x^{(0)}) - \frac{1}{2} \epsilon^{*} g^T g \\\end{split}$$

## 牛顿法

$$\begin{split} f(x) & \approx f(x_n) + f'(x_n)\Delta{x} + \frac{1}{2} f''(x_n)\Delta{x}^2 \\ \frac{ df(x)}{d\Delta{x}} & \approx f'(x_n)+ f''(x_n)\Delta{x} \\ \end{split}$$

$$\begin{split} f'(x_n)+ f''(x_n)\Delta{x} = 0 \\ \Delta{x} = -\frac{f'(x_n)}{f''(x_n)} \\ \end{split}$$

$$\begin{split} x_{n+1} & = x_n + \Delta{x} \\ & = x_n -\frac{f'(x_n)}{f''(x_n)} \\ &=x_{n} -[H f(x_{n})]^{-1} f'(x_{n})\\ \end{split}$$

$$\begin{split} x^{'} = x - [H f(x)]^{-1} f'(x) \\ \end{split}$$

$$\epsilon=[H f(x)]^{-1} f'(x)$$

## References

### 评论 2

1. #2

最后一个牛顿法的学习率是不是ϵ=[Hf(x)]−1，还要带着导数么

生之向往3年前 (2021-10-27)回复
2. #1

厉害

Tao5年前 (2019-11-27)回复