# CS224n笔记4 Word Window分类与神经网络

## 分类问题

### softmax详细

softmax分类函数：

$$p(y_j = 1|x) = \frac{\exp(W_{j\cdot}x)}{\sum_{c=1}^C\exp(W_{c\cdot}x)}$$

### softmax与交叉熵误差

$$-\log \bigg(\frac{\exp(W_{k\cdot}x)}{\sum_{c=1}^C\exp(W_{c\cdot}x)}\bigg)$$

\begin{align}H(\hat{y},y) &= -\sum_{j = 1}^{|V|} y_j \log(\hat{y}_j)\\ &=-\sum_{j=1}^{C}y_j\log(p(y_j = 1|x)) \\ &= -\sum_{j=1}^{C}y_j\log \bigg(\frac{\exp(W_{j\cdot}x)}{\sum_{c=1}^C\exp(W_{c\cdot}x)}\bigg)\\ &= -y_i \log(\hat{y}_i)\end{align}

$$-\sum_{i = 1}^N\log \bigg(\frac{\exp(W_{k{(i)}\cdot}x^{(i)})}{\sum_{c=1}^C\exp(W_{c\cdot}x^{(i)})}\bigg)$$

$$-\sum_{i = 1}^N\log \bigg(\frac{\exp(W_{k{(i)}\cdot}x^{(i)})}{\sum_{c=1}^C\exp(W_{c\cdot}x^{(i)})}\bigg) + \lambda \sum_{k=1}^{C\cdot d + |V|\cdot d} \theta_k^2$$

### re-training词向量失去泛化效果

Word vectors = word embeddings = word representations (mostly)

## Window classification

### 最简单的分类器：softmax

$J$对$x$求导，注意这里的$x$指的是窗口所有单词的词向量拼接向量。

\begin{align} \Delta_xJ &=\frac{\partial}{\partial x}-\log softmax\left(f_y(x)\right)\\ & = \sum_{c=1}^C-\frac{\partial \log softmax\left(f_y(x)\right)}{\partial f_c}\cdot \frac{\partial f_c(x)}{\partial x}\\ &=\left[ \begin{array}{c} \hat{y_1}\\ \vdots\\ \hat{y}-1\\ \vdots\\ \hat{y_C} \end{array} \right]\cdot \frac{\partial f_c(x)}{\partial x}\\ &=\delta \cdot \frac{\partial f_c(x)}{\partial x}\\ &=\sum_{c=1}^C \delta_c W_c^T\\ &=W^T\delta \in \mathbb{R}^{5d} \end{align}

$$\nabla_{\theta} J(\theta) = \left[ \begin{array}{c} \nabla_{x_{museums}} \\ \vdots \\ \nabla_{x_{amazing}} \end{array} \right]$$

$$\nabla_{\theta} J(\theta) = \left[ \begin{array}{c} \nabla_{W_{\cdot 1}} \\ \vdots \\ \nabla_{W_{\cdot d}} \\ \nabla_{x_{aardvark}} \\ \vdots \\ \nabla_{x_{zebra}} \end{array} \right] \in \mathbb{R}^{Cd+Vd}$$

## 使用神经网络

### 为什么需要非线性

$$W_1W_2x=Wx$$

### 前向传播网络

$$a = \frac{1}{1 + \exp(-(w^Tx + b))}$$

$$a = \frac{1}{1 + \exp(-[w^T~~~b] \cdot[x~~~1])}$$

### 间隔最大化目标函数

$$minimize J = \max(s_c - s, 0)$$

$$minimize J = \max(\Delta+s_c - s, 0)$$

$$minimize J = \max(1+s_c - s, 0)$$

## 反向传播训练

$$\nabla_{W^{(k)}} = \begin{bmatrix} \delta_{1}^{(k+1)} a_1^{(k)} & \delta_{1}^{(k+1)} a_2^{(k)} & \cdots \\ \delta_{2}^{(k+1)} a_1^{(k)} & \delta_{2}^{(k+1)} a_2^{(k)} & \cdots \\ \vdots & \vdots & \ddots \end{bmatrix} = \delta^{(k+1)} a^{(k)T}$$

$$\delta^{(k)} = f'(z^{(k)}) \circ (W^{(k)T} \delta^{(k+1)})$$

$$\frac{\partial s}{\partial x}=W^T\delta$$