CS224n Assignment 3-码农场

3 GRU
a latch
b toggle
c 实现GRU单元
d 使用TF内置的RNN模型学习latch
e 分析图像
f 运行第二题中的NER模型
彩蛋
References

3 GRU

课上讲过GRU可以有效降低梯度消失：

$$\begin{align*}
z_{t} &= \sigma(W^{(z)}x_{t} + U^{(z)}h_{t-1}+b_z)&~\text{(Update gate)}\\
r_{t} &= \sigma(W^{(r)}x_{t} + U^{(r)}h_{t-1}+b_r)&~\text{(Reset gate)}\\
\tilde{h}_{t} &= \operatorname{tanh}(r_{t}\circ Uh_{t-1} + Wx_{t} +b_h)&~\text{(New memory)}\\
h_{t} &= (1 - z_{t}) \circ \tilde{h}_{t} + z_{t} \circ h_{t-1}&~\text{(Hidden state)}
\end{align*}$$

为了与GRU一致，将RNN记作

$$h_{t} = \sigma(r_{t}\circ Uh_{t-1} + Wx_{t} +b_h)$$

a latch

用RNN模拟一个自动机，输出序列的第一个比特。

hankcs.com 2017-07-02 下午9.34.28.png

假设初始隐藏状态为0，激活函数替换为indicator函数：

$$\sigma(x)\rightarrow
\begin{cases}
1, & \text{if $x$ > 0} \\
0, & \text{otherwise}
\end{cases}$$

$$\tanh(x)\rightarrow \begin{cases} 1, & \text{if $x$ > 0} \\ 0, & \text{otherwise} \end{cases}$$

推导RNN各参数需要满足的条件。

$$h^{(t)} = \sigma(x^{(t)}U_h+h^{(t-1)}W_h+b_h)$$

当$h^{(t-1)}=0,x^{(t)}=0$时，要让$h^{(t)}=0$则需要

$$\begin{align}
\sigma(b_h)&=0\\
b_h&\leq 0
\end{align}$$

当$h^{(t-1)}=0,x^{(t)}=1$时，要让$h^{(t)}=1$则需要

$$\begin{align} \sigma(U_h+b_h)&=0\\ U_h+b_h&> 0 \end{align}$$

当$h^{(t-1)}=1,x^{(t)}=0$时，要让$h^{(t)}=1$则需要

$$\begin{align} \sigma(W_h+b_h)&=1\\ W_h+b_h&> 0 \end{align}$$

当$h^{(t-1)}=1,x^{(t)}=1$时，要让$h^{(t)}=1$则需要

$$\begin{align} \sigma(W_h+U_h+b_h)&=1\\ W_h+U_h+b_h 0 \end{align}$$

也就是说必须满足

$$\begin{align}
b_h&\leq0\\
U_h+b_h&>0\\
W_h+b_h&>0
\end{align}$$

让$w_r=u_r=b_r=b_z=b_h=0$，用GRU模拟上述自动机。GRU单元简化为：

$$\begin{align}
z^{(t)}&=\sigma(x^{(t)}U_z+h^{(t-1)}W_z)\\
r^{(t)}&=0\\
\tilde{h}^{(t)}&=\tanh(x^{(t)}U_h)\\
h^{(t)}&=z{(t)} \circ h^{(t-1)}+(1-z^{(t)}) \circ \tilde{h}^{(t)}
\end{align}$$

当$h^{(t-1)}=0,x^{(t)}=0$时，一定满足$h^{(t)}=0$

$$\begin{align}
z^{(t)}&=\sigma(0)=0 \\
\tilde{h}^{(t)}&=\tanh(0)=0 \\
h^{(t)}&=0
\end{align}$$

当$h^{(t-1)}=0,x^{(t)}=1$时，要让$h^{(t)}=1$则有

$$\begin{align}
z^{(t)}&=\sigma(U_z)\\
\tilde{h}^{(t)}&=\tanh(0)=0\\
h^{(t)}&=(1-\sigma(U_z)) \circ \tanh(U_h)=1\\
\rightarrow U_z &\leq 0 \\
U_h&>0
\end{align}$$

当$h^{(t-1)}=1,x^{(t)}=0$时，要让$h^{(t)}=1$则有

$$\begin{align}
z^{(t)}&=\sigma(W_z)\\
\tilde{h}^{(t)}&=\tanh(0)=0\\
h^{(t)}&=z^{(t)} \circ h^{(t-1)}=\sigma(W_z)=1\\
\rightarrow W_z&>0
\end{align}$$

当$h^{(t-1)}=1,x^{(t)}=1$时，要让$h^{(t)}=1$则有

$$\begin{align} z^{(t)}&=\sigma(U_z+W_z)\\ \tilde{h}^{(t)}&=\tanh(U_h)=0\\ h^{(t)}&=z^{(t)}+(1-\sigma(U_z+W_z)) \circ \tanh(U_h)=1\\ \rightarrow U_z +W_z&>0 \end{align}$$

综合起来，有

$$\begin{align}
W_z&>0\\
U_z&\leq 0\\
U_h&>0
\end{align}$$

b toggle

模拟开关，只要遇到1就切换状态：

hankcs.com 2017-07-03 上午9.46.40.png

对RNN来讲，当$x=0$时，RNN必须维持上一个状态不变：

$$\begin{align}
0\times w_h+0\times u_h+b_h &\leq0\\
1 \times w_h + 0 \times u_h + b_h &> 0\\
\rightarrow w_h&>0
\end{align}$$

而当$x=1$ 时，RNN必须翻转上一个状态：

$$\begin{align} 0\times w_h+0\times u_h+b_h &>0\\
1 \times w_h + 0 \times u_h + b_h &\leq 0
\\ \rightarrow w_h&<0 \end{align}$$

互相矛盾，所以RNN无法实现开关。

假设$w_r=u_r=b_z=b_h=0$，对GRU来讲先让 $b_r=1$ 去关掉reset gate保持上一个状态。当 $x=1$ 时，$u_z=1,b_z=w_z=0$，就有update gate为1。然后让$\tilde{h}$与$h$异号，有$u_h=0,w_h=-2$。

c 实现GRU单元

def __call__(self, inputs, state, scope=None):
    """Updates the state using the previous @state and @inputs.
    Remember the GRU equations are:
    z_t = sigmoid(x_t U_z + h_{t-1} W_z + b_z)
    r_t = sigmoid(x_t U_r + h_{t-1} W_r + b_r)
    o_t = tanh(x_t U_o + r_t * h_{t-1} W_o + b_o)
    h_t = z_t * h_{t-1} + (1 - z_t) * o_t
    TODO: In the code below, implement an GRU cell using @inputs
    (x_t above) and the state (h_{t-1} above).
        - Define W_r, U_r, b_r, W_z, U_z, b_z and W_o, U_o, b_o to
          be variables of the apporiate shape using the
          `tf.get_variable' functions.
        - Compute z, r, o and @new_state (h_t) defined above
    Tips:
        - Remember to initialize your matrices using the xavier
          initialization as before.
    Args:
        inputs: is the input vector of size [None, self.input_size]
        state: is the previous state vector of size [None, self.state_size]
        scope: is the name of the scope to be used when defining the variables inside.
    Returns:
        a pair of the output vector and the new state vector.
    """
    scope = scope or type(self).__name__
    # It's always a good idea to scope variables in functions lest they
    # be defined elsewhere!
    with tf.variable_scope(scope):
        ### YOUR CODE HERE (~20-30 lines)
        initFunc = tf.contrib.layers.xavier_initializer(uniform=False)
        W_r = tf.get_variable('W_r', [self.state_size, self.state_size], initializer=initFunc, dtype = tf.float32)
        U_r = tf.get_variable('U_r', [self.input_size, self.state_size], initializer=initFunc, dtype = tf.float32)
        b_r = tf.get_variable('b_r', [self.state_size,], initializer=tf.constant_initializer(0), dtype = tf.float32)
        W_z = tf.get_variable('W_z', [self.state_size, self.state_size], initializer=initFunc, dtype = tf.float32)
        U_z = tf.get_variable('U_z', [self.input_size, self.state_size], initializer=initFunc, dtype = tf.float32)
        b_z = tf.get_variable('b_z', [self.state_size,], initializer=tf.constant_initializer(0), dtype = tf.float32)    ## Recommend on Piazza
        W_o = tf.get_variable('W_o', [self.state_size, self.state_size], initializer=initFunc, dtype = tf.float32)
        U_o = tf.get_variable('U_o', [self.input_size, self.state_size], initializer=initFunc, dtype = tf.float32)
        b_o = tf.get_variable('b_o', [self.state_size,], initializer=tf.constant_initializer(0), dtype = tf.float32)
        z_t = tf.sigmoid(tf.matmul(inputs, U_z) + tf.matmul(state, W_z) + b_z)
        r_t = tf.sigmoid(tf.matmul(inputs, U_r) + tf.matmul(state, W_r) + b_r)
        o_t = tf.tanh(tf.matmul(inputs, U_o) + tf.matmul(r_t * state, W_o) + b_o)
        new_state = z_t * state + (1 - z_t) * o_t
        ### END YOUR CODE ###
    # For a GRU, the output and state are the same (N.B. this isn't true
    # for an LSTM, though we aren't using one of those in our
    # assignment)
    output = new_state
    return output, new_state

d 使用TF内置的RNN模型学习latch

先完成最重要的add_prediction_op

def add_prediction_op(self):
    """Runs an rnn on the input using TensorFlows's
    @tf.nn.dynamic_rnn function, and returns the final state as a prediction.
    TODO: 
        - Call tf.nn.dynamic_rnn using @cell below. See:
          https://www.tensorflow.org/api_docs/python/nn/recurrent_neural_networks
        - Apply a sigmoid transformation on the final state to
          normalize the inputs between 0 and 1.
    Returns:
        preds: tf.Tensor of shape (batch_size, 1)
    """
    # Pick out the cell to use here.
    if self.config.cell == "rnn":
        cell = RNNCell(1, 1)
    elif self.config.cell == "gru":
        cell = GRUCell(1, 1)
    elif self.config.cell == "lstm":
        cell = tf.nn.rnn_cell.LSTMCell(1)
    else:
        raise ValueError("Unsupported cell type.")
    x = self.inputs_placeholder
    ### YOUR CODE HERE (~2-3 lines)
    preds = tf.nn.dynamic_rnn(cell, x, dtype=tf.float32)[1]
    preds = tf.sigmoid(preds)
    ### END YOUR CODE
    return preds  # state # preds

这里的dynamic_rnn会自动unroll运行sequence_length次，输出pair的第二个元素是cell的state。

然后计算梯度的长度，并在其上进行裁剪：

def add_training_op(self, loss):
    """Sets up the training Ops.
    Creates an optimizer and applies the gradients to all trainable variables.
    The Op returned by this function is what must be passed to the
    `sess.run()` call to cause the model to train. See
    TODO:
        - Get the gradients for the loss from optimizer using
          optimizer.compute_gradients.
        - if self.clip_gradients is true, clip the global norm of
          the gradients using tf.clip_by_global_norm to self.config.max_grad_norm
        - Compute the resultant global norm of the gradients using
          tf.global_norm and save this global norm in self.grad_norm.
        - Finally, actually create the training operation by calling
          optimizer.apply_gradients.
    See: https://www.tensorflow.org/api_docs/python/train/gradient_clipping
    Args:
        loss: Loss tensor.
    Returns:
        train_op: The Op for training.
    """
    optimizer = tf.train.GradientDescentOptimizer(learning_rate=self.config.lr)
    ### YOUR CODE HERE (~6-10 lines)
    # - Remember to clip gradients only if self.config.clip_gradients
    # is True.
    # - Remember to set self.grad_norm
    grads_and_vars = optimizer.compute_gradients(loss)
    variables = [output[1] for output in grads_and_vars]
    gradients = [output[0] for output in grads_and_vars]
    if self.config.clip_gradients:
        tmp_gradients = tf.clip_by_global_norm(gradients, clip_norm=self.config.max_grad_norm)[0]
        gradients = tmp_gradients
    grads_and_vars = [(gradients[i], variables[i]) for i in range(len(gradients))]
    self.grad_norm = tf.global_norm(gradients)
    train_op = optimizer.apply_gradients(grads_and_vars)
    ### END YOUR CODE
    assert self.grad_norm is not None, "grad_norm was not set properly!"
    return train_op

用到了TF内置的操作clip_by_global_norm。

通过

python q3 gru.py predict -c [rnn|gru] [-g]

运行，得到RNN|GRU的有|无梯度裁剪的图像。

e 分析图像

RNN

由于RNN梯度消失太快，还没达到最大梯度5就没了，所以裁剪不裁剪都无所谓。

而GRU时不时来个梯度大爆炸，还是很有用的

要观测这种现象，与参数初始化关系密切，我的初始化是：

initFunc = tf.contrib.layers.xavier_initializer(uniform=False)

f 运行第二题中的NER模型

这次试用GRU单元

python q2 rnn.py train -c gru

得到差不多的结果

DEBUG:Token-level confusion matrix:
go\gu   PER     ORG     LOC     MISC    O    
PER     2887    62      39      49      112  
ORG     78      1692    67      125     130  
LOC     36      99      1855    61      43   
MISC    25      48      31      1065    99   
O       19      52      11      47      42630
DEBUG:Token-level scores:
label   acc     prec    rec     f1   
PER     0.99    0.95    0.92    0.93 
ORG     0.99    0.87    0.81    0.84 
LOC     0.99    0.93    0.89    0.91 
MISC    0.99    0.79    0.84    0.81 
O       0.99    0.99    1.00    0.99 
micro   0.99    0.98    0.98    0.98 
macro   0.99    0.90    0.89    0.90 
not-O   0.99    0.90    0.87    0.88 
INFO:Entity level P/R/F1: 0.84/0.85/0.85

彩蛋

代码中有一个彩蛋，运行

q3_gru.py dynamics

会得到RNN和GRU的上一个隐藏状态和当前隐藏状态的变化图像。原代码其实根本没写完，跑不出RNN的结果。我接着写完后得到：

$x=0$时

$x=1$时

References

https://github.com/hankcs/CS224n

https://github.com/rymc9384/DeepNLP_CS224N.git

https://github.com/gxlzj/cs224n-hw3

知识共享署名-非商业性使用-相同方式共享：码农场 » CS224n Assignment 3

q3_gru.py add_prediction_op() 中dynamic_rnn应该取output而不是state吧?
outputs, state = tf.nn.dynamic_rnn(cell, x, dtype=tf.float32)
output = outputs[:, -1]
preds = tf.sigmoid(output)

thinkdoom6年前 (2018-05-29)回复

代码提示里有写“returns the final state as a prediction”

wswsdcc5年前 (2018-12-24)回复

3 GRU–a latch–ii这道题目的最后一种情况，即h(t−1)=1, x(t)=1，为何h~(t)必须为0？是否可以是这种情况：h~(t)=1，而z(t)=0，仍然可以使h(t)=1。
博主请抽空指点一下，谢谢

人墙6年前 (2017-11-28)回复

3 GRU a latch ii这个题的最后一种情况，当x_t=h_t=1时，h~t为什么只能为0？h~t和z_t同为1，h_t也能为1，对不对？楼主能抽空答疑一下吗，谢谢。

谢谢博主= =！
因为用的win10，作业要求用python2.7，但没有对应的tensorFlow版本….（也是醉了）
全靠您的代码，我才可以校对
很谢谢

zy7年前 (2017-08-17)回复

CS224n Assignment 3