CS224n Assignment 2-码农场

1 Tensorflow Softmax
a softmax
b 交叉熵
c Placeholders & Feed Dictionaries
d Softmax & CE Loss
e Training Optimizer

先在TensorFlow上实现多项逻辑斯谛回归练练手，然后增加难度实现基于神经网络的转移依存句法分析，试验Xavier初始化、Dropout和Adam优化器。最后推导RNN和语言模型的困惑度、梯度、反向传播和复杂度。

Python代码开源在GitHub上。正式开始利用TensorFlow了，这里采用最新的TensorFlow-r1.2和Python2.7。如果在安装或升级TF的过程中遇到权限问题，比如：

IOError: [Errno 13] Permission denied: '/usr/local/bin/markdown_py'

不如试试：

pip install tensorflow --user

或《从源码编译安装TensorFlow》，自行编译安装得到的库效率更高。

一些TensorFlow函数假定输入为行向量，所以乘上权值矩阵的时候必须右乘（$xW+b$）而不是左乘($Wx+b$)。

1 Tensorflow Softmax

以交叉熵损失函数

$$\begin{align}
J(\boldsymbol{W}) &= CE(\boldsymbol{y}, softmax(\boldsymbol{xW}))
\end{align}$$

实现线性分类器。要求使用TensorFlow的自动微分功能将模型拟合到数据。

a softmax

手写实现softmax，不要用直接tf.nn.softmax。如果输入是矩阵，则视作不相干的行向量集合。

def softmax(x):
    """
    Compute the softmax function in tensorflow.
    You might find the tensorflow functions tf.exp, tf.reduce_max,
    tf.reduce_sum, tf.expand_dims useful. (Many solutions are possible, so you may
    not need to use all of these functions). Recall also that many common
    tensorflow operations are sugared (e.g. x * y does a tensor multiplication
    if x and y are both tensors). Make sure to implement the numerical stability
    fixes as in the previous homework!
    Args:
        x:   tf.Tensor with shape (n_samples, n_features). Note feature vectors are
                  represented by row-vectors. (For simplicity, no need to handle 1-d
                  input as in the previous homework)
    Returns:
        out: tf.Tensor with shape (n_sample, n_features). You need to construct this
                  tensor in this problem.
    """
    ### YOUR CODE HERE
    x_max = tf.reduce_max(x,1,keep_dims=True)          # find row-wise maximums
    x_sub = tf.sub(x,x_max)                            # subtract maximums
    x_exp = tf.exp(x_sub)                              # exponentiation
    sum_exp = tf.reduce_sum(x_exp,1,keep_dims=True)    # row-wise sums
    out = tf.div(x_exp,sum_exp)                        # divide
    ### END YOUR CODE
    return out

b 交叉熵

def cross_entropy_loss(y, yhat):
    """
    Compute the cross entropy loss in tensorflow.
    The loss should be summed over the current minibatch.
    y is a one-hot tensor of shape (n_samples, n_classes) and yhat is a tensor
    of shape (n_samples, n_classes). y should be of dtype tf.int32, and yhat should
    be of dtype tf.float32.
    The functions tf.to_float, tf.reduce_sum, and tf.log might prove useful. (Many
    solutions are possible, so you may not need to use all of these functions).
    Note: You are NOT allowed to use the tensorflow built-in cross-entropy
                functions.
    Args:
        y:    tf.Tensor with shape (n_samples, n_classes). One-hot encoded.
        yhat: tf.Tensorwith shape (n_sample, n_classes). Each row encodes a
                    probability distribution and should sum to 1.
    Returns:
        out:  tf.Tensor with shape (1,) (Scalar output). You need to construct this
                    tensor in the problem.
    """
    ### YOUR CODE HERE
    l_yhat = tf.log(yhat)                           # log yhat
    product = tf.mul(tf.to_float(y), l_yhat)        # multiply element-wise
    out = tf.neg(tf.reduce_sum(product))            # negative summation to scalar
    ### END YOUR CODE
    return out

c Placeholders & Feed Dictionaries

略

assignment2/model.py这个抽象层写得还算挺优美的。

d Softmax & CE Loss

def add_prediction_op(self):
    """Adds the core transformation for this model which transforms a batch of input
    data into a batch of predictions. In this case, the transformation is a linear layer plus a
    softmax transformation:
    y = softmax(Wx + b)
    Hint: Make sure to create tf.Variables as needed.
    Hint: For this simple use-case, it's sufficient to initialize both weights W
                and biases b with zeros.
    Args:
        input_data: A tensor of shape (batch_size, n_features).
    Returns:
        pred: A tensor of shape (batch_size, n_classes)
    """
    ### YOUR CODE HERE
    with tf.variable_scope("transformation"):
        bias = tf.Variable(tf.random_uniform([self.config.n_classes]))
        W = tf.Variable(tf.random_uniform([self.config.n_features, self.config.n_classes]))
        z = tf.matmul(self.input_placeholder, W) + bias
    pred = softmax(z)
    ### END YOUR CODE
    return pred
def add_loss_op(self, pred):
    """Adds cross_entropy_loss ops to the computational graph.
    Hint: Use the cross_entropy_loss function we defined. This should be a very
                short function.
    Args:
        pred: A tensor of shape (batch_size, n_classes)
    Returns:
        loss: A 0-d tensor (scalar)
    """
    ### YOUR CODE HERE
    loss = cross_entropy_loss(self.labels_placeholder, pred)
    ### END YOUR CODE
    return loss

e Training Optimizer

def add_training_op(self, loss):
    """Sets up the training Ops.
    Creates an optimizer and applies the gradients to all trainable variables.
    The Op returned by this function is what must be passed to the
    `sess.run()` call to cause the model to train. See
    https://www.tensorflow.org/versions/r0.7/api_docs/python/train.html#Optimizer
    for more information.
    Hint: Use tf.train.GradientDescentOptimizer to get an optimizer object.
                Calling optimizer.minimize() will return a train_op object.
    Args:
        loss: Loss tensor, from cross_entropy_loss.
    Returns:
        train_op: The Op for training.
    """
    ### YOUR CODE HERE
    train_op = tf.train.GradientDescentOptimizer(self.config.lr).minimize(loss)
    ### END YOUR CODE
    return train_op

TF会自动求偏导数，GradientDescentOptimizer负责更新参数。

运行后输出：

Epoch 47: loss = 0.45 (0.007 sec)
Epoch 48: loss = 0.44 (0.007 sec)
Epoch 49: loss = 0.43 (0.007 sec)
Basic (non-exhaustive) classifier tests pass

知识共享署名-非商业性使用-相同方式共享：码农场 » CS224n Assignment 2

#10

我搞到90.5分了，哈哈哈哈哈。

ALEX6年前 (2019-05-21)回复

2g(i)中：（以下来自斯坦福官网给出的Solution）：Adam使用动量，可以防止梯度更新过快。一是为了在陷入局部最优的时候梯度不为零，仍然可以逃离局部最优。二是可以让每一次的梯度估计都更加接近数据集整体的梯度。

dudu7年前 (2018-12-12)回复

我想问下，为什么笔记note里面的困惑度perplexity = 2^J，和作业里面定义的困惑度不一样？？

zavid7年前 (2018-08-02)回复

dropout那边，如果p是drop的概率的话，gamma应该等于\frac{1}{1-p}

r3dir3ct7年前 (2018-05-11)回复

2.b，每个词进去再出来，应该是2n步，可以数数上面5个词是10步。

Albert_J8年前 (2018-03-01)回复

xavier初始化是均匀分布，博文里面应该是笔误写成了正态分布

HYN8年前 (2017-12-29)回复

感谢指正

hankcs8年前 (2017-12-29)回复

谢谢博主分享！
每次代码都得参考您的= =~~很感谢！

zy8年前 (2017-08-05)回复

请问在求偏导的时候如何确定一个参数矩阵是否需要转置？
是在求导完式子之后通过观察求导式的各个参数的形状来确定是否需要添加转置操作嘛？
求导求得好乱。。博主能否推荐一下矩阵求导的资料。。感觉和高数学的知识之间有空白。。

MONK8年前 (2017-08-03)回复

stanford视频里面Richard是对各个元素求导，然后把结果写出矩阵形式。
博主这里可能是直接用的线性时的结论，就是左乘右乘分别有公式的。

zy8年前 (2017-08-05)回复
关于资料，你可以去http://web.stanford.edu/class/cs224n/syllabus.html
有一个补充阅读是讲神经网络梯度计算的，我还没看不知道情况。
我之前是看的一个网友总结的，= =反正我看完，做出来的结果就不对，还觉得正确的解法不符合定理。
就不推荐了。

zy8年前 (2017-08-05)回复

不是很明白 3.d 关于反向传播计算复杂度的计算，是dJ/dL, dJ/dI, dJ/dH三个计算复杂度的和么？对于softmax只在最后做一次来讲，传播 t 次是否应该是前两项乘上 t，而最后一项只算一次呢？

nylqd8年前 (2017-07-30)回复

1. 是的
2. 有两种interpretation，见上文补充。

hankcs8年前 (2017-07-31)回复

d Minibatch Parsing部分，
代码27行minibatch = [parse for parse in partial_parses if len(parse.stack) > 1 or len(parse.buffer) > 0]
其中partial_parses应为minibatch

matianjun18年前 (2017-07-12)回复

感谢指正

hankcs8年前 (2017-07-13)回复

CS224n Assignment 2