目录
先在TensorFlow上实现多项逻辑斯谛回归练练手,然后增加难度实现基于神经网络的转移依存句法分析,试验Xavier初始化、Dropout和Adam优化器。最后推导RNN和语言模型的困惑度、梯度、反向传播和复杂度。
Python代码开源在GitHub上。正式开始利用TensorFlow了,这里采用最新的TensorFlow-r1.2和Python2.7。如果在安装或升级TF的过程中遇到权限问题,比如:
IOError: [Errno 13] Permission denied: '/usr/local/bin/markdown_py'
不如试试:
pip install tensorflow --user
或《从源码编译安装TensorFlow》,自行编译安装得到的库效率更高。
一些TensorFlow函数假定输入为行向量,所以乘上权值矩阵的时候必须右乘($xW+b$)而不是左乘($Wx+b$)。
1 Tensorflow Softmax
以交叉熵损失函数
$$\begin{align}
J(\boldsymbol{W}) &= CE(\boldsymbol{y}, softmax(\boldsymbol{xW}))
\end{align}$$
实现线性分类器。要求使用TensorFlow的自动微分功能将模型拟合到数据。
a softmax
手写实现softmax,不要用直接tf.nn.softmax。如果输入是矩阵,则视作不相干的行向量集合。
def softmax(x): """ Compute the softmax function in tensorflow. You might find the tensorflow functions tf.exp, tf.reduce_max, tf.reduce_sum, tf.expand_dims useful. (Many solutions are possible, so you may not need to use all of these functions). Recall also that many common tensorflow operations are sugared (e.g. x * y does a tensor multiplication if x and y are both tensors). Make sure to implement the numerical stability fixes as in the previous homework! Args: x: tf.Tensor with shape (n_samples, n_features). Note feature vectors are represented by row-vectors. (For simplicity, no need to handle 1-d input as in the previous homework) Returns: out: tf.Tensor with shape (n_sample, n_features). You need to construct this tensor in this problem. """ ### YOUR CODE HERE x_max = tf.reduce_max(x,1,keep_dims=True) # find row-wise maximums x_sub = tf.sub(x,x_max) # subtract maximums x_exp = tf.exp(x_sub) # exponentiation sum_exp = tf.reduce_sum(x_exp,1,keep_dims=True) # row-wise sums out = tf.div(x_exp,sum_exp) # divide ### END YOUR CODE return out
b 交叉熵
def cross_entropy_loss(y, yhat): """ Compute the cross entropy loss in tensorflow. The loss should be summed over the current minibatch. y is a one-hot tensor of shape (n_samples, n_classes) and yhat is a tensor of shape (n_samples, n_classes). y should be of dtype tf.int32, and yhat should be of dtype tf.float32. The functions tf.to_float, tf.reduce_sum, and tf.log might prove useful. (Many solutions are possible, so you may not need to use all of these functions). Note: You are NOT allowed to use the tensorflow built-in cross-entropy functions. Args: y: tf.Tensor with shape (n_samples, n_classes). One-hot encoded. yhat: tf.Tensorwith shape (n_sample, n_classes). Each row encodes a probability distribution and should sum to 1. Returns: out: tf.Tensor with shape (1,) (Scalar output). You need to construct this tensor in the problem. """ ### YOUR CODE HERE l_yhat = tf.log(yhat) # log yhat product = tf.mul(tf.to_float(y), l_yhat) # multiply element-wise out = tf.neg(tf.reduce_sum(product)) # negative summation to scalar ### END YOUR CODE return out
c Placeholders & Feed Dictionaries
略
assignment2/model.py这个抽象层写得还算挺优美的。
d Softmax & CE Loss
def add_prediction_op(self): """Adds the core transformation for this model which transforms a batch of input data into a batch of predictions. In this case, the transformation is a linear layer plus a softmax transformation: y = softmax(Wx + b) Hint: Make sure to create tf.Variables as needed. Hint: For this simple use-case, it's sufficient to initialize both weights W and biases b with zeros. Args: input_data: A tensor of shape (batch_size, n_features). Returns: pred: A tensor of shape (batch_size, n_classes) """ ### YOUR CODE HERE with tf.variable_scope("transformation"): bias = tf.Variable(tf.random_uniform([self.config.n_classes])) W = tf.Variable(tf.random_uniform([self.config.n_features, self.config.n_classes])) z = tf.matmul(self.input_placeholder, W) + bias pred = softmax(z) ### END YOUR CODE return pred def add_loss_op(self, pred): """Adds cross_entropy_loss ops to the computational graph. Hint: Use the cross_entropy_loss function we defined. This should be a very short function. Args: pred: A tensor of shape (batch_size, n_classes) Returns: loss: A 0-d tensor (scalar) """ ### YOUR CODE HERE loss = cross_entropy_loss(self.labels_placeholder, pred) ### END YOUR CODE return loss
e Training Optimizer
def add_training_op(self, loss): """Sets up the training Ops. Creates an optimizer and applies the gradients to all trainable variables. The Op returned by this function is what must be passed to the `sess.run()` call to cause the model to train. See https://www.tensorflow.org/versions/r0.7/api_docs/python/train.html#Optimizer for more information. Hint: Use tf.train.GradientDescentOptimizer to get an optimizer object. Calling optimizer.minimize() will return a train_op object. Args: loss: Loss tensor, from cross_entropy_loss. Returns: train_op: The Op for training. """ ### YOUR CODE HERE train_op = tf.train.GradientDescentOptimizer(self.config.lr).minimize(loss) ### END YOUR CODE return train_op
TF会自动求偏导数,GradientDescentOptimizer负责更新参数。
运行后输出:
Epoch 47: loss = 0.45 (0.007 sec) Epoch 48: loss = 0.44 (0.007 sec) Epoch 49: loss = 0.43 (0.007 sec) Basic (non-exhaustive) classifier tests pass
我搞到90.5分了,哈哈哈哈哈。
2g(i)中:(以下来自斯坦福官网给出的Solution):Adam使用动量,可以防止梯度更新过快。一是为了在陷入局部最优的时候梯度不为零, 仍然可以逃离局部最优。二是可以让每一次的梯度估计都更加接近数据集整体的梯度。
我想问下,为什么笔记note里面的困惑度perplexity = 2^J,和作业里面定义的困惑度不一样??
dropout那边,如果p是drop的概率的话,gamma应该等于\frac{1}{1-p}
2.b, 每个词进去再出来,应该是2n步,可以数数上面5个词是10步。
xavier初始化是均匀分布,博文里面应该是笔误写成了正态分布
感谢指正
谢谢博主分享!
每次代码都得参考您的= =~~很感谢!
请问在求偏导的时候如何确定一个参数矩阵是否需要转置?
是在求导完式子之后通过观察求导式的各个参数的形状来确定是否需要添加转置操作嘛?
求导求得好乱。。博主能否推荐一下矩阵求导的资料。。感觉和高数学的知识之间有空白。。
stanford视频里面Richard是对各个元素求导,然后把结果写出矩阵形式。
博主这里可能是直接用的线性时的结论,就是左乘右乘分别有公式的。
关于资料,你可以去http://web.stanford.edu/class/cs224n/syllabus.html
有一个补充阅读是讲神经网络梯度计算的,我还没看不知道情况。
我之前是看的一个网友总结的,= =反正我看完,做出来的结果就不对,还觉得正确的解法不符合定理。
就不推荐了。
不是很明白 3.d 关于反向传播计算复杂度的计算,是dJ/dL, dJ/dI, dJ/dH三个计算复杂度的和么? 对于softmax只在最后做一次来讲,传播 t 次是否应该是前两项乘上 t,而最后一项只算一次呢?
1. 是的
2. 有两种interpretation,见上文补充。
d Minibatch Parsing部分,
代码27行minibatch = [parse for parse in partial_parses if len(parse.stack) > 1 or len(parse.buffer) > 0]
其中partial_parses应为minibatch
感谢指正