2 基于神经网络的转移依存句法分析
a 手工完成Transition-Based Dependency Parsing
b 复杂度
c 初始化 & parse_step
def __init__(self, sentence): """Initializes this partial parse. Your code should initialize the following fields: self.stack: The current stack represented as a list with the top of the stack as the last element of the list. self.buffer: The current buffer represented as a list with the first item on the buffer as the first item of the list self.dependencies: The list of dependencies produced so far. Represented as a list of tuples where each tuple is of the form (head, dependent). Order for this list doesn't matter. The root token should be represented with the string "ROOT" Args: sentence: The sentence to be parsed as a list of words. Your code should not modify the sentence. """ # The sentence being parsed is kept for bookkeeping purposes. Do not use it in your code. self.sentence = sentence ### YOUR CODE HERE self.stack = ['ROOT'] self.buffer = sentence[:] self.dependencies = [] ### END YOUR CODE
def parse_step(self, transition): """Performs a single parse step by applying the given transition to this partial parse Args: transition: A string that equals "S", "LA", or "RA" representing the shift, left-arc, and right-arc transitions. """ ### YOUR CODE HERE if transition == "S": self.stack.append(self.buffer[0]) self.buffer.pop(0) elif transition == "LA": self.dependencies.append((self.stack[-1], self.stack[-2])) self.stack.pop(-2) else: self.dependencies.append((self.stack[-2], self.stack[-1])) self.stack.pop(-1) ### END YOUR CODE
d Minibatch Parsing
def minibatch_parse(sentences, model, batch_size): """Parses a list of sentences in minibatches using a model. Args: sentences: A list of sentences to be parsed (each sentence is a list of words) model: The model that makes parsing decisions. It is assumed to have a function model.predict(partial_parses) that takes in a list of PartialParses as input and returns a list of transitions predicted for each parse. That is, after calling transitions = model.predict(partial_parses) transitions[i] will be the next transition to apply to partial_parses[i]. batch_size: The number of PartialParses to include in each minibatch Returns: dependencies: A list where each element is the dependencies list for a parsed sentence. Ordering should be the same as in sentences (i.e., dependencies[i] should contain the parse for sentences[i]). """ ### YOUR CODE HERE # refer: https://github.com/zysalice/cs224/blob/master/assignment2/q2_parser_transitions.py partial_parses = [PartialParse(s) for s in sentences] unfinished_parse = partial_parses while len(unfinished_parse) > 0: minibatch = unfinished_parse[0:batch_size] # perform transition and single step parser on the minibatch until it is empty while len(minibatch) > 0: transitions = model.predict(minibatch) for index, action in enumerate(transitions): minibatch[index].parse_step(action) minibatch = [parse for parse in minibatch if len(parse.stack) > 1 or len(parse.buffer) > 0] # move to the next batch unfinished_parse = unfinished_parse[batch_size:] dependencies = [] for n in range(len(sentences)): dependencies.append(partial_parses[n].dependencies) ### END YOUR CODE return dependencies
e Xavier初始化
为了防止神经元之间过度依赖,常用的技巧之一是Xavier Initialization。给定$m \times n$大小的$\boldsymbol{A}$,按照$[-\epsilon, \epsilon]$区间的均匀分布生成每个元素$A_{ij}$:
\epsilon &= \frac{\sqrt{6}}{\sqrt{m + n}}
def xavier_weight_init(): """Returns function that creates random tensor. The specified function will take in a shape (tuple or 1-d array) and returns a random tensor of the specified shape drawn from the Xavier initialization distribution. Hint: You might find tf.random_uniform useful. """ def _xavier_initializer(shape, **kwargs): """Defines an initializer for the Xavier distribution. Specifically, the output should be sampled uniformly from [-epsilon, epsilon] where epsilon = sqrt(6) / <sum of the sizes of shape's dimensions> e.g., if shape = (2, 3), epsilon = sqrt(6 / (2 + 3)) This function will be used as a variable initializer. Args: shape: Tuple or 1-d array that species the dimensions of the requested tensor. Returns: out: tf.Tensor of specified shape sampled from the Xavier distribution. """ ### YOUR CODE HERE epsilon = np.sqrt(6 / np.sum(shape)) out = tf.Variable(tf.random_uniform(shape=shape, minval=-epsilon, maxval=epsilon)) ### END YOUR CODE return out # Returns defined initializer function. return _xavier_initializer
f Dropout
随机地将隐藏层节点的激活值以概率$p$设为$0$,然后乘上一个常量 $\gamma$ ,这个过程可以写作:
\boldsymbol{h}_{drop} &= \lambda \boldsymbol{d} \circ \boldsymbol{h}
其中,$\boldsymbol{d} \in \{0,1\}^{D_{h}}$ ($D_{h}$ 是隐藏层$\boldsymbol{h}$ 单元数) 是一个遮罩向量,每个元素以概率$p$取0,概率$1-p$取1。$\gamma$的选取要保证激活值的期望不变,即对所有$0 \le i \le D_{h}$都有:
\mathbb{E_{p}}[\boldsymbol{h}_{drop}]_{i} &= \boldsymbol{h}_{i}
那么给定$p$,$\gamma$ 到底要取多少呢?
$$\gamma = \frac{1}{p}$$
g Adam Optimizer
\boldsymbol{\theta} \leftarrow \boldsymbol{\theta} - \alpha \nabla_{\boldsymbol{\theta}} J_{minibatch}(\boldsymbol{\theta})
\boldsymbol{m} &\leftarrow \beta_{1}\boldsymbol{m} + (1 - \beta_{1}) \nabla_{\boldsymbol{\theta}} J_{minibatch}(\boldsymbol{\theta})\\
\boldsymbol{\theta} &\leftarrow \boldsymbol{\theta} - \alpha\boldsymbol{m}
其中$\beta_{1}$ 是一个 0 到 1 (经常取 0.9)的超参数。由于$\beta_{1}$很接近1,所以每次更新量与前一次大致相同,动量减小了更新量的方差,避免了梯度的剧烈震荡。
\boldsymbol{m} &\leftarrow \beta_{1}\boldsymbol{m} + (1 - \beta_{1}) \nabla_{\boldsymbol{\theta}} J_{minibatch}(\boldsymbol{\theta})\\
\boldsymbol{v} &\leftarrow \beta_{2}\boldsymbol{v} + (1 - \beta_{2})( \nabla_{\boldsymbol{\theta}} J_{minibatch}(\boldsymbol{\theta}) \circ \nabla_{\boldsymbol{\theta}} J_{minibatch}(\boldsymbol{\theta})) \\
\boldsymbol{\theta} &\leftarrow \boldsymbol{\theta} - \alpha\boldsymbol{m} / \sqrt{\boldsymbol{v}}
其中$\circ$ 和 $/$ 都是 elementwise运算。
h 实现Neural Dependency Parser
def add_prediction_op(self): """Adds the 1-hidden-layer NN: h = Relu(xW + b1) h_drop = Dropout(h, dropout_rate) pred = h_dropU + b2 Note that we are not applying a softmax to pred. The softmax will instead be done in the add_loss_op function, which improves efficiency because we can use tf.nn.softmax_cross_entropy_with_logits Use the initializer from q2_initialization.py to initialize W and U (you can initialize b1 and b2 with zeros) Hint: Here are the dimensions of the various variables you will need to create W: (n_features*embed_size, hidden_size) b1: (hidden_size,) U: (hidden_size, n_classes) b2: (n_classes) Hint: Note that tf.nn.dropout takes the keep probability (1 - p_drop) as an argument. The keep probability should be set to the value of self.dropout_placeholder Returns: pred: tf.Tensor of shape (batch_size, n_classes) """ x = self.add_embedding() ### YOUR CODE HERE xavier = xavier_weight_init() with tf.variable_scope("transformation"): b1 = tf.Variable(tf.random_uniform([self.config.hidden_size,])) b2 = tf.Variable(tf.random_uniform([self.config.n_classes])) W = xavier([self.config.n_features * self.config.embed_size, self.config.hidden_size]) U = xavier([self.config.hidden_size, self.config.n_classes]) z1 = tf.matmul(x,W) + b1 h = tf.nn.relu(z1) h_drop = tf.nn.dropout(h,self.dropout_placeholder) pred = tf.matmul(h_drop, U) + b2 ### END YOUR CODE return pred
924/924 [==============================] - 44s - train loss: 0.0602 Evaluating on dev set - dev UAS: 88.44 New best dev UAS! Saving model in ./data/weights/parser.weights ================================================================================ TESTING ================================================================================ Restoring the best model weights found on the dev set Final evaluation on test set - test UAS: 88.69 Writing predictions Done!
924/924 [==============================] - 49s - train loss: 0.0631 Evaluating on dev set - dev UAS: 88.54 New best dev UAS! Saving model in ./data/weights/parser.weights ================================================================================ TESTING ================================================================================ Restoring the best model weights found on the dev set Final evaluation on test set - test UAS: 88.92 Writing predictions Done!
2g(i)中:(以下来自斯坦福官网给出的Solution):Adam使用动量,可以防止梯度更新过快。一是为了在陷入局部最优的时候梯度不为零, 仍然可以逃离局部最优。二是可以让每一次的梯度估计都更加接近数据集整体的梯度。
我想问下,为什么笔记note里面的困惑度perplexity = 2^J,和作业里面定义的困惑度不一样??
2.b, 每个词进去再出来,应该是2n步,可以数数上面5个词是10步。
每次代码都得参考您的= =~~很感谢!
我之前是看的一个网友总结的,= =反正我看完,做出来的结果就不对,还觉得正确的解法不符合定理。
不是很明白 3.d 关于反向传播计算复杂度的计算,是dJ/dL, dJ/dI, dJ/dH三个计算复杂度的和么? 对于softmax只在最后做一次来讲,传播 t 次是否应该是前两项乘上 t,而最后一项只算一次呢?
1. 是的
2. 有两种interpretation,见上文补充。
d Minibatch Parsing部分,
代码27行minibatch = [parse for parse in partial_parses if len(parse.stack) > 1 or len(parse.buffer) > 0]