# A Hierarchical Bayesian Language Model based on Pitman-Yor Processes

$$P(sentence)\approx \prod^T_{i=1}P(word_i\vert word^{i-1}_{i-n+1})$$

## 概率论复习

### Multinomial Distribution

$$\mathbf{x}=(0,0,1,0,0,0)^\text{T}$$

$$p(\mathbf{x}|\mathbf{\mu})=\prod_{k=1}^{K} \mu_{k}^{x_k}$$

$$p(\mathcal{D}|\mu ) = \prod\limits_{n = 1}^N {\prod\limits_{k = 1}^K {\mu _k^{{x_{nk}}}} } = \prod\limits_{k = 1}^K {{\mu _k}^{(\sum\limits_n {{x_{nk}})} }} = \prod\limits_{k = 1}^K {\mu _k^{{m_k}}} \\ \\$$

$${\text{Mult}}({m_1},{m_2}, \cdots ,{m_K}|\mu ,N) = \left( \begin{gathered} N \\ {m_1}{m_2} \cdots {m_k} \\ \end{gathered} \right)\mathop \prod \limits_{k = 1}^K \mu _k^{{m_k}}$$

### Dirichlet Distribution

$$\text{Dir}(\boldsymbol{\mu}\vert\boldsymbol{\alpha})=\frac{\Gamma(\alpha_{1}+\cdots+\alpha_{K})}{\Gamma(\alpha_1)\cdots\Gamma(\alpha_K)}\prod_{k=1}^{K}\mu_{k}^{\alpha_{k}-1} \label{eqn:dirichlet}$$

$$\Gamma \left( x \right) = \int\limits_0^\infty {{s^{x - 1}}{e^{ - s}}ds} = x\Gamma (x - 1)$$

Dirichlet分布是Multinomial分布的共轭先验。首先将Dirichlet分布作为先验采样出多项式分布的参数$\mathbf{\mu}$，然后乘以似然函数得到参数的后验概率：

$$p(\mu |{\cal D},\alpha ) \propto {\color{red} p({\cal D}|\mu )} {\color{blue} p(\mu |\alpha )} \propto \prod\limits_{k = 1}^K {\mu _k^{{\alpha _k} + {m_k} - 1}}$$

\begin{aligned} p(\mu |D,\alpha ) &= {\text{Dir}}(\mu |\alpha + {\mathbf{m}}) \\ &= \frac{{\Gamma ({\alpha _1} + \cdots + {\alpha _K} + N)}}{{\Gamma ({\alpha _1} + {m_1}) \cdots \Gamma ({\alpha _K} + {m_K})}}\prod\limits_{k = 1}^K {\mu _k^{{\alpha _k} + {m_k} - 1}} \\ \end{aligned}

### Dirichlet Process

\begin{aligned} \pi |\alpha &\sim {\text{Dir}}(\frac{\alpha }{K}, \cdots ,\frac{\alpha }{K}) \hfill \\ {z_i}|\pi &\sim {\text{Mult}}(\pi ) \hfill \\ \end{aligned}

\begin{aligned} (G({A_1}), \cdots ,G({A_r}))\sim{\text{Dir}}(\alpha H({A_1}), \cdots ,\alpha H({A_r})) \hfill \\ \hfill \\ \end{aligned}

\begin{aligned} E[G(A)] &= H(A) \hfill \\ V[G(A)] &= \frac{{H(A)(1 - H(A)}}{{\alpha + 1}} \hfill \\ \end{aligned}

### 中餐馆过程

DP有一个性质叫做clustering property。假设有$n \in \mathbb{N}$个自然数，将其划分为任意个子集，称这样的一次划分为一个“划分（partition）”。我们想要得到定义在划分上的分布，即某个划分出现的概率。由于所有划分的数量没有closed form，只知道它的上限是$O(n^n)$，所以无法直接用组合数学去计算。

1.第一个顾客总是选择第一个桌子

2.第$n$个顾客以$\frac{\alpha}{n-1+\alpha}$的概率选择第一个空桌子，以$\frac{c}{n-1+\alpha}$选择某个非空的桌子；其中$\alpha$为随机过程的一个参数，$c$为该桌子已经入座的人数。

COS 597C: Bayesian nonparametrics

$$\begin{array}{l} P({\pi _{[N]}}) &= \frac{1}{{\alpha (\alpha + 1) \cdots (\alpha + N - 1)}}\prod\limits_{c \in {\pi _{[N]}}} {\alpha (|c| - 1)!} \\ & = \frac{{{\alpha ^K}}}{{\alpha (\alpha + 1) \cdots (\alpha + N - 1)}}\prod\limits_{c \in {\pi _{[N]}}} {(|c| - 1)!} \end{array}$$

（$y=ax^{-k}$大部分区域的概率密度或质量很小，图片来源wiki

### Pitman-Yor Process

Pitman-Yor Process是nonparametric Bayesian models（贝叶斯学派利用数据决定模型复杂度的方法）的一种，此处用于建模unigram语言模型。定义$W$为词表，含有有限的$V$个单词。每个单词$w \in W$出现的概率为$G(w)$，所有单词的概率形成向量$G=[G(w)]_{w \in W}$。假设$G$的先验是Pitman-Yor Process：

$$G \sim \text{PY}(d,\theta,G_0)$$

1、rich-gets-richer clustering property：一个词重复得频繁，会导致它重复得更频繁（$\frac{c_k-d}{\theta+c_.}$）。

2、从$G_0$采样越频繁，会导致从$G_0$采样越频繁（$\frac{\theta+dt}{\theta+c_.}$）。

## Hierarchical Pitman-Yor语言模型

\begin{aligned} {G_{\mathbf{u}}}\sim{\text{PY}}({d_{|{\mathbf{u}}|}},{\theta _{|{\mathbf{u}}|}},{G_{\pi ({\mathbf{u}})}}) \hfill \\ \hfill \\ \end{aligned}

\begin{aligned} {G_\emptyset }\sim{\text{PY}}({d_0},{\theta _0},{G_0}) \hfill \\ \hfill \\ \end{aligned}

the lazy dog
the lazy cat
the lazy cat
that lazy fox
that lazy fox
that lazy fox

\begin{aligned} {G_{{\text{the lazy}}}}({\text{fox}})&\sim{\text{PY}}({d_2},{\theta _2},{G_{{\text{lazy}}}}({\text{fox}})) \hfill \\ & \sim{\text{PY}}({d_2},{\theta _2},{\text{PY}}({d_1},{\theta _1},{G_\emptyset }({\text{fox}}))) \hfill \\ &\sim{\text{PY}}({d_2},{\theta _2},PY({d_1},{\theta _1},{\text{PY}}({d_0},{\theta _0},{G_0}))) \hfill \\ \end{aligned}

## Hierarchical Chinese Restaurant Processes

\left\{ \begin{aligned} {t_{{\mathbf{u}}w \cdot }} = 0\qquad \;{\kern 1pt} \qquad \;{\kern 1pt} {\text{if }}{c_{{\mathbf{u}}w \cdot }} = 0; \hfill \\ 1 \leqslant {t_{{\mathbf{u}}w \cdot }} \leqslant {{\text{c}}_{{\mathbf{u}}w \cdot }}\qquad {\text{if }}{c_{{\mathbf{u}}w \cdot }} > 0; \hfill \\ \end{aligned} \right.

\begin{aligned} {c_{{\mathbf{u}}w \cdot }} &= \sum\limits_{{\mathbf{u'}}:\pi ({\mathbf{u'}}) = {\mathbf{u}}} {{t_{{\mathbf{u'}}w \cdot }}} \hfill \\ \hfill \\ \end{aligned}

${t_{{\mathbf{u'}}w \cdot }}=1$（the lazy fox）或${t_{{\mathbf{u'}}w \cdot }}=3$（that lazy fox * 3），而${c_{{\mathbf{u}}w \cdot }}=4$（the lazy fox, that lazy fox * 3）。

## 训练

\begin{aligned} p({\mathbf{G}},\Theta |\mathcal{D}) &= \frac{{p({\mathbf{G}},\Theta ,\mathcal{D})}}{{p(\mathcal{D})}} \hfill \\ \hfill \\ \end{aligned}

\begin{aligned} p({\mathbf{S}},\Theta |\mathcal{D}) &= \frac{{p({\mathbf{S}},\Theta ,\mathcal{D})}}{{p(\mathcal{D})}} \hfill \\ \hfill \\ \end{aligned}

\begin{aligned} p(w|{\mathbf{u}},\mathcal{D}) &= \int {{\color{red} p(w|{\mathbf{u}},{\mathbf{S}},\Theta } )}{\color{blue}p({\mathbf{S}},\Theta |\mathcal{D})}d({\mathbf{S}},\Theta ) \hfill \\ \hfill \\ \end{aligned}

\begin{aligned} p(w|{\mathbf{u}},\mathcal{D}) \approx \mathop \sum \limits_{i = 1}^I p\left( {w|{\mathbf{u}},{{\mathbf{S}}^{(i)}},{\Theta ^{(i)}}} \right) \hfill \\ \hfill \\ \end{aligned}

\begin{aligned} p\left( {w|0,{\mathbf{S}},\Theta } \right) &= \frac{1}{V} \hfill \\ p\left( {w|{\mathbf{u}},{\mathbf{S}},\Theta } \right) &= \frac{{{c_{{\mathbf{u}}w \cdot }} - {d_{|{\mathbf{u}}|}}{t_{{\mathbf{u}}w \cdot }}}}{{{\theta _{|{\mathbf{u}}|}} + {c_{{\mathbf{u}} \cdot \cdot }}}} + \frac{{{\theta _{|{\mathbf{u}}|}} + {d_{|{\mathbf{u}}|}}{t_{{\mathbf{u}} \cdot \cdot }}}}{{{\theta _{|{\mathbf{u}}|}} + {c_{{\mathbf{u}} \cdot \cdot }}}}p(w|\pi ({\mathbf{u}}),{\mathbf{S}},\Theta ) \hfill \\ \hfill \\ \end{aligned}

## References

[1]Y. W. Teh, “A Hierarchical Bayesian Language Model Based On Pitman-Yor Processes.,” arXiv, 2006.

### 评论 2

1. #1

你的公式在后台怎么编辑出来的呀

nortan5个月前 (12-11)回复
• 有一些插件，MathJax之类的

Hsing224个月前 (12-21)回复