宾州树库和CTB的Python预处理脚本-码农场

依赖软件
PTB
1. 将 PTB 导入 NLTK
2. 运行 ptb.py
CTB

在写句法分析器之前，通常需要将PTB和CTB预处理为：

一行一个句子，单文件；
符合规范比例的训练集/开发集/测试集；
去掉CTB中的xml标签，只保留句子，编码转换。

这些步骤很麻烦，因为通常bracketed的树形结构需要先解析才能转为一行，CTB的组织结构、文件格式与PTB的不同等等。

本以为如此古老的数据集，肯定有开源项目做这些脏活累活的。不料并没有，所以我写了几个脚本，自动完成这些预处理，开源在GitHub上。

其中，数据集拆分具体是按照Chen and Manning (2014), Dyer et al. (2015)等人的传统比例：

PTB Training: 02-21. Development: 22. Test: 23.
CTB Training: 001–815, 1001–1136. Development: 886–931, 1148–1151. Test: 816–885, 1137–1147.

依赖软件

Python3
NLTK

PTB

1. 将 PTB 导入 NLTK

括号树形结构的解析依赖于 NLTK。请参考 NLTK instruction, 将 BROWN 和 WSJ 放入 nltk_data/corpora/ptb, 得到

ptb
├── BROWN
└── WSJ

2. 运行 `ptb.py`

这个脚本自动执行转换与分割, 只需指定一个目录存放输出文件.

usage: ptb.py [-h] --output OUTPUT
Combine Penn Treebank WSJ MRG files into train/dev/test set
optional arguments:
  -h, --help       show this help message and exit
  --output OUTPUT  The folder where to store the output
                   train.txt/dev.txt/test.txt

比如

$ python3 ptb.py --output ptb-combined
Importing ptb from nltk
Generating ptb-combined/train.txt
1875 files...
100.00%
39832 sentences.
Generating ptb-combined/dev.txt
83 files...
100.00%
1700 sentences.
Generating ptb-combined/test.txt
100 files...
100.00%
2416 sentences.

CTB

CTB 的结构有点乱，它的文档中含有xml标签，需要去掉。另文件命名也不是连续的，编码是GBK。NLTK默认也不支持它。我写了这些针对措施，只需指定 CTB 根目录 (包含index.html的那个目录)即可。

usage: ctb.py [-h] --ctb CTB --output OUTPUT
Combine Chinese Treebank 5.1 fid files into train/dev/test set
optional arguments:
  -h, --help       show this help message and exit
  --ctb CTB        The root path to Chinese Treebank 5.1
  --output OUTPUT  The folder where to store the output
                   train.txt/dev.txt/test.txt

比如

$ python3 ctb.py --ctb corpus/ctb5.1 --output ctb5.1-combined
Converting CTB: removing xml tags...
Importing to nltk...
Generating ctb5.1-combined/train.txt
773 files...
100.00%
16083 sentences.
Generating ctb5.1-combined/dev.txt
36 files...
100.00%
803 sentences.
Generating ctb5.1-combined/test.txt
81 files...
100.00%
1910 sentences.

之后就可以开始你的正事了。

神经网络依存句法分析51.png

hankcs.com 2017-07-13 上午11.30.59.png

知识共享署名-非商业性使用-相同方式共享：码农场 » 宾州树库和CTB的Python预处理脚本

成功解析文件，但是解析出的句子数，三个文件都是0 sentences

RyanBin2年前 (2022-03-03)回复

我在网上找到一份ctb8.0的语料，但是运行ctb8.py就报编码错误，看代码发现，里面用的语料好像是nltk自带的一个ctb8，跟我这个语料好像没啥关系，能解释一下吗

outsider5年前 (2019-01-01)回复

nltk没有带ctb8.0，反正我的nltk3.5里面没有，只带了ptb（仅有文件名列表），ctb8.py的第75行下载ptb只是为了创建一个nltk_data的路径。然后将ctb8.0的原始数据去除xml标签（比如文档号）放到nltk_data里。然后再用nltk处理成各个任务需要的数据。
如果无法下载数据（断网或者被墙），可以自行创建nltk_data路径，win系统可以创建’C:\\nltk_data’。

风维月魄4年前 (2020-09-02)回复
如果你的nltk_data路径原来有ctb8.0，建议删除，然后再运行处理程序。

风维月魄4年前 (2020-09-02)回复

我擦，找资料突然发现倒数第二个图是用我的软件DependencyViewer画出来的。。。

nixius5年前 (2018-12-26)回复

博主，CTB和PTB都是收费的吗？不知个人开发者有没有办法获取到这两个数据集？

cer6年前 (2017-11-27)回复

LDC提供商用授权，个人用户可以申请购买。

hankcs6年前 (2017-12-09)回复
- 博主，购买要花多少money？
  
  vipning5年前 (2019-05-29)回复

宾州树库和CTB的Python预处理脚本

依赖软件

PTB

1. 将 PTB 导入 NLTK

2. 运行 `ptb.py`

CTB

评论 8

我的作品

依赖软件

PTB

1. 将 PTB 导入 NLTK

2. 运行 ptb.py

CTB

评论 8

我的作品

2. 运行 `ptb.py`