在陆续支持了简繁中英日俄法德在内的130种语言后,HanLP今日正式发布开源古汉语模型,以支持古汉语文言文自动分词、词形、词性标注和依存句法分析。得益于多任务学习技术,只需一个模型就可以支持这些任务,以及粗分/细分、UPOS/XPOS/PKU词性标注集。
调用方法
调用方法仅需三行代码:
# pip install hanlp -U import hanlp HanLP = hanlp.load(hanlp.pretrained.mtl.KYOTO_EVAHAN_TOK_LEM_POS_UDEP_LZH) doc = HanLP(['晋太元中,武陵人捕鱼为业。', '司馬牛問君子']) print(doc)
输出:
{ "tok/fine": [ ["晋", "太元", "中", ",", "武陵", "人", "捕", "鱼", "为", "业", "。"], ["司馬", "牛", "問", "君子"] ], "tok/coarse": [ ["晋", "太元", "中", ",", "武陵", "人", "捕", "鱼", "为", "业", "。"], ["司馬牛", "問", "君子"] ], "lem": [ ["晉", "太元", "中", ",", "武陵", "人", "捕", "魚", "爲", "業", "。"], ["司馬", "牛", "問", "君子"] ], "pos/upos": [ ["PROPN", "NOUN", "NOUN", "PUNCT", "PROPN", "NOUN", "VERB", "NOUN", "AUX", "NOUN", "PUNCT"], ["PROPN", "PROPN", "VERB", "NOUN"] ], "pos/xpos": [ ["n,名詞,主体,国名", "n,名詞,時,*", "n,名詞,固定物,関係", "p,補助記号", "n,名詞,固定物,地名", "n,名詞,人,人", "v,動詞,行為,動作", "n,名詞,主体,動物", "v,動詞,存在,存在", "n,名詞,可搬,成果物", "p,補助記号"], ["n,名詞,人,姓氏", "n,名詞,人,名", "v,動詞,行為,伝達", "n,名詞,人,役割"] ], "pos/pku": [ ["ns", "t", "f", "w", "ns", "n", "v", "n", "v", "n", "w"], ["nr", "nr", "v", "n"] ], "dep": [ [[3, "nmod"], [3, "nmod"], [7, "obl"], [6, "fixed"], [6, "nmod"], [7, "nsubj"], [0, "root"], [7, "obj"], [10, "cop"], [7, "parataxis"], [10, "conj"]], [[3, "nsubj"], [1, "flat"], [0, "root"], [3, "obj"]] ] }
可视化:
doc.pretty_print()
效果:
Dep Tree To Relation Le PoS ──────── ── ───────── ── ───── ┌──► 晋 nmod 晉 PROPN │┌─► 太元 nmod 太元 NOUN ┌──►└┴── 中 obl 中 NOUN │ ┌──► , fixed , PUNCT │ │┌─► 武陵 nmod 武陵 PROPN │┌─►└┴── 人 nsubj 人 NOUN └┴┬──┬── 捕 root 捕 VERB │ └─► 鱼 obj 魚 NOUN │ ┌─► 为 cop 爲 AUX └─►├── 业 parataxis 業 NOUN └─► 。 conj 。 PUNCT Dep Tr To Relat Le PoS ────── ── ───── ── ───── ┌─►┌── 司馬 nsubj 司馬 PROPN │ └─► 牛 flat 牛 PROPN └──┬── 問 root 問 VERB └─► 君子 obj 君子 NOUN
注意默认的分词标准(细分tok/fine)为UD_Classical_Chinese-Kyoto标准,会将姓名拆分为姓+名。如果需要粗分标准,请跳过细分任务:
HanLP('司馬牛問君子', skip_tasks='tok/fine').pretty_print()
此时会得到:
Dep Tok Relat Lem PoS ─── ─── ───── ─── ───── ┌─► 司馬牛 nsubj 司馬牛 PROPN ├── 問 root 問 VERB └─► 君子 obj 君子 NOUN
另外,因为古汉语不含标点,UD_Classical_Chinese-Kyoto没有关于标点的任何标注,所以与标点符号有关的UD标注可能出现未定义的状况。
古汉语分词
如果你只需古汉语分词,下列分词模型性能稍微高一点(EvaHan TestB F1: 93.98% v.s. 93.60%):
import hanlp HanLP = hanlp.load(hanlp.pretrained.tok.KYOTO_EVAHAN_TOK_LZH) doc = HanLP('司馬牛問君子') print(doc)
性能及原理
有关这些任务的具体性能以及原理,请参考文档。
引用
如果你在研究中使用了HanLP,请按如下格式引用我们的EMNLP论文:
@inproceedings{he-choi-2021-stem, title = "The Stem Cell Hypothesis: Dilemma behind Multi-Task Learning with Transformer Encoders", author = "He, Han and Choi, Jinho D.", booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing", month = nov, year = "2021", address = "Online and Punta Cana, Dominican Republic", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2021.emnlp-main.451", pages = "5555--5577", abstract = "Multi-task learning with transformer encoders (MTL) has emerged as a powerful technique to improve performance on closely-related tasks for both accuracy and efficiency while a question still remains whether or not it would perform as well on tasks that are distinct in nature. We first present MTL results on five NLP tasks, POS, NER, DEP, CON, and SRL, and depict its deficiency over single-task learning. We then conduct an extensive pruning analysis to show that a certain set of attention heads get claimed by most tasks during MTL, who interfere with one another to fine-tune those heads for their own objectives. Based on this finding, we propose the Stem Cell Hypothesis to reveal the existence of attention heads naturally talented for many tasks that cannot be jointly trained to create adequate embeddings for all of those tasks. Finally, we design novel parameter-free probes to justify our hypothesis and demonstrate how attention heads are transformed across the five tasks during MTL through label analysis.", }
References
-
粗分和PKU词性使用了EvaHan的语料库。
-
细分和UD标注使用了UD_Classical_Chinese-Kyoto。
-
Encoder采用了bert-ancient-chinese