Phrase-Structure Parsing

How to compile

Suppose that ZPar has been downloaded to the directory zpar. To make a phrase-structure parsing system for English, type make english.conparser. This will create a directory zpar/dist/english.conparser, in which there are two files: train and conparser. The file train is used to train a parsing model,and the file conparser is used to parse new texts using a trained parsing model. Similarly, we can make a phrase-structure parsing system for Chinese by typing make chinese.conparser. The train and conparser files are created under the directory of zpar/dist/chinese.conparser. Note that the English and Chinese parsers are designed specifically for Penn Treebanks.

Format of inputs and outputs

The input file to the train executable contains a set of parse trees, one for each line. An example parse tree is as follows:

 ( S r ( NP r ( NNP t Ms. )  ( NNP t Haag )  )  ( S l* ( VP l ( VBZ t plays )  ( NP s ( NNP t Elianti )  )  )  ( . t . )  )  )

The format is different from the original format used in Penn Treebanks. Here is a Python script to convert the original Penn Treebank format to the ZPar format. The usage is

 python binarize.py <rule-file> <input-file>

Here rule-file is a file containing head-finding rules (see the example rules for Penn Chinese Treebank), and the conversion results will be printed to the console. Note that, in the respect of Chinese, the encoding of input-file to binarize.py should be gb and the output will be encoded in utf8. Here is a script that transfers files that are encoded in gb to the utf8 encoding.

The input file to the conparser contain POS tagged sentences. The formats for English and Chinese are different.

English:

 Ms./NNP Haag/NNP plays/VBZ Elianti/NNP

Chinese:

 ZPar_NR 可以_MD 分析_VV 中文_NN 和_CC 英文_NN

For Chinese, inputs to both train and conparser must be encoded in utf8.

How to train a model

To train a model, use

 zpar/dist/english.conparser/train <train-file> <model-file> <number of iterations>

For example, using the example train file, you can train a model by

 zpar/dist/english.conparser/train train.txt model 1

After training is completed, a new file model will be created in the current directory, which can be used to parse POS-tagged sentences. The above command performs training with one iteration (see How to tune the performance of a system) using the training file. The commands for training Chinese parsing models are the same.

How to parse new texts

To apply an existing model to parse new texts, use

 zpar/dist/english.conparser/conparser <input-file> <output-file>  <model>

For example, using the model we just trained, we can parse an example input by

 zpar/dist/english.conparser/conparser input.txt output.txt model

The output file contains automatically parsed trees. The commands for parsing Chinese texts are the same.

Outputs and evaluation

In order to evaluate the quality of the outputs, we can manually specify the gold parse trees of a sample, and compare the outputs with the correct sample.

Manually specified parse trees of the input file are given in this example reference file. Refer to evalb to obtain a software that performs automatic evaluation.

Using the above output.txt and reference.txt, we can evaluate the accuracies by typing

 ./evalb -p <config.file> output.txt reference.txt

Here config.file sets running parameters of the evaluation. COLLINS.prm is a widely used configuration file. Evaluation results will be printed to the console.

How to tune the performance of a system

The performance of the system after one training iteration may not be optimal. You can try training a model for another few iterations, after each you compare the performance. You can choose the model that gives the highest f-score on your test data. We conventionally call this test file the development test data, because you develop a parsing model using this. Here is a a shell script that automatically trains the parser for 30 iterations, and after the ith iteration, stores the model file to model.i. You can compare the f-score of all 30 iterations and choose model.k, which gives the best f-score, as the final model. In this file, this is a variable called parser. You need to set this variable to the relative directory of zpar/dist/english.conparsr or zpar/dist/chinese.conparser.

Source code

The source code for the English phrase-structure parser can be found at

 zpar/src/common/conparser/implementation/ENGLISH_CONPARSER_IMPL

where ENGLISH_CONPARSER_IMPL is a macro defined in Makefile, and specifies a specific implementation for the English phrase-structure parser.

The source code for the Chinese phrase-structure parser can be found at

 zpar/src/common/conparser/implementation/CHINESE_CONPARSER_IMPL

where CHINESE_CONPARSER_IMPL is a macro defined in Makefile, and specifies a specific implementation for the Chinese phrase-structure parser.

Reference

Yue Zhang and Stephen Clark. 2009. Transition-Based Parsing of the Chinese Treebank using a Global Discriminative Model. In Proc. of IWPT, pages 162-171.
Muhua Zhu, Yue Zhang, Wenliang Chen, Min Zhang and Jingbo Zhu. 2013. Fast and Accurate Shift-Reduce Constituent Parsing. In Proc. of ACL pages 434-443.