Phrase-Structure Parsing
How to compile
Suppose that ZPar has been downloaded to the directory zpar
. To make a phrase-structure parsing system for English, type make english.conparser
. This will create a directory zpar/dist/english.conparser
, in which there are two files: train
and conparser
. The file train
is used to train a parsing model,and the file conparser
is used to parse new texts using a trained parsing model. Similarly, we can make a phrase-structure parsing system for Chinese by typing make chinese.conparser
. The train
and conparser
files are created under the directory of zpar/dist/chinese.conparser
. Note that the English and Chinese parsers are designed specifically for Penn Treebanks.
Format of inputs and outputs
The input file to the train
executable contains a set of parse trees, one for each line. An example parse tree is as follows:
( S r ( NP r ( NNP t Ms. ) ( NNP t Haag ) ) ( S l* ( VP l ( VBZ t plays ) ( NP s ( NNP t Elianti ) ) ) ( . t . ) ) )
The format is different from the original format used in Penn Treebanks. Here is a Python script to convert the original Penn Treebank format to the ZPar format. The usage is
python binarize.py <rule-file> <input-file>
Here rule-file
is a file containing head-finding rules (see the example rules for Penn Chinese Treebank), and the conversion results will be printed to the console. Note that, in the respect of Chinese, the encoding of input-file to binarize.py
should be gb
and the output will be encoded in utf8
. Here is a script that transfers files that are encoded in gb
to the utf8
encoding.
The input file to the conparser
contain POS tagged sentences. The formats for English and Chinese are different.
Ms./NNP Haag/NNP plays/VBZ Elianti/NNP
Chinese:
ZPar_NR 可以_MD 分析_VV 中文_NN 和_CC 英文_NN
For Chinese, inputs to both train
and conparser
must be encoded in utf8
.
How to train a model
To train a model, use
zpar/dist/english.conparser/train <train-file> <model-file> <number of iterations>
For example, using the example train file, you can train a model by
zpar/dist/english.conparser/train train.txt model 1
After training is completed, a new file model
will be created in the current directory, which can be used to parse POS-tagged sentences. The above command performs training with one iteration (see How to tune the performance of a system) using the training file. The commands for training Chinese parsing models are the same.
How to parse new texts
To apply an existing model to parse new texts, use
zpar/dist/english.conparser/conparser <input-file> <output-file> <model>
For example, using the model we just trained, we can parse an example input by
zpar/dist/english.conparser/conparser input.txt output.txt model
The output file contains automatically parsed trees. The commands for parsing Chinese texts are the same.
Outputs and evaluation
In order to evaluate the quality of the outputs, we can manually specify the gold parse trees of a sample, and compare the outputs with the correct sample.
Manually specified parse trees of the input file are given in this example reference file. Refer to evalb to obtain a software that performs automatic evaluation.
Using the above output.txt
and reference.txt
, we can evaluate the accuracies by typing
./evalb -p <config.file> output.txt reference.txt
Here config.file
sets running parameters of the evaluation. COLLINS.prm is a widely used configuration file.
Evaluation results will be printed to the console.
How to tune the performance of a system
The performance of the system after one training iteration may not be optimal. You can try training a model for another few iterations, after each you compare the performance. You can choose the model that gives the highest f-score on your test data. We conventionally call this test file the development test data, because you develop a parsing model using this. Here is a a shell script that automatically trains the parser for 30 iterations, and after the ith iteration, stores the model file to model.i. You can compare the f-score of all 30 iterations and choose model.k, which gives the best f-score, as the final model. In this file, this is a variable called parser
. You need to set this variable to the relative directory of zpar/dist/english.conparsr
or zpar/dist/chinese.conparser
.
Source code
The source code for the English phrase-structure parser can be found at
zpar/src/common/conparser/implementation/ENGLISH_CONPARSER_IMPL
where ENGLISH_CONPARSER_IMPL
is a macro defined in Makefile
, and specifies a specific implementation for the English phrase-structure parser.
The source code for the Chinese phrase-structure parser can be found at
zpar/src/common/conparser/implementation/CHINESE_CONPARSER_IMPL
where CHINESE_CONPARSER_IMPL
is a macro defined in Makefile
, and specifies a specific implementation for the Chinese phrase-structure parser.
Reference
- Yue Zhang and Stephen Clark. 2009. Transition-Based Parsing of the Chinese Treebank using a Global Discriminative Model. In Proc. of IWPT, pages 162-171.
- Muhua Zhu, Yue Zhang, Wenliang Chen, Min Zhang and Jingbo Zhu. 2013. Fast and Accurate Shift-Reduce Constituent Parsing. In Proc. of ACL pages 434-443.