Dependency Parsing
How to compile
Suppose that ZPar has been downloaded to the directory zpar
. To make a dependency parsing system for English, type make english.depparser
. This will create a directory zpar/dist/english.depparser
, in which there are two files: train
and depparser
. The file train
is used to train a parsing model, and the file depparser
is used to parse new texts using a trained parsing model. Similarly, we can make a dependency parsing system for Chinese by typing make chinese.depparser
. The train
and depparser
files are created under the directory of zpar/dist/chinese.depparser
.
Format of inputs and outputs
The input file to the train
executable contains a set of parse trees. An example parse tree is as follows:
Ms. NNP 1
Haag NNP 2
plays VBZ -1
Elianti NNP 2
. . 2
Here the first column represents the words of the sentence; the second column contains POS tags of the words; the third column represents the indices of the heads of the words. Indices start from 0. For example, the head index of the word Ms.
is 1, which means its head is Haag
. The head index for the root word of the sentences is -1. Note that, in each line tab characters are used to separate a word, a POS tag, and an index.
The input file to the depparser
executable contains POS tagged sentences. The formats for English and Chinese are different.
Ms./NNP Haag/NNP plays/VBZ Elianti/NNP
Chinese:
ZPar_NR 可以_MD 分析_VV 中文_NN 和_CC 英文_NN
For Chinese, inputs to both train
and depparser
must be encoded in utf8
.
How to train a model
To train a model, use
zpar/dist/english.depparser/train <train-file> <model-file> <number of iterations>
For example, using the English example train file, you can train a model by
zpar/dist/english.depparser/train train.txt model 1
After training is completed, a new file model
will be created in the current directory, which can be used to parse POS-tagged sentences. The above command performs training with one iteration (see How to tune the performance of a system) using the training file.
The commands for training Chinese parsing models are the same. For example, using the Chinese example train file, you can train a model by
zpar/dist/chinese.depparser/train train.txt model 1
How to parse new texts
To apply an existing model to parse new texts, use
zpar/dist/english.depparser/depparser <model> <input-file> <output-file>
For example, using the model we just trained, we can parse an example input by
zpar/dist/english.depparser/depparser model input.txt output.txt
The output file contains automatically parsed trees. The commands for parsing Chinese texts are the same. See an example of Chinese input file.
Outputs and evaluation
In order to evaluate the quality of the outputs, we can manually specify the gold parse trees of a sample, and compare the outputs with the correct sample.
Manually specified parse trees of the input file are given in this example reference file (find a Chinese reference file here). Here is a Python script that performs automatic evaluation.
Using the above output.txt
and reference.txt
, we can evaluate the accuracies by typing
python evaluate.py output.txt reference.txt
You can find the precision, recall, and f-score here. See the explanation of these measures on Wikipedia.
How to tune the performance of a system
The performance of the system after one training iteration may not be optimal. You can try training a model for another few iterations, after each you compare the performance. You can choose the model that gives the highest f-score on your test data. We conventionally call this test file the development test data, because you develop a parsing model using this. Here is a a shell script that automatically trains the parser for 30 iterations, and after the ith iteration, stores the model file to model.i. You can compare the f-score of all 30 iterations and choose model.k, which gives the best f-score, as the final model. In this file, this is a variable called zpar
. You need to set this variable to the relative directory of zpar/dist/english.depparser
or zpar/dist/chinese.depparser
.
Source code
The source code for the English dependency parser can be found at
zpar/src/common/depparser/implementation/ENGLISH_DEPPARSER_IMPL
where ENGLISH_DEPPARSER_IMPL
is a macro defined in Makefile
, and specifies a specific implementation for the English dependency parser.
The source code for the Chinese dependency parser can be found at
zpar/src/common/depparser/implementation/CHINESE_DEPPARSER_IMPL
where CHINESE_DEPPARSER_IMPL
is a macro defined in Makefile
, and specifies a specific implementation for the Chinese dependency parser.
Reference
- Yue Zhang and Stephen Clark. 2008. A Tale of Two Parsers: Investigating and Combining Graph-based And transition-based Dependency Parsing Using Beam-search. In Proc. of EMNLP, pages 562-571.
- Yue Zhang and Joakim Nivre. 2011. Transition-based Dependency Parsing with Rich Non-local Features. In Proc. of ACL, pages 188-193.