Dependency Parsing

How to compile

Suppose that ZPar has been downloaded to the directory zpar. To make a dependency parsing system for English, type make english.depparser. This will create a directory zpar/dist/english.depparser, in which there are two files: train and depparser. The file train is used to train a parsing model, and the file depparser is used to parse new texts using a trained parsing model. Similarly, we can make a dependency parsing system for Chinese by typing make chinese.depparser. The train and depparser files are created under the directory of zpar/dist/chinese.depparser.

Format of inputs and outputs

The input file to the train executable contains a set of parse trees. An example parse tree is as follows:


Ms.        NNP    1
Haag       NNP    2
plays      VBZ    -1
Elianti    NNP    2
.          .      2

Here the first column represents the words of the sentence; the second column contains POS tags of the words; the third column represents the indices of the heads of the words. Indices start from 0. For example, the head index of the word Ms. is 1, which means its head is Haag. The head index for the root word of the sentences is -1. Note that, in each line tab characters are used to separate a word, a POS tag, and an index.

The input file to the depparser executable contains POS tagged sentences. The formats for English and Chinese are different.

English:

 Ms./NNP Haag/NNP plays/VBZ Elianti/NNP

Chinese:

 ZPar_NR 可以_MD 分析_VV 中文_NN 和_CC 英文_NN

For Chinese, inputs to both train and depparser must be encoded in utf8.

How to train a model

To train a model, use

 zpar/dist/english.depparser/train <train-file> <model-file> <number of iterations>

For example, using the English example train file, you can train a model by

 zpar/dist/english.depparser/train train.txt model 1

After training is completed, a new file model will be created in the current directory, which can be used to parse POS-tagged sentences. The above command performs training with one iteration (see How to tune the performance of a system) using the training file.

The commands for training Chinese parsing models are the same. For example, using the Chinese example train file, you can train a model by

 zpar/dist/chinese.depparser/train train.txt model 1

How to parse new texts

To apply an existing model to parse new texts, use

 zpar/dist/english.depparser/depparser <model> <input-file> <output-file>

For example, using the model we just trained, we can parse an example input by

 zpar/dist/english.depparser/depparser model input.txt output.txt

The output file contains automatically parsed trees. The commands for parsing Chinese texts are the same. See an example of Chinese input file.

Outputs and evaluation

In order to evaluate the quality of the outputs, we can manually specify the gold parse trees of a sample, and compare the outputs with the correct sample.

Manually specified parse trees of the input file are given in this example reference file (find a Chinese reference file here). Here is a Python script that performs automatic evaluation.

Using the above output.txt and reference.txt, we can evaluate the accuracies by typing

 python evaluate.py output.txt reference.txt

You can find the precision, recall, and f-score here. See the explanation of these measures on Wikipedia.

How to tune the performance of a system

The performance of the system after one training iteration may not be optimal. You can try training a model for another few iterations, after each you compare the performance. You can choose the model that gives the highest f-score on your test data. We conventionally call this test file the development test data, because you develop a parsing model using this. Here is a a shell script that automatically trains the parser for 30 iterations, and after the ith iteration, stores the model file to model.i. You can compare the f-score of all 30 iterations and choose model.k, which gives the best f-score, as the final model. In this file, this is a variable called zpar. You need to set this variable to the relative directory of zpar/dist/english.depparser or zpar/dist/chinese.depparser.

Source code

The source code for the English dependency parser can be found at

 zpar/src/common/depparser/implementation/ENGLISH_DEPPARSER_IMPL

where ENGLISH_DEPPARSER_IMPL is a macro defined in Makefile, and specifies a specific implementation for the English dependency parser.

The source code for the Chinese dependency parser can be found at

 zpar/src/common/depparser/implementation/CHINESE_DEPPARSER_IMPL

where CHINESE_DEPPARSER_IMPL is a macro defined in Makefile, and specifies a specific implementation for the Chinese dependency parser.

Reference

Yue Zhang and Stephen Clark. 2008. A Tale of Two Parsers: Investigating and Combining Graph-based And transition-based Dependency Parsing Using Beam-search. In Proc. of EMNLP, pages 562-571.
Yue Zhang and Joakim Nivre. 2011. Transition-based Dependency Parsing with Rich Non-local Features. In Proc. of ACL, pages 188-193.