English POS Tagging
How to compile
Suppose that ZPar has been downloaded to the directory zpar
. To make a POS tagging system for English, type make english.postagger
. This will create a directory zpar/dist/english.postagger
, in which there are two files: train
and tagger
. The file train
is used to train a tagging model,and the file tagger
is used to tag new texts using a trained tagging model.
Format of inputs and outputs
The input files to the tagger
executable are formatted as a sequence of tokenized English sentences. An example input is:
Ms. Haag plays Elianti .
The output files contain space-separated words:
Ms./NNP Haag/NNP plays/VBZ Elianti/NNP ./.
The output format is also the format of training files for the train
executable.
How to train a model
To train a model, use
zpar/dist/english.postagger/train <train-file> <model-file> <number of iterations>
For example, using the example train file, you can train a model by
zpar/dist/english.postagger/train train.txt model 1
After training is completed, a new file model
will be created in the current directory, which can be used to assign POS tags to tokenized sentences. The above command performs training with one iteration (see How to tune the performance of a system) using the training file.
How to tag new texts
To apply an existing model to tag new texts, use
zpar/dist/english.postagger/tagger <input-file> <output-file> <model>
For example, using the model we just trained, we can tag an example input by
zpar/dist/english.postagger/tagger input.txt output.txt model
The output file contains automatically tagged sentences.
Outputs and evaluation
Automatically tagged texts contain errors. In order to evaluate the quality of the outputs, we can manually specify the POS tags of a sample, and compare the outputs with the correct sample.
Manually specified POS tags of the input file are given in this example reference file. Here is a Python script that performs automatic evaluation.
Using the above output.txt
and reference.txt
, we can evaluate the accuracies by typing
python evaluate.py output.txt reference.txt
The output of the evaluation script is a single number: per-token accuracy
which is defined to be the ratio of correctly POS-tagged words over all the words in output.txt.
How to tune the performance of a system
The performance of the system after one training iteration may not be optimal. You can try training a model for another few iterations, after each you compare the performance. You can choose the model that gives the highest f-score on your test data. We conventionally call this test file the development test data, because you develop a parsing model using this. Here is a a shell script that automatically trains the POS tagger for 30 iterations, and after the ith iteration, stores the model file to model.i. You can compare the f-score of all 30 iterations and choose model.k, which gives the best f-score, as the final model. In this file, this is a variable called zpar
. You need to set this variable to the relative directory of zpar/dist/english.postagger
.
Source code
The source code for the English POS tagger can be found at
zpar/src/common/tagger/implementation/ENGLISH_TAGGER_IMPL
where ENGLISH_TAGGER_IMPL
is a macro defined in Makefile
, and specifies a specific implementation for the English POS tagger.