TWeb Wrapper: ZPar Support for the Neural Network Tagger

Overview

This is the zpar wrapper for TWeb, the robust web tagger using neural network.

Suppose that ZPar has been downloaded to the directory zpar. The TWeb source is already inlcuded and located in zpar/src/common/tagger/implementations/tweb/TWeb. To make a POS tagging system using TWeb, change the GENERIC_TAGGER_IMPL macro defined in Makefile into tweb, and type make postagger. This will create a directory zpar/dist/postagger, in which there are two files: train and tagger. The file train is used to train a tagging model,and the file tagger is used to tag new texts using a trained tagging model.

Format of inputs and outputs

The input files to the tagger executable are formatted as a sequence of tokenized English sentences. An example input is:

 Ms. Haag plays Elianti .

The output files contain space-separated words:

 Ms./NNP Haag/NNP plays/VBZ Elianti/NNP ./.

The format of training files for the train executable is quite different from other tagger implementations in zpar, the details will be given in the next section.

How to train a model

To train a model, use

 zpar/dist/postagger/train training <configureFile>

Which is the same with the original TWeb.

For details about configureFile and training data format, please refer to the TWeb README. A copy is included as follows.


1. Build.
  Go into the src dir then make.
    a. cd src
    b. make -f Makefile
  After make, an executable named Tagger would be created in src/


2. Training.
  You can train your own model with the following command:
     src/Tagger   training   configureFile
  for example:
     src/Tagger  training  config/conf_sample.txt   
  where configFile specify some of the parameters used during training.
  There are some examples of the configureFile in config/
  (When training your own model, you just need to specify the path of the training set, 
   development set and the path of the trained model.
   For others parameters, use their default value is OK)

  lines start with :: will be ignored
  Take conf/conf.txt as an example:
    trainPath    : specify the training data
    devPath      : specify the development set, you can specify several development set
    testPath     : similar as devPath.
    strLogDir    : a log file will be generated during the training process,
                     the log file records tagging accuracy of each epoch on development and test sets.
    strModelPath : specify the directory where the model will be generated
    nRound       : number of maximum training epoch.
    prefix       : the prefix of the name of the trained model
    strRBMPrefix : specify the "common prefix" of the word-representation RBM, 
                     which is used as features to improve tagging accuracy.
                   
                     Example of "common Prefix":
                     WRRBM contains several modules, for example:
                      wrrbm_abc.dict  wrrbm_abc.model  wrrbm_abc.random
                     the common prefix is "wrrbm_abc".
                   using WRRBM will cause both training and testing slow,
                   but the model will improve web-domain tagging accuracy.

                   if you want to train a tagger on standard WSJ data set,
                   you can choose to ignore wrrbm by add "::" at the begining
                   of the line of this parameter.

    bEnTagger    : "true" denotes English tagger, 
                   "false" means Chinese tagger(currently not supported)
    bEarlyUpdate : "true" denotes whenever a word is incorrectly tagged,
                    parameters will be updated and the rest of the sentence is ignored.
    
    fMargin      : since we use a margin loss (Ma et al., 2014) to train the model, 
                   this parameter specify the value of the margin
    fRate        : learning rate, currently, we do not use weight decay.
    vIHSize      : size of the feature embedding, corresponding to the deminsion of the hidden layer size of
                   the sparse feature module described in section 3.2 of (Ma et al., 2014)
    vIHType      : "linear", linear projection layer


3. Tagging
    src/Tagger  tagging  modelPrefix  inputFile  outputFile
    
    After training, several files will be generated at the location specified by
    "strModelPath", these files share a common prefix, for example:
    
    WebTagger_AvgParam.confg   WebTagger_AvgParam.model  ....
    the common prefix is "WebTagger_AvgParam"

    "modelPrefix" is that common prefix.


Data format:
  See the sample data provided at sampleData/.

  Each line of the data file is a token which contains three columns, 
     Word  generalizedForm  POS
  Word: word form
  generalizedForm: lower cased Word, digit are converted to #DIG...
                   scripts for such processing will coming soon.
  POS: part of speech, when tagging new data, this column would be ignored by the tagger  
  A sentence contains several tokens.  Sentences are separated by empty line.
  

How to cite:
  Either cite the URL or the paper 
  "Tagging the web: building robust web tagger using neural network"
  is OK.

Example configuration files and data set can be found in config and sampleData under zpar/src/common/tagger/implementations/tweb/TWeb.

How to tag new texts

To apply an existing model to tag new texts, use

 zpar/dist/postagger/tagger <modelPrefix> <input-file> <output-file>

The modelPrefix is the same as described in TWeb README.

For example, using the model we just trained, we can tag an example input by

 zpar/dist/postagger/tagger modelPrefix input.txt output.txt

The output file contains automatically tagged sentences.

Reference

Ji Ma, Yue Zhang and Jingbo Zhu. 2014. Tagging The Web: Building A Robust Web Tagger with Neural Network. In Proc. of ACL, 144--154.

Overview

How to compile

Format of inputs and outputs

How to train a model

How to tag new texts

Reference