RDRPOSTagger

A Ripple Down Rules-based Part-Of-Speech Tagging Toolkit

http://rdrpostagger.sourceforge.net

Copyright © 2013-2014 by Dat Quoc Nguyen, Dai Quoc Nguyen, Dang Duc Pham, and Son Bao Pham


1. Introduction.. 1

2. Adapting RDRPOSTagger to a specific language. 3

3. RDRPOSTagger for English and Vietnamese. 5

2.1. RDRPOSTagger for English.. 5

2.2. RDRPOSTagger for Vietnamese. 6

4. Integrating with an external initial tagger 7

5. An external implementation in Java to speed up tagging process. 7

References. 9

 

1. Introduction

The robust, easy-to-use and language independent POS tagging toolkit RDRPOSTagger is a rule-based tagger employing Single Classification Ripple Down Rules (SCRDR) methodology [C90, R09]. In short, RDRPOSTagger approach compares an initialized corpus produced by using an initial tagger and a golden corpus to automatically construct transformation rules in the form of a SCRDR tree. All rules are stored in an exception-structure SCRDR tree, and new rules are only added to correct errors of existing rules.

On Penn WSJ Treebank corpus [M93], taking 40 minutes to train on WSJ sections 0-18, RDRPOSTagger obtains a competitive result compared to other state-of-the-art English POS taggers on test set of WSJ sections 22-24. For Vietnamese, RDRPOSTagger outperforms machine learning-based POS tagging systems to reach an up-to-date highest result on Vietnamese Treebank corpus [N09], and also achieves 1st place at the VLSP 2013 workshop’s evaluation campaign for POS tagging task.

A specific description of the RDRPOSTagger approach is detailed in our CICLing2011 paper whilst its architecture is presented in our EACL2014 demo paper:

Nguyen, D. Q., Nguyen, D. Q., Pham, S. B., & Pham, D. D. (2011). Ripple Down Rules for Part-of-Speech Tagging. In Proceedings of the 12th International Conference on Intelligent Text Processing and Computational Linguistics - Volume Part I, CICLing’11, Springer-Verlag LNCS, pp. 190–201. [.pdf] [.bib]

Nguyen, D. Q., Nguyen, D. Q., Pham, D. D., & Pham, S. B. (2014). RDRPOSTagger: A Ripple Down Rules-based Part-Of-Speech Tagger. To appear in Proceedings of the Demonstrations at the 14th Conference of the European Chapter of the Association for Computational Linguistics, EACL’14, in press. [.pdf] [.bib]

Please cite the CICLing2011 paper if you are interested in mentioning the RDRPOSTagger approach, and refer to the EACL2014 paper when exploiting the tagger.

RDRPOSTagger is available to download at http://sourceforge.net/projects/rdrpostagger/
(Size: 0.6 MB. Tagging speeds, for instances on a computer Core 2Duo 2.4GHz & 3G Ram: 92k words/second for English, 45k words/second for Vietnamese)

We would highly appreciate to have your bug reports, comments and suggestions about RDRPOSTagger. As a free open-source implementation, RDRPOSTagger is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

2. Adapting RDRPOSTagger to a specific language

For a specific language, RDRPOSTagger assumes that the golden training corpus is formatted similarly as sample datasets in directory “Sample”: each line be a sequence of WORD/TAG pairs separated by a white space characters.

RDRPOSTagger requires an initial tagger in its processing. In RDRPOSTagger, we develop a simple initial tagger to assign a tag for each word extracted from a lexicon. If you want to integrate your own initial tagger to RDRPOSTagger, finding the instructions in section 4.

Supposed that Python 2.x is already installed in your operating system, and set to run in command line (e.g: adding Python to environment variables in Windows OS). Following steps are to apply RDRPOSTagger for your own language:

1.    Utilize module LexiconCreator.py in Utility package to generate a lexicon of words and the most frequent associated tags from the input golden corpus. The lexicon also contains an entry  DefaultTag to tag unknown-words (out-of-dictionary words) with the most frequent label in the corpus.

Utility> python LexiconCreator.py PATH-TO-GOLDEN-CORPUS PATH-TO-OUTPUT-LEXICON OPTION-VALUE

OPTION-VALUE gets True or False string-value. True value will output a full lexicon extracted from the input golden training corpus. False value means that the output lexicon does not include 1-time occurrence words in the golden corpus. Examples:

Utility> python LexiconCreator.py ../Sample/En/correctTrain  ../Sample/En/fullDict True

Utility> python LexiconCreator.py ../Sample/En/correctTrain  ../Sample/En/shortDict False

2.    [Skip this step if not applicable] You could evolve heuristic rules to deal the unknown-words instead of setting the default tag to those (in code line 21 in module InitialTagger.py in InitialTagger package). Please refer to EnInitialTagger.py as an example.

3.    You can train the tagger using the golden corpus by executing (provided that it is now in pSCRDRtagger package):

pSCRDRtagger> python RDRPOSTagger.py train PATH-TO-LEXICON PATH-TO-CORPUS-DIRECTORY GOLDEN-CORPUS MODEL-NAME

Example:

pSCRDRtagger> python RDRPOSTagger.py train ../Sample/En/fullDict ../Sample/En correctTrain postagging.rdr

Or:

pSCRDRtagger> python RDRPOSTagger.py train ../Sample/En/shortDict ../Sample/En correctTrain postagging.rdr
(Using this command means that 1-time occurrence words in training corpus are initially tagged as out-of-dictionary words. This scheme will (mostly) return a higher accuracy result, however, it slows down a bit the tagging process because of a larger-in-size trained model.)

The golden corpus correctTrain is located in directory Sample/En. The rule-based model postagging.rdr will be generated in a new directory named T3-2 in the directory Sample/En as we apply default threshold-pair 3 and 2 for learning the model. You can easily add other thresholds by changing the code line 17 in the module RDRPOSTagger.py.

4.    After training is completed, you can do POS tagging on raw tokenized data by simply executing following command: 

pSCRDRtagger> python RDRPOSTagger.py tag PATH-TO-TRAINED-MODEL PATH-TO-FULL-LEXICON PATH-TO-RAW-CORPUS

Example:

pSCRDRtagger> python RDRPOSTagger.py tag ../Sample/En/T3-2/postagging.rdr ../Sample/En/fullDict ../Sample/En/rawTest

A file named rawTest.TAGGED will be generated in the directory Sample/En.

5.    In order to evaluate tagging result, you can use the module Eval.py in the Utility package:

Utility> python Eval.py PathToTaggedFile PathToGoldenFile

Example: Utility> python Eval.py ../Sample/En/rawTest.TAGGED ../Sample/En/correctTest

3. RDRPOSTagger for English and Vietnamese

For English and Vietnamese, we use some heuristics to initially label out-of-lexicon words. Therefore, we separately develop two modules EnRDRPOSTagger.py and VnRDRPOSTagger.py. Using EnRDRPOSTagger.py and VnRDRPOSTagger.py in the same way as described in section 2 to retrain the tagger (if necessary) for English and Vietnamese respectively.

Below information is to present instructions of employing pre-trained models for English and Vietnamese.

3.1. RDRPOSTagger for English

Performing POS tagging on raw English data by:

pSCRDRtagger> python EnRDRPOSTagger.py tag ../Models/EN.RDR ../Dicts/EN.DICT PATH-TO-RAW-CORPUS

Example: pSCRDRtagger> python EnRDRPOSTagger.py tag ../Models/EN.RDR ../Dicts/EN.DICT ../Sample/En/rawTest

in which EN.RDR is a model of 2319 transformation rules, that has been trained on the Penn WSJ Treebank sections 0-18. The EN.DICT lexicon is also generated from the input Treebank sections 0-18.

Besides, we supply an additional model of 2418 rules named EN1.RDR which was trained by exploiting a smaller lexicon of not containing 1-time occurrence words (only in training process, 1-time occurrence words are initially labeled as out-of-lexicon words).

Example: pSCRDRtagger> python EnRDRPOSTagger.py tag ../Models/EN1.RDR ../Dicts/EN.DICT ../Sample/En/rawTest

·         Accuracy result: the 2319 rules-based RDRPOSTagger gains an accuracy of 96.49% while it is 96.51% accounted for the 2418-rules-based tagger on test corpus of Penn WSJ sections 22-24.

·         Training and tagging times computed on a computer Core 2Duo 2.4GHz & 3G Ram: Time taken for training RDRPOSTagger on the WSJ sections 0-18 is 40 minutes to return the 2319-rules model. For the latest implementation version RDRPOSTagger in Python, the tagging speed is 2800 words/second. To speed up tagging process to 92k words/second, we provide an external implementation in Java as described in section 5.

3.2. RDRPOSTagger for Vietnamese

You can do POS tagging on raw word-segmented data in Vietnamese as follows:

pSCRDRtagger> python VnRDRPOSTagger.py tag ../Models/VN.RDR ../Dicts/VN.DICT PATH-TO-RAW-CORPUS

Example: pSCRDRtagger> python VnRDRPOSTagger.py tag ../Models/VN.RDR ../Dicts/VN.DICT ../Sample/Vn/rawTest

in which the model VN.RDR of 2896 rules and the VN.DICT lexicon are returned through a training process on a golden Vietnamese POS tagging corpus of 28k sentences supplied by the VLSP 2013 workshop’s evaluation campaign [set of POS tags/labels].

·        It takes 100 minutes to complete the training process on the golden training corpus of 28k Vietnamese sentences.

·         Tagging speed: 1100 words/second for the implementation in Python; 45k words/second for an external implementation in Java, see section 5 to find more details.

4. Integrating with an external initial tagger

Assumed that we have an external initial tagger, in order to retrain RDRPOSTagger, you can execute:

pSCRDRtagger> python iRDRPOSTagger.py train PATH-TO-DIRECTORY GOLDEN-CORPUS INITIALIZED-CORPUS MODEL-NAME

Example: pSCRDRtagger> python iRDRPOSTagger.py train ../Sample/En correctTrain initTrain postagging.rdr

where the file initTrain is in the same directory En as the file correctTrain for which the file initTrain is produced by using the external initial tagger to execute POS tagging on a raw word-segmented corpus corresponding with the golden file correctTrain (applying function getRawTextFromFile in Utils.py in Utility package to extract raw text from the golden training corpus).

To perform POS tagging on an initialized corpus using the retrained model:

pSCRDRtagger> python iRDRPOSTagger.py tag PATH-TO-TRAINED-MODEL PATH-TO-INITIALIZED-CORPUS

Example: pSCRDRtagger> python iRDRPOSTagger.py tag ../Sample/En/T3-2/postagging.rdr ../Sample/En/initTest

5. An external implementation in Java to speed up tagging process

We separately implement an external package in Java for tagging process.

To tag a corpus (it is now in the jSCRDRTagger package):

jSCRDRTagger> java RDRPOSTagger OPTION PATH-TO-LEARNED-MODEL [PATH-TO-LEXICON] PATH-TO-CORPUS

in which OPTION gets one of the 4 values 'init', 'en', 'vn' and 'other' corresponding to a tagging process on an initialized corpus, on English, Vietnamese and for other languages, respectively. There is no [PATH-TO-LEXICON] for option 'init'.

Example1:

jSCRDRTagger> java RDRPOSTagger en ../Models/EN.RDR ../Dicts/EN.DICT ../Sample/En/rawTest

Example2:

jSCRDRTagger> java RDRPOSTagger vn ../Models/VN.RDR ../Dicts/VN.DICT ../Sample/Vn/rawTest

Example3:

jSCRDRTagger> java RDRPOSTagger other ../Sample/En/T3-2/postagging.rdr ../Sample/En/fullDict ../Sample/En/rawTest

Example4:

jSCRDRTagger> java RDRPOSTagger init ../Sample/En/T3-2/postagging.rdr ../Sample/En/initTest

Turning to OPTION 'other' employed for other languages, instead of assigning a default tag, you could develop some heuristic rules to deal with unknown words in function InitTagger4Sentence in module InitialTagger.java (code line 36).

Rebuild RDRPOSTagger using: javac -encoding UTF-8 RDRPOSTagger.java, and perform a POS tagging according to the above Java command with 'other' (NOTICE that it is required to use the implementation in Python to train a model before executing the Java command).   

Acknowledgement

            We would like to thank Sourceforge.net for hosting this project.

References

[C90] Compton, P, & Jansen, R. (1990). A philosophical basis for knowledge acquisition. Knowledge Acquisition, 2(3), 241–257.

[R09] Richards, D. (2009). Two decades of ripple down rules research. Knowledge Engineering Review, 24(2), 159–184.

[M93] Marcus, M. P., Marcinkiewicz, M. A., & Santorini, B. (1993). Building a large annotated corpus of English: the penn treebank. Computational Linguistics 19(2), 313–330.

[B95] Brill, E.: Transformation-based error-driven learning and natural language processing: a case study in part-of-speech tagging. Computational Linguistics 21(4) (1995) 543–565.

[N09] Nguyen, P. T., Vu, X. L., Nguyen, T. M. H., Nguyen, V. H., & Le, H. P. (2009). Building a Large Syntactically-Annotated Corpus of Vietnamese. In Proceedings of the Third Linguistic Annotation Workshop (pp. 182–185).


Last updated: February 28, 2014