A Rule-based Part-of-Speech and Morphological Tagging Toolkit


Copyright © 2013-2014 by Dat Quoc Nguyen, Dai Quoc Nguyen, Dang Duc Pham, and Son Bao Pham

1. Introduction.. 2

2. Adapting RDRPOSTagger to a specific language. 3

3. Pre-trained POS and morphological tagging models. 5

3.1 RDRPOSTagger for English and Vietnamese. 5

3.2. RDRPOSTagger for other languages. 7

4. Combination with an external initial tagger 9

5. An external implementation in Java to speed up tagging process. 10

References. 11


·         14/05/2014: Released the RDRPOSTagger version 1.1.3 with minor changes in the package Utility.

·         28/04/2014: Updated pre-trained POS and morphological tagging models for other 13 languages of Bulgarian, Czech, Danish, Dutch, French, German, Hindi, Italian, Lao, Portuguese, Spanish, Swedish and Thai.

·         28/02/2014: Updated the RDRPOSTagger version 1.1.2.

1. Introduction

The RDRPOSTagger is a rule-based tagger employing the Single Classification Ripple Down Rules (SCRDR) methodology [C90, R09]. In short, the RDRPOSTagger automatically constructs transformation-based error-driven rules in the form of a SCRDR tree where new rules are only added to correct the errors of existing rules.

For the English POS tagging, RDRPOSTagger took 40 minutes to be trained on the Penn WSJ sections 0-18 [M93], and then obtained a tagging accuracy of 96.51% on the Penn WSJ sections 22-24 with a tagging speed of 92K words per second, using a laptop computer of Window7 Core2Duo 2.4GHz & 3GB of memory. For the Vietnamese POS tagging, RDRPOSTagger has reached an up-to-date highest tagging result on the Vietnamese Treebank corpus [N09] and also achieved 1st place for the VLSP 2013 POS tagging task. The RDRPOSTagger descriptions are detailed in the following papers:

Nguyen, D. Q., Nguyen, D. Q., Pham, S. B., & Pham, D. D. (2011). Ripple Down Rules for Part-of-Speech Tagging. In Proceedings of the 12th International Conference on Intelligent Text Processing and Computational Linguistics - Volume Part I, CICLing2011, pp. 190-201. [.pdf] [.bib]

Nguyen, D. Q., Nguyen, D. Q., Pham, D. D., & Pham, S. B. (2014). RDRPOSTagger: A Ripple Down Rules-based Part-Of-Speech Tagger. In Proceedings of the Demonstrations at the 14th Conference of the European Chapter of the Association for Computational Linguistics, EACL2014, pp. 17-20. [.pdf] [.bib]

Please cite one of the RDRPOSTagger papers in any publication reporting on results obtained with the help of RDRPOSTagger.

RDRPOSTagger is available to download at http://sourceforge.net/projects/rdrpostagger/

We would highly appreciate to have your bug reports, comments and suggestions about the RDRPOSTagger. As a free open-source implementation, RDRPOSTagger is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

2. Adapting RDRPOSTagger to a specific language

Input format: RDRPOSTagger assumes that each line in the golden training corpus is a sequence of WORD/TAG pairs separated by a white space character. See sample training and test datasets in the directory Sample.

Initial tagger: RDRPOSTagger requires an initial tagger. In RDRPOSTagger, the initial tagger is based on a lexicon to assign a tag for each word. Find in section 4 the instructions of using RDRPOSTagger when applying an external POS tagger as the initial tagger.

In RDRPOSTagger, the training process was implemented in Python while the tagging process was implemented in both Python and Java. Find in section 5 the instructions of using the Java implementation to speed up the tagging process.

Supposed that Python 2.x is set to run in command line (e.g. adding Python to the environment variable ‘path’ in Windows OS). Following steps are to adapt RDRPOSTagger to a specific corpus/language:

1.    Use the module LexiconCreator.py in the package Utility to generate a lexicon where each entry consists of a word type and its most frequent associated tag in the golden training corpus. The lexicon also contains an entry of a default tag with highest number of word types associated to label all out-of-training-corpus words (i.e. unknown-words).


where OPTION gets True or False string values: True value will return a full lexicon containing all word types while False value will return a smaller-in-size lexicon of excluding 1-time occurrence words in the golden training corpus. Examples:

Utility> python LexiconCreator.py ../Sample/En/correctTrain  ../Sample/En/fullDict True

Utility> python LexiconCreator.py ../Sample/En/correctTrain  ../Sample/En/shortDict False

NOTE that the actual command starts from python. The PACKAGE-NAME> starting each example, such as Utility> or pSCRDRtagger>, is simply to denote the current source package.

2.    [Skip this step if not applicable] Develop heuristic rules to handle the unknown-words instead of using the default tag (apply to the code line 21 in the module InitialTagger.py in the package InitialTagger). Refer to EnInitialTagger.py as an example.

3.    Train RDRPOSTagger using the golden training corpus by executing (provided that it is now in the package pSCRDRtagger):


Example: pSCRDRtagger> python RDRPOSTagger.py train ../Sample/En/shortDict ../Sample/En correctTrain postagging.rdr

The golden training corpus correctTrain is located in the corpus directory Sample/En. The rule-based model postagging.rdr will be generated in a new directory named T3-2 in the corpus directory Sample/En since we apply the default threshold pair 3 and 2 for learning the model. The threshold pairs can be easily changed by modifying the code line 17 in the module RDRPOSTagger.py.

4.    Perform POS tagging on a raw tokenized/word-segmented test corpus:


Example: pSCRDRtagger> python RDRPOSTagger.py tag ../Sample/En/T3-2/postagging.rdr ../Sample/En/fullDict ../Sample/En/rawTest

A file named rawTest.TAGGED will be generated in the test directory Sample/En of the test corpus rawTest.

5.    Use the module Eval.py in the package Utility to evaluate tagging result:

Utility> python Eval.py Path-To-Tagged-Test-Data Path-To-Golden-Test-Data

Example: Utility> python Eval.py ../Sample/En/rawTest.TAGGED ../Sample/En/correctTest

3. Pre-trained POS and morphological tagging models

3.1 RDRPOSTagger for English and Vietnamese

For English and Vietnamese, we use some heuristics to initially tag unknown words, so we separately developed two modules EnRDRPOSTagger.py and VnRDRPOSTagger.py, respectively. We could use the modules EnRDRPOSTagger.py and VnRDRPOSTagger.py in the same way as presented in section 2 to retrain the tagger if necessary.

Following instructions are to employ the pre-trained POS tagging models for English and Vietnamese.

RDRPOSTagger for English

Perform POS tagging on raw English data:

pSCRDRtagger> python EnRDRPOSTagger.py tag ../Models/English.RDR ../Dicts/English.DICT PATH-TO-RAW-TEST-CORPUS

Example: pSCRDRtagger> python EnRDRPOSTagger.py tag ../Models/English.RDR ../Dicts/English.DICT ../Sample/En/rawTest

in which the lexicon English.DICT contains all word types in the Penn WSJ Treebank sections 0-18 where the model English.RDR was trained on.

·         Tagging accuracy of 96.51% on the test corpus of the Penn WSJ Treebank sections 22-24.

·         Computed on a laptop computer of Window7 Core2Duo 2.4GHz & 3GB of memory: training time of 40 minutes and tagging speed of 2800 words/second.
To speed up the tagging speed to 92K words/second, we provide an implementation in Java as presented in section 5.

RDRPOSTagger for Vietnamese

Perform POS tagging on raw word-segmented Vietnamese data:

pSCRDRtagger> python VnRDRPOSTagger.py tag ../Models/Vietnamese.RDR ../Dicts/Vietnamese.DICT PATH-TO-RAW-TEST-CORPUS

Example: pSCRDRtagger> python VnRDRPOSTagger.py tag ../Models/Vietnamese.RDR ../Dicts/Vietnamese.DICT ../Sample/Vn/rawTest

in which the model Vietnamese.RDR and the lexicon Vietnamese.DICT were returned through a training process on a golden Vietnamese POS tagged corpus of 28K sentences. This corpus was supplied by the VLSP 2013 workshop’s evaluation campaign [set of POS tags/labels].

Computed on the laptop computer of Window7 Core2Duo 2.4GHz & 3GB of memory: training time of 100 minutes and tagging speeds of 1100 words/second for the implementation in Python and 45K words/second for the implementation in Java (see section 5).

3.2. RDRPOSTagger for other languages

Pre-trained POS tagging models:

Model (.zip)




French Treebank [A03]



TIGER Corpus [B04]



Hindi Treebank [P09]



ISDT Treebank [B13]



English-Lao Parallel Corpus [P10]



ORCHID Corpus [S97]



#sent: size of the corpus - the number of sentences

Pre-trained models for the combined POS and morphological tagging:

Model (.zip)




BulTreeBank-Morph [S04]



Prague Dependency Treebank 2.5 [B12]



Danish Dependency Treebank [K04]



Lassy Small Corpus [N13]



French Treebank [A03]



TIGER Corpus [B04]



Tycho Brahe Corpus [G10]



IULA LSP Treebank [M12]



Stockholm—Ume°a Corpus 3.0 [S12]


Each.ZIP file contains a .RDR rule model file and a .DICT lexicon file. The size of the training dataset used to learn each model is nine-tenths (9/10) of the size of the corresponding experimental corpus at sentence level. Find more details of experimental results [HERE].

To use the pre-trained models:

·         Download RDRPOSTagger at: http://sourceforge.net/projects/rdrpostagger/

·         Download and unzip the .ZIP file

·         Execute tagging on raw tokenized/word-segmented text:


Example: pSCRDRtagger> python RDRPOSTagger.py tag ../Models/German.RDR ../Dicts/German.DICT ../Sample/German/rawTest

Or using the following Java command to obtain a faster tagging speed (detailed in section 5, supposed that Java is set to run in command line):


Example: jSCRDRTagger> java RDRPOSTagger other ../Models/German.RDR ../Dicts/German.DICT ../Sample/German/rawTest

4. Combination with an external initial tagger

Assumed that we have an external initial tagger, in order to train RDRPOSTagger, execute:


Example: pSCRDRtagger> python iRDRPOSTagger.py train ../Sample/En correctTrain initTrain postagging.rdr

where the initialized corpus initTrain is produced by performing the external initial tagger on the raw text (excluding POS tags) which is extracted from the golden training corpus correctTrain (We provide the function getRawTextFromFile in the module Utils.py in the package Utility to extract raw text from a golden training corpus). Here initTrain is put into the same corpus directory Sample/En as correctTrain.

To execute POS tagging on an initialized test corpus using the trained model:


Example: pSCRDRtagger> python iRDRPOSTagger.py tag ../Sample/En/T3-2/postagging.rdr ../Sample/En/initTest

Here the initialized test corpus initTest is produced by executing the external initial tagger on a raw test corpus.

5. An external implementation in Java to speed up tagging process

We separately implemented a package in Java for tagging process. To tag a raw tokenized/word-segmented corpus using the pre-trained models (it is now in the jSCRDRTagger package):


in which OPTION gets one of the 3 values 'en', 'vn' and 'other' associated to a tagging process for English, Vietnamese and other languages, respectively.

Example1: jSCRDRTagger> java RDRPOSTagger en ../Models/English.RDR ../Dicts/English.DICT ../Sample/En/rawTest

Example2: jSCRDRTagger> java RDRPOSTagger vn ../Models/Vietnamese.RDR ../Dicts/Vietnamese.DICT ../Sample/Vn/rawTest

For other languages, users could develop some heuristic rules to deal with unknown words in the function InitTagger4Sentence in the module InitialTagger.java (code line 36). Rebuild RDRPOSTagger using javac -encoding UTF-8 RDRPOSTagger.java, and then perform a POS tagging according to the above Java command with OPTION value of 'other'. NOTE that users have to apply the same heuristic rules to the module InitialTagger.py (code line 21) in the package InitialTagger, and then use the implementation in Python to retrain the tagging model before executing the 'other' Java command.

NOTE that the actual command starts from java; jSCRDRtagger> is simply to denote the current source package.

To perform a tagging process on an initialized corpus returned by an external initial tagger:


Example: jSCRDRTagger> java RDRPOSTagger init ../Sample/En/T3-2/postagging.rdr ../Sample/En/initTest


            We would like to thank Sourceforge.net for hosting this project.


[C90] P. Compton and R. Jansen. 1990. A philosophical basis for knowledge acquisition. Knowledge Aquisition, 2(3):241–257.

[R09] Debbie Richards. 2009. Two decades of Ripple Down Rules research. Knowledge Engineering Review, 24(2):159– 184.

[M93] Mitchell P Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. 1993. Building a large annotated corpus of English: the Penn Treebank. Comput. Linguist., 19(2):313–330.

[B95] Eric Brill. 1995. Transformation-based error-driven learning and natural language processing: a case study in part-of-speech tagging. Computational Linguistics, 21(4):543–565.

[N09] Phuong Thai Nguyen, Xuan Luong Vu, Thi Minh Huyen Nguyen, Van Hiep Nguyen, and Hong Phuong Le. 2009. Building a Large Syntactically-Annotated Corpus of Vietnamese. In Proc. of LAW III workshop, pages 182–185.

[A03] Anne Abeillé, Lionel Clément, and Franc¸ois Toussenel. 2003. Building a Treebank for French. In Treebanks, volume 20 of Text, Speech and Language Technology, pages 165– 187.

[B04] Sabine Brants, Stefanie Dipper, Peter Eisenberg, Silvia Hansen-Schirra, Esther K¨onig, Wolfgang Lezius, Christian Rohrer, George Smith, and Hans Uszkoreit. 2004. TIGER: Linguistic interpretation of a German corpus. Research on Language and Computation, 2(4):597–620.

[P09] Martha Palmer, Rajesh Bhatt, Bhuvana Narasimhan, Owen Rambow, Dipti Misra Sharma, and Fei Xia. 2009. Hindi syntax: Annotating dependency, lexical predicate-argument structure, and phrase structure. In Proceedings of ICON, pages 261–268.

[B13] C. Bosco, S. Montemagni, and M. Simi. 2013. Converting Italian Treebanks: Towards an Italian Stanford Dependency Treebank. In: Proceedings of the 7th Linguistic Annotation Workshop and Interoperability with Discourse, pp. 61–69.

[P10] PANL10N. 2010. Pan localization project: English-Lao parallel corpus. http://panl10n.net/english/OutputsLaos2.htm.

[S97] Virach Sornlertlamvanich, Thatsanee Charoenporn, and Hitoshi Isahara. 1997. ORCHID: Thai Part-Of-Speech Tagged Corpus. http://culturelab.in.th/files/orchid.html.

[S04] Kiril Simov, Petya Osenova, Alexander Simov, and Milen Kouylekov. 2004. Design and Implementation of the Bulgarian HPSG-based Treebank. Research on Language and Computation, 2:495–522.

[B12] Eduard Bejcek, Jarmila Panevová, Jan Popelka, Pavel Stranák, Magda Sevcíková, Jan Stepánek, and Zdenek Zabokrtský. 2012. Prague Dependency Treebank 2.5 - a revisited version of PDT 2.0. In Proceedings of COLING, pages 231–246.

[K04] M.T. Kromann and S.K. Lynge. 2004. Danish Dependency Treebank v. 1.0. Department of Computational Linguistics, Copenhagen Business School. http://www.id.cbs.dk/~mtk/ddt1.0.M.T.

[N13] Gertjan Noord, Gosse Bouma, Frank Eynde, Dani¨el Kok, Jelmer Linde, Ineke Schuurman, ErikTjongKim Sang, and Vincent Vandeghinste. 2013. Large Scale Syntactic Annotation of Written Dutch: Lassy. In Essential Speech and Language Technology for Dutch, Theory and Applications of Natural Language Processing, pages 147–164.

[G10] Charlotte Galves and Pablo Faria. 2010. Tycho Brahe Parsed Corpus of Historical Portuguese. http://www.tycho.iel.unicamp.br/~tycho/corpus/en/index.html.

[M12] Montserrat Marimon, Beatríz Fisas, Núria Bel, Marta Villegas, Jorge Vivaldi, Sergi Torner, Mercè Lorente, Silvia Vázquez, and Marta Villegas. 2012. The IULA Treebank. In Proceedings of LREC, pages 1920–1926.

[S12] SUC-3.0. 2012. The Stockholm—Ume°a Corpus (SUC) 3.0. http://spraakbanken.gu.se/eng/resource/suc3.

Last updated: May 14, 2014