RDRPOSTagger

A Ripple Down Rules-based Part-Of-Speech Tagging Toolkit

http://rdrpostagger.sourceforge.net

Copyright © 2013-2014 by Dat Quoc Nguyen, Dai Quoc Nguyen, Dang Duc Pham, and Son Bao Pham


1. Introduction. 1

2. Adapting RDRPOSTagger to a specific language. 3

3. Pre-trained models. 5

3.1 RDRPOSTagger for English and Vietnamese. 5

3.2. RDRPOSTagger for other languages. 7

4. Integrating with an external initial tagger 7

5. An external implementation in Java to speed up tagging process. 8

References. 9

 

News:

·         14/05/2014: Released RDRPOSTagger version 1.1.3 with the improvement on package Utility.

·         28/04/2014: Updated pre-trained models for other 13 languages of Bulgarian, Czech, Danish, Dutch, French, German, Hindi, Italian, Lao, Portuguese, Spanish, Swedish and Thai. Please find in section 3.2 for more details.

·         28/02/2014: Updated RDRPOSTagger version 1.1.2.

1. Introduction

The robust, easy-to-use and language independent POS tagging toolkit RDRPOSTagger is a rule-based tagger employing Single Classification Ripple Down Rules (SCRDR) methodology [C90, R09]. In short, RDRPOSTagger approach compares an initialized corpus produced by using an initial tagger and a golden corpus to automatically construct transformation rules in the form of a SCRDR tree. All rules are stored in an exception-structure SCRDR tree, and new rules are only added to correct errors of existing rules.

On Penn WSJ Treebank corpus [M93], taking 40 minutes to train on WSJ sections 0-18, RDRPOSTagger obtains a competitive result compared to other state-of-the-art English POS taggers on test set of WSJ sections 22-24. For Vietnamese, RDRPOSTagger outperforms machine learning-based POS tagging systems to reach an up-to-date highest result on Vietnamese Treebank corpus [N09], and also achieves 1st place at the VLSP 2013 workshop’s evaluation campaign for POS tagging task.

A specific description of the RDRPOSTagger approach is detailed in our CICLing2011 paper whilst its architecture is presented in our EACL2014 demo paper:

Nguyen, D. Q., Nguyen, D. Q., Pham, S. B., & Pham, D. D. (2011). Ripple Down Rules for Part-of-Speech Tagging. In Proceedings of the 12th International Conference on Intelligent Text Processing and Computational Linguistics - Volume Part I, CICLing’11, Springer-Verlag LNCS, pp. 190-201. [.pdf] [.bib]

Nguyen, D. Q., Nguyen, D. Q., Pham, D. D., & Pham, S. B. (2014). RDRPOSTagger: A Ripple Down Rules-based Part-Of-Speech Tagger. In Proceedings of the Demonstrations at the 14th Conference of the European Chapter of the Association for Computational Linguistics, EACL'14, pp. 17-20. [.pdf] [.bib]

Please cite the CICLing2011 paper if you are interested in mentioning the RDRPOSTagger approach, and refer to the EACL2014 paper when exploiting the tagger.

RDRPOSTagger is available to download at http://sourceforge.net/projects/rdrpostagger/
(Size: 0.6 MB. Tagging speed: for instance, 92K words/second for English on a computer Core 2Duo 2.4GHz & 3GB of memory)

We would highly appreciate to have your bug reports, comments and suggestions about RDRPOSTagger. As a free open-source implementation, RDRPOSTagger is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

2. Adapting RDRPOSTagger to a specific language

For a specific language, RDRPOSTagger assumes that the golden training corpus is formatted similarly as sample datasets in directory “Sample”: each line be a sequence of WORD/TAG pairs separated by a white space characters.

RDRPOSTagger requires an initial tagger in its processing. In RDRPOSTagger, we develop a simple initial tagger to assign a tag for each word extracted from a lexicon. If you want to integrate your own initial tagger to RDRPOSTagger, finding the instructions in section 4.

Supposed that Python 2.x is already installed in your operating system, and set to run in command line (e.g.: adding Python to environment variables in Windows OS). Following steps are to apply RDRPOSTagger for your own language:

1.    Utilize module LexiconCreator.py in Utility package to generate a lexicon where each entry in the lexicon consists of a word and its most frequent associated tag in the input golden training corpus. The lexicon also contains an entry  DefaultTag to label unknown-words (out-of-dictionary words) by the tag with highest number of associated words in the input golden corpus.

Utility> python LexiconCreator.py PATH-TO-GOLDEN-CORPUS PATH-TO-OUTPUT-LEXICON OPTION-VALUE

OPTION-VALUE gets True or False string values. True value will create a full lexicon containing all words in the golden corpus. False value will return a smaller-in-size lexicon of not including 1-time occurrence words in the golden corpus. Examples:

Utility> python LexiconCreator.py ../Sample/En/correctTrain  ../Sample/En/fullDict True

Utility> python LexiconCreator.py ../Sample/En/correctTrain  ../Sample/En/shortDict False

2.    [Skip this step if not applicable] You could evolve heuristic rules to deal the unknown-words instead of setting the default tag to those (in code line 21 in module InitialTagger.py in InitialTagger package). Please refer to EnInitialTagger.py as an example.

3.    You can train the tagger using the golden corpus by executing (provided that it is now in pSCRDRtagger package):

pSCRDRtagger> python RDRPOSTagger.py train PATH-TO-SMALLER-IN-SIZE-LEXICON PATH-TO-CORPUS-DIRECTORY GOLDEN-CORPUS MODEL-NAME

Example: pSCRDRtagger> python RDRPOSTagger.py train ../Sample/En/shortDict ../Sample/En correctTrain postagging.rdr

The golden corpus correctTrain is located in directory Sample/En. The rule-based model postagging.rdr will be generated in a new directory named T3-2 in the directory Sample/En since we apply default threshold-pair 3 and 2 for learning the model. You can easily add other thresholds by changing the code line 17 in the module RDRPOSTagger.py.

4.    After training is completed, you can do POS tagging on raw tokenized data by simply executing following command: 

pSCRDRtagger> python RDRPOSTagger.py tag PATH-TO-TRAINED-MODEL PATH-TO-FULL-LEXICON PATH-TO-RAW-CORPUS

Example: pSCRDRtagger> python RDRPOSTagger.py tag ../Sample/En/T3-2/postagging.rdr ../Sample/En/fullDict ../Sample/En/rawTest

A file named rawTest.TAGGED will be generated in the directory Sample/En.

5.    In order to evaluate tagging result, you can use the module Eval.py in the Utility package:

Utility> python Eval.py PathToTaggedFile PathToGoldenFile

Example: Utility> python Eval.py ../Sample/En/rawTest.TAGGED ../Sample/En/correctTest

3. Pre-trained models

3.1 RDRPOSTagger for English and Vietnamese

For English and Vietnamese, we use some heuristics to initially label out-of-lexicon words. Therefore, we separately develop two modules EnRDRPOSTagger.py and VnRDRPOSTagger.py. Using EnRDRPOSTagger.py and VnRDRPOSTagger.py in the same way as described in section 2 to retrain the tagger (if necessary) for English and Vietnamese respectively.

Below information is to present instructions of employing pre-trained models for English and Vietnamese.

RDRPOSTagger for English

Performing POS tagging on raw English data by:

pSCRDRtagger> python EnRDRPOSTagger.py tag ../Models/English.RDR ../Dicts/English.DICT PATH-TO-RAW-CORPUS

Example: pSCRDRtagger> python EnRDRPOSTagger.py tag ../Models/English.RDR ../Dicts/English.DICT ../Sample/En/rawTest

in which the lexicon English.DICT contains all words in the Penn WSJ Treebank sections 0-18 where the model English.RDR was trained on.

·         Accuracy result: the RDRPOSTagger gains an accuracy of 96.51% on the test corpus of the Penn WSJ Treebank sections 22-24.

·         Computed on a computer Core 2Duo 2.4GHz & 3GB of memory: Time taken for training RDRPOSTagger on the sections 0-18 is 40 minutes. For the latest implementation version RDRPOSTagger in Python, the tagging speed is 2800 words/second. To speed up tagging process to 92K words/second, we provide an external implementation in Java as described in section 5.

RDRPOSTagger for Vietnamese

You can do POS tagging on raw word-segmented data in Vietnamese as follows:

pSCRDRtagger> python VnRDRPOSTagger.py tag ../Models/Vietnamese.RDR ../Dicts/Vietnamese.DICT PATH-TO-RAW-CORPUS

Example: pSCRDRtagger> python VnRDRPOSTagger.py tag ../Models/Vietnamese.RDR ../Dicts/Vietnamese.DICT ../Sample/Vn/rawTest

in which the model Vietnamese.RDR and the lexicon Vietnamese.DICT are returned through a training process on a golden Vietnamese POS tagged corpus of 28K sentences supplied by the VLSP 2013 workshop’s evaluation campaign [set of POS tags/labels]. Evaluated on the computer Core 2Duo 2.4GHz & 3GB of memory:

·        It takes 100 minutes to complete the training process on the golden training corpus of 28K Vietnamese sentences.

·         Tagging speed: 1100 words/second for the implementation in Python; 45K words/second for an external implementation in Java, see section 5 to find more details.

3.2. RDRPOSTagger for other languages

Pre-trained POS tagging models:

Model (.zip)

Corpus used to train

#sent

French

French Treebank [A03]

21,562

German

TIGER Corpus [B04]

50,470

Hindi

Hindi Treebank [P09]

26,547

Italian

ISDT Treebank [B13]

10,206

Lao

English-Lao Parallel Corpus [P10]

2,114

Thai

ORCHID Corpus [S97]

23,225

 

#sent: size of the corpus - the number of sentences

Pre-trained models for combined POS and morphological tagging:

Model (.zip)

Corpus used to train

#sent

Bulgarian

BulTreeBank-Morph [S04]

20,558

Czech

Prague Dependency Treebank 2.5 [B12]

115,844

Danish

Danish Dependency Treebank [K04]

5,512

Dutch

Lassy Small Corpus [N13]

65,200

French

French Treebank [A03]

21,562

German

TIGER Corpus [B04]

50,470

Portuguese

Tycho Brahe Corpus [G10]

68,859

Spanish

IULA LSP Treebank [M12]

42,099

Swedish

Stockholm—Ume°a Corpus 3.0 [S12]

74,245

Each associated .ZIP file contains a .RDR rule model file and a .DICT lexicon file. Each .RDR rule model was trained on a training data set which its size is nine-tenths (9/10) of the size of the corresponding corpus at sentence level, whereas the .DICT lexicon was generated from the whole corpus. To utilize the pre-trained models:

·         Download RDRPOSTagger at: http://sourceforge.net/projects/rdrpostagger/

·         Download and unzip the .ZIP file

·         Execute tagging on raw tokenized/word-segmented text:

pSCRDRtagger> python RDRPOSTagger.py tag PATH-TO-RDR-RULE-MODEL PATH-TO-DICT-LEXICON PATH-TO-RAW-CORPUS

Example: pSCRDRtagger> python RDRPOSTagger.py tag ../Models/English.RDR ../Dicts/English.DICT ../Sample/En/rawTest

Or using the following Java command to obtain a faster tagging process (detailed in section 5):

jSCRDRTagger> java RDRPOSTagger other PATH-TO-RDR-RULE-MODEL PATH-TO-DICT-LEXICON PATH-TO-RAW-CORPUS

Example: jSCRDRTagger> java RDRPOSTagger other ../Models/English.RDR ../Dicts/English.DICT ../Sample/En/rawTest

4. Integrating with an external initial tagger

Assumed that we have an external initial tagger, in order to retrain RDRPOSTagger, you can execute:

pSCRDRtagger> python iRDRPOSTagger.py train PATH-TO-DIRECTORY GOLDEN-CORPUS INITIALIZED-CORPUS MODEL-NAME

Example: pSCRDRtagger> python iRDRPOSTagger.py train ../Sample/En correctTrain initTrain postagging.rdr

where the file initTrain is in the same directory En as the file correctTrain for which the file initTrain is produced by using the external initial tagger to execute POS tagging on a raw word-segmented corpus corresponding with the golden file correctTrain (applying function getRawTextFromFile in Utils.py in Utility package to extract raw text from the golden training corpus).

To perform POS tagging on an initialized corpus using the retrained model:

pSCRDRtagger> python iRDRPOSTagger.py tag PATH-TO-TRAINED-MODEL PATH-TO-INITIALLY-TAGGED-CORPUS

Example: pSCRDRtagger> python iRDRPOSTagger.py tag ../Sample/En/T3-2/postagging.rdr ../Sample/En/initTest

The file initTest is returned through an initially tagging process utilizing the external initial tagger on the raw data rawTest.

5. An external implementation in Java to speed up tagging process

We separately implement an external package in Java for tagging process.

To tag a corpus (it is now in the jSCRDRTagger package) using pre-trained models:

jSCRDRTagger> java RDRPOSTagger OPTION PATH-TO-TRAINED-MODEL PATH-TO-LEXICON PATH-TO-RAW-CORPUS

in which OPTION gets one of the 3 values 'en', 'vn' and 'other' corresponding to a tagging process for English, Vietnamese and other languages, respectively.

Example1: jSCRDRTagger> java RDRPOSTagger en ../Models/English.RDR ../Dicts/English.DICT ../Sample/En/rawTest

Example2: jSCRDRTagger> java RDRPOSTagger vn ../Models/Vietnamese.RDR ../Dicts/Vietnamese.DICT ../Sample/Vn/rawTest

For OPTION 'other' employed to other languages, instead of assigning a default tag, you could develop some heuristic rules to deal with unknown words in function InitTagger4Sentence in module InitialTagger.java (code line 36).

Rebuild RDRPOSTagger using: javac -encoding UTF-8 RDRPOSTagger.java, and perform a POS tagging according to the above Java command with 'other' (NOTICE that it is required to use the implementation in Python to train a model before executing the Java command).   

To perform a tagging process on an initially tagged corpus:

jSCRDRTagger> java RDRPOSTagger init PATH-TO-TRAINED-MODEL PATH-TO-INITIALLY-TAGGED-CORPUS

Example: jSCRDRTagger> java RDRPOSTagger init ../Sample/En/T3-2/postagging.rdr ../Sample/En/initTest

Acknowledgement

            We would like to thank Sourceforge.net for hosting this project.

References

[C90] P. Compton and R. Jansen. 1990. A philosophical basis for knowledge acquisition. Knowledge Aquisition, 2(3):241–257.

[R09] Debbie Richards. 2009. Two decades of Ripple Down Rules research. Knowledge Engineering Review, 24(2):159– 184.

[M93] Mitchell P Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. 1993. Building a large annotated corpus of English: the Penn Treebank. Comput. Linguist., 19(2):313–330..

[B95] Eric Brill. 1995. Transformation-based error-driven learning and natural language processing: a case study in part-of-speech tagging. Computational Linguistics, 21(4):543–565.

[N09] Phuong Thai Nguyen, Xuan Luong Vu, Thi Minh Huyen Nguyen, Van Hiep Nguyen, and Hong Phuong Le. 2009. Building a Large Syntactically-Annotated Corpus of Vietnamese. In Proc. of LAW III workshop, pages 182–185.

[A03] Anne Abeillé, Lionel Clément, and Franc¸ois Toussenel. 2003. Building a Treebank for French. In Treebanks, volume 20 of Text, Speech and Language Technology, pages 165– 187.

[B04] Sabine Brants, Stefanie Dipper, Peter Eisenberg, Silvia Hansen-Schirra, Esther K¨onig, Wolfgang Lezius, Christian Rohrer, George Smith, and Hans Uszkoreit. 2004. TIGER: Linguistic interpretation of a German corpus. Research on Language and Computation, 2(4):597–620.

[P09] Martha Palmer, Rajesh Bhatt, Bhuvana Narasimhan, Owen Rambow, Dipti Misra Sharma, and Fei Xia. 2009. Hindi syntax: Annotating dependency, lexical predicate-argument structure, and phrase structure. In Proceedings of ICON, pages 261–268.

[B13] C. Bosco, S. Montemagni, and M. Simi. 2013. Converting Italian Treebanks : Towards an Italian Stanford Dependency Treebank. In: Proceedings of the 7th Linguistic Annotation Workshop and Interoperability with Discourse, pp. 61–69.

[P10] PANL10N. 2010. Pan localization project: English-Lao parallel corpus. http://panl10n.net/english/OutputsLaos2.htm.

[S97] Virach Sornlertlamvanich, Thatsanee Charoenporn, and Hitoshi Isahara. 1997. ORCHID: Thai Part-Of-Speech Tagged Corpus. http://culturelab.in.th/files/orchid.html.

[S04] Kiril Simov, Petya Osenova, Alexander Simov, and Milen Kouylekov. 2004. Design and Implementation of the Bulgarian HPSG-based Treebank. Research on Language and Computation, 2:495–522.

[B12] Eduard Bejcek, Jarmila Panevová, Jan Popelka, Pavel Stranák, Magda Sevcíková, Jan Stepánek, and Zdenek Zabokrtský. 2012. Prague Dependency Treebank 2.5 - a revisited version of PDT 2.0. In Proceedings of COLING, pages 231–246.

[K04] M.T. Kromann and S.K. Lynge. 2004. Danish Dependency Treebank v. 1.0. Department of Computational Linguistics, Copenhagen Business School. http://www.id.cbs.dk/~mtk/ddt1.0.M.T.

[N13] Gertjan Noord, Gosse Bouma, Frank Eynde, Dani¨el Kok, Jelmer Linde, Ineke Schuurman, ErikTjongKim Sang, and Vincent Vandeghinste. 2013. Large Scale Syntactic Annotation of Written Dutch: Lassy. In Essential Speech and Language Technology for Dutch, Theory and Applications of Natural Language Processing, pages 147–164.

[G10] Charlotte Galves and Pablo Faria. 2010. Tycho Brahe Parsed Corpus of Historical Portuguese. http://www.tycho.iel.unicamp.br/~tycho/corpus/en/index.html.

[M12] Montserrat Marimon, Beatríz Fisas, Núria Bel, Marta Villegas, Jorge Vivaldi, Sergi Torner, Mercè Lorente, Silvia Vázquez, and Marta Villegas. 2012. The IULA Treebank. In Proceedings of LREC, pages 1920–1926.

[S12] SUC-3.0. 2012. The Stockholm—Ume°a Corpus (SUC) 3.0. http://spraakbanken.gu.se/eng/resource/suc3.


Last updated: May 14, 2014