RDRPOSTagger

A Rule-based Part-of-Speech and Morphological Tagging Toolkit

http://rdrpostagger.sourceforge.net

Copyright © 2013-2015 by Dat Quoc Nguyen, Dai Quoc Nguyen, Dang Duc Pham, and Son Bao Pham


1. Introduction. 2

2. Train RDRPOSTagger on a gold standard training corpus. 3

3. Use pre-trained POS and morphological tagging models. 5

4. Combine RDRPOSTagger with an external initial tagger 8

5. Speed up tagging process with an implementation in Java. 9

References. 10

News:

·         21/12/2015: release version 1.2.1 with improved tagging speed in Python

·         18/11/2015: release version 1.2

§  Yield improved tagging accuracy, especially on morphologically rich languages. See experimental results for 13 languages in our AI Communications article.

§  Include new pre-trained Part-of-Speech (POS) and morphological tagging models for Bulgarian, Czech, Dutch, English, French, German, Hindi, Italian, Portuguese, Spanish, Swedish, Thai and Vietnamese.

·         14/05/2014: release version 1.1.3

1. Introduction

RDRPOSTagger is a robust, easy-to-use and language-independent toolkit for POS and morphological tagging. It employs an error-driven approach to automatically construct tagging rules in the form of a binary tree. The main properties of RDRPOSTagger are as follows:

·         RDRPOSTagger obtains fast performance in both learning and tagging process. For example, on the English Penn WSJ sections 22-24, RDRPOSTagger achieved tagging speeds of 2800 5K and 90K words/second computed for single threaded implementations in Python and Java respectively, using a computer with Core2Duo 2.4GHz and 3GB of memory. See more results in our AI Communications article.

·         RDRPOSTagger achieves a very competitive accuracy in comparison to the state-of-the-art results.

The general architecture and experimental results of RDRPOSTagger can be found in our papers:

Dat Quoc Nguyen, Dai Quoc Nguyen, Dang Duc Pham and Son Bao Pham. RDRPOSTagger: A Ripple Down Rules-based Part-Of-Speech Tagger. In Proceedings of the Demonstrations at the 14th Conference of the European Chapter of the Association for Computational Linguistics (EACL), pp. 17-20, 2014. [.PDF] [.bib]

Dat Quoc Nguyen, Dai Quoc Nguyen, Dang Duc Pham and Son Bao Pham. A Robust Transformation-Based Learning Approach Using Ripple Down Rules for Part-Of-Speech Tagging. AI Communications, to appear (accepted for publication on 3/12/2015). [CameraReadyVersion]

Please cite our EACL 2014 demo paper whenever RDRPOSTagger is used to produce published results.

RDRPOSTagger is available to download (5MB .zip file) at: https://sourceforge.net/projects/rdrpostagger/files/RDRPOSTagger_v1.2.1.zip

RDRPOSTagger (version 1.2.1) is now also available to download at: https://github.com/datquocnguyen/RDRPOSTagger

We would highly appreciate to have your bug reports, comments and suggestions about the RDRPOSTagger. As a free open-source implementation, RDRPOSTagger is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

2. Train RDRPOSTagger on a gold standard training corpus

Notices:

·         In terms of implementation, the training process has been implemented in Python while the tagging process has been implemented in both Python and Java. See Section 5 for details of using the Java implementation.

·         RDRPOSTagger requires an initial tagger. The internal initial tagger developed within RDRPOSTagger uses a lexicon to assign a tag for each word. See Section 4 for a combination between RDRPOSTagger and an external initial tagger.

·         RDRPOSTagger assumes that each line in the gold standard training corpus is a sequence of WORD/TAG pairs separated by white space characters. See sample training and test sets in the data directory.

Supposed that Python 2.x is already set to run in command line or terminal (e.g. adding Python to the environment variable ‘path’ in Windows OS).

·         We train RDRPOSTagger on the gold standard training corpus by executing:

pSCRDRtagger$ python RDRPOSTagger.py train PATH-TO-GOLD-STANDARD-TRAINING-CORPUS

Example 1: pSCRDRtagger$ python RDRPOSTagger.py train ../data/goldTrain

Note that the actual command starts from python. Here pSCRDRtagger$ is simply used to denote the current pSCRDRtagger source package.

A .DICT lexicon file and an .RDR trained model file, for example goldTrain.DICT and goldTrain.RDR, will be generated in the same directory containing the gold standard training corpus.

·         To employ the trained model for POS tagging on a raw unlabeled text corpus, we perform:

pSCRDRtagger$ python RDRPOSTagger.py tag PATH-TO-TRAINED-MODEL PATH-TO-LEXICON PATH-TO-RAW-TEXT-CORPUS

Example 2: pSCRDRtagger$ python RDRPOSTagger.py tag ../data/goldTrain.RDR ../data/goldTrain.DICT ../data/rawTest

A .TAGGED file, in this case rawTest.TAGGED, will be generated in the same directory containing the raw text corpus.

To obtain faster tagging process in Python: set a higher value for the "NUMBER_OF_PROCESSES" variable in the "Config.py" module in the "Utility" package. The value should not larger than the number of CPU cores which your computer has.

·         To evaluate tagging accuracy, we can employ the Eval.py module in the Utility package:

Utility$ python Eval.py PATH-TO-TAGGED-TEST-CORPUS PATH-TO-GOLD-TEST-CORPUS

Example 3: Utility$ python Eval.py ../data/rawTest.TAGGED ../data/goldTest

·         Use RDRPOSTagger4En.py and RDRPOSTagger4Vn.py in case of retraining tagging models for English with Penn Treebank POS tags and Vietnamese with VietTreebank/VLSP POS tags, respectively.

3. Use pre-trained POS and morphological tagging models

Pre-trained models for POS tagging:

Language

Corpus

Model

Lexicon

English

Penn WSJ section 00-18 [M93]

../Models/POS/English.RDR

../Models/POS/English.DICT

French

French Treebank [A03]

../Models/POS/French.RDR

../Models/POS/French.DICT

German

TIGER Corpus [B04]

../Models/POS/German.RDR

../Models/POS/German.DICT

Hindi

Hindi Treebank [P09]

../Models/POS/Hindi.RDR

../Models/POS/Hindi.DICT

Italian

ISDT Treebank [B13]

../Models/POS/Italian.RDR

../Models/POS/Italian.DICT

Thai

ORCHID Corpus [S97]

../Models/POS/Thai.RDR

../Models/POS/Thai.DICT

Vietnamese

VLSP 2013 POS-annotated corpus [N09]

../Models/POS/Vietnamese.RDR

../Models/POS/Vietnamese.DICT

Pre-trained models for the combined POS and morphological (POS+MORPH) tagging:

Language

Corpus

Model

Lexicon

Bulgarian

BulTreeBank-Morph [S04]

../Models/MORPH/Bulgarian.RDR

../Models/MORPH/Bulgarian.DICT

Czech

Prague Dependency Treebank 2.5 [B12]

../Models/MORPH/Czech.RDR

../Models/MORPH/Czech.DICT

Dutch

Lassy Small Corpus [N13]

../Models/MORPH/Dutch.RDR

../Models/MORPH/Dutch.DICT

French

French Treebank [A03]

../Models/MORPH/French.RDR

../Models/MORPH/French.DICT

German

TIGER Corpus [B04]

../Models/MORPH/German.RDR

../Models/MORPH/German.DICT

Portuguese

Tycho Brahe Corpus [G10]

../Models/MORPH/Portuguese.RDR

../Models/MORPH/Portuguese.DICT

Spanish

IULA LSP Treebank [M12]

../Models/MORPH/Spanish.RDR

../Models/MORPH/Spanish.DICT

Swedish

Stockholm—Ume°a Corpus 3.0 [S12]

../Models/MORPH/Swedish.RDR

../Models/MORPH/Swedish.DICT

We trained a tagging model for each language on nine-tenths (9/10) of the size of the corresponding corpus other than English, except for Vietnamese where we only used four-fifths (4/5) of the size of the VLSP 2013 POS-annotated corpus. See experimental results here.

·         To utilize a pre-trained model for POS or POS+MORPH tagging on a raw text corpus in Python, we perform:

pSCRDRtagger$ python RDRPOSTagger.py tag PATH-TO-PRETRAINED-MODEL PATH-TO-LEXICON PATH-TO-RAW-TEXT-CORPUS

Example 4: pSCRDRtagger$ python RDRPOSTagger.py tag ../Models/POS/German.RDR ../Models/POS/German.DICT ../data/GermanRawTest

Example 5: pSCRDRtagger$ python RDRPOSTagger.py tag ../Models/MORPH/German.RDR ../Models/MORPH/German.DICT ../data/GermanRawTest

NOTE that each line in the input raw text corpus represents a tokenized/word-segmented sentence. For programming with RDRPOSTagger, please follow code lines 92-98 in RDRPOSTagger.py module in pSCRDRTagger package. Here is an example:

r = RDRPOSTagger()

# Load the POS tagging model for French

r.constructSCRDRtreeFromRDRfile("../Models/POS/French.RDR")

# Load the lexicon for French

DICT = readDictionary("../Models/POS/French.DICT")

# Tag a tokenized/word-segmented sentence

r.tagRawSentence(DICT, "Cette annonce a fait l' effet d' une véritable bombe .")

·         Use RDRPOSTagger4En.py and RDRPOSTagger4Vn.py instead of RDRPOSTagger.py for pre-trained English and Vietnamese POS tagging models, respectively.

·         To utilize a pre-trained tagging model for POS or POS+MORPH tagging on a raw text corpus in Java, please see Section 5.

4. Combine RDRPOSTagger with an external initial tagger

·         To train RDRPOSTagger in case of using output from an external initial POS or POS+MORPH tagger:

pSCRDRtagger$ python ExtRDRPOSTagger.py train PATH-TO-GOLD-STANDARD-TRAINING-CORPUS PATH-TO-TRAINING-CORPUS-INITIALIZED-BY-EXTERNAL-TAGGER

Example 6: pSCRDRtagger$ python ExtRDRPOSTagger.py train ../data/goldTrain ../data/initTrain

Here the initialized training corpus initTrain is generated by using the external initial tagger to perform POS or POS+MORPH tagging on the raw corpus which consists of the raw text extracted from the gold standard training corpus goldTrain.

An .RDR trained model file, for example initTrain.RDR, will be generated in the same directory containing the initialized training corpus.

·         To use the trained model for POS or POS+MORPH tagging on a test corpus where words already are initially tagged by the external initial tagger:

pSCRDRtagger$ python ExtRDRPOSTagger.py tag PATH-TO-TRAINED-MODEL PATH-TO-TEST-CORPUS-INITIALIZED-BY-EXTERNAL-TAGGER

Example 7: pSCRDRtagger$ python ExtRDRPOSTagger.py tag ../data/initTrain.RDR ../data/initTest

5. Speed up tagging process with an implementation in Java

·        To utilize a pre-trained model for POS or POS+MORPH tagging on a raw text corpus:

jSCRDRTagger$ java RDRPOSTagger PATH-TO-PRETRAINED-MODEL PATH-TO-LEXICON PATH-TO-RAW-TEXT-CORPUS

Example 8: jSCRDRTagger$ java RDRPOSTagger ../Models/POS/German.RDR ../Models/POS/German.DICT ../data/GermanRawTest

Example 9: jSCRDRTagger$ java RDRPOSTagger ../Models/MORPH/German.RDR ../Models/MORPH/German.DICT ../data/GermanRawTest

RDRPOSTagger has an additional parameter specialized for POS tagging in English and Vietnamese:

Example 10: jSCRDRTagger$ java RDRPOSTagger en ../Models/POS/English.RDR ../Models/POS/English.DICT ../data/en/rawTest

Example 11: jSCRDRTagger$ java RDRPOSTagger vn ../Models/POS/Vietnamese.RDR ../Models/POS/Vietnamese.DICT ../data/vn/rawTest

·         In case of using an external initial POS or POS+MORPH tagger:

jSCRDRTagger$ java RDRPOSTagger ex PATH-TO-TRAINED-MODEL PATH-TO-TEST-CORPUS-INITIALIZED-BY-EXTERNAL-TAGGER

Example 12: jSCRDRTagger$ java RDRPOSTagger ex ../data/initTrain.RDR ../data/initTest

·         Recompile if there is any problem: jSCRDRTagger$ javac -encoding UTF-8 RDRPOSTagger.java

References

[M93] M. P. Marcus, M. A. Marcinkiewicz, and B. Santorini. Building a Large Annotated Corpus of English: The Penn Treebank. Computational Linguistics, 19(2):313– 330, 1993. http://www.cis.upenn.edu/~treebank/

[A03] A. Abeillé, L. Clément, and F. Toussenel. Building a Treebank for French. In Treebanks, volume 20 of Text, Speech and Language Technology, pages 165– 187. 2003. http://www.llf.cnrs.fr/en/Gens/Abeille/French-Treebank-fr.php

[B04] S. Brants, S. Dipper, P. Eisenberg, S. Hansen-Schirra, E. K¨onig, W. Lezius, C. Rohrer, G. Smith, and H. Uszkoreit. TIGER: Linguistic Interpretation of a German Corpus. Research on Language and Computation, 2(4):597–620, 2004. http://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/tiger.en.html

[P09] M. Palmer, R. Bhatt, B. Narasimhan, O. Rambow, D. M. Sharma, and F. Xia. Hindi Syntax: Annotating Dependency, Lexical Predicate-Argument Structure, and Phrase Structure. In Proceedings of 7th International Conference on Natural Language Processing, pages 261–268, 2009. http://verbs.colorado.edu/hindiurdu/index.html

[B13] C. Bosco, S. Montemagni, and M. Simi. Converting Italian Treebanks: Towards an Italian Stanford Dependency Treebank. In Proceedings of the 7th Linguistic Annotation Workshop and Interoperability with Discourse, pages 61–69, 2013. http://medialab.di.unipi.it/wiki/ISDT

[S97] V. Sornlertlamvanich, T. Charoenporn, and H. Isahara. ORCHID: Thai Part-Of-Speech Tagged Corpus, 1997. URL http://culturelab.in.th/files/orchid.html

[N09] P. T. Nguyen, X. L. Vu, T. M. H. Nguyen, V. H. Nguyen, and H. P. Le. Building a Large Syntactically-Annotated Corpus of Vietnamese. In Proceedings of the Third Linguistic Annotation Workshop, pages 182–185, 2009. http://vlsp.vietlp.org:8080/

[S04] K. Simov, P. Osenova, A. Simov, and M. Kouylekov. Design and Implementation of the Bulgarian HPSGbased Treebank. Research on Language and Computation, 2:495–522, 2004. http://www.bultreebank.org

[B12] E. Bejcek, J. Panevová, J. Popelka, P. Stranák, M. Sevcíková, J. Stepánek, and Z. Zabokrtský. Prague Dependency Treebank 2.5 - a Revisited Version of PDT 2.0. In Proceedings of 24th International Conference on Computational Linguistics, pages 231–246, 2012. https://ufal.mff.cuni.cz/pdt2.5/

[N13] G. Noord, G. Bouma, F. Eynde, D. Kok, J. Linde, I. Schuurman, E. Sang, and V. Vandeghinste. Large Scale Syntactic Annotation of Written Dutch: Lassy. In Essential Speech and Language Technology for Dutch, Theory and Applications of Natural Language Processing, pages 147–164, 2013. http://www.let.rug.nl/~vannoord/Lassy/

[G10] C. Galves and P. Faria. Tycho Brahe Parsed Corpus of Historical Portuguese, 2010. http://www.tycho.iel.unicamp.br/~tycho/corpus/en/index.html.

[M12] M. Marimon, B. Fisas, N. Bel, M. Villegas, J. Vivaldi, S. Torner, M. Lorente, and S. Vázquez. The IULA Treebank. In Proceedings of the eighth international conference on Language Resources and Evaluation, pages 1920–1926, 2012. https://www.iula.upf.edu/recurs01_tbk_uk.htm

[S12] SUC-3.0. The Stockholm—Ume°a Corpus (SUC) 3.0, 2012. URL http://spraakbanken.gu.se/eng/resource/suc3


Last updated: December 21, 2015