Software tools

In this page you will find a sample of the Natural Language Processing tools built by the COLE Group.

COMPAS (COMpiler for PArsing Schemata)

COMPAS (COMpiler for PArsing Schemata) is a system that can be used to automatically compile formal specifications of parsing algorithms (in the form of parsing schemata) to efficient Java implementations of the corresponding parsers.

You can find more information, and download the source code and binaries of the system at the COMPAS home page.

LIBNADFA, A library for efficient management of very large dictionaries

A library written in C to manage very large dictionaries in a efficient way with minimal memory requirements, thanks to the use of Numbered Acyclic Deterministic Fininte-state Automata. By dictionaries we means any data structure that allows associating entries in the dictionary (typically words) with any kind of information.

In this library we have integrated the techniques for the construction of minimal automata proposed by Jan Daciuk et al. in the article "Incremental Construction of Minimal Acyclic Finite-State Automata" with the techniques for the management of the information associated to entries (words) proposed by Jorge Graña et al. in "Compilation Methods of Minimal Acyclic Finite-State Automata for Large Dictionaries".

This library is distributed under the terms of the GNU General Public License version 3.

Preprocessor

A linguistically-motivated preprocessor module for Spanish that performs tasks such as format conversion, tokenization, sentence segmentation, morphological pretagging, contraction splitting, separation of enclitic pronouns from verbal stems, expression identification, numeral identification and proper noun recognition. It is described in detail in the following publications:

Jorge Graña, Fco. Mario Barcala, and Jesús Vilares, Formal Methods of Tokenization for Part-of-Speech Tagging, in Alexander Gelbukh (ed.), Computational Linguistics and Intelligent Text Processing, volume 2276 of Lecture Notes in Computer Science, pp. 240-249, Springer-Verlag, Berlin-Heidelberg-New York, 2002. ISSN 0302-9743 / ISBN 3-540-43219-1. [GraBarVil2002a.ps.gz, 65 K]
Fco. Mario Barcala, Jesús Vilares, Miguel A. Alonso, Jorge Graña and Manuel Vilares, Tokenization and Proper Noun Recognition for Information Retrieval, in A Min Tjoa and Roland R. Wagner (eds.), Thirteen International Workshop on Database and Expert Systems Applications. 2-6 September 2002. Aix-en-Provence, France, pp. 246-250, IEEE Computer Society Press, Los Alamitos, California, USA, 2002. ISSN 1529-4188 / ISBN 0-7695-1668-8. [BarVilAloGraVil2002a.ps.gz, 34 K]

To download this software, please contact Jesús Vilares (jvilares@udc.es).

PoS Tagger and Lemmatizer (MrTagoo)

MrTagoo is a high-performance part-of-speech tagger and lemmatizer based on a second order Hidden Markov Model that also incorporates certain capabilities such as a very efficient structure for storage and search based on finite-state automata, management of unknown words, and the possibility of integrating external dictionaries in the probabilistic frame defined by the Hidden Markov Models. Currently, the possibility of managing ambiguous segmentations is under implementation.

MrTagoo is described in detail in the following publications:

Jorge Graña, Jean-Cédric Chappelier, and Manuel Vilares, Integrating external dictionaries into stochastic part-of-speech taggers, in Galia Angelova, Kalina Bontcheva, Ruslan Mitkov, Nicolas Nocolov and Nokolai Nikolov (eds.), EuroConference Recent Advances in Natural Language Processing. Proceedings, pp. 122-128, Tzigov Chark, Bulgaria, 2001. ISBN 954-90906-1-2. [GraChaVil2001a.ps.gz, 76 K]
Jorge Graña, Fco. Mario Barcala, and Miguel A. Alonso, Compilation Methods of Minimal Acyclic Finite-State Automata for Large Dictionaries, in Bruce W. Watson and Derick Wood (eds.), Implementation and Application of Automata, volume 2494 of Lecture Notes in Computer Science, pp. 135-148, Springer-Verlag, Berlin-Heidelberg-New York, 2002. ISSN 0302-9743 / ISBN 3-540-00400-9. [GraBarAlo2001a.ps.gz, 84 K]
Jorge Graña, Miguel A. Alonso and Manuel Vilares, A Common Solution for Tokenization and Part-of-Speech Tagging: One-Pass Viterbi Algorithm vs. Iterative Approaches, in Petr Sojka, Ivan Kopecek and Karel Pala (eds.), Text, Speech and Dialogue, volume 2448 of Lecture Notes in Artificial Intelligence, pp. 3-10, Springer-Verlag, Berlin-Heidelberg-New York, 2002. ISSN 0302-9743 / ISBN 3-540-44129-8. [GraAloVil2002a.ps.gz, 75 K]

To download this software, please contact Jorge Graña (grana@udc.es).

Definite Clause Grammar Parser

An efficient parser of Definite Clause Grammars guided by the LALR(1) automata constructed from the underlying context-free backbone. As a special feature, some kinds of non-terminating computations are detected and the resulting cycles on the logical arguments are represented in a compact manner.

This parser is described in detail in the folloing publications:

Manuel Vilares, David Cabrero and Miguel A. Alonso, On Non-Termination on DCGs, in Alexander Gelbukh (ed.), Topics in Computational Linguistics and Intelligent Text Processing, volume of Lecture Notes in Computer Science, Springer-Verlag, Berlin-Heidelberg-New York, 2003. ISSN 0302-9743. [VilCabAlo2003a.ps.gz, 92 K]

To download this software, please contact David Cabrero (cabrero@udc.es).