Stemmer

Abstract

In this short guide to the stemmer module of Egothor we are going to discuss two things: how to use the default stemmer we offer as a part of this project; and how to build a stemmer for a new language (if our default stemmers are not suitable for your needs).

The stemmer you can build with Egothor is one of the fastest, as fast as a simple table look up stemmer. In addition to its speed, it can recognize the common prefixes, infixes, or suffixes that many popular stemmers can. Moreover, the Egothor stemmer can automatically learn all the (main) stemming rules from a sample set of transformations for the given language.

Theory. The Egothor stemmer generates the final stem with time complexity O(l), where l is the length of the original word form. It uses up to 50% less memory than similar stemmers and is able to process complex languages (i.e., German with infixes and composite words) more accurately than any other stemmer we have tried.

The stemmer was tested with 11 European languages.

Default Egothor Stemmer. To begin, download the language tables that Egothor uses to learn stemming rules. You can download the tables for more than 11 European languages as a separate package that is available on our download page (see stemmer-data package).

The next step is installation of the package to the etc/ directory of your Egothor directory. For example, if you want to use the English dictionary, you would have the directory etc/stemmer/us_uk/.

The format of the sample file is as follows:

'nt 't nt t
:
abolish abolishes abolishing abolished
:
about-turn about-turns
:
yes-man yes-men
:

The first word is the stem and the rest of the line contains all variants of the stem. The file need not be complete list of all transformations to respective stems. As we stated before, Egothor will learn the language given just a sample set of transformations. We highly recommend you use your own vocabulary for a language or slang you want to index. It will reduce the space and time requirements of the algorithms used by Egothor.

The last step is compilation of the sample file to Egothor stemming structure (a stemmer table) with a proper learning method. If you want to know about this process, we recommend you read the papers about this topic by Leo Galambos. The process is based on the latest paper from 06/2002. In this guide, we will use the -0E2 method, that is, NO MULTI-TRIE OPERATIONS which are the best methods for German and other languages with compound words.

Make sure that you have Apache's Ant installed on your machine, and run ant comp_stem in the root directory of Egothor package, where build.xml is stored.

In the tmp/dist/stemmer-comp.zip file you will find all compiled stemmer structures for all languages in the etc/ directory. Pick up your language (or rather its stemmer file from the zip archive) and copy it, for instance, to var/stemmer.comp.

If you want to use the stemmer table in your indexing process, just define default.stem property for your JAVA. Doing this involves invoking: java -Degothor.stemmer.default=var/stemmer.comp org.egothor.... where var/stemmer.comp is the path to the stemmer table.

Build Your Own Stemmer. If your language is not supported by our default stemmers yet, or you must process the documents in a specific slang, you will have to create your own stemmer table. In that case, we recommend you to follow the default process.

First, select the name (us_uk for English, es_es for Spanish, etc.) for your new table/language and create a directory with that name in the etc/ directory. Then put your stemmer file in this directory. The format of the file has been described above.

Second, execute ant comp_stem and follow all steps as they were presented in the previous scenario.

If you are successful in creating a stemmer table for a language not as yet supported by Egothor, please consider contributing it to the Egothor project.

Prev	Up	Next
Ant Tasks	Home	Test Scripts