Indexing Local Pages Using the Indexer GUI

The Egothor Indexer. 

Egothor comes equipped with an indexing tool that is quite easy to use. Once you've compiled Egothor and set your classpath (make sure you've included all the jars in tmp/dist), all you need to do is run currentDir>java org.egothor.test.Indexer.

Filling in the fields. 

Below is an image of the Indexer GUI that should pop up on your screen once you've make the above Java command.

The fields in the sections not requiring a path (all but Index, Source, and WWW) are set to the most desirable level by default. The charset should be your system charset. The default charset (if your system charset in not in the properties file), ISO-8859-1, is set to the most prevalent charset on the web.


Fill in the path where you'd like your index to reside or click the folder to browse for a directory you'd like.


Fill in the path where the pages you want to index are or click the folder to browse for a directory you'd like to index.


If you wish search results to be available on the net, you should supply a path. For instance, if you were to index the Egothor Javadocs for the Egothor web site, you'd fill in http://www.egothor.net/javadocs. If this field is left blank then Source is supplied to the application as WWW as well.


Set this to true or false depending on whether your language uses accent marks.


Setting this to true lets the application ignore case in search terms. For example, a search for "chicago" will return hits for both "chicago" and "Chicago". A false setting forces the search mechanism to be case sensitive.


Snippets are context surrounding the search term(s) in a hit. Setting this to true will give the searcher some context for his/her search term(s) in a document. The index will be larger with snippet support. If you are using Egothor with the Carrot2 result clustering application, snippet support must be on.


Set the application to be single- or multi-threaded. A single-threaded indexer will be slower but consume few system resources. Conversely, a multi-threaded app speeds up the indexing but also consumes more system resources. Set this according to the load level of your computer.


The tool will pick up your system's charset if it is one of the charsets listed in the file charsets.properties. If your system's charset is not on the list simply add it to the properties file and restart the application. Use the charset that the majority of your documents are written in. If you are not sure leave it at the default setting. More information about languages and charsets is available at [insert link here].


This number represents a power of 2, thus the default capacity of the index is 2^32 (4,294,967,296 documents. As of this writing, on 9/3/03, Google was searching 3,307,998,701 documents.) Reducing or increasing this number will have no effect on performance.


This is the number of documents that will be in the cache as the index is built. Reducing or increasing this number will have a proportionate effect on indexing speed and RAM demand. Thus, a doubling of the default cache size will result in a doubling of indexing speed at the cost of doubling of the demand on your RAM.

Prev Up Next
Indexing Remote WWW Documents Home Indexing Local Pages Using Ant
© 2003-2004 Egothor Developers