:: EGOTHOR

What Happens When You Index

This is a short guide to what is created when you index files using Egothor. A number of files are created in the index directory(ies), each of which has a specific function and is used by Egothor for a specific task. Below is a table with explanations of each file.

Table 2.2.  Index Files

FilenameDescription
*.sepseparate inverted lists which can be modified without index rebuilding
doc.btmstores a bitmap of removed documents (1=document was removed, 0=else)
doc.dtastores documents metadata
doc.idxthe list of offset to doc.dta. "seek(i*8);readLong;" gives the offset of metadata of the ith document
doc.mtathe list of document metadata JAVA class names which are used in doc.dta
ils.dtastores all inverted lists (except of those *.sep)
prx.dtastores offset positions of all terms in all documents
trm.dtalist of all terms in A-Z order, their respective offset positions to ils.dta and prx.dta
trm.idxindex over trm.dta

The last file, trm.idx, requires some further explanation. When the index is constructed, all terms are sorted, divided into groups of up to n terms, and the last term of each group is moved forward so it becomes the head element.

Example 2.4. Looking for a term in trm.idx

Imagine an index of 12 terms sorted in the order below:

Nigeria [offset to trm.dta] [skip this block of 8 elements]
Azerbaijan [offset to trm.dta]
Canada [offset to trm.dta]
Djibouti [offset to trm.dta]
England [offset to trm.dta]
Georgia [offset to trm.dta]
Laos [offset to trm.dta]
Mali [offset to trm.dta]
Vietnam [offset to trm.dta] [skip this block of 4 elements = **end**]
Poland [offset to trm.dta]
Qatar [offset to trm.dta]
Tanzania [offset to trm.dta]

You are looking for the term "Poland". Egothor reads the list until it finds a term lexicographically higher than the one you are looking for. Since the first block has a beginning term starting with "N", Egothor knows that it can skip that entire block and move on to the next one. Finding "Vietnam", it now knows that the term must be in that block and it will return the appropriate result.

Example 2.5. Looking for a term NOT in trm.idx

Now, let's say that you are looking for the term "Zanzibar". Egothor reads the file as above, but now when it gets to "Vietnam" there's a special instruction that says, "This is the highest term available; if you're looking for anything higher there won't be a result." In this way, a lot of unnecessary searching is avoided.

Example 2.6. Looking for a term that MIGHT be in trm.idx

Finally, let's say that you are looking for the term "Uruguay". Egothor reads the file as above, but now when it gets to "Vietnam" there's a special instruction that says, "This is the highest term available; if you're looking for anything higher there won't be a result. Not only that, but there is nothing between "Vietnam" and "Tanzania", so no need to go on searching."

Prev Up Next
Indexing Local Pages Using Ant Home Chapter 3. Querying Your Index
© 2003-2004 Egothor Developers