What Happens When You Index

This is a short guide to what is created when you index files using Egothor. A number of files are created in the index directory(ies), each of which has a specific function and is used by Egothor for a specific task. Below is a table with explanations of each file.

Table 2.2. Index Files

Filename	Description
`*.sep`	separate inverted lists which can be modified without index rebuilding
`doc.btm`	stores a bitmap of removed documents (1=document was removed, 0=else)
`doc.dta`	stores documents metadata
`doc.idx`	the list of offset to `doc.dta`. "seek(i*8);readLong;" gives the offset of metadata of the i^th document
`doc.mta`	the list of document metadata JAVA class names which are used in `doc.dta`
`ils.dta`	stores all inverted lists (except of those `*.sep`)
`prx.dta`	stores offset positions of all terms in all documents
`trm.dta`	list of all terms in A-Z order, their respective offset positions to `ils.dta` and `prx.dta`
`trm.idx`	index over `trm.dta`

The last file, trm.idx, requires some further explanation. When the index is constructed, all terms are sorted, divided into groups of up to n terms, and the last term of each group is moved forward so it becomes the head element.

Example 2.4. Looking for a term in trm.idx

Imagine an index of 12 terms sorted in the order below:

Nigeria [offset to trm.dta] [skip this block of 8 elements]
Azerbaijan [offset to trm.dta]
Canada [offset to trm.dta]
Djibouti [offset to trm.dta]
England [offset to trm.dta]
Georgia [offset to trm.dta]
Laos [offset to trm.dta]
Mali [offset to trm.dta]
Vietnam [offset to trm.dta] [skip this block of 4 elements = **end**]
Poland [offset to trm.dta]
Qatar [offset to trm.dta]
Tanzania [offset to trm.dta]

You are looking for the term "Poland". Egothor reads the list until it finds a term lexicographically higher than the one you are looking for. Since the first block has a beginning term starting with "N", Egothor knows that it can skip that entire block and move on to the next one. Finding "Vietnam", it now knows that the term must be in that block and it will return the appropriate result.

Example 2.5. Looking for a term NOT in trm.idx

Now, let's say that you are looking for the term "Zanzibar". Egothor reads the file as above, but now when it gets to "Vietnam" there's a special instruction that says, "This is the highest term available; if you're looking for anything higher there won't be a result." In this way, a lot of unnecessary searching is avoided.

Example 2.6. Looking for a term that MIGHT be in trm.idx

Finally, let's say that you are looking for the term "Uruguay". Egothor reads the file as above, but now when it gets to "Vietnam" there's a special instruction that says, "This is the highest term available; if you're looking for anything higher there won't be a result. Not only that, but there is nothing between "Vietnam" and "Tanzania", so no need to go on searching."

Prev	Up	Next
Indexing Local Pages Using Ant	Home	Chapter 3. Querying Your Index