Indexing Remote WWW Documents

The Capek Web Crawler

Egothor comes with its own web crawler, code named Capek.

Figure 2.1. Karel Capek, Czech playwright

Karel Capek (1890-1938) was a Czech playwright early in the last century who wrote, most notably, Rossum's Universal Robots, which introduced the word robot into the English language.

The crawler uses "rules" contained in the etc/rules file in your Egothor directory. You'll have to edit this file to establish some boundaries for Capek to operate within. Below is what the rules file that ships with Egothor looks like:

# http://a.a/b/v/d.html?a=a
# {A }   {B}{    C    } {D}
# {          E            }

# this is default
must                    
# {A} allow http
protocol ^http$           
# {B} our local domain
domain	.shark$           
# {E}
whole	http://www.       
# {C}
uripath	^/                
# {D}
query	=                 

# what is prohibited
not                     
# antihackers
domain	^127.             
domain	^localhost
# antiapache dirlists
# {D}
query	^C=D&O=A$	  
query	^C=D&O=D$
query	^C=S&O=A$
query	^C=S&O=D$
query	^C=N&O=A$
query	^C=N&O=D$
query	^C=M&O=A$
query	^C=M&O=D$

The file is divided into two parts: must and not. The must block defines what must be in the URL. The not block, conversely, defines what cannot be in the URL

URL's that follow the HTTP protocol are allowed.

	Tip
	^ means begin string and $ means end string.

The host's name ends with ".shark"...

...and the path starts with a slash.

We also allow a query string which contains "=" (thus ...?abc is not allowed, but ...?id=3455 is).

Moreover, we disallow a URL if its hostname starts with "127." or "localhost".

The query string must not contain any of the eight strings which are generated by apache httpd server for directory listings (?C=D&O=A etc.)

Example 2.1. A Rules File for Crawling Egothor.org

Let's say you wanted to crawl the Egothor web site to get its trove of fantastically designed web pages for your own use. Below is the rules file that you'd need to supply Capek with in order to accomplish this task:

# http://a.a/b/v/d.html?a=a
# {A }   {B}{    C    } {D}
# {          E            }

# this is default
must
# {A} only http is acceptable
protocol ^http$
# {B} only this site will be scanned
domain	^www.egothor.org$
# {C} this is not required, but why not...
uripath	^/
# {D} we will accept URLs with query strings
query	=

# what is prohibited
not
# antiapache dirlists
# {D}
query	^C=D&O=A$
query	^C=D&O=D$
query	^C=S&O=A$
query	^C=S&O=D$
query	^C=N&O=A$
query	^C=N&O=D$
query	^C=M&O=A$
query	^C=M&O=D$

Since we know exactly where the robot will be going, the prohibited list only excludes Apache httpd directory listing stuff. In practice, you'll probably want to crawl more than just one website, so your prohibited list will probably need to resemble the default list rather than this one.

Starting Capek.

To start Capek working for you (and no it won't write any famous works of literature, at least not yet ;-) ), just issue the command:

java org.egothor.robot.Capek [URLs of websites to crawl]

Additionally, there are a number of options that you can specify when starting up Capek. They are specified by appending the -D flag to the java command above.

egothor.server.pause - Specifies the time gap between two requests to the same hostname. If it ends with "ms", the value is read as millisecs. (default: 15sec)
egothor.rules.file - Specifies the location of the "rules" file (default: "./rules")
egothor.capek.port - Specifies the port Capek listens to (default: 9713)

Stopping Capek.

Stopping Capek can occur in one of two ways:

If it is idle for more than a minute, Capek dies a natural death and will need to be restarted if you are not done.
You can manually kill Capek by telnetting to 127.0.0.1 9713 and issuing a shutdown command.

Capek Options.

Capek can be run in daemon mode (meaning that it will run forever without an explicit stop call) by specifying it as an optional parameter on the command line.

Example 2.2. Running Capek as a Daemon

Just issue this command and Capek will be at your service forever:

java org.egothor.robot.Capek -daemon [your URLs to crawl]

Capek's Controlling Terminal

Other commands of Capek's terminal may help you look over the gathering, or stop the robot.

stat. This command prints out a brief summary of robot's statistics.

stat
Threads 10
Max idle time 60000
Last job at 1068448877389
Started at 1068448864654
Tasks 12
Rate (t/s) 0.9397760200485551
Manager cache miss-rate 0.0
Corpus in ./corpus/
Cache url->id miss-rate 1.0
Cache NodeOfSH miss-rate 0.07692308
Link DB in ./linkdb/db.txt
Records 0 links 0
Scheduler in ./scheduler/
Ready servers 1
Root level 2 items planned 2

list. This command displays last 20 URLs which were processed by the robot.

list
*capek#1 1068448881680 15 http://www0.it-dept.centralbank:80/manual/dso.html
*capek#1 1068448880579 14 http://www0.it-dept.centralbank:80/manual/programs/
*capek#1 1068448879529 13 http://www0.it-dept.centralbank:80/manual/mod/index.html
*capek#1 1068448878459 12 http://www0.it-dept.centralbank:80/manual/mod/index-bytype.html
*capek#1 1068448877389 11 http://www0.it-dept.centralbank:80/manual/stopping.html
 capek#1 1068448876329 36 10 http://www0.it-dept.centralbank:80/manual/invoking.html
 capek#1 1068448875259 64 9 http://www0.it-dept.centralbank:80/manual/install.html
 capek#1 1068448874229 19 8 http://www0.it-dept.centralbank:80/manual/LICENSE
 capek#1 1068448873189 37 7 http://www0.it-dept.centralbank:80/manual/upgrading_to_1_3.html
 capek#1 1068448872059 112 6 http://www0.it-dept.centralbank:80/manual/new_features_1_3.html
 capek#1 1068448871009 49 5 http://www0.it-dept.centralbank:80/manual/mod/
 capek#1 1068448869909 83 4 http://www0.it-dept.centralbank:80/manual/mod/directives.html
 capek#1 1068448868809 59 3 http://www0.it-dept.centralbank:80/manual/sitemap.html
 capek#1 1068448867609 140 2 http://www0.it-dept.centralbank:80/manual/mod/mod_ssl/
 capek#1 1068448865949 657 1 http://www0.it-dept.centralbank:80/manual/index.html
 capek#1 1068448864772 172 0 http://www0.it-dept.centralbank:80/

Tasks denoted with an asterix are still in progress.

Other fields are described as follows (the last line is read as an example)

capek#1: Identification of node which solves the task (distributed robot may contain many different IDs here).
1068448864772: Time in milliseconds when the task was prepared.
172: How many milliseconds the task was solved.
0: Unique identification of the task.
http://www0.it-dept.centralbank:80/: URL to gather.

threads. This command prints out a list of threads which are used by the robot and what they do.

threads
 0 pop
 1 save: http://www0.it-dept.centralbank:80/manual/dso.html
 2 http://www0.it-dept.centralbank:80/manual/netware.html
 3 wait
 4 pop
 5 pop
 6 pop
 7 pop
 8 pop
 9 pop

shutdown. Stops the robot as soon as possible. However, this action may take a few seconds, because tasks which are still open are finished first.

quit. Disconnects with a peer.

Indexing Crawled Pages Using Michelangelo

Once you've used Capek to crawl the pages you want, you'll want to index them using the Michelangelo indexer.

Figure 2.2. Michelangelo Buonarroti, Italian Renaissance artist

Capek saves the pages it gathers (compressed using gzip) while crawling in a file marked /corpus/. Capek also creates two other directories while it's at it: /linkdb/ and /scheduler/. The linkdb directory will contain just one file, db.txt, which contains the structure of hypertext links found in the documents. The scheduler directory contains a data structure which is used by a scheduler in the robot.

Starting Michelangelo.

Issue the command:

java org.egothor.robot.apps.Michelangelo [dir to create the corpus, et al., in]

	Note
	If you omit the parameter, the directories will be created in the current directory.

Michelangelo will create an index in, of all places, an /index/ directory inside the directory specified on the command line, or the current one if one isn't specified.

Your new index is ready for searching!

Delta Indexing Using Oracul

Now let's say that you want to compile an index with linksrank. This just a bit more involved than just using Michelangelo, but should not prove to be too difficult for even a novice Egothor user. You'll have to make use of a series of applications to form the eventual index. They should be called in this order:

java org.egothor.apps.Compile location of db.txt name location of output file
java org.egothor.robot.oracle.LinksFileReader location of output file generated above location of /ranks dir
java org.egothor.apps.Michelangelo [optional location to create new index]
java org.egothor.apps.Oracul location of index file generated by Michelangelo [ -linksrank location of /ranks.dir ] [ -depthrank ]

Example 2.3. A script

java org.egothor.apps.Compile linkdb/db.txt linkdb/db.xxx
java org.egothor.oracle.LinksFileReader linkdb/db.xxx ranks/
java org.egothor.apps.Michelangelo
java org.egothor.apps.Oracul index/ -linksrank ranks/

Figure 2.3. Processing of links

Here is what happens with each (also see Figure 2.3):

Compile reads db.txt and converts into a binary form used by....
LinksFileReader reads the info generated by Compile and computes the linksranks and saves it into a directory called /ranks.
Michelangelo does the raw indexing. If there is an existing index Michelangelo will update it, otherwise a new index will be constructed.
Important
If delta indexing is being done, a new directory called /newbits is created. The /newbits directory must not be removed
Oracul reads the index created above and saves the previously generated ranks into it.

	Important
If delta indexing is being done, a new directory called `/newbits` is created. The `/newbits` directory must not be removed

	Note
	Michelangelo and Oracul cannot run concurrently as they both depend on and alter the index.

Prev	Up	Next
Chapter 2. Indexing Your Documents	Home	Indexing Local Pages Using the Indexer GUI