:: EGOTHOR |
Egothor comes with its own web crawler, code named Capek.
Karel Capek (1890-1938) was a Czech playwright early in the last century who wrote, most notably, Rossum's Universal Robots, which introduced the word robot into the English language.
The crawler uses "rules" contained in the etc/rules file in your Egothor directory. You'll have to edit this file to establish some boundaries for Capek to operate within. Below is what the rules file that ships with Egothor looks like:
# http://a.a/b/v/d.html?a=a # {A } {B}{ C } {D} # { E } # this is default must # {A} allow http protocol ^http$ # {B} our local domain domain .shark$ # {E} whole http://www. # {C} uripath ^/ # {D} query = # what is prohibited not # antihackers domain ^127. domain ^localhost # antiapache dirlists # {D} query ^C=D&O=A$ query ^C=D&O=D$ query ^C=S&O=A$ query ^C=S&O=D$ query ^C=N&O=A$ query ^C=N&O=D$ query ^C=M&O=A$ query ^C=M&O=D$
Example 2.1. A Rules File for Crawling Egothor.org
Let's say you wanted to crawl the Egothor web site to get its trove of fantastically designed web pages for your own use. Below is the rules file that you'd need to supply Capek with in order to accomplish this task:
# http://a.a/b/v/d.html?a=a # {A } {B}{ C } {D} # { E } # this is default must # {A} only http is acceptable protocol ^http$ # {B} only this site will be scanned domain ^www.egothor.org$ # {C} this is not required, but why not... uripath ^/ # {D} we will accept URLs with query strings query = # what is prohibited not # antiapache dirlists # {D} query ^C=D&O=A$ query ^C=D&O=D$ query ^C=S&O=A$ query ^C=S&O=D$ query ^C=N&O=A$ query ^C=N&O=D$ query ^C=M&O=A$ query ^C=M&O=D$
Since we know exactly where the robot will be going, the prohibited list only excludes Apache httpd directory listing stuff. In practice, you'll probably want to crawl more than just one website, so your prohibited list will probably need to resemble the default list rather than this one.
Starting Capek.
To start Capek working for you (and no it won't write any famous works of literature, at least not yet ;-) ), just issue the command:
java org.egothor.robot.Capek [URLs of websites to crawl]
Additionally, there are a number of options that you can specify when starting up Capek. They are specified by appending the -D flag to the java command above.
egothor.server.pause - Specifies the time gap between two requests to the same hostname. If it ends with "ms", the value is read as millisecs. (default: 15sec)
egothor.rules.file - Specifies the location of the "rules" file (default: "./rules")
egothor.capek.port - Specifies the port Capek listens to (default: 9713)
Stopping Capek.
Stopping Capek can occur in one of two ways:
If it is idle for more than a minute, Capek dies a natural death and will need to be restarted if you are not done.
You can manually kill Capek by telnetting to 127.0.0.1 9713 and issuing a shutdown command.
Capek Options.
Capek can be run in daemon mode (meaning that it will run forever without an explicit stop call) by specifying it as an optional parameter on the command line.
Example 2.2. Running Capek as a Daemon
Just issue this command and Capek will be at your service forever:
java org.egothor.robot.Capek -daemon [your URLs to crawl]
Other commands of Capek's terminal may help you look over the gathering, or stop the robot.
stat. This command prints out a brief summary of robot's statistics.
stat Threads 10 Max idle time 60000 Last job at 1068448877389 Started at 1068448864654 Tasks 12 Rate (t/s) 0.9397760200485551 Manager cache miss-rate 0.0 Corpus in ./corpus/ Cache url->id miss-rate 1.0 Cache NodeOfSH miss-rate 0.07692308 Link DB in ./linkdb/db.txt Records 0 links 0 Scheduler in ./scheduler/ Ready servers 1 Root level 2 items planned 2
list. This command displays last 20 URLs which were processed by the robot.
list *capek#1 1068448881680 15 http://www0.it-dept.centralbank:80/manual/dso.html *capek#1 1068448880579 14 http://www0.it-dept.centralbank:80/manual/programs/ *capek#1 1068448879529 13 http://www0.it-dept.centralbank:80/manual/mod/index.html *capek#1 1068448878459 12 http://www0.it-dept.centralbank:80/manual/mod/index-bytype.html *capek#1 1068448877389 11 http://www0.it-dept.centralbank:80/manual/stopping.html capek#1 1068448876329 36 10 http://www0.it-dept.centralbank:80/manual/invoking.html capek#1 1068448875259 64 9 http://www0.it-dept.centralbank:80/manual/install.html capek#1 1068448874229 19 8 http://www0.it-dept.centralbank:80/manual/LICENSE capek#1 1068448873189 37 7 http://www0.it-dept.centralbank:80/manual/upgrading_to_1_3.html capek#1 1068448872059 112 6 http://www0.it-dept.centralbank:80/manual/new_features_1_3.html capek#1 1068448871009 49 5 http://www0.it-dept.centralbank:80/manual/mod/ capek#1 1068448869909 83 4 http://www0.it-dept.centralbank:80/manual/mod/directives.html capek#1 1068448868809 59 3 http://www0.it-dept.centralbank:80/manual/sitemap.html capek#1 1068448867609 140 2 http://www0.it-dept.centralbank:80/manual/mod/mod_ssl/ capek#1 1068448865949 657 1 http://www0.it-dept.centralbank:80/manual/index.html capek#1 1068448864772 172 0 http://www0.it-dept.centralbank:80/Tasks denoted with an asterix are still in progress.
Other fields are described as follows (the last line is read as an example)
Identification of node which solves the task (distributed robot may contain many different IDs here).
Time in milliseconds when the task was prepared.
How many milliseconds the task was solved.
Unique identification of the task.
URL to gather.
threads. This command prints out a list of threads which are used by the robot and what they do.
threads 0 pop 1 save: http://www0.it-dept.centralbank:80/manual/dso.html 2 http://www0.it-dept.centralbank:80/manual/netware.html 3 wait 4 pop 5 pop 6 pop 7 pop 8 pop 9 pop
shutdown. Stops the robot as soon as possible. However, this action may take a few seconds, because tasks which are still open are finished first.
quit. Disconnects with a peer.
Once you've used Capek to crawl the pages you want, you'll want to index them using the Michelangelo indexer.
Capek saves the pages it gathers (compressed using gzip) while crawling in a file marked /corpus/. Capek also creates two other directories while it's at it: /linkdb/ and /scheduler/. The linkdb directory will contain just one file, db.txt, which contains the structure of hypertext links found in the documents. The scheduler directory contains a data structure which is used by a scheduler in the robot.
Starting Michelangelo.
Issue the command:
java org.egothor.robot.apps.Michelangelo [dir to create the corpus, et al., in]Note | |
---|---|
If you omit the parameter, the directories will be created in the current directory. |
Michelangelo will create an index in, of all places, an /index/ directory inside the directory specified on the command line, or the current one if one isn't specified.
Your new index is ready for searching!
Now let's say that you want to compile an index with linksrank. This just a bit more involved than just using Michelangelo, but should not prove to be too difficult for even a novice Egothor user. You'll have to make use of a series of applications to form the eventual index. They should be called in this order:
java org.egothor.apps.Compile location of db.txt name location of output file
java org.egothor.robot.oracle.LinksFileReader location of output file generated above location of /ranks dir
java org.egothor.apps.Michelangelo [optional location to create new index]
java org.egothor.apps.Oracul location of index file generated by Michelangelo [ -linksrank location of /ranks.dir ] [ -depthrank ]
Example 2.3. A script
java org.egothor.apps.Compile linkdb/db.txt linkdb/db.xxx java org.egothor.oracle.LinksFileReader linkdb/db.xxx ranks/ java org.egothor.apps.Michelangelo java org.egothor.apps.Oracul index/ -linksrank ranks/
Here is what happens with each (also see Figure 2.3):
Compile reads db.txt and converts into a binary form used by....
LinksFileReader reads the info generated by Compile and computes the linksranks and saves it into a directory called /ranks.
Michelangelo does the raw indexing. If there is an existing index Michelangelo will update it, otherwise a new index will be constructed.
Important | |
---|---|
If delta indexing is being done, a new directory called /newbits is created. The /newbits directory must not be removed |
Oracul reads the index created above and saves the previously generated ranks into it.
Note | |
---|---|
Michelangelo and Oracul cannot run concurrently as they both depend on and alter the index. |
Prev | Up | Next |
Chapter 2. Indexing Your Documents | Home | Indexing Local Pages Using the Indexer GUI |
© 2003-2004 Egothor Developers |