:: EGOTHOR |
The first module that touches the input documents or queries is the parser. You will need to modify our implementation if your documents need special handling and you can simply describe their elements by a regular expression (as shown below).
The parser in Egothor tries to reduce the total number of active words, attempting to keep valid tokens of text as one word. For instance, it is better for an IR system to store a date like "10-Jul-2002" as one word (term, token) rather than decomposing it to its three constituent elements: "10", "Jul" and "2002". Using the second method entails the implementation of a complicated algorithm that recognizes that the three independent words are in reality just the constituent parts of the date as a whole. Remember, when parsing, less is definitely more!
The table below outlines which text elements are recognized (the actual list can be found in org.egothor.parser.plain.Simple):
Table 1.1. Parser specification
Element | Template | Notes, examples |
---|---|---|
Acronym | L . ( L . )+ | I.B.M., s.r.o., R.U.R. |
Apostrophe | W ( ' W )+ | O'Hare, wouldn't |
Date | D(D)?.D(D)?.DD(DD)? | 28.10.02, 28.10.2002, but also 35.16.2002 |
DD-DD-DDDD | 28-10-2002, but also 35-16-2002 | |
DD-W-DDDD | 28-Jul-2002, but also 35-squadron-1000 | |
E-mail address | S ( . S )* @ H | leo.galambos@mff.cuni.cz |
Host name | S ( . S )* . L L (L)? (L)? | com-os2.ms.mff.cuni.cz |
Mark, sign | W & W | Woody&Son, AT&T, Me&You |
W / N | OS/2, AS/400 | |
W ++ | C++ | |
Number | (D)+ | 31415926535 |
IP address | (D)+.(D)+.(D)+.(D)+ | 192.168.0.0, but also 333.100.2.1000 |
String | (L | D) ( L | D | _ | - )* | D-N-A, R-N-A, com-os2, OS2, id_3 |
Word | L ( L | D )+ | K2, guru |
N-Gram fragment | (CJK letter)+ | 亄亅了亇 |
Where: L is a letter, D is a digit, S is a string (as defined above), W is a word (as defined above), H is a host name (as defined above), () groups elements, + means "at least one", * means "may appear several times", ? means "it may appear", | means "or".
As you can see, the grammar does not validate, for instance, the date format. It can be done by filters which go in action after the initial parser. The filters can parse already parsed elements to smaller pieces. They can discard or copy (clone) elements, and do many other interesting things.
Note for geeks and gurus | |
---|---|
For more complex cases, where the elements are dependent on a language or other attributes, such that you cannot describe them by regular expression or LL(1) grammar, you would use a system of filters. The filters can contain other parsers or change the recognized elements to other types of elements. We are currently working on automatization of the filtering subsystem, because (in most cases) the computer can decide in what order the filters have to be applied. The new filtering system will be backward compatible, but we feel that is not important to describe it here. All should be clear when you read the source code. |
Tip | |
---|---|
It is more flexible to have a filter that checks the date format depending on the language, than to move the implementation to the core parser. |
Prev | Up | Next |
Notes | Home | Developing with ::egothor |
© 2003-2004 Egothor Developers |