--- trunk/README 2004/07/13 12:22:09 106 +++ tags/WAIT_1_900/README 2004/07/13 12:45:55 107 @@ -102,24 +102,22 @@ addition a query and a display module must be choosen. Access - The access module defines which documents are members of a database. Usually an access module is a tied hash, whose keys are the Ids of the documents (did = document id) and whose values are the documents - themselves. The indexing process loops over the keys using `FIRSTKEY' - and `NEXTKEY'. Documents are retrieved with `FETCH'. + themselves. The indexing process loops over the keys using "FIRSTKEY" + and "NEXTKEY". Documents are retrieved with "FETCH". - By convention access modules should be members of the `WAIT::Document' - hierarchy. Have a look at the `WAIT::Document::Split' module to get the + By convention access modules should be members of the "WAIT::Document" + hierarchy. Have a look at the "WAIT::Document::Split" module to get the idea. Parse - The task of the parse module is to split the documents into logical - parts via the `split' method. E.g. the `WAIT::Parse::Nroff' splits + parts via the "split" method. E.g. the "WAIT::Parse::Nroff" splits manuals piped through nroff(1) into the sections *name*, *synopsis*, *options*, *description*, *author*, *example*, *bugs*, *text*, *see*, - and *environment*. Here is the implementation of `WAIT::Parse::Base' + and *environment*. Here is the implementation of "WAIT::Parse::Base" which handles documents with a pretty simple tagged format: AU: Pfeifer, U.; Fuhr, N.; Huynh, T. @@ -150,8 +148,8 @@ we need a second method (*tag*) which marks the regions of the document with tags for the different attributes. This tagged form is used by the display module to hilight search terms in the documents. Besides the - tags for the attributes, the method might assign the special tags `_b' - and `_i' for indicating bold and italic regions. + tags for the attributes, the method might assign the special tags "_b" + and "_i" for indicating bold and italic regions. sub tag { my @result; @@ -172,15 +170,14 @@ return @result; # we don't go for speed } - Obviously one could implement `split' via `tag'. The reason for having - two functions is speed. We need to call `split' for each document when + Obviously one could implement "split" via "tag". The reason for having + two functions is speed. We need to call "split" for each document when indexing a collection. Therefore speed is essential. On the other hand, - `tag' is called in order to display a single document and may be a + "tag" is called in order to display a single document and may be a little slower. It may care about tagging bold and italic regions. See - `WAIT::Parse::Nroff' how this might decrease performance. + "WAIT::Parse::Nroff" how this might decrease performance. Filter definition - From the Information Retrieval perspective, the hardest part of the system is the filter module. The database administrator defines for each attribute, how the contents should be processed before it is stored in @@ -196,10 +193,10 @@ [ 'isotr', 'isolc', 'split2', 'stop', 'Stem'] - The function `isotr' replaces unknown characters by blanks. `isolc' - transforms to lower case. `split2' splits into words and removes words - shorter than two characters. `stop' removes the freeWAIS-sf stopwords - and `Stem' applies the Porter algorithm for computing the stem of the + The function "isotr" replaces unknown characters by blanks. "isolc" + transforms to lower case. "split2" splits into words and removes words + shorter than two characters. "stop" removes the freeWAIS-sf stopwords + and "Stem" applies the Porter algorithm for computing the stem of the words. The filter definition for a collection defines a set of pipelines for