102 |
addition a query and a display module must be choosen. |
addition a query and a display module must be choosen. |
103 |
|
|
104 |
Access |
Access |
|
|
|
105 |
The access module defines which documents are members of a database. |
The access module defines which documents are members of a database. |
106 |
Usually an access module is a tied hash, whose keys are the Ids of the |
Usually an access module is a tied hash, whose keys are the Ids of the |
107 |
documents (did = document id) and whose values are the documents |
documents (did = document id) and whose values are the documents |
108 |
themselves. The indexing process loops over the keys using `FIRSTKEY' |
themselves. The indexing process loops over the keys using "FIRSTKEY" |
109 |
and `NEXTKEY'. Documents are retrieved with `FETCH'. |
and "NEXTKEY". Documents are retrieved with "FETCH". |
110 |
|
|
111 |
By convention access modules should be members of the `WAIT::Document' |
By convention access modules should be members of the "WAIT::Document" |
112 |
hierarchy. Have a look at the `WAIT::Document::Split' module to get the |
hierarchy. Have a look at the "WAIT::Document::Split" module to get the |
113 |
idea. |
idea. |
114 |
|
|
115 |
Parse |
Parse |
|
|
|
116 |
The task of the parse module is to split the documents into logical |
The task of the parse module is to split the documents into logical |
117 |
parts via the `split' method. E.g. the `WAIT::Parse::Nroff' splits |
parts via the "split" method. E.g. the "WAIT::Parse::Nroff" splits |
118 |
manuals piped through nroff(1) into the sections *name*, *synopsis*, |
manuals piped through nroff(1) into the sections *name*, *synopsis*, |
119 |
*options*, *description*, *author*, *example*, *bugs*, *text*, *see*, |
*options*, *description*, *author*, *example*, *bugs*, *text*, *see*, |
120 |
and *environment*. Here is the implementation of `WAIT::Parse::Base' |
and *environment*. Here is the implementation of "WAIT::Parse::Base" |
121 |
which handles documents with a pretty simple tagged format: |
which handles documents with a pretty simple tagged format: |
122 |
|
|
123 |
AU: Pfeifer, U.; Fuhr, N.; Huynh, T. |
AU: Pfeifer, U.; Fuhr, N.; Huynh, T. |
148 |
we need a second method (*tag*) which marks the regions of the document |
we need a second method (*tag*) which marks the regions of the document |
149 |
with tags for the different attributes. This tagged form is used by the |
with tags for the different attributes. This tagged form is used by the |
150 |
display module to hilight search terms in the documents. Besides the |
display module to hilight search terms in the documents. Besides the |
151 |
tags for the attributes, the method might assign the special tags `_b' |
tags for the attributes, the method might assign the special tags "_b" |
152 |
and `_i' for indicating bold and italic regions. |
and "_i" for indicating bold and italic regions. |
153 |
|
|
154 |
sub tag { |
sub tag { |
155 |
my @result; |
my @result; |
170 |
return @result; # we don't go for speed |
return @result; # we don't go for speed |
171 |
} |
} |
172 |
|
|
173 |
Obviously one could implement `split' via `tag'. The reason for having |
Obviously one could implement "split" via "tag". The reason for having |
174 |
two functions is speed. We need to call `split' for each document when |
two functions is speed. We need to call "split" for each document when |
175 |
indexing a collection. Therefore speed is essential. On the other hand, |
indexing a collection. Therefore speed is essential. On the other hand, |
176 |
`tag' is called in order to display a single document and may be a |
"tag" is called in order to display a single document and may be a |
177 |
little slower. It may care about tagging bold and italic regions. See |
little slower. It may care about tagging bold and italic regions. See |
178 |
`WAIT::Parse::Nroff' how this might decrease performance. |
"WAIT::Parse::Nroff" how this might decrease performance. |
179 |
|
|
180 |
Filter definition |
Filter definition |
|
|
|
181 |
From the Information Retrieval perspective, the hardest part of the |
From the Information Retrieval perspective, the hardest part of the |
182 |
system is the filter module. The database administrator defines for each |
system is the filter module. The database administrator defines for each |
183 |
attribute, how the contents should be processed before it is stored in |
attribute, how the contents should be processed before it is stored in |
193 |
|
|
194 |
[ 'isotr', 'isolc', 'split2', 'stop', 'Stem'] |
[ 'isotr', 'isolc', 'split2', 'stop', 'Stem'] |
195 |
|
|
196 |
The function `isotr' replaces unknown characters by blanks. `isolc' |
The function "isotr" replaces unknown characters by blanks. "isolc" |
197 |
transforms to lower case. `split2' splits into words and removes words |
transforms to lower case. "split2" splits into words and removes words |
198 |
shorter than two characters. `stop' removes the freeWAIS-sf stopwords |
shorter than two characters. "stop" removes the freeWAIS-sf stopwords |
199 |
and `Stem' applies the Porter algorithm for computing the stem of the |
and "Stem" applies the Porter algorithm for computing the stem of the |
200 |
words. |
words. |
201 |
|
|
202 |
The filter definition for a collection defines a set of pipelines for |
The filter definition for a collection defines a set of pipelines for |