1 |
ulpfr |
10 |
WAIT 1.6 |
2 |
|
|
|
3 |
|
|
Copyright (c) 1996, Ulrich Pfeifer |
4 |
|
|
|
5 |
|
|
------------------------------------------------------------------------ |
6 |
|
|
This program is free software; you can redistribute it and/or |
7 |
|
|
modify it under the same terms than Perl itself. |
8 |
|
|
|
9 |
|
|
This program is distributed in the hope that it will be useful, |
10 |
|
|
but WITHOUT ANY WARRANTY; without even the implied warranty of |
11 |
|
|
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. |
12 |
|
|
------------------------------------------------------------------------ |
13 |
|
|
|
14 |
|
|
This software is not actively maintained by it's author. |
15 |
|
|
|
16 |
|
|
For more two years now I tried to steal some time to clean this up |
17 |
|
|
without any luck. So I decided to pass the baton on. I consider the |
18 |
|
|
input part pretty satisfying. The query part - despite being operable |
19 |
|
|
and useful - needs a major overhaul. To provide a forum for further |
20 |
|
|
discussions an to coordinate further developement, I did setup a |
21 |
|
|
mailinglist. Drop me a line if you want to participate. |
22 |
|
|
|
23 |
|
|
Ulrich Pfeifer <upf@wait.de> |
24 |
|
|
|
25 |
|
|
------------------------------------------------------------------------ |
26 |
|
|
NAME |
27 |
|
|
WAIT - a rewrite of the freeWAIS-sf engine in Perl |
28 |
|
|
|
29 |
|
|
Status of this document |
30 |
|
|
I started writing down some information about the implementation |
31 |
|
|
before I forget them in my spare time. The stuff is incomplete |
32 |
|
|
at least. Any additions, corrections, ... welcome. |
33 |
|
|
|
34 |
|
|
PURPOSE |
35 |
|
|
As you might know, I developed and maintained freeWAIS-sf (with |
36 |
|
|
the help of many people in The Net). FreeWAIS-sf is based on |
37 |
|
|
freeWAIS maintained by the Clearing House for Network |
38 |
|
|
Information Retrieval (CNIDR) which in turn is based on wais-8- |
39 |
|
|
b5 implemented by Thinking Machine et al. During this long |
40 |
|
|
history - implementation started about 1989 - many people |
41 |
|
|
contributed to the distribution and added features not foreseen |
42 |
|
|
by the original design. While the system fulfills its task now, |
43 |
|
|
the code has reached a state where adding new features is nearly |
44 |
|
|
impossible and even fixing longstanding bugs and removing |
45 |
|
|
limitations has become a very time consuming task. |
46 |
|
|
|
47 |
|
|
Therefore I decided to pass the maintenance to WSC Inc. and |
48 |
|
|
built a new system from scratch. For obvious reasons I choosed |
49 |
|
|
Perl as implementation language. |
50 |
|
|
|
51 |
|
|
DESCRIPTION |
52 |
|
|
The central idea of the system is to provide a framework and the |
53 |
|
|
building blocks for any indexing and search system the users |
54 |
|
|
might want to build. Obviously the framework limits the class of |
55 |
|
|
system which can be build. |
56 |
|
|
|
57 |
|
|
+------+ +-----+ +------+ |
58 |
|
|
==> |Access| ==> |Parse| ==> | | |
59 |
|
|
+------+ +-----+ | | |
60 |
|
|
|| | | +-----+ |
61 |
|
|
|| |Filter| ==> |Index| |
62 |
|
|
\/ | | +-----+ |
63 |
|
|
+-------+ +-----+ | | |
64 |
|
|
<= |Display| <== |Query| <-> | | |
65 |
|
|
+-------+ +-----+ +------+ |
66 |
|
|
|
67 |
|
|
A collection (aka table) is defined by the instances of the |
68 |
|
|
access and parse module together with the filter definitions. At |
69 |
|
|
query time in addition a query and a display module must be |
70 |
|
|
choosen. |
71 |
|
|
|
72 |
|
|
Access |
73 |
|
|
|
74 |
|
|
The access module defines which documents where members of a |
75 |
|
|
database. Usually an access module is a tied hash, whose keys |
76 |
|
|
are the Ids of the documents (did = document id) and whose |
77 |
|
|
values are the documents themselves. The indexing process loops |
78 |
|
|
over the keys using `FIRSTKEY' and `NEXTKEY'. Documents are |
79 |
|
|
retrieved with `FETCH'. |
80 |
|
|
|
81 |
|
|
By convention access modules should be members of the |
82 |
|
|
`WAIT::Document' hierarchy. Have a look at the |
83 |
|
|
`WAIT::Document::Split' module to get the idea. |
84 |
|
|
|
85 |
|
|
Parse |
86 |
|
|
|
87 |
|
|
The task parse module is to split the documents into logical |
88 |
|
|
parts via the `split' method. E.g. the `WAIT::Parse::Nroff' |
89 |
|
|
splits manuals piped through nroff(1) into the sections *name*, |
90 |
|
|
*synopsis*, *options*, *description*, *author*, *example*, |
91 |
|
|
*bugs*, *text*, *see*, and *environment*. Here is the |
92 |
|
|
implementation of `WAIT::Parse::Base' which handes documents |
93 |
|
|
with a pretty simple tagged format: |
94 |
|
|
|
95 |
|
|
AU: Pfeifer, U.; Fuhr, N.; Huynh, T. |
96 |
|
|
TI: Searching Structured Documents with the Enhanced Retrieval |
97 |
|
|
Functionality of freeWAIS-sf and SFgate |
98 |
|
|
ER: D. Kroemker |
99 |
|
|
BT: Computer Networks and ISDN Systems; Proceedings of the third |
100 |
|
|
International World-Wide Web Conference |
101 |
|
|
PN: Elsevier |
102 |
|
|
PA: Amsterdam - Lausanne - New York - Oxford - Shannon - Tokyo |
103 |
|
|
PP: 1027-1036 |
104 |
|
|
PY: 1995 |
105 |
|
|
|
106 |
|
|
sub split { # called as method |
107 |
|
|
my %result; |
108 |
|
|
my $fld; |
109 |
|
|
|
110 |
|
|
for (split /\n/, $_[1]) { |
111 |
|
|
if (s/^(\S+):\s*//) { |
112 |
|
|
$fld = lc $1; |
113 |
|
|
} |
114 |
|
|
$result{$fld} .= $_ if defined $fld; |
115 |
|
|
} |
116 |
|
|
return \%result; |
117 |
|
|
} |
118 |
|
|
|
119 |
|
|
Since the original document cannot be reconstructed from its |
120 |
|
|
attributes, we need a second method (*tag*) which marks the |
121 |
|
|
regions of the document with tags for the different attributes. |
122 |
|
|
This tagged form is used by the display module to hilight search |
123 |
|
|
terms in the documents. Besides the tags for the attributes, the |
124 |
|
|
method might assign the special tags `_b' and `_i' for |
125 |
|
|
indicating bold and italic regions. |
126 |
|
|
|
127 |
|
|
sub tag { |
128 |
|
|
my @result; |
129 |
|
|
my $tag; |
130 |
|
|
|
131 |
|
|
for (split /\n/, $_[1]) { |
132 |
|
|
next if /^\w\w:\s*$/; |
133 |
|
|
if (s/^(\S+)://) { |
134 |
|
|
push @result, {_b => 1}, "$1:"; |
135 |
|
|
$tag = lc $1; |
136 |
|
|
} |
137 |
|
|
if (defined $tag) { |
138 |
|
|
push @result, {$tag => 1}, "$_\n"; |
139 |
|
|
} else { |
140 |
|
|
push @result, {}, "$_\n"; |
141 |
|
|
} |
142 |
|
|
} |
143 |
|
|
return @result; # we don't go for speed |
144 |
|
|
} |
145 |
|
|
|
146 |
|
|
Obviously one could implement `split' via `tag'. The reason for |
147 |
|
|
having two functions is speed. We need to call `split' for each |
148 |
|
|
document when indexing a collection. Therefore speed is |
149 |
|
|
essential. On the other hand, `tag' is called in order to |
150 |
|
|
display a single document and may be a little slower. It may |
151 |
|
|
care about tagging bold and italic regions. See |
152 |
|
|
`WAIT::Parse::Nroff' how this might decrease performance. |
153 |
|
|
|
154 |
|
|
Filter definition |
155 |
|
|
|
156 |
|
|
From the Information Retrieval perspective, the hardest part of |
157 |
|
|
the system is the filter module. The database administrator |
158 |
|
|
defines for each attribute, how the contents should be processed |
159 |
|
|
before it is stored in the index. Usually the processing |
160 |
|
|
contains steps to restrict the character set, case |
161 |
|
|
transformation, splitting to words and transforming to word |
162 |
|
|
stems. In WAIT these steps are defined naturally as a pipeline |
163 |
|
|
of processing steps. The pipelines are made up by functions in |
164 |
|
|
the package WAIT::Filter which is pre-populated by the most |
165 |
|
|
common functions but may be extended any time. |
166 |
|
|
|
167 |
|
|
The equivalent for a typical freeWAIS-sf processing would be |
168 |
|
|
this pipeline: |
169 |
|
|
|
170 |
|
|
[ 'isotr', 'isolc', 'split2', 'stop', 'Stem'] |
171 |
|
|
|
172 |
|
|
The function `isotr' replaces unknown characters by blanks. |
173 |
|
|
`isolc' transforms to lower case. `split2' splits into words and |
174 |
|
|
removes words shorter than two characters. `stop' removes the |
175 |
|
|
freeWAIS-sf stopwords and `Stem' applies the Porter algorithm |
176 |
|
|
for computing the stem of the words. |
177 |
|
|
|
178 |
|
|
The filter definition for a collection defines a set of piplines |
179 |
|
|
for the attributes and modifies the pipelines which should be |
180 |
|
|
used for prefix and interval searches. |
181 |
|
|
|
182 |
|
|
Here is a complete example: |
183 |
|
|
|
184 |
|
|
my $stem = [{ |
185 |
|
|
'prefix' => ['unroff', 'isotr', 'isolc'], |
186 |
|
|
'intervall' => ['unroff', 'isotr', 'isolc'], |
187 |
|
|
},'unroff', 'isotr', 'isolc', 'split2', 'stop', 'Stem']; |
188 |
|
|
my $text = [{ |
189 |
|
|
'prefix' => ['unroff', 'isotr', 'isolc'], |
190 |
|
|
'intervall' => ['unroff', 'isotr', 'isolc'], |
191 |
|
|
}, |
192 |
|
|
'unroff', 'isotr', 'isolc', 'split2', 'stop']; |
193 |
|
|
my $sound = ['unroff', 'isotr', 'isolc', 'split2', 'Soundex']; |
194 |
|
|
|
195 |
|
|
my $spec = [ |
196 |
|
|
'name' => $stem, |
197 |
|
|
'synopsis' => $stem, |
198 |
|
|
'bugs' => $stem, |
199 |
|
|
'description' => $stem, |
200 |
|
|
'text' => $stem, |
201 |
|
|
'environment' => $text, |
202 |
|
|
'example' => $text, 'example' => $stem, |
203 |
|
|
'author' => $sound, 'author' => $stem, |
204 |
|
|
] |
205 |
|
|
|