/[wait]/tags/WAIT_1_900/README
This is repository of my old source code which isn't updated any more. Go to git.rot13.org for current projects!
ViewVC logotype

Contents of /tags/WAIT_1_900/README

Parent Directory Parent Directory | Revision Log Revision Log


Revision 11 - (show annotations)
Fri Apr 28 15:41:10 2000 UTC (23 years, 11 months ago) by unknown
Original Path: branches/CPAN/README
File size: 8411 byte(s)
This commit was manufactured by cvs2svn to create branch 'CPAN'.
1 WAIT 1.6
2
3 Copyright (c) 1996, Ulrich Pfeifer
4
5 ------------------------------------------------------------------------
6 This program is free software; you can redistribute it and/or
7 modify it under the same terms than Perl itself.
8
9 This program is distributed in the hope that it will be useful,
10 but WITHOUT ANY WARRANTY; without even the implied warranty of
11 MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
12 ------------------------------------------------------------------------
13
14 This software is not actively maintained by it's author.
15
16 For more two years now I tried to steal some time to clean this up
17 without any luck. So I decided to pass the baton on. I consider the
18 input part pretty satisfying. The query part - despite being operable
19 and useful - needs a major overhaul. To provide a forum for further
20 discussions an to coordinate further developement, I did setup a
21 mailinglist. Drop me a line if you want to participate.
22
23 Ulrich Pfeifer <upf@wait.de>
24
25 ------------------------------------------------------------------------
26 NAME
27 WAIT - a rewrite of the freeWAIS-sf engine in Perl
28
29 Status of this document
30 I started writing down some information about the implementation
31 before I forget them in my spare time. The stuff is incomplete
32 at least. Any additions, corrections, ... welcome.
33
34 PURPOSE
35 As you might know, I developed and maintained freeWAIS-sf (with
36 the help of many people in The Net). FreeWAIS-sf is based on
37 freeWAIS maintained by the Clearing House for Network
38 Information Retrieval (CNIDR) which in turn is based on wais-8-
39 b5 implemented by Thinking Machine et al. During this long
40 history - implementation started about 1989 - many people
41 contributed to the distribution and added features not foreseen
42 by the original design. While the system fulfills its task now,
43 the code has reached a state where adding new features is nearly
44 impossible and even fixing longstanding bugs and removing
45 limitations has become a very time consuming task.
46
47 Therefore I decided to pass the maintenance to WSC Inc. and
48 built a new system from scratch. For obvious reasons I choosed
49 Perl as implementation language.
50
51 DESCRIPTION
52 The central idea of the system is to provide a framework and the
53 building blocks for any indexing and search system the users
54 might want to build. Obviously the framework limits the class of
55 system which can be build.
56
57 +------+ +-----+ +------+
58 ==> |Access| ==> |Parse| ==> | |
59 +------+ +-----+ | |
60 || | | +-----+
61 || |Filter| ==> |Index|
62 \/ | | +-----+
63 +-------+ +-----+ | |
64 <= |Display| <== |Query| <-> | |
65 +-------+ +-----+ +------+
66
67 A collection (aka table) is defined by the instances of the
68 access and parse module together with the filter definitions. At
69 query time in addition a query and a display module must be
70 choosen.
71
72 Access
73
74 The access module defines which documents where members of a
75 database. Usually an access module is a tied hash, whose keys
76 are the Ids of the documents (did = document id) and whose
77 values are the documents themselves. The indexing process loops
78 over the keys using `FIRSTKEY' and `NEXTKEY'. Documents are
79 retrieved with `FETCH'.
80
81 By convention access modules should be members of the
82 `WAIT::Document' hierarchy. Have a look at the
83 `WAIT::Document::Split' module to get the idea.
84
85 Parse
86
87 The task parse module is to split the documents into logical
88 parts via the `split' method. E.g. the `WAIT::Parse::Nroff'
89 splits manuals piped through nroff(1) into the sections *name*,
90 *synopsis*, *options*, *description*, *author*, *example*,
91 *bugs*, *text*, *see*, and *environment*. Here is the
92 implementation of `WAIT::Parse::Base' which handes documents
93 with a pretty simple tagged format:
94
95 AU: Pfeifer, U.; Fuhr, N.; Huynh, T.
96 TI: Searching Structured Documents with the Enhanced Retrieval
97 Functionality of freeWAIS-sf and SFgate
98 ER: D. Kroemker
99 BT: Computer Networks and ISDN Systems; Proceedings of the third
100 International World-Wide Web Conference
101 PN: Elsevier
102 PA: Amsterdam - Lausanne - New York - Oxford - Shannon - Tokyo
103 PP: 1027-1036
104 PY: 1995
105
106 sub split { # called as method
107 my %result;
108 my $fld;
109
110 for (split /\n/, $_[1]) {
111 if (s/^(\S+):\s*//) {
112 $fld = lc $1;
113 }
114 $result{$fld} .= $_ if defined $fld;
115 }
116 return \%result;
117 }
118
119 Since the original document cannot be reconstructed from its
120 attributes, we need a second method (*tag*) which marks the
121 regions of the document with tags for the different attributes.
122 This tagged form is used by the display module to hilight search
123 terms in the documents. Besides the tags for the attributes, the
124 method might assign the special tags `_b' and `_i' for
125 indicating bold and italic regions.
126
127 sub tag {
128 my @result;
129 my $tag;
130
131 for (split /\n/, $_[1]) {
132 next if /^\w\w:\s*$/;
133 if (s/^(\S+)://) {
134 push @result, {_b => 1}, "$1:";
135 $tag = lc $1;
136 }
137 if (defined $tag) {
138 push @result, {$tag => 1}, "$_\n";
139 } else {
140 push @result, {}, "$_\n";
141 }
142 }
143 return @result; # we don't go for speed
144 }
145
146 Obviously one could implement `split' via `tag'. The reason for
147 having two functions is speed. We need to call `split' for each
148 document when indexing a collection. Therefore speed is
149 essential. On the other hand, `tag' is called in order to
150 display a single document and may be a little slower. It may
151 care about tagging bold and italic regions. See
152 `WAIT::Parse::Nroff' how this might decrease performance.
153
154 Filter definition
155
156 From the Information Retrieval perspective, the hardest part of
157 the system is the filter module. The database administrator
158 defines for each attribute, how the contents should be processed
159 before it is stored in the index. Usually the processing
160 contains steps to restrict the character set, case
161 transformation, splitting to words and transforming to word
162 stems. In WAIT these steps are defined naturally as a pipeline
163 of processing steps. The pipelines are made up by functions in
164 the package WAIT::Filter which is pre-populated by the most
165 common functions but may be extended any time.
166
167 The equivalent for a typical freeWAIS-sf processing would be
168 this pipeline:
169
170 [ 'isotr', 'isolc', 'split2', 'stop', 'Stem']
171
172 The function `isotr' replaces unknown characters by blanks.
173 `isolc' transforms to lower case. `split2' splits into words and
174 removes words shorter than two characters. `stop' removes the
175 freeWAIS-sf stopwords and `Stem' applies the Porter algorithm
176 for computing the stem of the words.
177
178 The filter definition for a collection defines a set of piplines
179 for the attributes and modifies the pipelines which should be
180 used for prefix and interval searches.
181
182 Here is a complete example:
183
184 my $stem = [{
185 'prefix' => ['unroff', 'isotr', 'isolc'],
186 'intervall' => ['unroff', 'isotr', 'isolc'],
187 },'unroff', 'isotr', 'isolc', 'split2', 'stop', 'Stem'];
188 my $text = [{
189 'prefix' => ['unroff', 'isotr', 'isolc'],
190 'intervall' => ['unroff', 'isotr', 'isolc'],
191 },
192 'unroff', 'isotr', 'isolc', 'split2', 'stop'];
193 my $sound = ['unroff', 'isotr', 'isolc', 'split2', 'Soundex'];
194
195 my $spec = [
196 'name' => $stem,
197 'synopsis' => $stem,
198 'bugs' => $stem,
199 'description' => $stem,
200 'text' => $stem,
201 'environment' => $text,
202 'example' => $text, 'example' => $stem,
203 'author' => $sound, 'author' => $stem,
204 ]
205

Properties

Name Value
cvs2svn:cvs-rev 1.1

  ViewVC Help
Powered by ViewVC 1.1.26