/[wait]/trunk/README
This is repository of my old source code which isn't updated any more. Go to git.rot13.org for current projects!
ViewVC logotype

Annotation of /trunk/README

Parent Directory Parent Directory | Revision Log Revision Log


Revision 10 - (hide annotations)
Fri Apr 28 15:40:52 2000 UTC (23 years, 11 months ago) by ulpfr
Original Path: cvs-head/README
File size: 8411 byte(s)
Initial revision

1 ulpfr 10 WAIT 1.6
2    
3     Copyright (c) 1996, Ulrich Pfeifer
4    
5     ------------------------------------------------------------------------
6     This program is free software; you can redistribute it and/or
7     modify it under the same terms than Perl itself.
8    
9     This program is distributed in the hope that it will be useful,
10     but WITHOUT ANY WARRANTY; without even the implied warranty of
11     MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
12     ------------------------------------------------------------------------
13    
14     This software is not actively maintained by it's author.
15    
16     For more two years now I tried to steal some time to clean this up
17     without any luck. So I decided to pass the baton on. I consider the
18     input part pretty satisfying. The query part - despite being operable
19     and useful - needs a major overhaul. To provide a forum for further
20     discussions an to coordinate further developement, I did setup a
21     mailinglist. Drop me a line if you want to participate.
22    
23     Ulrich Pfeifer <upf@wait.de>
24    
25     ------------------------------------------------------------------------
26     NAME
27     WAIT - a rewrite of the freeWAIS-sf engine in Perl
28    
29     Status of this document
30     I started writing down some information about the implementation
31     before I forget them in my spare time. The stuff is incomplete
32     at least. Any additions, corrections, ... welcome.
33    
34     PURPOSE
35     As you might know, I developed and maintained freeWAIS-sf (with
36     the help of many people in The Net). FreeWAIS-sf is based on
37     freeWAIS maintained by the Clearing House for Network
38     Information Retrieval (CNIDR) which in turn is based on wais-8-
39     b5 implemented by Thinking Machine et al. During this long
40     history - implementation started about 1989 - many people
41     contributed to the distribution and added features not foreseen
42     by the original design. While the system fulfills its task now,
43     the code has reached a state where adding new features is nearly
44     impossible and even fixing longstanding bugs and removing
45     limitations has become a very time consuming task.
46    
47     Therefore I decided to pass the maintenance to WSC Inc. and
48     built a new system from scratch. For obvious reasons I choosed
49     Perl as implementation language.
50    
51     DESCRIPTION
52     The central idea of the system is to provide a framework and the
53     building blocks for any indexing and search system the users
54     might want to build. Obviously the framework limits the class of
55     system which can be build.
56    
57     +------+ +-----+ +------+
58     ==> |Access| ==> |Parse| ==> | |
59     +------+ +-----+ | |
60     || | | +-----+
61     || |Filter| ==> |Index|
62     \/ | | +-----+
63     +-------+ +-----+ | |
64     <= |Display| <== |Query| <-> | |
65     +-------+ +-----+ +------+
66    
67     A collection (aka table) is defined by the instances of the
68     access and parse module together with the filter definitions. At
69     query time in addition a query and a display module must be
70     choosen.
71    
72     Access
73    
74     The access module defines which documents where members of a
75     database. Usually an access module is a tied hash, whose keys
76     are the Ids of the documents (did = document id) and whose
77     values are the documents themselves. The indexing process loops
78     over the keys using `FIRSTKEY' and `NEXTKEY'. Documents are
79     retrieved with `FETCH'.
80    
81     By convention access modules should be members of the
82     `WAIT::Document' hierarchy. Have a look at the
83     `WAIT::Document::Split' module to get the idea.
84    
85     Parse
86    
87     The task parse module is to split the documents into logical
88     parts via the `split' method. E.g. the `WAIT::Parse::Nroff'
89     splits manuals piped through nroff(1) into the sections *name*,
90     *synopsis*, *options*, *description*, *author*, *example*,
91     *bugs*, *text*, *see*, and *environment*. Here is the
92     implementation of `WAIT::Parse::Base' which handes documents
93     with a pretty simple tagged format:
94    
95     AU: Pfeifer, U.; Fuhr, N.; Huynh, T.
96     TI: Searching Structured Documents with the Enhanced Retrieval
97     Functionality of freeWAIS-sf and SFgate
98     ER: D. Kroemker
99     BT: Computer Networks and ISDN Systems; Proceedings of the third
100     International World-Wide Web Conference
101     PN: Elsevier
102     PA: Amsterdam - Lausanne - New York - Oxford - Shannon - Tokyo
103     PP: 1027-1036
104     PY: 1995
105    
106     sub split { # called as method
107     my %result;
108     my $fld;
109    
110     for (split /\n/, $_[1]) {
111     if (s/^(\S+):\s*//) {
112     $fld = lc $1;
113     }
114     $result{$fld} .= $_ if defined $fld;
115     }
116     return \%result;
117     }
118    
119     Since the original document cannot be reconstructed from its
120     attributes, we need a second method (*tag*) which marks the
121     regions of the document with tags for the different attributes.
122     This tagged form is used by the display module to hilight search
123     terms in the documents. Besides the tags for the attributes, the
124     method might assign the special tags `_b' and `_i' for
125     indicating bold and italic regions.
126    
127     sub tag {
128     my @result;
129     my $tag;
130    
131     for (split /\n/, $_[1]) {
132     next if /^\w\w:\s*$/;
133     if (s/^(\S+)://) {
134     push @result, {_b => 1}, "$1:";
135     $tag = lc $1;
136     }
137     if (defined $tag) {
138     push @result, {$tag => 1}, "$_\n";
139     } else {
140     push @result, {}, "$_\n";
141     }
142     }
143     return @result; # we don't go for speed
144     }
145    
146     Obviously one could implement `split' via `tag'. The reason for
147     having two functions is speed. We need to call `split' for each
148     document when indexing a collection. Therefore speed is
149     essential. On the other hand, `tag' is called in order to
150     display a single document and may be a little slower. It may
151     care about tagging bold and italic regions. See
152     `WAIT::Parse::Nroff' how this might decrease performance.
153    
154     Filter definition
155    
156     From the Information Retrieval perspective, the hardest part of
157     the system is the filter module. The database administrator
158     defines for each attribute, how the contents should be processed
159     before it is stored in the index. Usually the processing
160     contains steps to restrict the character set, case
161     transformation, splitting to words and transforming to word
162     stems. In WAIT these steps are defined naturally as a pipeline
163     of processing steps. The pipelines are made up by functions in
164     the package WAIT::Filter which is pre-populated by the most
165     common functions but may be extended any time.
166    
167     The equivalent for a typical freeWAIS-sf processing would be
168     this pipeline:
169    
170     [ 'isotr', 'isolc', 'split2', 'stop', 'Stem']
171    
172     The function `isotr' replaces unknown characters by blanks.
173     `isolc' transforms to lower case. `split2' splits into words and
174     removes words shorter than two characters. `stop' removes the
175     freeWAIS-sf stopwords and `Stem' applies the Porter algorithm
176     for computing the stem of the words.
177    
178     The filter definition for a collection defines a set of piplines
179     for the attributes and modifies the pipelines which should be
180     used for prefix and interval searches.
181    
182     Here is a complete example:
183    
184     my $stem = [{
185     'prefix' => ['unroff', 'isotr', 'isolc'],
186     'intervall' => ['unroff', 'isotr', 'isolc'],
187     },'unroff', 'isotr', 'isolc', 'split2', 'stop', 'Stem'];
188     my $text = [{
189     'prefix' => ['unroff', 'isotr', 'isolc'],
190     'intervall' => ['unroff', 'isotr', 'isolc'],
191     },
192     'unroff', 'isotr', 'isolc', 'split2', 'stop'];
193     my $sound = ['unroff', 'isotr', 'isolc', 'split2', 'Soundex'];
194    
195     my $spec = [
196     'name' => $stem,
197     'synopsis' => $stem,
198     'bugs' => $stem,
199     'description' => $stem,
200     'text' => $stem,
201     'environment' => $text,
202     'example' => $text, 'example' => $stem,
203     'author' => $sound, 'author' => $stem,
204     ]
205    

Properties

Name Value
cvs2svn:cvs-rev 1.1

  ViewVC Help
Powered by ViewVC 1.1.26