/[wait]/trunk/README
This is repository of my old source code which isn't updated any more. Go to git.rot13.org for current projects!
ViewVC logotype

Contents of /trunk/README

Parent Directory Parent Directory | Revision Log Revision Log


Revision 20 - (show annotations)
Tue May 9 11:29:45 2000 UTC (23 years, 11 months ago) by cvs2svn
Original Path: cvs-head/README
File size: 8200 byte(s)
This commit was generated by cvs2svn to compensate for changes in r10,
which included commits to RCS files with non-trunk default branches.

1 WAIT 1.8
2
3 Copyright (c) 1996-2000, Ulrich Pfeifer
4
5 ------------------------------------------------------------------------
6 This program is free software; you can redistribute it and/or
7 modify it under the same terms than Perl itself.
8
9 This program is distributed in the hope that it will be useful,
10 but WITHOUT ANY WARRANTY; without even the implied warranty of
11 MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
12 ------------------------------------------------------------------------
13
14 News:
15
16 Locking
17 =======
18
19 WAIT now supports some basic locking.
20
21 Speed
22 =====
23
24 Searching large collections is now considerably faster:
25
26 $table->search({attr => 'text',
27 cont => $query,
28 top => 1,
29 picky => 0});
30
31 Table indices may now be tuned to improve search performance. The
32 index tuning can be switched on and off using $table->set(top=>1/0) to
33 allow for bulk inserts.
34
35 Documentation
36 =============
37
38 WAIT is still not documented really. But Andreas König took the
39 trouble to comment the example scripts. This will help you
40 implementing your own applications. I added some tiny scripts to
41 index e.g. your .yow file or the fourtune databases.
42
43 SourceForge
44 ===========
45
46 WAIT is registered on SourceForge now:
47
48 http://wait.sourceforge.net/
49 https://sourceforge.net/project/?group_id=4814
50
51 I will keep the CVS repository up to date. If you have some spare
52 tuits, feel free to contribute.
53
54 Ulrich Pfeifer <upf@wait.de>
55
56 ------------------------------------------------------------------------
57 NAME
58 WAIT - a rewrite of the freeWAIS-sf engine in Perl and XS
59
60 SYNOPSIS
61 A Synopsis is not yet available.
62
63 Status of this document
64 I started writing down some information about the implementation before
65 I forget them in my spare time. The stuff is incomplete at least. Any
66 additions, corrections, ... welcome.
67
68 PURPOSE
69 As you might know, I developed and maintained freeWAIS-sf (with the help
70 of many people in The Net). FreeWAIS-sf is based on freeWAIS maintained
71 by the Clearing House for Network Information Retrieval (CNIDR) which in
72 turn is based on wais-8-b5 implemented by Thinking Machine et al. During
73 this long history - implementation started about 1989 - many people
74 contributed to the distribution and added features not foreseen by the
75 original design. While the system fulfills its task now, the code has
76 reached a state where adding new features is nearly impossible and even
77 fixing longstanding bugs and removing limitations has become a very time
78 consuming task.
79
80 Therefore I decided to pass the maintenance to WSC Inc. and built a new
81 system from scratch. For obvious reasons I choosed Perl as
82 implementation language.
83
84 DESCRIPTION
85 The central idea of the system is to provide a framework and the
86 building blocks for any indexing and search system the users might want
87 to build. Obviously the framework limits the class of system which can
88 be build.
89
90 +------+ +-----+ +------+
91 ==> |Access| ==> |Parse| ==> | |
92 +------+ +-----+ | |
93 || | | +-----+
94 || |Filter| ==> |Index|
95 \/ | | +-----+
96 +-------+ +-----+ | |
97 <= |Display| <== |Query| <-> | |
98 +-------+ +-----+ +------+
99
100 A collection (aka table) is defined by the instances of the access and
101 parse module together with the filter definitions. At query time in
102 addition a query and a display module must be choosen.
103
104 Access
105
106 The access module defines which documents are members of a database.
107 Usually an access module is a tied hash, whose keys are the Ids of the
108 documents (did = document id) and whose values are the documents
109 themselves. The indexing process loops over the keys using `FIRSTKEY'
110 and `NEXTKEY'. Documents are retrieved with `FETCH'.
111
112 By convention access modules should be members of the `WAIT::Document'
113 hierarchy. Have a look at the `WAIT::Document::Split' module to get the
114 idea.
115
116 Parse
117
118 The task of the parse module is to split the documents into logical
119 parts via the `split' method. E.g. the `WAIT::Parse::Nroff' splits
120 manuals piped through nroff(1) into the sections *name*, *synopsis*,
121 *options*, *description*, *author*, *example*, *bugs*, *text*, *see*,
122 and *environment*. Here is the implementation of `WAIT::Parse::Base'
123 which handles documents with a pretty simple tagged format:
124
125 AU: Pfeifer, U.; Fuhr, N.; Huynh, T.
126 TI: Searching Structured Documents with the Enhanced Retrieval
127 Functionality of freeWAIS-sf and SFgate
128 ER: D. Kroemker
129 BT: Computer Networks and ISDN Systems; Proceedings of the third
130 International World-Wide Web Conference
131 PN: Elsevier
132 PA: Amsterdam - Lausanne - New York - Oxford - Shannon - Tokyo
133 PP: 1027-1036
134 PY: 1995
135
136 sub split { # called as method
137 my %result;
138 my $fld;
139
140 for (split /\n/, $_[1]) {
141 if (s/^(\S+):\s*//) {
142 $fld = lc $1;
143 }
144 $result{$fld} .= $_ if defined $fld;
145 }
146 return \%result;
147 }
148
149 Since the original document cannot be reconstructed from its attributes,
150 we need a second method (*tag*) which marks the regions of the document
151 with tags for the different attributes. This tagged form is used by the
152 display module to hilight search terms in the documents. Besides the
153 tags for the attributes, the method might assign the special tags `_b'
154 and `_i' for indicating bold and italic regions.
155
156 sub tag {
157 my @result;
158 my $tag;
159
160 for (split /\n/, $_[1]) {
161 next if /^\w\w:\s*$/;
162 if (s/^(\S+)://) {
163 push @result, {_b => 1}, "$1:";
164 $tag = lc $1;
165 }
166 if (defined $tag) {
167 push @result, {$tag => 1}, "$_\n";
168 } else {
169 push @result, {}, "$_\n";
170 }
171 }
172 return @result; # we don't go for speed
173 }
174
175 Obviously one could implement `split' via `tag'. The reason for having
176 two functions is speed. We need to call `split' for each document when
177 indexing a collection. Therefore speed is essential. On the other hand,
178 `tag' is called in order to display a single document and may be a
179 little slower. It may care about tagging bold and italic regions. See
180 `WAIT::Parse::Nroff' how this might decrease performance.
181
182 Filter definition
183
184 From the Information Retrieval perspective, the hardest part of the
185 system is the filter module. The database administrator defines for each
186 attribute, how the contents should be processed before it is stored in
187 the index. Usually the processing contains steps to restrict the
188 character set, case transformation, splitting to words and transforming
189 to word stems. In WAIT these steps are defined naturally as a pipeline
190 of processing steps. The pipelines are made up by functions in the
191 package WAIT::Filter which is pre-populated by the most common functions
192 but may be extended any time.
193
194 The equivalent for a typical freeWAIS-sf processing would be this
195 pipeline:
196
197 [ 'isotr', 'isolc', 'split2', 'stop', 'Stem']
198
199 The function `isotr' replaces unknown characters by blanks. `isolc'
200 transforms to lower case. `split2' splits into words and removes words
201 shorter than two characters. `stop' removes the freeWAIS-sf stopwords
202 and `Stem' applies the Porter algorithm for computing the stem of the
203 words.
204
205 The filter definition for a collection defines a set of pipelines for
206 the attributes and modifies the pipelines which should be used for
207 prefix and interval searches.
208
209 Several complete working examples come with WAIT in the script
210 directory. It is recommended to follow the pattern of the scripts
211 smakewhatis and sman.
212

Properties

Name Value
cvs2svn:cvs-rev 1.1.1.2

  ViewVC Help
Powered by ViewVC 1.1.26