/[wait]/trunk/README
This is repository of my old source code which isn't updated any more. Go to git.rot13.org for current projects!
ViewVC logotype

Annotation of /trunk/README

Parent Directory Parent Directory | Revision Log Revision Log


Revision 107 - (hide annotations)
Tue Jul 13 12:45:55 2004 UTC (19 years, 9 months ago) by dpavlin
File size: 8197 byte(s)
tag for version 1.900

1 ulpfr 19 WAIT 1.8
2 ulpfr 10
3 ulpfr 19 Copyright (c) 1996-2000, Ulrich Pfeifer
4 ulpfr 10
5     ------------------------------------------------------------------------
6     This program is free software; you can redistribute it and/or
7     modify it under the same terms than Perl itself.
8    
9     This program is distributed in the hope that it will be useful,
10     but WITHOUT ANY WARRANTY; without even the implied warranty of
11     MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
12     ------------------------------------------------------------------------
13    
14 ulpfr 19 News:
15 ulpfr 10
16 ulpfr 19 Locking
17     =======
18 ulpfr 10
19 ulpfr 19 WAIT now supports some basic locking.
20    
21     Speed
22     =====
23    
24     Searching large collections is now considerably faster:
25    
26     $table->search({attr => 'text',
27     cont => $query,
28     top => 1,
29     picky => 0});
30    
31     Table indices may now be tuned to improve search performance. The
32     index tuning can be switched on and off using $table->set(top=>1/0) to
33     allow for bulk inserts.
34    
35     Documentation
36     =============
37    
38     WAIT is still not documented really. But Andreas König took the
39     trouble to comment the example scripts. This will help you
40     implementing your own applications. I added some tiny scripts to
41     index e.g. your .yow file or the fourtune databases.
42    
43     SourceForge
44     ===========
45    
46     WAIT is registered on SourceForge now:
47    
48     http://wait.sourceforge.net/
49     https://sourceforge.net/project/?group_id=4814
50    
51     I will keep the CVS repository up to date. If you have some spare
52     tuits, feel free to contribute.
53    
54 ulpfr 10 Ulrich Pfeifer <upf@wait.de>
55    
56     ------------------------------------------------------------------------
57     NAME
58 ulpfr 19 WAIT - a rewrite of the freeWAIS-sf engine in Perl and XS
59 ulpfr 10
60 ulpfr 19 SYNOPSIS
61     A Synopsis is not yet available.
62    
63 ulpfr 10 Status of this document
64 ulpfr 19 I started writing down some information about the implementation before
65     I forget them in my spare time. The stuff is incomplete at least. Any
66     additions, corrections, ... welcome.
67 ulpfr 10
68     PURPOSE
69 ulpfr 19 As you might know, I developed and maintained freeWAIS-sf (with the help
70     of many people in The Net). FreeWAIS-sf is based on freeWAIS maintained
71     by the Clearing House for Network Information Retrieval (CNIDR) which in
72     turn is based on wais-8-b5 implemented by Thinking Machine et al. During
73     this long history - implementation started about 1989 - many people
74     contributed to the distribution and added features not foreseen by the
75     original design. While the system fulfills its task now, the code has
76     reached a state where adding new features is nearly impossible and even
77     fixing longstanding bugs and removing limitations has become a very time
78     consuming task.
79 ulpfr 10
80 ulpfr 19 Therefore I decided to pass the maintenance to WSC Inc. and built a new
81     system from scratch. For obvious reasons I choosed Perl as
82     implementation language.
83 ulpfr 10
84     DESCRIPTION
85     The central idea of the system is to provide a framework and the
86 ulpfr 19 building blocks for any indexing and search system the users might want
87     to build. Obviously the framework limits the class of system which can
88     be build.
89 ulpfr 10
90     +------+ +-----+ +------+
91     ==> |Access| ==> |Parse| ==> | |
92     +------+ +-----+ | |
93     || | | +-----+
94     || |Filter| ==> |Index|
95     \/ | | +-----+
96     +-------+ +-----+ | |
97     <= |Display| <== |Query| <-> | |
98     +-------+ +-----+ +------+
99    
100 ulpfr 19 A collection (aka table) is defined by the instances of the access and
101     parse module together with the filter definitions. At query time in
102     addition a query and a display module must be choosen.
103 ulpfr 10
104     Access
105 ulpfr 19 The access module defines which documents are members of a database.
106     Usually an access module is a tied hash, whose keys are the Ids of the
107     documents (did = document id) and whose values are the documents
108 dpavlin 107 themselves. The indexing process loops over the keys using "FIRSTKEY"
109     and "NEXTKEY". Documents are retrieved with "FETCH".
110 ulpfr 10
111 dpavlin 107 By convention access modules should be members of the "WAIT::Document"
112     hierarchy. Have a look at the "WAIT::Document::Split" module to get the
113 ulpfr 19 idea.
114 ulpfr 10
115     Parse
116 ulpfr 19 The task of the parse module is to split the documents into logical
117 dpavlin 107 parts via the "split" method. E.g. the "WAIT::Parse::Nroff" splits
118 ulpfr 19 manuals piped through nroff(1) into the sections *name*, *synopsis*,
119     *options*, *description*, *author*, *example*, *bugs*, *text*, *see*,
120 dpavlin 107 and *environment*. Here is the implementation of "WAIT::Parse::Base"
121 ulpfr 19 which handles documents with a pretty simple tagged format:
122 ulpfr 10
123     AU: Pfeifer, U.; Fuhr, N.; Huynh, T.
124     TI: Searching Structured Documents with the Enhanced Retrieval
125     Functionality of freeWAIS-sf and SFgate
126     ER: D. Kroemker
127     BT: Computer Networks and ISDN Systems; Proceedings of the third
128     International World-Wide Web Conference
129     PN: Elsevier
130     PA: Amsterdam - Lausanne - New York - Oxford - Shannon - Tokyo
131     PP: 1027-1036
132     PY: 1995
133    
134     sub split { # called as method
135     my %result;
136     my $fld;
137 ulpfr 19
138 ulpfr 10 for (split /\n/, $_[1]) {
139     if (s/^(\S+):\s*//) {
140     $fld = lc $1;
141     }
142     $result{$fld} .= $_ if defined $fld;
143     }
144     return \%result;
145 ulpfr 19 }
146 ulpfr 10
147 ulpfr 19 Since the original document cannot be reconstructed from its attributes,
148     we need a second method (*tag*) which marks the regions of the document
149     with tags for the different attributes. This tagged form is used by the
150     display module to hilight search terms in the documents. Besides the
151 dpavlin 107 tags for the attributes, the method might assign the special tags "_b"
152     and "_i" for indicating bold and italic regions.
153 ulpfr 10
154     sub tag {
155     my @result;
156     my $tag;
157 ulpfr 19
158 ulpfr 10 for (split /\n/, $_[1]) {
159     next if /^\w\w:\s*$/;
160     if (s/^(\S+)://) {
161     push @result, {_b => 1}, "$1:";
162     $tag = lc $1;
163     }
164     if (defined $tag) {
165     push @result, {$tag => 1}, "$_\n";
166     } else {
167     push @result, {}, "$_\n";
168     }
169     }
170     return @result; # we don't go for speed
171 ulpfr 19 }
172 ulpfr 10
173 dpavlin 107 Obviously one could implement "split" via "tag". The reason for having
174     two functions is speed. We need to call "split" for each document when
175 ulpfr 19 indexing a collection. Therefore speed is essential. On the other hand,
176 dpavlin 107 "tag" is called in order to display a single document and may be a
177 ulpfr 19 little slower. It may care about tagging bold and italic regions. See
178 dpavlin 107 "WAIT::Parse::Nroff" how this might decrease performance.
179 ulpfr 10
180     Filter definition
181 ulpfr 19 From the Information Retrieval perspective, the hardest part of the
182     system is the filter module. The database administrator defines for each
183     attribute, how the contents should be processed before it is stored in
184     the index. Usually the processing contains steps to restrict the
185     character set, case transformation, splitting to words and transforming
186     to word stems. In WAIT these steps are defined naturally as a pipeline
187     of processing steps. The pipelines are made up by functions in the
188     package WAIT::Filter which is pre-populated by the most common functions
189     but may be extended any time.
190 ulpfr 10
191 ulpfr 19 The equivalent for a typical freeWAIS-sf processing would be this
192     pipeline:
193 ulpfr 10
194     [ 'isotr', 'isolc', 'split2', 'stop', 'Stem']
195    
196 dpavlin 107 The function "isotr" replaces unknown characters by blanks. "isolc"
197     transforms to lower case. "split2" splits into words and removes words
198     shorter than two characters. "stop" removes the freeWAIS-sf stopwords
199     and "Stem" applies the Porter algorithm for computing the stem of the
200 ulpfr 19 words.
201 ulpfr 10
202 ulpfr 19 The filter definition for a collection defines a set of pipelines for
203     the attributes and modifies the pipelines which should be used for
204     prefix and interval searches.
205 ulpfr 10
206 ulpfr 19 Several complete working examples come with WAIT in the script
207     directory. It is recommended to follow the pattern of the scripts
208     smakewhatis and sman.
209 ulpfr 10

Properties

Name Value
cvs2svn:cvs-rev 1.1.1.2

  ViewVC Help
Powered by ViewVC 1.1.26