/[wait]/branches/unido/README
This is repository of my old source code which isn't updated any more. Go to git.rot13.org for current projects!
ViewVC logotype

Annotation of /branches/unido/README

Parent Directory Parent Directory | Revision Log Revision Log


Revision 106 - (hide annotations)
Tue Jul 13 12:22:09 2004 UTC (19 years, 10 months ago) by dpavlin
File size: 8200 byte(s)
Changes made by Andreas J. Koenig <andreas.koenig(at)anima.de> for Unido project

1 dpavlin 106 WAIT 1.8
2    
3     Copyright (c) 1996-2000, Ulrich Pfeifer
4    
5     ------------------------------------------------------------------------
6     This program is free software; you can redistribute it and/or
7     modify it under the same terms than Perl itself.
8    
9     This program is distributed in the hope that it will be useful,
10     but WITHOUT ANY WARRANTY; without even the implied warranty of
11     MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
12     ------------------------------------------------------------------------
13    
14     News:
15    
16     Locking
17     =======
18    
19     WAIT now supports some basic locking.
20    
21     Speed
22     =====
23    
24     Searching large collections is now considerably faster:
25    
26     $table->search({attr => 'text',
27     cont => $query,
28     top => 1,
29     picky => 0});
30    
31     Table indices may now be tuned to improve search performance. The
32     index tuning can be switched on and off using $table->set(top=>1/0) to
33     allow for bulk inserts.
34    
35     Documentation
36     =============
37    
38     WAIT is still not documented really. But Andreas König took the
39     trouble to comment the example scripts. This will help you
40     implementing your own applications. I added some tiny scripts to
41     index e.g. your .yow file or the fourtune databases.
42    
43     SourceForge
44     ===========
45    
46     WAIT is registered on SourceForge now:
47    
48     http://wait.sourceforge.net/
49     https://sourceforge.net/project/?group_id=4814
50    
51     I will keep the CVS repository up to date. If you have some spare
52     tuits, feel free to contribute.
53    
54     Ulrich Pfeifer <upf@wait.de>
55    
56     ------------------------------------------------------------------------
57     NAME
58     WAIT - a rewrite of the freeWAIS-sf engine in Perl and XS
59    
60     SYNOPSIS
61     A Synopsis is not yet available.
62    
63     Status of this document
64     I started writing down some information about the implementation before
65     I forget them in my spare time. The stuff is incomplete at least. Any
66     additions, corrections, ... welcome.
67    
68     PURPOSE
69     As you might know, I developed and maintained freeWAIS-sf (with the help
70     of many people in The Net). FreeWAIS-sf is based on freeWAIS maintained
71     by the Clearing House for Network Information Retrieval (CNIDR) which in
72     turn is based on wais-8-b5 implemented by Thinking Machine et al. During
73     this long history - implementation started about 1989 - many people
74     contributed to the distribution and added features not foreseen by the
75     original design. While the system fulfills its task now, the code has
76     reached a state where adding new features is nearly impossible and even
77     fixing longstanding bugs and removing limitations has become a very time
78     consuming task.
79    
80     Therefore I decided to pass the maintenance to WSC Inc. and built a new
81     system from scratch. For obvious reasons I choosed Perl as
82     implementation language.
83    
84     DESCRIPTION
85     The central idea of the system is to provide a framework and the
86     building blocks for any indexing and search system the users might want
87     to build. Obviously the framework limits the class of system which can
88     be build.
89    
90     +------+ +-----+ +------+
91     ==> |Access| ==> |Parse| ==> | |
92     +------+ +-----+ | |
93     || | | +-----+
94     || |Filter| ==> |Index|
95     \/ | | +-----+
96     +-------+ +-----+ | |
97     <= |Display| <== |Query| <-> | |
98     +-------+ +-----+ +------+
99    
100     A collection (aka table) is defined by the instances of the access and
101     parse module together with the filter definitions. At query time in
102     addition a query and a display module must be choosen.
103    
104     Access
105    
106     The access module defines which documents are members of a database.
107     Usually an access module is a tied hash, whose keys are the Ids of the
108     documents (did = document id) and whose values are the documents
109     themselves. The indexing process loops over the keys using `FIRSTKEY'
110     and `NEXTKEY'. Documents are retrieved with `FETCH'.
111    
112     By convention access modules should be members of the `WAIT::Document'
113     hierarchy. Have a look at the `WAIT::Document::Split' module to get the
114     idea.
115    
116     Parse
117    
118     The task of the parse module is to split the documents into logical
119     parts via the `split' method. E.g. the `WAIT::Parse::Nroff' splits
120     manuals piped through nroff(1) into the sections *name*, *synopsis*,
121     *options*, *description*, *author*, *example*, *bugs*, *text*, *see*,
122     and *environment*. Here is the implementation of `WAIT::Parse::Base'
123     which handles documents with a pretty simple tagged format:
124    
125     AU: Pfeifer, U.; Fuhr, N.; Huynh, T.
126     TI: Searching Structured Documents with the Enhanced Retrieval
127     Functionality of freeWAIS-sf and SFgate
128     ER: D. Kroemker
129     BT: Computer Networks and ISDN Systems; Proceedings of the third
130     International World-Wide Web Conference
131     PN: Elsevier
132     PA: Amsterdam - Lausanne - New York - Oxford - Shannon - Tokyo
133     PP: 1027-1036
134     PY: 1995
135    
136     sub split { # called as method
137     my %result;
138     my $fld;
139    
140     for (split /\n/, $_[1]) {
141     if (s/^(\S+):\s*//) {
142     $fld = lc $1;
143     }
144     $result{$fld} .= $_ if defined $fld;
145     }
146     return \%result;
147     }
148    
149     Since the original document cannot be reconstructed from its attributes,
150     we need a second method (*tag*) which marks the regions of the document
151     with tags for the different attributes. This tagged form is used by the
152     display module to hilight search terms in the documents. Besides the
153     tags for the attributes, the method might assign the special tags `_b'
154     and `_i' for indicating bold and italic regions.
155    
156     sub tag {
157     my @result;
158     my $tag;
159    
160     for (split /\n/, $_[1]) {
161     next if /^\w\w:\s*$/;
162     if (s/^(\S+)://) {
163     push @result, {_b => 1}, "$1:";
164     $tag = lc $1;
165     }
166     if (defined $tag) {
167     push @result, {$tag => 1}, "$_\n";
168     } else {
169     push @result, {}, "$_\n";
170     }
171     }
172     return @result; # we don't go for speed
173     }
174    
175     Obviously one could implement `split' via `tag'. The reason for having
176     two functions is speed. We need to call `split' for each document when
177     indexing a collection. Therefore speed is essential. On the other hand,
178     `tag' is called in order to display a single document and may be a
179     little slower. It may care about tagging bold and italic regions. See
180     `WAIT::Parse::Nroff' how this might decrease performance.
181    
182     Filter definition
183    
184     From the Information Retrieval perspective, the hardest part of the
185     system is the filter module. The database administrator defines for each
186     attribute, how the contents should be processed before it is stored in
187     the index. Usually the processing contains steps to restrict the
188     character set, case transformation, splitting to words and transforming
189     to word stems. In WAIT these steps are defined naturally as a pipeline
190     of processing steps. The pipelines are made up by functions in the
191     package WAIT::Filter which is pre-populated by the most common functions
192     but may be extended any time.
193    
194     The equivalent for a typical freeWAIS-sf processing would be this
195     pipeline:
196    
197     [ 'isotr', 'isolc', 'split2', 'stop', 'Stem']
198    
199     The function `isotr' replaces unknown characters by blanks. `isolc'
200     transforms to lower case. `split2' splits into words and removes words
201     shorter than two characters. `stop' removes the freeWAIS-sf stopwords
202     and `Stem' applies the Porter algorithm for computing the stem of the
203     words.
204    
205     The filter definition for a collection defines a set of pipelines for
206     the attributes and modifies the pipelines which should be used for
207     prefix and interval searches.
208    
209     Several complete working examples come with WAIT in the script
210     directory. It is recommended to follow the pattern of the scripts
211     smakewhatis and sman.
212    

  ViewVC Help
Powered by ViewVC 1.1.26