/[wait]/trunk/README
This is repository of my old source code which isn't updated any more. Go to git.rot13.org for current projects!
ViewVC logotype

Diff of /trunk/README

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

cvs-head/README revision 10 by ulpfr, Fri Apr 28 15:40:52 2000 UTC trunk/README revision 107 by dpavlin, Tue Jul 13 12:45:55 2004 UTC
# Line 1  Line 1 
1                               WAIT 1.6                               WAIT 1.8
2    
3                    Copyright (c) 1996, Ulrich Pfeifer                    Copyright (c) 1996-2000, Ulrich Pfeifer
4    
5  ------------------------------------------------------------------------  ------------------------------------------------------------------------
6      This program is free software; you can redistribute it and/or      This program is free software; you can redistribute it and/or
# Line 11  Line 11 
11      MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.      MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
12  ------------------------------------------------------------------------  ------------------------------------------------------------------------
13    
14  This software is not actively maintained by it's author.  News:
15    
16  For more two years now I tried to steal some time to clean this up  Locking
17  without any luck. So I decided to pass the baton on. I consider the  =======
18  input part pretty satisfying. The query part - despite being operable  
19  and useful - needs a major overhaul. To provide a forum for further  WAIT now supports some basic locking.
20  discussions an to coordinate further developement, I did setup a  
21  mailinglist.  Drop me a line if you want to participate.  Speed
22    =====
23    
24    Searching large collections is now considerably faster:
25    
26            $table->search({attr  => 'text',
27                            cont  => $query,
28                            top   => 1,
29                            picky => 0});
30    
31    Table indices may now be tuned to improve search performance.  The
32    index tuning can be switched on and off using $table->set(top=>1/0) to
33    allow for bulk inserts.
34    
35    Documentation
36    =============
37    
38    WAIT is still not documented really.  But Andreas König took the
39    trouble to comment the example scripts.  This will help you
40    implementing your own applications.  I added some tiny scripts to
41    index e.g. your .yow file or the fourtune databases.
42    
43    SourceForge
44    ===========
45    
46    WAIT is registered on SourceForge now:
47    
48            http://wait.sourceforge.net/
49            https://sourceforge.net/project/?group_id=4814
50    
51    I will keep the CVS repository up to date.  If you have some spare
52    tuits, feel free to contribute.
53    
54  Ulrich Pfeifer <upf@wait.de>  Ulrich Pfeifer <upf@wait.de>
55    
56  ------------------------------------------------------------------------  ------------------------------------------------------------------------
57  NAME  NAME
58      WAIT - a rewrite of the freeWAIS-sf engine in Perl      WAIT - a rewrite of the freeWAIS-sf engine in Perl and XS
59    
60    SYNOPSIS
61        A Synopsis is not yet available.
62    
63  Status of this document  Status of this document
64      I started writing down some information about the implementation      I started writing down some information about the implementation before
65      before I forget them in my spare time. The stuff is incomplete      I forget them in my spare time. The stuff is incomplete at least. Any
66      at least. Any additions, corrections, ... welcome.      additions, corrections, ... welcome.
67    
68  PURPOSE  PURPOSE
69      As you might know, I developed and maintained freeWAIS-sf (with      As you might know, I developed and maintained freeWAIS-sf (with the help
70      the help of many people in The Net). FreeWAIS-sf is based on      of many people in The Net). FreeWAIS-sf is based on freeWAIS maintained
71      freeWAIS maintained by the Clearing House for Network      by the Clearing House for Network Information Retrieval (CNIDR) which in
72      Information Retrieval (CNIDR) which in turn is based on wais-8-      turn is based on wais-8-b5 implemented by Thinking Machine et al. During
73      b5 implemented by Thinking Machine et al. During this long      this long history - implementation started about 1989 - many people
74      history - implementation started about 1989 - many people      contributed to the distribution and added features not foreseen by the
75      contributed to the distribution and added features not foreseen      original design. While the system fulfills its task now, the code has
76      by the original design. While the system fulfills its task now,      reached a state where adding new features is nearly impossible and even
77      the code has reached a state where adding new features is nearly      fixing longstanding bugs and removing limitations has become a very time
78      impossible and even fixing longstanding bugs and removing      consuming task.
79      limitations has become a very time consuming task.  
80        Therefore I decided to pass the maintenance to WSC Inc. and built a new
81      Therefore I decided to pass the maintenance to WSC Inc. and      system from scratch. For obvious reasons I choosed Perl as
82      built a new system from scratch. For obvious reasons I choosed      implementation language.
     Perl as implementation language.  
83    
84  DESCRIPTION  DESCRIPTION
85      The central idea of the system is to provide a framework and the      The central idea of the system is to provide a framework and the
86      building blocks for any indexing and search system the users      building blocks for any indexing and search system the users might want
87      might want to build. Obviously the framework limits the class of      to build. Obviously the framework limits the class of system which can
88      system which can be build.      be build.
89    
90             +------+     +-----+     +------+             +------+     +-----+     +------+
91         ==> |Access| ==> |Parse| ==> |      |         ==> |Access| ==> |Parse| ==> |      |
# Line 64  DESCRIPTION Line 97  DESCRIPTION
97         <= |Display| <== |Query| <-> |      |         <= |Display| <== |Query| <-> |      |
98            +-------+     +-----+     +------+            +-------+     +-----+     +------+
99    
100      A collection (aka table) is defined by the instances of the      A collection (aka table) is defined by the instances of the access and
101      access and parse module together with the filter definitions. At      parse module together with the filter definitions. At query time in
102      query time in addition a query and a display module must be      addition a query and a display module must be choosen.
     choosen.  
103    
104    Access    Access
105        The access module defines which documents are members of a database.
106      The access module defines which documents where members of a      Usually an access module is a tied hash, whose keys are the Ids of the
107      database. Usually an access module is a tied hash, whose keys      documents (did = document id) and whose values are the documents
108      are the Ids of the documents (did = document id) and whose      themselves. The indexing process loops over the keys using "FIRSTKEY"
109      values are the documents themselves. The indexing process loops      and "NEXTKEY". Documents are retrieved with "FETCH".
110      over the keys using `FIRSTKEY' and `NEXTKEY'. Documents are  
111      retrieved with `FETCH'.      By convention access modules should be members of the "WAIT::Document"
112        hierarchy. Have a look at the "WAIT::Document::Split" module to get the
113      By convention access modules should be members of the      idea.
     `WAIT::Document' hierarchy. Have a look at the  
     `WAIT::Document::Split' module to get the idea.  
114    
115    Parse    Parse
116        The task of the parse module is to split the documents into logical
117      The task parse module is to split the documents into logical      parts via the "split" method. E.g. the "WAIT::Parse::Nroff" splits
118      parts via the `split' method. E.g. the `WAIT::Parse::Nroff'      manuals piped through nroff(1) into the sections *name*, *synopsis*,
119      splits manuals piped through nroff(1) into the sections *name*,      *options*, *description*, *author*, *example*, *bugs*, *text*, *see*,
120      *synopsis*, *options*, *description*, *author*, *example*,      and *environment*. Here is the implementation of "WAIT::Parse::Base"
121      *bugs*, *text*, *see*, and *environment*. Here is the      which handles documents with a pretty simple tagged format:
     implementation of `WAIT::Parse::Base' which handes documents  
     with a pretty simple tagged format:  
122    
123        AU: Pfeifer, U.; Fuhr, N.; Huynh, T.        AU: Pfeifer, U.; Fuhr, N.; Huynh, T.
124        TI: Searching Structured Documents with the Enhanced Retrieval        TI: Searching Structured Documents with the Enhanced Retrieval
# Line 106  DESCRIPTION Line 134  DESCRIPTION
134        sub split {                     # called as method        sub split {                     # called as method
135          my %result;          my %result;
136          my $fld;          my $fld;
137          
138          for (split /\n/, $_[1]) {          for (split /\n/, $_[1]) {
139            if (s/^(\S+):\s*//) {            if (s/^(\S+):\s*//) {
140              $fld = lc $1;              $fld = lc $1;
# Line 114  DESCRIPTION Line 142  DESCRIPTION
142            $result{$fld} .= $_ if defined $fld;            $result{$fld} .= $_ if defined $fld;
143          }          }
144          return \%result;          return \%result;
145        }        }
146    
147      Since the original document cannot be reconstructed from its      Since the original document cannot be reconstructed from its attributes,
148      attributes, we need a second method (*tag*) which marks the      we need a second method (*tag*) which marks the regions of the document
149      regions of the document with tags for the different attributes.      with tags for the different attributes. This tagged form is used by the
150      This tagged form is used by the display module to hilight search      display module to hilight search terms in the documents. Besides the
151      terms in the documents. Besides the tags for the attributes, the      tags for the attributes, the method might assign the special tags "_b"
152      method might assign the special tags `_b' and `_i' for      and "_i" for indicating bold and italic regions.
     indicating bold and italic regions.  
153    
154        sub tag {        sub tag {
155          my @result;          my @result;
156          my $tag;          my $tag;
157            
158          for (split /\n/, $_[1]) {          for (split /\n/, $_[1]) {
159            next if /^\w\w:\s*$/;            next if /^\w\w:\s*$/;
160            if (s/^(\S+)://) {            if (s/^(\S+)://) {
# Line 141  DESCRIPTION Line 168  DESCRIPTION
168            }            }
169          }          }
170          return @result;               # we don't go for speed          return @result;               # we don't go for speed
171        }        }
172    
173      Obviously one could implement `split' via `tag'. The reason for      Obviously one could implement "split" via "tag". The reason for having
174      having two functions is speed. We need to call `split' for each      two functions is speed. We need to call "split" for each document when
175      document when indexing a collection. Therefore speed is      indexing a collection. Therefore speed is essential. On the other hand,
176      essential. On the other hand, `tag' is called in order to      "tag" is called in order to display a single document and may be a
177      display a single document and may be a little slower. It may      little slower. It may care about tagging bold and italic regions. See
178      care about tagging bold and italic regions. See      "WAIT::Parse::Nroff" how this might decrease performance.
     `WAIT::Parse::Nroff' how this might decrease performance.  
179    
180    Filter definition    Filter definition
181        From the Information Retrieval perspective, the hardest part of the
182        system is the filter module. The database administrator defines for each
183        attribute, how the contents should be processed before it is stored in
184        the index. Usually the processing contains steps to restrict the
185        character set, case transformation, splitting to words and transforming
186        to word stems. In WAIT these steps are defined naturally as a pipeline
187        of processing steps. The pipelines are made up by functions in the
188        package WAIT::Filter which is pre-populated by the most common functions
189        but may be extended any time.
190    
191      From the Information Retrieval perspective, the hardest part of      The equivalent for a typical freeWAIS-sf processing would be this
192      the system is the filter module. The database administrator      pipeline:
     defines for each attribute, how the contents should be processed  
     before it is stored in the index. Usually the processing  
     contains steps to restrict the character set, case  
     transformation, splitting to words and transforming to word  
     stems. In WAIT these steps are defined naturally as a pipeline  
     of processing steps. The pipelines are made up by functions in  
     the package WAIT::Filter which is pre-populated by the most  
     common functions but may be extended any time.  
   
     The equivalent for a typical freeWAIS-sf processing would be  
     this pipeline:  
193    
194              [ 'isotr', 'isolc', 'split2', 'stop', 'Stem']              [ 'isotr', 'isolc', 'split2', 'stop', 'Stem']
195    
196      The function `isotr' replaces unknown characters by blanks.      The function "isotr" replaces unknown characters by blanks. "isolc"
197      `isolc' transforms to lower case. `split2' splits into words and      transforms to lower case. "split2" splits into words and removes words
198      removes words shorter than two characters. `stop' removes the      shorter than two characters. "stop" removes the freeWAIS-sf stopwords
199      freeWAIS-sf stopwords and `Stem' applies the Porter algorithm      and "Stem" applies the Porter algorithm for computing the stem of the
200      for computing the stem of the words.      words.
201    
202      The filter definition for a collection defines a set of piplines      The filter definition for a collection defines a set of pipelines for
203      for the attributes and modifies the pipelines which should be      the attributes and modifies the pipelines which should be used for
204      used for prefix and interval searches.      prefix and interval searches.
205    
206      Here is a complete example:      Several complete working examples come with WAIT in the script
207        directory. It is recommended to follow the pattern of the scripts
208        my $stem  = [{      smakewhatis and sman.
                     'prefix'    => ['unroff', 'isotr', 'isolc'],  
                     'intervall' => ['unroff', 'isotr', 'isolc'],  
                    },'unroff', 'isotr', 'isolc', 'split2', 'stop', 'Stem'];  
       my $text  = [{  
                     'prefix'    => ['unroff', 'isotr', 'isolc'],  
                     'intervall' => ['unroff', 'isotr', 'isolc'],  
                    },  
                     'unroff', 'isotr', 'isolc', 'split2', 'stop'];  
       my $sound = ['unroff', 'isotr', 'isolc', 'split2', 'Soundex'];  
         
       my $spec  = [  
           'name'         => $stem,  
           'synopsis'     => $stem,  
           'bugs'         => $stem,  
           'description'  => $stem,  
           'text'         => $stem,  
           'environment'  => $text,  
           'example'      => $text,  'example' => $stem,  
           'author'       => $sound, 'author'  => $stem,  
          ]  
209    

Legend:
Removed from v.10  
changed lines
  Added in v.107

  ViewVC Help
Powered by ViewVC 1.1.26