/[wait]/trunk/README
This is repository of my old source code which isn't updated any more. Go to git.rot13.org for current projects!
ViewVC logotype

Diff of /trunk/README

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

branches/CPAN/README revision 11 by unknown, Fri Apr 28 15:41:10 2000 UTC trunk/README revision 88 by dpavlin, Mon May 24 13:44:01 2004 UTC
# Line 1  Line 1 
1                               WAIT 1.6                               WAIT 1.8
2    
3                    Copyright (c) 1996, Ulrich Pfeifer                    Copyright (c) 1996-2000, Ulrich Pfeifer
4    
5  ------------------------------------------------------------------------  ------------------------------------------------------------------------
6      This program is free software; you can redistribute it and/or      This program is free software; you can redistribute it and/or
# Line 11  Line 11 
11      MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.      MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
12  ------------------------------------------------------------------------  ------------------------------------------------------------------------
13    
14  This software is not actively maintained by it's author.  News:
15    
16  For more two years now I tried to steal some time to clean this up  Locking
17  without any luck. So I decided to pass the baton on. I consider the  =======
18  input part pretty satisfying. The query part - despite being operable  
19  and useful - needs a major overhaul. To provide a forum for further  WAIT now supports some basic locking.
20  discussions an to coordinate further developement, I did setup a  
21  mailinglist.  Drop me a line if you want to participate.  Speed
22    =====
23    
24    Searching large collections is now considerably faster:
25    
26            $table->search({attr  => 'text',
27                            cont  => $query,
28                            top   => 1,
29                            picky => 0});
30    
31    Table indices may now be tuned to improve search performance.  The
32    index tuning can be switched on and off using $table->set(top=>1/0) to
33    allow for bulk inserts.
34    
35    Documentation
36    =============
37    
38    WAIT is still not documented really.  But Andreas König took the
39    trouble to comment the example scripts.  This will help you
40    implementing your own applications.  I added some tiny scripts to
41    index e.g. your .yow file or the fourtune databases.
42    
43    SourceForge
44    ===========
45    
46    WAIT is registered on SourceForge now:
47    
48            http://wait.sourceforge.net/
49            https://sourceforge.net/project/?group_id=4814
50    
51    I will keep the CVS repository up to date.  If you have some spare
52    tuits, feel free to contribute.
53    
54  Ulrich Pfeifer <upf@wait.de>  Ulrich Pfeifer <upf@wait.de>
55    
56  ------------------------------------------------------------------------  ------------------------------------------------------------------------
57  NAME  NAME
58      WAIT - a rewrite of the freeWAIS-sf engine in Perl      WAIT - a rewrite of the freeWAIS-sf engine in Perl and XS
59    
60    SYNOPSIS
61        A Synopsis is not yet available.
62    
63  Status of this document  Status of this document
64      I started writing down some information about the implementation      I started writing down some information about the implementation before
65      before I forget them in my spare time. The stuff is incomplete      I forget them in my spare time. The stuff is incomplete at least. Any
66      at least. Any additions, corrections, ... welcome.      additions, corrections, ... welcome.
67    
68  PURPOSE  PURPOSE
69      As you might know, I developed and maintained freeWAIS-sf (with      As you might know, I developed and maintained freeWAIS-sf (with the help
70      the help of many people in The Net). FreeWAIS-sf is based on      of many people in The Net). FreeWAIS-sf is based on freeWAIS maintained
71      freeWAIS maintained by the Clearing House for Network      by the Clearing House for Network Information Retrieval (CNIDR) which in
72      Information Retrieval (CNIDR) which in turn is based on wais-8-      turn is based on wais-8-b5 implemented by Thinking Machine et al. During
73      b5 implemented by Thinking Machine et al. During this long      this long history - implementation started about 1989 - many people
74      history - implementation started about 1989 - many people      contributed to the distribution and added features not foreseen by the
75      contributed to the distribution and added features not foreseen      original design. While the system fulfills its task now, the code has
76      by the original design. While the system fulfills its task now,      reached a state where adding new features is nearly impossible and even
77      the code has reached a state where adding new features is nearly      fixing longstanding bugs and removing limitations has become a very time
78      impossible and even fixing longstanding bugs and removing      consuming task.
79      limitations has become a very time consuming task.  
80        Therefore I decided to pass the maintenance to WSC Inc. and built a new
81      Therefore I decided to pass the maintenance to WSC Inc. and      system from scratch. For obvious reasons I choosed Perl as
82      built a new system from scratch. For obvious reasons I choosed      implementation language.
     Perl as implementation language.  
83    
84  DESCRIPTION  DESCRIPTION
85      The central idea of the system is to provide a framework and the      The central idea of the system is to provide a framework and the
86      building blocks for any indexing and search system the users      building blocks for any indexing and search system the users might want
87      might want to build. Obviously the framework limits the class of      to build. Obviously the framework limits the class of system which can
88      system which can be build.      be build.
89    
90             +------+     +-----+     +------+             +------+     +-----+     +------+
91         ==> |Access| ==> |Parse| ==> |      |         ==> |Access| ==> |Parse| ==> |      |
# Line 64  DESCRIPTION Line 97  DESCRIPTION
97         <= |Display| <== |Query| <-> |      |         <= |Display| <== |Query| <-> |      |
98            +-------+     +-----+     +------+            +-------+     +-----+     +------+
99    
100      A collection (aka table) is defined by the instances of the      A collection (aka table) is defined by the instances of the access and
101      access and parse module together with the filter definitions. At      parse module together with the filter definitions. At query time in
102      query time in addition a query and a display module must be      addition a query and a display module must be choosen.
     choosen.  
103    
104    Access    Access
105    
106      The access module defines which documents where members of a      The access module defines which documents are members of a database.
107      database. Usually an access module is a tied hash, whose keys      Usually an access module is a tied hash, whose keys are the Ids of the
108      are the Ids of the documents (did = document id) and whose      documents (did = document id) and whose values are the documents
109      values are the documents themselves. The indexing process loops      themselves. The indexing process loops over the keys using `FIRSTKEY'
110      over the keys using `FIRSTKEY' and `NEXTKEY'. Documents are      and `NEXTKEY'. Documents are retrieved with `FETCH'.
111      retrieved with `FETCH'.  
112        By convention access modules should be members of the `WAIT::Document'
113      By convention access modules should be members of the      hierarchy. Have a look at the `WAIT::Document::Split' module to get the
114      `WAIT::Document' hierarchy. Have a look at the      idea.
     `WAIT::Document::Split' module to get the idea.  
115    
116    Parse    Parse
117    
118      The task parse module is to split the documents into logical      The task of the parse module is to split the documents into logical
119      parts via the `split' method. E.g. the `WAIT::Parse::Nroff'      parts via the `split' method. E.g. the `WAIT::Parse::Nroff' splits
120      splits manuals piped through nroff(1) into the sections *name*,      manuals piped through nroff(1) into the sections *name*, *synopsis*,
121      *synopsis*, *options*, *description*, *author*, *example*,      *options*, *description*, *author*, *example*, *bugs*, *text*, *see*,
122      *bugs*, *text*, *see*, and *environment*. Here is the      and *environment*. Here is the implementation of `WAIT::Parse::Base'
123      implementation of `WAIT::Parse::Base' which handes documents      which handles documents with a pretty simple tagged format:
     with a pretty simple tagged format:  
124    
125        AU: Pfeifer, U.; Fuhr, N.; Huynh, T.        AU: Pfeifer, U.; Fuhr, N.; Huynh, T.
126        TI: Searching Structured Documents with the Enhanced Retrieval        TI: Searching Structured Documents with the Enhanced Retrieval
# Line 106  DESCRIPTION Line 136  DESCRIPTION
136        sub split {                     # called as method        sub split {                     # called as method
137          my %result;          my %result;
138          my $fld;          my $fld;
139          
140          for (split /\n/, $_[1]) {          for (split /\n/, $_[1]) {
141            if (s/^(\S+):\s*//) {            if (s/^(\S+):\s*//) {
142              $fld = lc $1;              $fld = lc $1;
# Line 114  DESCRIPTION Line 144  DESCRIPTION
144            $result{$fld} .= $_ if defined $fld;            $result{$fld} .= $_ if defined $fld;
145          }          }
146          return \%result;          return \%result;
147        }        }
148    
149      Since the original document cannot be reconstructed from its      Since the original document cannot be reconstructed from its attributes,
150      attributes, we need a second method (*tag*) which marks the      we need a second method (*tag*) which marks the regions of the document
151      regions of the document with tags for the different attributes.      with tags for the different attributes. This tagged form is used by the
152      This tagged form is used by the display module to hilight search      display module to hilight search terms in the documents. Besides the
153      terms in the documents. Besides the tags for the attributes, the      tags for the attributes, the method might assign the special tags `_b'
154      method might assign the special tags `_b' and `_i' for      and `_i' for indicating bold and italic regions.
     indicating bold and italic regions.  
155    
156        sub tag {        sub tag {
157          my @result;          my @result;
158          my $tag;          my $tag;
159            
160          for (split /\n/, $_[1]) {          for (split /\n/, $_[1]) {
161            next if /^\w\w:\s*$/;            next if /^\w\w:\s*$/;
162            if (s/^(\S+)://) {            if (s/^(\S+)://) {
# Line 141  DESCRIPTION Line 170  DESCRIPTION
170            }            }
171          }          }
172          return @result;               # we don't go for speed          return @result;               # we don't go for speed
173        }        }
174    
175      Obviously one could implement `split' via `tag'. The reason for      Obviously one could implement `split' via `tag'. The reason for having
176      having two functions is speed. We need to call `split' for each      two functions is speed. We need to call `split' for each document when
177      document when indexing a collection. Therefore speed is      indexing a collection. Therefore speed is essential. On the other hand,
178      essential. On the other hand, `tag' is called in order to      `tag' is called in order to display a single document and may be a
179      display a single document and may be a little slower. It may      little slower. It may care about tagging bold and italic regions. See
     care about tagging bold and italic regions. See  
180      `WAIT::Parse::Nroff' how this might decrease performance.      `WAIT::Parse::Nroff' how this might decrease performance.
181    
182    Filter definition    Filter definition
183    
184      From the Information Retrieval perspective, the hardest part of      From the Information Retrieval perspective, the hardest part of the
185      the system is the filter module. The database administrator      system is the filter module. The database administrator defines for each
186      defines for each attribute, how the contents should be processed      attribute, how the contents should be processed before it is stored in
187      before it is stored in the index. Usually the processing      the index. Usually the processing contains steps to restrict the
188      contains steps to restrict the character set, case      character set, case transformation, splitting to words and transforming
189      transformation, splitting to words and transforming to word      to word stems. In WAIT these steps are defined naturally as a pipeline
190      stems. In WAIT these steps are defined naturally as a pipeline      of processing steps. The pipelines are made up by functions in the
191      of processing steps. The pipelines are made up by functions in      package WAIT::Filter which is pre-populated by the most common functions
192      the package WAIT::Filter which is pre-populated by the most      but may be extended any time.
     common functions but may be extended any time.  
193    
194      The equivalent for a typical freeWAIS-sf processing would be      The equivalent for a typical freeWAIS-sf processing would be this
195      this pipeline:      pipeline:
196    
197              [ 'isotr', 'isolc', 'split2', 'stop', 'Stem']              [ 'isotr', 'isolc', 'split2', 'stop', 'Stem']
198    
199      The function `isotr' replaces unknown characters by blanks.      The function `isotr' replaces unknown characters by blanks. `isolc'
200      `isolc' transforms to lower case. `split2' splits into words and      transforms to lower case. `split2' splits into words and removes words
201      removes words shorter than two characters. `stop' removes the      shorter than two characters. `stop' removes the freeWAIS-sf stopwords
202      freeWAIS-sf stopwords and `Stem' applies the Porter algorithm      and `Stem' applies the Porter algorithm for computing the stem of the
203      for computing the stem of the words.      words.
204    
205      The filter definition for a collection defines a set of piplines      The filter definition for a collection defines a set of pipelines for
206      for the attributes and modifies the pipelines which should be      the attributes and modifies the pipelines which should be used for
207      used for prefix and interval searches.      prefix and interval searches.
208    
209      Here is a complete example:      Several complete working examples come with WAIT in the script
210        directory. It is recommended to follow the pattern of the scripts
211        my $stem  = [{      smakewhatis and sman.
                     'prefix'    => ['unroff', 'isotr', 'isolc'],  
                     'intervall' => ['unroff', 'isotr', 'isolc'],  
                    },'unroff', 'isotr', 'isolc', 'split2', 'stop', 'Stem'];  
       my $text  = [{  
                     'prefix'    => ['unroff', 'isotr', 'isolc'],  
                     'intervall' => ['unroff', 'isotr', 'isolc'],  
                    },  
                     'unroff', 'isotr', 'isolc', 'split2', 'stop'];  
       my $sound = ['unroff', 'isotr', 'isolc', 'split2', 'Soundex'];  
         
       my $spec  = [  
           'name'         => $stem,  
           'synopsis'     => $stem,  
           'bugs'         => $stem,  
           'description'  => $stem,  
           'text'         => $stem,  
           'environment'  => $text,  
           'example'      => $text,  'example' => $stem,  
           'author'       => $sound, 'author'  => $stem,  
          ]  
212    

Legend:
Removed from v.11  
changed lines
  Added in v.88

  ViewVC Help
Powered by ViewVC 1.1.26