/[wait]/trunk/README

This is repository of my old source code which isn't updated any more. Go to git.rot13.org for current projects!

Diff of /trunk/README

Parent Directory | Revision Log | View Patch Patch

-branches/CPAN/README
revision 11 by unknown,
Fri Apr 28 15:41:10 2000 UTC
+trunk/README
revision 88 by dpavlin,
Mon May 24 13:44:01 2004 UTC
 Line 1
-                              WAIT 1.6
+                              WAIT 1.8
-                   Copyright (c) 1996, Ulrich Pfeifer
+                   Copyright (c) 1996-2000, Ulrich Pfeifer
  ------------------------------------------------------------------------
      This program is free software; you can redistribute it and/or
 Line 11
      MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
  ------------------------------------------------------------------------
- This software is not actively maintained by it's author.
+ News:
- For more two years now I tried to steal some time to clean this up
+ Locking
- without any luck. So I decided to pass the baton on. I consider the
+ =======
- input part pretty satisfying. The query part - despite being operable
- and useful - needs a major overhaul. To provide a forum for further
+ WAIT now supports some basic locking.
- discussions an to coordinate further developement, I did setup a
- mailinglist.  Drop me a line if you want to participate.
+ Speed
+ =====
+ Searching large collections is now considerably faster:
+         $table->search({attr  => 'text',
+                         cont  => $query,
+                         top   => 1,
+                         picky => 0});
+ Table indices may now be tuned to improve search performance.  The
+ index tuning can be switched on and off using $table->set(top=>1/0) to
+ allow for bulk inserts.
+ Documentation
+ =============
+ WAIT is still not documented really.  But Andreas König took the
+ trouble to comment the example scripts.  This will help you
+ implementing your own applications.  I added some tiny scripts to
+ index e.g. your .yow file or the fourtune databases.
+ SourceForge
+ ===========
+ WAIT is registered on SourceForge now:
+         http://wait.sourceforge.net/
+         https://sourceforge.net/project/?group_id=4814
+ I will keep the CVS repository up to date.  If you have some spare
+ tuits, feel free to contribute.
  Ulrich Pfeifer <upf@wait.de>
  ------------------------------------------------------------------------
  NAME
-     WAIT - a rewrite of the freeWAIS-sf engine in Perl
+     WAIT - a rewrite of the freeWAIS-sf engine in Perl and XS
+ SYNOPSIS
+     A Synopsis is not yet available.
  Status of this document
-     I started writing down some information about the implementation
+     I started writing down some information about the implementation before
-     before I forget them in my spare time. The stuff is incomplete
+     I forget them in my spare time. The stuff is incomplete at least. Any
-     at least. Any additions, corrections, ... welcome.
+     additions, corrections, ... welcome.
  PURPOSE
-     As you might know, I developed and maintained freeWAIS-sf (with
+     As you might know, I developed and maintained freeWAIS-sf (with the help
-     the help of many people in The Net). FreeWAIS-sf is based on
+     of many people in The Net). FreeWAIS-sf is based on freeWAIS maintained
-     freeWAIS maintained by the Clearing House for Network
+     by the Clearing House for Network Information Retrieval (CNIDR) which in
-     Information Retrieval (CNIDR) which in turn is based on wais-8-
+     turn is based on wais-8-b5 implemented by Thinking Machine et al. During
-     b5 implemented by Thinking Machine et al. During this long
+     this long history - implementation started about 1989 - many people
-     history - implementation started about 1989 - many people
+     contributed to the distribution and added features not foreseen by the
-     contributed to the distribution and added features not foreseen
+     original design. While the system fulfills its task now, the code has
-     by the original design. While the system fulfills its task now,
+     reached a state where adding new features is nearly impossible and even
-     the code has reached a state where adding new features is nearly
+     fixing longstanding bugs and removing limitations has become a very time
-     impossible and even fixing longstanding bugs and removing
+     consuming task.
-     limitations has become a very time consuming task.
+     Therefore I decided to pass the maintenance to WSC Inc. and built a new
-     Therefore I decided to pass the maintenance to WSC Inc. and
+     system from scratch. For obvious reasons I choosed Perl as
-     built a new system from scratch. For obvious reasons I choosed
+     implementation language.
-     Perl as implementation language.
  DESCRIPTION
      The central idea of the system is to provide a framework and the
-     building blocks for any indexing and search system the users
+     building blocks for any indexing and search system the users might want
-     might want to build. Obviously the framework limits the class of
+     to build. Obviously the framework limits the class of system which can
-     system which can be build.
+     be build.
             +------+     +-----+     +------+
         ==> |Access| ==> |Parse| ==> |      |
-Line 64 
 DESCRIPTION
+Line 97 
 DESCRIPTION
         <= |Display| <== |Query| <-> |      |
            +-------+     +-----+     +------+
-     A collection (aka table) is defined by the instances of the
+     A collection (aka table) is defined by the instances of the access and
-     access and parse module together with the filter definitions. At
+     parse module together with the filter definitions. At query time in
-     query time in addition a query and a display module must be
+     addition a query and a display module must be choosen.
-     choosen.
    Access
-     The access module defines which documents where members of a
+     The access module defines which documents are members of a database.
-     database. Usually an access module is a tied hash, whose keys
+     Usually an access module is a tied hash, whose keys are the Ids of the
-     are the Ids of the documents (did = document id) and whose
+     documents (did = document id) and whose values are the documents
-     values are the documents themselves. The indexing process loops
+     themselves. The indexing process loops over the keys using `FIRSTKEY'
-     over the keys using `FIRSTKEY' and `NEXTKEY'. Documents are
+     and `NEXTKEY'. Documents are retrieved with `FETCH'.
-     retrieved with `FETCH'.
+     By convention access modules should be members of the `WAIT::Document'
-     By convention access modules should be members of the
+     hierarchy. Have a look at the `WAIT::Document::Split' module to get the
-     `WAIT::Document' hierarchy. Have a look at the
+     idea.
-     `WAIT::Document::Split' module to get the idea.
    Parse
-     The task parse module is to split the documents into logical
+     The task of the parse module is to split the documents into logical
-     parts via the `split' method. E.g. the `WAIT::Parse::Nroff'
+     parts via the `split' method. E.g. the `WAIT::Parse::Nroff' splits
-     splits manuals piped through nroff(1) into the sections *name*,
+     manuals piped through nroff(1) into the sections *name*, *synopsis*,
-     *synopsis*, *options*, *description*, *author*, *example*,
+     *options*, *description*, *author*, *example*, *bugs*, *text*, *see*,
-     *bugs*, *text*, *see*, and *environment*. Here is the
+     and *environment*. Here is the implementation of `WAIT::Parse::Base'
-     implementation of `WAIT::Parse::Base' which handes documents
+     which handles documents with a pretty simple tagged format:
-     with a pretty simple tagged format:
        AU: Pfeifer, U.; Fuhr, N.; Huynh, T.
        TI: Searching Structured Documents with the Enhanced Retrieval
-Line 106 
 DESCRIPTION
+Line 136 
 DESCRIPTION
        sub split {                     # called as method
          my %result;
          my $fld;
          for (split /\n/, $_[1]) {
            if (s/^(\S+):\s*//) {
              $fld = lc $1;
-Line 114 
 DESCRIPTION
+Line 144 
 DESCRIPTION
            $result{$fld} .= $_ if defined $fld;
          }
          return \%result;
        }
-     Since the original document cannot be reconstructed from its
+     Since the original document cannot be reconstructed from its attributes,
-     attributes, we need a second method (*tag*) which marks the
+     we need a second method (*tag*) which marks the regions of the document
-     regions of the document with tags for the different attributes.
+     with tags for the different attributes. This tagged form is used by the
-     This tagged form is used by the display module to hilight search
+     display module to hilight search terms in the documents. Besides the
-     terms in the documents. Besides the tags for the attributes, the
+     tags for the attributes, the method might assign the special tags `_b'
-     method might assign the special tags `_b' and `_i' for
+     and `_i' for indicating bold and italic regions.
-     indicating bold and italic regions.
        sub tag {
          my @result;
          my $tag;
          for (split /\n/, $_[1]) {
            next if /^\w\w:\s*$/;
            if (s/^(\S+)://) {
-Line 141 
 DESCRIPTION
+Line 170 
 DESCRIPTION
            }
          }
          return @result;               # we don't go for speed
        }
-     Obviously one could implement `split' via `tag'. The reason for
+     Obviously one could implement `split' via `tag'. The reason for having
-     having two functions is speed. We need to call `split' for each
+     two functions is speed. We need to call `split' for each document when
-     document when indexing a collection. Therefore speed is
+     indexing a collection. Therefore speed is essential. On the other hand,
-     essential. On the other hand, `tag' is called in order to
+     `tag' is called in order to display a single document and may be a
-     display a single document and may be a little slower. It may
+     little slower. It may care about tagging bold and italic regions. See
-     care about tagging bold and italic regions. See
      `WAIT::Parse::Nroff' how this might decrease performance.
    Filter definition
-     From the Information Retrieval perspective, the hardest part of
+     From the Information Retrieval perspective, the hardest part of the
-     the system is the filter module. The database administrator
+     system is the filter module. The database administrator defines for each
-     defines for each attribute, how the contents should be processed
+     attribute, how the contents should be processed before it is stored in
-     before it is stored in the index. Usually the processing
+     the index. Usually the processing contains steps to restrict the
-     contains steps to restrict the character set, case
+     character set, case transformation, splitting to words and transforming
-     transformation, splitting to words and transforming to word
+     to word stems. In WAIT these steps are defined naturally as a pipeline
-     stems. In WAIT these steps are defined naturally as a pipeline
+     of processing steps. The pipelines are made up by functions in the
-     of processing steps. The pipelines are made up by functions in
+     package WAIT::Filter which is pre-populated by the most common functions
-     the package WAIT::Filter which is pre-populated by the most
+     but may be extended any time.
-     common functions but may be extended any time.
-     The equivalent for a typical freeWAIS-sf processing would be
+     The equivalent for a typical freeWAIS-sf processing would be this
-     this pipeline:
+     pipeline:
              [ 'isotr', 'isolc', 'split2', 'stop', 'Stem']
-     The function `isotr' replaces unknown characters by blanks.
+     The function `isotr' replaces unknown characters by blanks. `isolc'
-     `isolc' transforms to lower case. `split2' splits into words and
+     transforms to lower case. `split2' splits into words and removes words
-     removes words shorter than two characters. `stop' removes the
+     shorter than two characters. `stop' removes the freeWAIS-sf stopwords
-     freeWAIS-sf stopwords and `Stem' applies the Porter algorithm
+     and `Stem' applies the Porter algorithm for computing the stem of the
-     for computing the stem of the words.
+     words.
-     The filter definition for a collection defines a set of piplines
+     The filter definition for a collection defines a set of pipelines for
-     for the attributes and modifies the pipelines which should be
+     the attributes and modifies the pipelines which should be used for
-     used for prefix and interval searches.
+     prefix and interval searches.
-     Here is a complete example:
+     Several complete working examples come with WAIT in the script
+     directory. It is recommended to follow the pattern of the scripts
-       my $stem  = [{
+     smakewhatis and sman.
-                     'prefix'    => ['unroff', 'isotr', 'isolc'],
-                     'intervall' => ['unroff', 'isotr', 'isolc'],
-                    },'unroff', 'isotr', 'isolc', 'split2', 'stop', 'Stem'];
-       my $text  = [{
-                     'prefix'    => ['unroff', 'isotr', 'isolc'],
-                     'intervall' => ['unroff', 'isotr', 'isolc'],
-                    },
-                     'unroff', 'isotr', 'isolc', 'split2', 'stop'];
-       my $sound = ['unroff', 'isotr', 'isolc', 'split2', 'Soundex'];
-       my $spec  = [
-           'name'         => $stem,
-           'synopsis'     => $stem,
-           'bugs'         => $stem,
-           'description'  => $stem,
-           'text'         => $stem,
-           'environment'  => $text,
-           'example'      => $text,  'example' => $stem,
-           'author'       => $sound, 'author'  => $stem,
-          ]

 Legend:



Removed from v.11
 


changed lines


 
Added in v.88
 Legend:



Removed from v.11
 


changed lines


 
Added in v.88
-Removed from v.11
+Added in v.88

	ViewVC Help
Powered by ViewVC 1.1.26