Annotation of /trunk/README

                             WAIT 1.6

                  Copyright (c) 1996, Ulrich Pfeifer

------------------------------------------------------------------------
    This program is free software; you can redistribute it and/or
    modify it under the same terms than Perl itself.

    This program is distributed in the hope that it will be useful,
    but WITHOUT ANY WARRANTY; without even the implied warranty of
    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. 
------------------------------------------------------------------------

This software is not actively maintained by it's author.

For more two years now I tried to steal some time to clean this up
without any luck. So I decided to pass the baton on. I consider the
input part pretty satisfying. The query part - despite being operable
and useful - needs a major overhaul. To provide a forum for further
discussions an to coordinate further developement, I did setup a
mailinglist.  Drop me a line if you want to participate.

Ulrich Pfeifer <upf@wait.de> 

------------------------------------------------------------------------
NAME
    WAIT - a rewrite of the freeWAIS-sf engine in Perl

Status of this document
    I started writing down some information about the implementation
    before I forget them in my spare time. The stuff is incomplete
    at least. Any additions, corrections, ... welcome.

PURPOSE
    As you might know, I developed and maintained freeWAIS-sf (with
    the help of many people in The Net). FreeWAIS-sf is based on
    freeWAIS maintained by the Clearing House for Network
    Information Retrieval (CNIDR) which in turn is based on wais-8-
    b5 implemented by Thinking Machine et al. During this long
    history - implementation started about 1989 - many people
    contributed to the distribution and added features not foreseen
    by the original design. While the system fulfills its task now,
    the code has reached a state where adding new features is nearly
    impossible and even fixing longstanding bugs and removing
    limitations has become a very time consuming task.

    Therefore I decided to pass the maintenance to WSC Inc. and
    built a new system from scratch. For obvious reasons I choosed
    Perl as implementation language.

DESCRIPTION
    The central idea of the system is to provide a framework and the
    building blocks for any indexing and search system the users
    might want to build. Obviously the framework limits the class of
    system which can be build.

           +------+     +-----+     +------+
       ==> |Access| ==> |Parse| ==> |      |
           +------+     +-----+     |      |
                           ||       |      |     +-----+
                           ||       |Filter| ==> |Index|
                           \/       |      |     +-----+
          +-------+     +-----+     |      |
       <= |Display| <== |Query| <-> |      |
          +-------+     +-----+     +------+

    A collection (aka table) is defined by the instances of the
    access and parse module together with the filter definitions. At
    query time in addition a query and a display module must be
    choosen.

  Access

    The access module defines which documents where members of a
    database. Usually an access module is a tied hash, whose keys
    are the Ids of the documents (did = document id) and whose
    values are the documents themselves. The indexing process loops
    over the keys using `FIRSTKEY' and `NEXTKEY'. Documents are
    retrieved with `FETCH'.

    By convention access modules should be members of the
    `WAIT::Document' hierarchy. Have a look at the
    `WAIT::Document::Split' module to get the idea.

  Parse

    The task parse module is to split the documents into logical
    parts via the `split' method. E.g. the `WAIT::Parse::Nroff'
    splits manuals piped through nroff(1) into the sections *name*,
    *synopsis*, *options*, *description*, *author*, *example*,
    *bugs*, *text*, *see*, and *environment*. Here is the
    implementation of `WAIT::Parse::Base' which handes documents
    with a pretty simple tagged format:

      AU: Pfeifer, U.; Fuhr, N.; Huynh, T.
      TI: Searching Structured Documents with the Enhanced Retrieval
          Functionality of freeWAIS-sf and SFgate
      ER: D. Kroemker
      BT: Computer Networks and ISDN Systems; Proceedings of the third
          International World-Wide Web Conference
      PN: Elsevier
      PA: Amsterdam - Lausanne - New York - Oxford - Shannon - Tokyo
      PP: 1027-1036
      PY: 1995

      sub split {                     # called as method
        my %result;
        my $fld;
      
        for (split /\n/, $_[1]) {
          if (s/^(\S+):\s*//) {
            $fld = lc $1;
          }
          $result{$fld} .= $_ if defined $fld;
        }
        return \%result;
      } 

    Since the original document cannot be reconstructed from its
    attributes, we need a second method (*tag*) which marks the
    regions of the document with tags for the different attributes.
    This tagged form is used by the display module to hilight search
    terms in the documents. Besides the tags for the attributes, the
    method might assign the special tags `_b' and `_i' for
    indicating bold and italic regions.

      sub tag {
        my @result;
        my $tag;
        
        for (split /\n/, $_[1]) {
          next if /^\w\w:\s*$/;
          if (s/^(\S+)://) {
            push @result, {_b => 1}, "$1:";
            $tag = lc $1;
          }
          if (defined $tag) {
            push @result, {$tag => 1}, "$_\n";
          } else {
            push @result, {}, "$_\n";
          }
        }
        return @result;               # we don't go for speed
      } 

    Obviously one could implement `split' via `tag'. The reason for
    having two functions is speed. We need to call `split' for each
    document when indexing a collection. Therefore speed is
    essential. On the other hand, `tag' is called in order to
    display a single document and may be a little slower. It may
    care about tagging bold and italic regions. See
    `WAIT::Parse::Nroff' how this might decrease performance.

  Filter definition

    From the Information Retrieval perspective, the hardest part of
    the system is the filter module. The database administrator
    defines for each attribute, how the contents should be processed
    before it is stored in the index. Usually the processing
    contains steps to restrict the character set, case
    transformation, splitting to words and transforming to word
    stems. In WAIT these steps are defined naturally as a pipeline
    of processing steps. The pipelines are made up by functions in
    the package WAIT::Filter which is pre-populated by the most
    common functions but may be extended any time.

    The equivalent for a typical freeWAIS-sf processing would be
    this pipeline:

            [ 'isotr', 'isolc', 'split2', 'stop', 'Stem']

    The function `isotr' replaces unknown characters by blanks.
    `isolc' transforms to lower case. `split2' splits into words and
    removes words shorter than two characters. `stop' removes the
    freeWAIS-sf stopwords and `Stem' applies the Porter algorithm
    for computing the stem of the words.

    The filter definition for a collection defines a set of piplines
    for the attributes and modifies the pipelines which should be
    used for prefix and interval searches.

    Here is a complete example:

      my $stem  = [{
                    'prefix'    => ['unroff', 'isotr', 'isolc'],
                    'intervall' => ['unroff', 'isotr', 'isolc'],
                   },'unroff', 'isotr', 'isolc', 'split2', 'stop', 'Stem'];
      my $text  = [{
                    'prefix'    => ['unroff', 'isotr', 'isolc'],
                    'intervall' => ['unroff', 'isotr', 'isolc'],
                   },
                    'unroff', 'isotr', 'isolc', 'split2', 'stop'];
      my $sound = ['unroff', 'isotr', 'isolc', 'split2', 'Soundex'];
      
      my $spec  = [
          'name'         => $stem,
          'synopsis'     => $stem,
          'bugs'         => $stem,
          'description'  => $stem,
          'text'         => $stem,
          'environment'  => $text,
          'example'      => $text,  'example' => $stem,
          'author'       => $sound, 'author'  => $stem,
         ]

1	ulpfr	10	WAIT 1.6
2
3			Copyright (c) 1996, Ulrich Pfeifer
4
5			------------------------------------------------------------------------
6			This program is free software; you can redistribute it and/or
7			modify it under the same terms than Perl itself.
8
9			This program is distributed in the hope that it will be useful,
10			but WITHOUT ANY WARRANTY; without even the implied warranty of
11			MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
12			------------------------------------------------------------------------
13
14			This software is not actively maintained by it's author.
15
16			For more two years now I tried to steal some time to clean this up
17			without any luck. So I decided to pass the baton on. I consider the
18			input part pretty satisfying. The query part - despite being operable
19			and useful - needs a major overhaul. To provide a forum for further
20			discussions an to coordinate further developement, I did setup a
21			mailinglist. Drop me a line if you want to participate.
22
23			Ulrich Pfeifer <upf@wait.de>
24
25			------------------------------------------------------------------------
26			NAME
27			WAIT - a rewrite of the freeWAIS-sf engine in Perl
28
29			Status of this document
30			I started writing down some information about the implementation
31			before I forget them in my spare time. The stuff is incomplete
32			at least. Any additions, corrections, ... welcome.
33
34			PURPOSE
35			As you might know, I developed and maintained freeWAIS-sf (with
36			the help of many people in The Net). FreeWAIS-sf is based on
37			freeWAIS maintained by the Clearing House for Network
38			Information Retrieval (CNIDR) which in turn is based on wais-8-
39			b5 implemented by Thinking Machine et al. During this long
40			history - implementation started about 1989 - many people
41			contributed to the distribution and added features not foreseen
42			by the original design. While the system fulfills its task now,
43			the code has reached a state where adding new features is nearly
44			impossible and even fixing longstanding bugs and removing
45			limitations has become a very time consuming task.
46
47			Therefore I decided to pass the maintenance to WSC Inc. and
48			built a new system from scratch. For obvious reasons I choosed
49			Perl as implementation language.
50
51			DESCRIPTION
52			The central idea of the system is to provide a framework and the
53			building blocks for any indexing and search system the users
54			might want to build. Obviously the framework limits the class of
55			system which can be build.
56
57			+------+ +-----+ +------+
58			==> \|Access\| ==> \|Parse\| ==> \| \|
59			+------+ +-----+ \| \|
60			\|\| \| \| +-----+
61			\|\| \|Filter\| ==> \|Index\|
62			\/ \| \| +-----+
63			+-------+ +-----+ \| \|
64			<= \|Display\| <== \|Query\| <-> \| \|
65			+-------+ +-----+ +------+
66
67			A collection (aka table) is defined by the instances of the
68			access and parse module together with the filter definitions. At
69			query time in addition a query and a display module must be
70			choosen.
71
72			Access
73
74			The access module defines which documents where members of a
75			database. Usually an access module is a tied hash, whose keys
76			are the Ids of the documents (did = document id) and whose
77			values are the documents themselves. The indexing process loops
78			over the keys using `FIRSTKEY' and `NEXTKEY'. Documents are
79			retrieved with `FETCH'.
80
81			By convention access modules should be members of the
82			`WAIT::Document' hierarchy. Have a look at the
83			`WAIT::Document::Split' module to get the idea.
84
85			Parse
86
87			The task parse module is to split the documents into logical
88			parts via the `split' method. E.g. the `WAIT::Parse::Nroff'
89			splits manuals piped through nroff(1) into the sections name,
90			synopsis, options, description, author, example,
91			bugs, text, see, and environment. Here is the
92			implementation of `WAIT::Parse::Base' which handes documents
93			with a pretty simple tagged format:
94
95			AU: Pfeifer, U.; Fuhr, N.; Huynh, T.
96			TI: Searching Structured Documents with the Enhanced Retrieval
97			Functionality of freeWAIS-sf and SFgate
98			ER: D. Kroemker
99			BT: Computer Networks and ISDN Systems; Proceedings of the third
100			International World-Wide Web Conference
101			PN: Elsevier
102			PA: Amsterdam - Lausanne - New York - Oxford - Shannon - Tokyo
103			PP: 1027-1036
104			PY: 1995
105
106			sub split { # called as method
107			my %result;
108			my $fld;
109
110			for (split /\n/, $_[1]) {
111			if (s/^(\S+):\s*//) {
112			$fld = lc $1;
113			}
114			$result{$fld} .= $_ if defined $fld;
115			}
116			return \%result;
117			}
118
119			Since the original document cannot be reconstructed from its
120			attributes, we need a second method (tag) which marks the
121			regions of the document with tags for the different attributes.
122			This tagged form is used by the display module to hilight search
123			terms in the documents. Besides the tags for the attributes, the
124			method might assign the special tags `_b' and `_i' for
125			indicating bold and italic regions.
126
127			sub tag {
128			my @result;
129			my $tag;
130
131			for (split /\n/, $_[1]) {
132			next if /^\w\w:\s*$/;
133			if (s/^(\S+)://) {
134			push @result, {_b => 1}, "$1:";
135			$tag = lc $1;
136			}
137			if (defined $tag) {
138			push @result, {$tag => 1}, "$_\n";
139			} else {
140			push @result, {}, "$_\n";
141			}
142			}
143			return @result; # we don't go for speed
144			}
145
146			Obviously one could implement `split' via `tag'. The reason for
147			having two functions is speed. We need to call `split' for each
148			document when indexing a collection. Therefore speed is
149			essential. On the other hand, `tag' is called in order to
150			display a single document and may be a little slower. It may
151			care about tagging bold and italic regions. See
152			`WAIT::Parse::Nroff' how this might decrease performance.
153
154			Filter definition
155
156			From the Information Retrieval perspective, the hardest part of
157			the system is the filter module. The database administrator
158			defines for each attribute, how the contents should be processed
159			before it is stored in the index. Usually the processing
160			contains steps to restrict the character set, case
161			transformation, splitting to words and transforming to word
162			stems. In WAIT these steps are defined naturally as a pipeline
163			of processing steps. The pipelines are made up by functions in
164			the package WAIT::Filter which is pre-populated by the most
165			common functions but may be extended any time.
166
167			The equivalent for a typical freeWAIS-sf processing would be
168			this pipeline:
169
170			[ 'isotr', 'isolc', 'split2', 'stop', 'Stem']
171
172			The function `isotr' replaces unknown characters by blanks.
173			`isolc' transforms to lower case. `split2' splits into words and
174			removes words shorter than two characters. `stop' removes the
175			freeWAIS-sf stopwords and `Stem' applies the Porter algorithm
176			for computing the stem of the words.
177
178			The filter definition for a collection defines a set of piplines
179			for the attributes and modifies the pipelines which should be
180			used for prefix and interval searches.
181
182			Here is a complete example:
183
184			my $stem = [{
185			'prefix' => ['unroff', 'isotr', 'isolc'],
186			'intervall' => ['unroff', 'isotr', 'isolc'],
187			},'unroff', 'isotr', 'isolc', 'split2', 'stop', 'Stem'];
188			my $text = [{
189			'prefix' => ['unroff', 'isotr', 'isolc'],
190			'intervall' => ['unroff', 'isotr', 'isolc'],
191			},
192			'unroff', 'isotr', 'isolc', 'split2', 'stop'];
193			my $sound = ['unroff', 'isotr', 'isolc', 'split2', 'Soundex'];
194
195			my $spec = [
196			'name' => $stem,
197			'synopsis' => $stem,
198			'bugs' => $stem,
199			'description' => $stem,
200			'text' => $stem,
201			'environment' => $text,
202			'example' => $text, 'example' => $stem,
203			'author' => $sound, 'author' => $stem,
204			]
205