Annotation of /trunk/README

                             WAIT 1.8

                  Copyright (c) 1996-2000, Ulrich Pfeifer

------------------------------------------------------------------------
    This program is free software; you can redistribute it and/or
    modify it under the same terms than Perl itself.

    This program is distributed in the hope that it will be useful,
    but WITHOUT ANY WARRANTY; without even the implied warranty of
    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. 
------------------------------------------------------------------------

News:

Locking
=======

WAIT now supports some basic locking.

Speed
=====

Searching large collections is now considerably faster:

        $table->search({attr  => 'text', 
                        cont  => $query, 
                        top   => 1, 
                        picky => 0});

Table indices may now be tuned to improve search performance.  The
index tuning can be switched on and off using $table->set(top=>1/0) to
allow for bulk inserts.

Documentation
=============

WAIT is still not documented really.  But Andreas König took the
trouble to comment the example scripts.  This will help you
implementing your own applications.  I added some tiny scripts to
index e.g. your .yow file or the fourtune databases.

SourceForge
===========

WAIT is registered on SourceForge now:

        http://wait.sourceforge.net/
        https://sourceforge.net/project/?group_id=4814

I will keep the CVS repository up to date.  If you have some spare
tuits, feel free to contribute.

Ulrich Pfeifer <upf@wait.de> 

------------------------------------------------------------------------
NAME
    WAIT - a rewrite of the freeWAIS-sf engine in Perl and XS

SYNOPSIS
    A Synopsis is not yet available.

Status of this document
    I started writing down some information about the implementation before
    I forget them in my spare time. The stuff is incomplete at least. Any
    additions, corrections, ... welcome.

PURPOSE
    As you might know, I developed and maintained freeWAIS-sf (with the help
    of many people in The Net). FreeWAIS-sf is based on freeWAIS maintained
    by the Clearing House for Network Information Retrieval (CNIDR) which in
    turn is based on wais-8-b5 implemented by Thinking Machine et al. During
    this long history - implementation started about 1989 - many people
    contributed to the distribution and added features not foreseen by the
    original design. While the system fulfills its task now, the code has
    reached a state where adding new features is nearly impossible and even
    fixing longstanding bugs and removing limitations has become a very time
    consuming task.

    Therefore I decided to pass the maintenance to WSC Inc. and built a new
    system from scratch. For obvious reasons I choosed Perl as
    implementation language.

DESCRIPTION
    The central idea of the system is to provide a framework and the
    building blocks for any indexing and search system the users might want
    to build. Obviously the framework limits the class of system which can
    be build.

           +------+     +-----+     +------+
       ==> |Access| ==> |Parse| ==> |      |
           +------+     +-----+     |      |
                           ||       |      |     +-----+
                           ||       |Filter| ==> |Index|
                           \/       |      |     +-----+
          +-------+     +-----+     |      |
       <= |Display| <== |Query| <-> |      |
          +-------+     +-----+     +------+

    A collection (aka table) is defined by the instances of the access and
    parse module together with the filter definitions. At query time in
    addition a query and a display module must be choosen.

  Access
    The access module defines which documents are members of a database.
    Usually an access module is a tied hash, whose keys are the Ids of the
    documents (did = document id) and whose values are the documents
    themselves. The indexing process loops over the keys using "FIRSTKEY"
    and "NEXTKEY". Documents are retrieved with "FETCH".

    By convention access modules should be members of the "WAIT::Document"
    hierarchy. Have a look at the "WAIT::Document::Split" module to get the
    idea.

  Parse
    The task of the parse module is to split the documents into logical
    parts via the "split" method. E.g. the "WAIT::Parse::Nroff" splits
    manuals piped through nroff(1) into the sections *name*, *synopsis*,
    *options*, *description*, *author*, *example*, *bugs*, *text*, *see*,
    and *environment*. Here is the implementation of "WAIT::Parse::Base"
    which handles documents with a pretty simple tagged format:

      AU: Pfeifer, U.; Fuhr, N.; Huynh, T.
      TI: Searching Structured Documents with the Enhanced Retrieval
          Functionality of freeWAIS-sf and SFgate
      ER: D. Kroemker
      BT: Computer Networks and ISDN Systems; Proceedings of the third
          International World-Wide Web Conference
      PN: Elsevier
      PA: Amsterdam - Lausanne - New York - Oxford - Shannon - Tokyo
      PP: 1027-1036
      PY: 1995

      sub split {                     # called as method
        my %result;
        my $fld;

        for (split /\n/, $_[1]) {
          if (s/^(\S+):\s*//) {
            $fld = lc $1;
          }
          $result{$fld} .= $_ if defined $fld;
        }
        return \%result;
      }

    Since the original document cannot be reconstructed from its attributes,
    we need a second method (*tag*) which marks the regions of the document
    with tags for the different attributes. This tagged form is used by the
    display module to hilight search terms in the documents. Besides the
    tags for the attributes, the method might assign the special tags "_b"
    and "_i" for indicating bold and italic regions.

      sub tag {
        my @result;
        my $tag;

        for (split /\n/, $_[1]) {
          next if /^\w\w:\s*$/;
          if (s/^(\S+)://) {
            push @result, {_b => 1}, "$1:";
            $tag = lc $1;
          }
          if (defined $tag) {
            push @result, {$tag => 1}, "$_\n";
          } else {
            push @result, {}, "$_\n";
          }
        }
        return @result;               # we don't go for speed
      }

    Obviously one could implement "split" via "tag". The reason for having
    two functions is speed. We need to call "split" for each document when
    indexing a collection. Therefore speed is essential. On the other hand,
    "tag" is called in order to display a single document and may be a
    little slower. It may care about tagging bold and italic regions. See
    "WAIT::Parse::Nroff" how this might decrease performance.

  Filter definition
    From the Information Retrieval perspective, the hardest part of the
    system is the filter module. The database administrator defines for each
    attribute, how the contents should be processed before it is stored in
    the index. Usually the processing contains steps to restrict the
    character set, case transformation, splitting to words and transforming
    to word stems. In WAIT these steps are defined naturally as a pipeline
    of processing steps. The pipelines are made up by functions in the
    package WAIT::Filter which is pre-populated by the most common functions
    but may be extended any time.

    The equivalent for a typical freeWAIS-sf processing would be this
    pipeline:

            [ 'isotr', 'isolc', 'split2', 'stop', 'Stem']

    The function "isotr" replaces unknown characters by blanks. "isolc"
    transforms to lower case. "split2" splits into words and removes words
    shorter than two characters. "stop" removes the freeWAIS-sf stopwords
    and "Stem" applies the Porter algorithm for computing the stem of the
    words.

    The filter definition for a collection defines a set of pipelines for
    the attributes and modifies the pipelines which should be used for
    prefix and interval searches.

    Several complete working examples come with WAIT in the script
    directory. It is recommended to follow the pattern of the scripts
    smakewhatis and sman.

1	ulpfr	19	WAIT 1.8
2	ulpfr	10
3	ulpfr	19	Copyright (c) 1996-2000, Ulrich Pfeifer
4	ulpfr	10
5			------------------------------------------------------------------------
6			This program is free software; you can redistribute it and/or
7			modify it under the same terms than Perl itself.
8
9			This program is distributed in the hope that it will be useful,
10			but WITHOUT ANY WARRANTY; without even the implied warranty of
11			MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
12			------------------------------------------------------------------------
13
14	ulpfr	19	News:
15	ulpfr	10
16	ulpfr	19	Locking
17			=======
18	ulpfr	10
19	ulpfr	19	WAIT now supports some basic locking.
20
21			Speed
22			=====
23
24			Searching large collections is now considerably faster:
25
26			$table->search({attr => 'text',
27			cont => $query,
28			top => 1,
29			picky => 0});
30
31			Table indices may now be tuned to improve search performance. The
32			index tuning can be switched on and off using $table->set(top=>1/0) to
33			allow for bulk inserts.
34
35			Documentation
36			=============
37
38			WAIT is still not documented really. But Andreas König took the
39			trouble to comment the example scripts. This will help you
40			implementing your own applications. I added some tiny scripts to
41			index e.g. your .yow file or the fourtune databases.
42
43			SourceForge
44			===========
45
46			WAIT is registered on SourceForge now:
47
48			http://wait.sourceforge.net/
49			https://sourceforge.net/project/?group_id=4814
50
51			I will keep the CVS repository up to date. If you have some spare
52			tuits, feel free to contribute.
53
54	ulpfr	10	Ulrich Pfeifer <upf@wait.de>
55
56			------------------------------------------------------------------------
57			NAME
58	ulpfr	19	WAIT - a rewrite of the freeWAIS-sf engine in Perl and XS
59	ulpfr	10
60	ulpfr	19	SYNOPSIS
61			A Synopsis is not yet available.
62
63	ulpfr	10	Status of this document
64	ulpfr	19	I started writing down some information about the implementation before
65			I forget them in my spare time. The stuff is incomplete at least. Any
66			additions, corrections, ... welcome.
67	ulpfr	10
68			PURPOSE
69	ulpfr	19	As you might know, I developed and maintained freeWAIS-sf (with the help
70			of many people in The Net). FreeWAIS-sf is based on freeWAIS maintained
71			by the Clearing House for Network Information Retrieval (CNIDR) which in
72			turn is based on wais-8-b5 implemented by Thinking Machine et al. During
73			this long history - implementation started about 1989 - many people
74			contributed to the distribution and added features not foreseen by the
75			original design. While the system fulfills its task now, the code has
76			reached a state where adding new features is nearly impossible and even
77			fixing longstanding bugs and removing limitations has become a very time
78			consuming task.
79	ulpfr	10
80	ulpfr	19	Therefore I decided to pass the maintenance to WSC Inc. and built a new
81			system from scratch. For obvious reasons I choosed Perl as
82			implementation language.
83	ulpfr	10
84			DESCRIPTION
85			The central idea of the system is to provide a framework and the
86	ulpfr	19	building blocks for any indexing and search system the users might want
87			to build. Obviously the framework limits the class of system which can
88			be build.
89	ulpfr	10
90			+------+ +-----+ +------+
91			==> \|Access\| ==> \|Parse\| ==> \| \|
92			+------+ +-----+ \| \|
93			\|\| \| \| +-----+
94			\|\| \|Filter\| ==> \|Index\|
95			\/ \| \| +-----+
96			+-------+ +-----+ \| \|
97			<= \|Display\| <== \|Query\| <-> \| \|
98			+-------+ +-----+ +------+
99
100	ulpfr	19	A collection (aka table) is defined by the instances of the access and
101			parse module together with the filter definitions. At query time in
102			addition a query and a display module must be choosen.
103	ulpfr	10
104			Access
105	ulpfr	19	The access module defines which documents are members of a database.
106			Usually an access module is a tied hash, whose keys are the Ids of the
107			documents (did = document id) and whose values are the documents
108	dpavlin	107	themselves. The indexing process loops over the keys using "FIRSTKEY"
109			and "NEXTKEY". Documents are retrieved with "FETCH".
110	ulpfr	10
111	dpavlin	107	By convention access modules should be members of the "WAIT::Document"
112			hierarchy. Have a look at the "WAIT::Document::Split" module to get the
113	ulpfr	19	idea.
114	ulpfr	10
115			Parse
116	ulpfr	19	The task of the parse module is to split the documents into logical
117	dpavlin	107	parts via the "split" method. E.g. the "WAIT::Parse::Nroff" splits
118	ulpfr	19	manuals piped through nroff(1) into the sections name, synopsis,
119			options, description, author, example, bugs, text, see,
120	dpavlin	107	and environment. Here is the implementation of "WAIT::Parse::Base"
121	ulpfr	19	which handles documents with a pretty simple tagged format:
122	ulpfr	10
123			AU: Pfeifer, U.; Fuhr, N.; Huynh, T.
124			TI: Searching Structured Documents with the Enhanced Retrieval
125			Functionality of freeWAIS-sf and SFgate
126			ER: D. Kroemker
127			BT: Computer Networks and ISDN Systems; Proceedings of the third
128			International World-Wide Web Conference
129			PN: Elsevier
130			PA: Amsterdam - Lausanne - New York - Oxford - Shannon - Tokyo
131			PP: 1027-1036
132			PY: 1995
133
134			sub split { # called as method
135			my %result;
136			my $fld;
137	ulpfr	19
138	ulpfr	10	for (split /\n/, $_[1]) {
139			if (s/^(\S+):\s*//) {
140			$fld = lc $1;
141			}
142			$result{$fld} .= $_ if defined $fld;
143			}
144			return \%result;
145	ulpfr	19	}
146	ulpfr	10
147	ulpfr	19	Since the original document cannot be reconstructed from its attributes,
148			we need a second method (tag) which marks the regions of the document
149			with tags for the different attributes. This tagged form is used by the
150			display module to hilight search terms in the documents. Besides the
151	dpavlin	107	tags for the attributes, the method might assign the special tags "_b"
152			and "_i" for indicating bold and italic regions.
153	ulpfr	10
154			sub tag {
155			my @result;
156			my $tag;
157	ulpfr	19
158	ulpfr	10	for (split /\n/, $_[1]) {
159			next if /^\w\w:\s*$/;
160			if (s/^(\S+)://) {
161			push @result, {_b => 1}, "$1:";
162			$tag = lc $1;
163			}
164			if (defined $tag) {
165			push @result, {$tag => 1}, "$_\n";
166			} else {
167			push @result, {}, "$_\n";
168			}
169			}
170			return @result; # we don't go for speed
171	ulpfr	19	}
172	ulpfr	10
173	dpavlin	107	Obviously one could implement "split" via "tag". The reason for having
174			two functions is speed. We need to call "split" for each document when
175	ulpfr	19	indexing a collection. Therefore speed is essential. On the other hand,
176	dpavlin	107	"tag" is called in order to display a single document and may be a
177	ulpfr	19	little slower. It may care about tagging bold and italic regions. See
178	dpavlin	107	"WAIT::Parse::Nroff" how this might decrease performance.
179	ulpfr	10
180			Filter definition
181	ulpfr	19	From the Information Retrieval perspective, the hardest part of the
182			system is the filter module. The database administrator defines for each
183			attribute, how the contents should be processed before it is stored in
184			the index. Usually the processing contains steps to restrict the
185			character set, case transformation, splitting to words and transforming
186			to word stems. In WAIT these steps are defined naturally as a pipeline
187			of processing steps. The pipelines are made up by functions in the
188			package WAIT::Filter which is pre-populated by the most common functions
189			but may be extended any time.
190	ulpfr	10
191	ulpfr	19	The equivalent for a typical freeWAIS-sf processing would be this
192			pipeline:
193	ulpfr	10
194			[ 'isotr', 'isolc', 'split2', 'stop', 'Stem']
195
196	dpavlin	107	The function "isotr" replaces unknown characters by blanks. "isolc"
197			transforms to lower case. "split2" splits into words and removes words
198			shorter than two characters. "stop" removes the freeWAIS-sf stopwords
199			and "Stem" applies the Porter algorithm for computing the stem of the
200	ulpfr	19	words.
201	ulpfr	10
202	ulpfr	19	The filter definition for a collection defines a set of pipelines for
203			the attributes and modifies the pipelines which should be used for
204			prefix and interval searches.
205	ulpfr	10
206	ulpfr	19	Several complete working examples come with WAIT in the script
207			directory. It is recommended to follow the pattern of the scripts
208			smakewhatis and sman.
209	ulpfr	10