cvs-head/lib/WAIT.pm

#!/usr/bin/perl
#                              -*- Mode: Cperl -*-
# $Basename: WAIT.pm $
# $Revision: 1.7 $
# Author          : Ulrich Pfeifer
# Created On      : Wed Nov  5 16:59:32 1997
# Last Modified By: Ulrich Pfeifer
# Last Modified On: Mon May 31 22:34:35 1999
# Language        : CPerl
# Update Count    : 5
# Status          : Unknown, Use with caution!
#
# (C) Copyright 1997, Ulrich Pfeifer, all rights reserved.
#
#

package WAIT;
require DynaLoader;
use vars qw($VERSION @ISA);
@ISA = qw(DynaLoader);

# $Format: "$\VERSION = sprintf '%5.3f', ($ProjectMajorVersion$ * 100 + ($ProjectMinorVersion$-1))/1000;"$
$VERSION = sprintf '%5.3f', (18 * 100 + (1-1))/1000;


bootstrap WAIT $VERSION;

__END__

=head1 NAME

WAIT - a rewrite of the freeWAIS-sf engine in Perl and XS

=head1 SYNOPSIS

A Synopsis is not yet available.

=head1 Status of this document

I started writing down some information about the implementation
before I forget them in my spare time. The stuff is incomplete at
least. Any additions, corrections, ... welcome.

=head1 PURPOSE

As you might know, I developed and maintained B<freeWAIS-sf> (with the
help of many people in The Net). FreeWAIS-sf is based on B<freeWAIS>
maintained by the Clearing House for Network Information Retrieval
(CNIDR) which in turn is based on B<wais-8-b5> implemented by Thinking
Machine et al. During this long history - implementation started about
1989 - many people contributed to the distribution and added features
not foreseen by the original design. While the system fulfills its
task now, the code has reached a state where adding new features is
nearly impossible and even fixing longstanding bugs and removing
limitations has become a very time consuming task.

Therefore I decided to pass the maintenance  to WSC Inc. and built a
new system from scratch. For obvious reasons I choosed Perl as
implementation language.

=head1 DESCRIPTION

The central idea of the system is to provide a framework and the
building blocks for any indexing and search system the users might
want to build. Obviously the framework limits the class of system
which can be build.

       +------+     +-----+     +------+
   ==> |Access| ==> |Parse| ==> |      |
       +------+     +-----+     |      |
                       ||       |      |     +-----+
                       ||       |Filter| ==> |Index|
                       \/       |      |     +-----+
      +-------+     +-----+     |      |
   <= |Display| <== |Query| <-> |      |
      +-------+     +-----+     +------+

A collection (aka table) is defined by the instances of the B<access>
and B<parse> module together with the B<filter definitions>. At query
time in addition a B<query> and a B<display> module must be choosen.

=head2 Access

The access module defines which documents are members of a database.
Usually an access module is a tied hash, whose keys are the Ids of the
documents (did = document id) and whose values are the documents
themselves. The indexing process loops over the keys using C<FIRSTKEY>
and C<NEXTKEY>. Documents are retrieved with C<FETCH>.

By convention access modules should be members of the
C<WAIT::Document> hierarchy. Have a look at the
C<WAIT::Document::Split> module to get the idea.


=head2 Parse

The task of the parse module is to split the documents into logical
parts via the C<split> method. E.g. the C<WAIT::Parse::Nroff> splits
manuals piped through B<nroff>(1) into the sections I<name>,
I<synopsis>, I<options>, I<description>, I<author>, I<example>,
I<bugs>, I<text>, I<see>, and I<environment>. Here is the
implementation of C<WAIT::Parse::Base> which handles documents with a
pretty simple tagged format:

  AU: Pfeifer, U.; Fuhr, N.; Huynh, T.
  TI: Searching Structured Documents with the Enhanced Retrieval
      Functionality of freeWAIS-sf and SFgate
  ER: D. Kroemker
  BT: Computer Networks and ISDN Systems; Proceedings of the third
      International World-Wide Web Conference
  PN: Elsevier
  PA: Amsterdam - Lausanne - New York - Oxford - Shannon - Tokyo
  PP: 1027-1036
  PY: 1995

  sub split {                     # called as method
    my %result;
    my $fld;

    for (split /\n/, $_[1]) {
      if (s/^(\S+):\s*//) {
        $fld = lc $1;
      }
      $result{$fld} .= $_ if defined $fld;
    }
    return \%result;
  }

Since the original document cannot be reconstructed from its
attributes, we need a second method (I<tag>) which marks the regions
of the document with tags for the different attributes. This tagged
form is used by the display module to hilight search terms in the
documents. Besides the tags for the attributes, the method might assign
the special tags C<_b> and C<_i> for indicating bold and italic
regions.

  sub tag {
    my @result;
    my $tag;

    for (split /\n/, $_[1]) {
      next if /^\w\w:\s*$/;
      if (s/^(\S+)://) {
        push @result, {_b => 1}, "$1:";
        $tag = lc $1;
      }
      if (defined $tag) {
        push @result, {$tag => 1}, "$_\n";
      } else {
        push @result, {}, "$_\n";
      }
    }
    return @result;               # we don't go for speed
  }

Obviously one could implement C<split> via C<tag>. The reason for
having two functions is speed. We need to call C<split> for each
document when indexing a collection. Therefore speed is essential. On
the other hand, C<tag> is called in order to display a single document
and may be a little slower. It may care about tagging bold and italic
regions. See C<WAIT::Parse::Nroff> how this might decrease
performance.


=head2 Filter definition

From the Information Retrieval perspective, the hardest part of the
system is the filter module. The database administrator defines for
each attribute, how the contents should be processed before it is
stored in the index. Usually the processing contains steps to restrict
the character set, case transformation, splitting to words and
transforming to word stems. In WAIT these steps are defined naturally
as a pipeline of processing steps. The pipelines are made up by
functions in the package B<WAIT::Filter> which is pre-populated by the
most common functions but may be extended any time.

The equivalent for a typical freeWAIS-sf processing would be this
pipeline:

        [ 'isotr', 'isolc', 'split2', 'stop', 'Stem']

The function C<isotr> replaces unknown characters by blanks. C<isolc>
transforms to lower case. C<split2> splits into words and removes
words shorter than two characters. C<stop> removes the freeWAIS-sf
stopwords and C<Stem> applies the Porter algorithm for computing the
stem of the words.

The filter definition for a collection defines a set of pipelines for
the attributes and modifies the pipelines which should be used for
prefix and interval searches.

Several complete working examples come with WAIT in the script
directory. It is recommended to follow the pattern of the scripts
smakewhatis and sman.

=cut

1	ulpfr	10	#!/usr/bin/perl
2	ulpfr	13	# -- Mode: Cperl --
3	ulpfr	10	# $Basename: WAIT.pm $
4	ulpfr	19	# $Revision: 1.7 $
5	ulpfr	10	# Author : Ulrich Pfeifer
6			# Created On : Wed Nov 5 16:59:32 1997
7			# Last Modified By: Ulrich Pfeifer
8	ulpfr	19	# Last Modified On: Mon May 31 22:34:35 1999
9	ulpfr	10	# Language : CPerl
10	ulpfr	19	# Update Count : 5
11	ulpfr	10	# Status : Unknown, Use with caution!
12	ulpfr	13	#
13	ulpfr	10	# (C) Copyright 1997, Ulrich Pfeifer, all rights reserved.
14	ulpfr	13	#
15			#
16	ulpfr	10
17			package WAIT;
18			require DynaLoader;
19			use vars qw($VERSION @ISA);
20			@ISA = qw(DynaLoader);
21
22	ulpfr	19	# $Format: "$\VERSION = sprintf '%5.3f', ($ProjectMajorVersion$ * 100 + ($ProjectMinorVersion$-1))/1000;"$
23			$VERSION = sprintf '%5.3f', (18 * 100 + (1-1))/1000;
24	ulpfr	10
25	ulpfr	19
26	ulpfr	10	bootstrap WAIT $VERSION;
27
28			__END__
29
30			=head1 NAME
31
32	ulpfr	13	WAIT - a rewrite of the freeWAIS-sf engine in Perl and XS
33	ulpfr	10
34	ulpfr	13	=head1 SYNOPSIS
35
36			A Synopsis is not yet available.
37
38	ulpfr	10	=head1 Status of this document
39
40			I started writing down some information about the implementation
41			before I forget them in my spare time. The stuff is incomplete at
42			least. Any additions, corrections, ... welcome.
43
44			=head1 PURPOSE
45
46			As you might know, I developed and maintained B<freeWAIS-sf> (with the
47			help of many people in The Net). FreeWAIS-sf is based on B<freeWAIS>
48			maintained by the Clearing House for Network Information Retrieval
49			(CNIDR) which in turn is based on B<wais-8-b5> implemented by Thinking
50			Machine et al. During this long history - implementation started about
51			1989 - many people contributed to the distribution and added features
52			not foreseen by the original design. While the system fulfills its
53			task now, the code has reached a state where adding new features is
54			nearly impossible and even fixing longstanding bugs and removing
55			limitations has become a very time consuming task.
56
57			Therefore I decided to pass the maintenance to WSC Inc. and built a
58			new system from scratch. For obvious reasons I choosed Perl as
59			implementation language.
60
61			=head1 DESCRIPTION
62
63			The central idea of the system is to provide a framework and the
64			building blocks for any indexing and search system the users might
65			want to build. Obviously the framework limits the class of system
66			which can be build.
67
68			+------+ +-----+ +------+
69			==> \|Access\| ==> \|Parse\| ==> \| \|
70			+------+ +-----+ \| \|
71			\|\| \| \| +-----+
72			\|\| \|Filter\| ==> \|Index\|
73			\/ \| \| +-----+
74			+-------+ +-----+ \| \|
75			<= \|Display\| <== \|Query\| <-> \| \|
76			+-------+ +-----+ +------+
77
78			A collection (aka table) is defined by the instances of the B<access>
79			and B<parse> module together with the B<filter definitions>. At query
80			time in addition a B<query> and a B<display> module must be choosen.
81
82			=head2 Access
83
84	ulpfr	13	The access module defines which documents are members of a database.
85			Usually an access module is a tied hash, whose keys are the Ids of the
86			documents (did = document id) and whose values are the documents
87			themselves. The indexing process loops over the keys using C<FIRSTKEY>
88			and C<NEXTKEY>. Documents are retrieved with C<FETCH>.
89	ulpfr	10
90			By convention access modules should be members of the
91			C<WAIT::Document> hierarchy. Have a look at the
92			C<WAIT::Document::Split> module to get the idea.
93
94
95			=head2 Parse
96
97	ulpfr	13	The task of the parse module is to split the documents into logical
98			parts via the C<split> method. E.g. the C<WAIT::Parse::Nroff> splits
99	ulpfr	10	manuals piped through B<nroff>(1) into the sections I<name>,
100			I<synopsis>, I<options>, I<description>, I<author>, I<example>,
101			I<bugs>, I<text>, I<see>, and I<environment>. Here is the
102	ulpfr	13	implementation of C<WAIT::Parse::Base> which handles documents with a
103	ulpfr	10	pretty simple tagged format:
104
105			AU: Pfeifer, U.; Fuhr, N.; Huynh, T.
106			TI: Searching Structured Documents with the Enhanced Retrieval
107			Functionality of freeWAIS-sf and SFgate
108			ER: D. Kroemker
109			BT: Computer Networks and ISDN Systems; Proceedings of the third
110			International World-Wide Web Conference
111			PN: Elsevier
112			PA: Amsterdam - Lausanne - New York - Oxford - Shannon - Tokyo
113			PP: 1027-1036
114			PY: 1995
115
116			sub split { # called as method
117			my %result;
118			my $fld;
119	ulpfr	13
120	ulpfr	10	for (split /\n/, $_[1]) {
121			if (s/^(\S+):\s*//) {
122			$fld = lc $1;
123			}
124			$result{$fld} .= $_ if defined $fld;
125			}
126			return \%result;
127	ulpfr	13	}
128	ulpfr	10
129			Since the original document cannot be reconstructed from its
130			attributes, we need a second method (I<tag>) which marks the regions
131			of the document with tags for the different attributes. This tagged
132			form is used by the display module to hilight search terms in the
133			documents. Besides the tags for the attributes, the method might assign
134			the special tags C<_b> and C<_i> for indicating bold and italic
135			regions.
136
137			sub tag {
138			my @result;
139			my $tag;
140	ulpfr	13
141	ulpfr	10	for (split /\n/, $_[1]) {
142			next if /^\w\w:\s*$/;
143			if (s/^(\S+)://) {
144			push @result, {_b => 1}, "$1:";
145			$tag = lc $1;
146			}
147			if (defined $tag) {
148			push @result, {$tag => 1}, "$_\n";
149			} else {
150			push @result, {}, "$_\n";
151			}
152			}
153			return @result; # we don't go for speed
154	ulpfr	13	}
155	ulpfr	10
156			Obviously one could implement C<split> via C<tag>. The reason for
157			having two functions is speed. We need to call C<split> for each
158			document when indexing a collection. Therefore speed is essential. On
159			the other hand, C<tag> is called in order to display a single document
160			and may be a little slower. It may care about tagging bold and italic
161			regions. See C<WAIT::Parse::Nroff> how this might decrease
162			performance.
163
164
165			=head2 Filter definition
166
167			From the Information Retrieval perspective, the hardest part of the
168			system is the filter module. The database administrator defines for
169			each attribute, how the contents should be processed before it is
170			stored in the index. Usually the processing contains steps to restrict
171			the character set, case transformation, splitting to words and
172			transforming to word stems. In WAIT these steps are defined naturally
173			as a pipeline of processing steps. The pipelines are made up by
174			functions in the package B<WAIT::Filter> which is pre-populated by the
175			most common functions but may be extended any time.
176
177			The equivalent for a typical freeWAIS-sf processing would be this
178			pipeline:
179
180			[ 'isotr', 'isolc', 'split2', 'stop', 'Stem']
181
182			The function C<isotr> replaces unknown characters by blanks. C<isolc>
183			transforms to lower case. C<split2> splits into words and removes
184			words shorter than two characters. C<stop> removes the freeWAIS-sf
185			stopwords and C<Stem> applies the Porter algorithm for computing the
186			stem of the words.
187
188	ulpfr	13	The filter definition for a collection defines a set of pipelines for
189	ulpfr	10	the attributes and modifies the pipelines which should be used for
190			prefix and interval searches.
191
192	ulpfr	13	Several complete working examples come with WAIT in the script
193			directory. It is recommended to follow the pattern of the scripts
194			smakewhatis and sman.
195	ulpfr	10
196	ulpfr	13	=cut
197	ulpfr	10