CPAN/lib/WAIT.pm

#!/usr/bin/perl
#                              -*- Mode: Cperl -*-
# $Basename: WAIT.pm $
# $Revision: 1.6 $
# Author          : Ulrich Pfeifer
# Created On      : Wed Nov  5 16:59:32 1997
# Last Modified By: Ulrich Pfeifer
# Last Modified On: Wed Nov 12 18:26:44 1997
# Language        : CPerl
# Update Count    : 4
# Status          : Unknown, Use with caution!
#
# (C) Copyright 1997, Ulrich Pfeifer, all rights reserved.
#
#

package WAIT;
require DynaLoader;
use vars qw($VERSION @ISA);
@ISA = qw(DynaLoader);

$VERSION = sprintf '%.4f', map $_/10,'$ProjectVersion: 17.1 $ ' =~ /([\d.]+)/;

bootstrap WAIT $VERSION;

__END__

=head1 NAME

WAIT - a rewrite of the freeWAIS-sf engine in Perl and XS

=head1 SYNOPSIS

A Synopsis is not yet available.

=head1 Status of this document

I started writing down some information about the implementation
before I forget them in my spare time. The stuff is incomplete at
least. Any additions, corrections, ... welcome.

=head1 PURPOSE

As you might know, I developed and maintained B<freeWAIS-sf> (with the
help of many people in The Net). FreeWAIS-sf is based on B<freeWAIS>
maintained by the Clearing House for Network Information Retrieval
(CNIDR) which in turn is based on B<wais-8-b5> implemented by Thinking
Machine et al. During this long history - implementation started about
1989 - many people contributed to the distribution and added features
not foreseen by the original design. While the system fulfills its
task now, the code has reached a state where adding new features is
nearly impossible and even fixing longstanding bugs and removing
limitations has become a very time consuming task.

Therefore I decided to pass the maintenance  to WSC Inc. and built a
new system from scratch. For obvious reasons I choosed Perl as
implementation language.

=head1 DESCRIPTION

The central idea of the system is to provide a framework and the
building blocks for any indexing and search system the users might
want to build. Obviously the framework limits the class of system
which can be build.

       +------+     +-----+     +------+
   ==> |Access| ==> |Parse| ==> |      |
       +------+     +-----+     |      |
                       ||       |      |     +-----+
                       ||       |Filter| ==> |Index|
                       \/       |      |     +-----+
      +-------+     +-----+     |      |
   <= |Display| <== |Query| <-> |      |
      +-------+     +-----+     +------+

A collection (aka table) is defined by the instances of the B<access>
and B<parse> module together with the B<filter definitions>. At query
time in addition a B<query> and a B<display> module must be choosen.

=head2 Access

The access module defines which documents are members of a database.
Usually an access module is a tied hash, whose keys are the Ids of the
documents (did = document id) and whose values are the documents
themselves. The indexing process loops over the keys using C<FIRSTKEY>
and C<NEXTKEY>. Documents are retrieved with C<FETCH>.

By convention access modules should be members of the
C<WAIT::Document> hierarchy. Have a look at the
C<WAIT::Document::Split> module to get the idea.


=head2 Parse

The task of the parse module is to split the documents into logical
parts via the C<split> method. E.g. the C<WAIT::Parse::Nroff> splits
manuals piped through B<nroff>(1) into the sections I<name>,
I<synopsis>, I<options>, I<description>, I<author>, I<example>,
I<bugs>, I<text>, I<see>, and I<environment>. Here is the
implementation of C<WAIT::Parse::Base> which handles documents with a
pretty simple tagged format:

  AU: Pfeifer, U.; Fuhr, N.; Huynh, T.
  TI: Searching Structured Documents with the Enhanced Retrieval
      Functionality of freeWAIS-sf and SFgate
  ER: D. Kroemker
  BT: Computer Networks and ISDN Systems; Proceedings of the third
      International World-Wide Web Conference
  PN: Elsevier
  PA: Amsterdam - Lausanne - New York - Oxford - Shannon - Tokyo
  PP: 1027-1036
  PY: 1995

  sub split {                     # called as method
    my %result;
    my $fld;

    for (split /\n/, $_[1]) {
      if (s/^(\S+):\s*//) {
        $fld = lc $1;
      }
      $result{$fld} .= $_ if defined $fld;
    }
    return \%result;
  }

Since the original document cannot be reconstructed from its
attributes, we need a second method (I<tag>) which marks the regions
of the document with tags for the different attributes. This tagged
form is used by the display module to hilight search terms in the
documents. Besides the tags for the attributes, the method might assign
the special tags C<_b> and C<_i> for indicating bold and italic
regions.

  sub tag {
    my @result;
    my $tag;

    for (split /\n/, $_[1]) {
      next if /^\w\w:\s*$/;
      if (s/^(\S+)://) {
        push @result, {_b => 1}, "$1:";
        $tag = lc $1;
      }
      if (defined $tag) {
        push @result, {$tag => 1}, "$_\n";
      } else {
        push @result, {}, "$_\n";
      }
    }
    return @result;               # we don't go for speed
  }

Obviously one could implement C<split> via C<tag>. The reason for
having two functions is speed. We need to call C<split> for each
document when indexing a collection. Therefore speed is essential. On
the other hand, C<tag> is called in order to display a single document
and may be a little slower. It may care about tagging bold and italic
regions. See C<WAIT::Parse::Nroff> how this might decrease
performance.


=head2 Filter definition

From the Information Retrieval perspective, the hardest part of the
system is the filter module. The database administrator defines for
each attribute, how the contents should be processed before it is
stored in the index. Usually the processing contains steps to restrict
the character set, case transformation, splitting to words and
transforming to word stems. In WAIT these steps are defined naturally
as a pipeline of processing steps. The pipelines are made up by
functions in the package B<WAIT::Filter> which is pre-populated by the
most common functions but may be extended any time.

The equivalent for a typical freeWAIS-sf processing would be this
pipeline:

        [ 'isotr', 'isolc', 'split2', 'stop', 'Stem']

The function C<isotr> replaces unknown characters by blanks. C<isolc>
transforms to lower case. C<split2> splits into words and removes
words shorter than two characters. C<stop> removes the freeWAIS-sf
stopwords and C<Stem> applies the Porter algorithm for computing the
stem of the words.

The filter definition for a collection defines a set of pipelines for
the attributes and modifies the pipelines which should be used for
prefix and interval searches.

Several complete working examples come with WAIT in the script
directory. It is recommended to follow the pattern of the scripts
smakewhatis and sman.

=cut

1	#!/usr/bin/perl
2	# -- Mode: Cperl --
3	# $Basename: WAIT.pm $
4	# $Revision: 1.6 $
5	# Author : Ulrich Pfeifer
6	# Created On : Wed Nov 5 16:59:32 1997
7	# Last Modified By: Ulrich Pfeifer
8	# Last Modified On: Wed Nov 12 18:26:44 1997
9	# Language : CPerl
10	# Update Count : 4
11	# Status : Unknown, Use with caution!
12	#
13	# (C) Copyright 1997, Ulrich Pfeifer, all rights reserved.
14	#
15	#
16
17	package WAIT;
18	require DynaLoader;
19	use vars qw($VERSION @ISA);
20	@ISA = qw(DynaLoader);
21
22	$VERSION = sprintf '%.4f', map $_/10,'$ProjectVersion: 17.1 $ ' =~ /([\d.]+)/;
23
24	bootstrap WAIT $VERSION;
25
26	__END__
27
28	=head1 NAME
29
30	WAIT - a rewrite of the freeWAIS-sf engine in Perl and XS
31
32	=head1 SYNOPSIS
33
34	A Synopsis is not yet available.
35
36	=head1 Status of this document
37
38	I started writing down some information about the implementation
39	before I forget them in my spare time. The stuff is incomplete at
40	least. Any additions, corrections, ... welcome.
41
42	=head1 PURPOSE
43
44	As you might know, I developed and maintained B<freeWAIS-sf> (with the
45	help of many people in The Net). FreeWAIS-sf is based on B<freeWAIS>
46	maintained by the Clearing House for Network Information Retrieval
47	(CNIDR) which in turn is based on B<wais-8-b5> implemented by Thinking
48	Machine et al. During this long history - implementation started about
49	1989 - many people contributed to the distribution and added features
50	not foreseen by the original design. While the system fulfills its
51	task now, the code has reached a state where adding new features is
52	nearly impossible and even fixing longstanding bugs and removing
53	limitations has become a very time consuming task.
54
55	Therefore I decided to pass the maintenance to WSC Inc. and built a
56	new system from scratch. For obvious reasons I choosed Perl as
57	implementation language.
58
59	=head1 DESCRIPTION
60
61	The central idea of the system is to provide a framework and the
62	building blocks for any indexing and search system the users might
63	want to build. Obviously the framework limits the class of system
64	which can be build.
65
66	+------+ +-----+ +------+
67	==> \|Access\| ==> \|Parse\| ==> \| \|
68	+------+ +-----+ \| \|
69	\|\| \| \| +-----+
70	\|\| \|Filter\| ==> \|Index\|
71	\/ \| \| +-----+
72	+-------+ +-----+ \| \|
73	<= \|Display\| <== \|Query\| <-> \| \|
74	+-------+ +-----+ +------+
75
76	A collection (aka table) is defined by the instances of the B<access>
77	and B<parse> module together with the B<filter definitions>. At query
78	time in addition a B<query> and a B<display> module must be choosen.
79
80	=head2 Access
81
82	The access module defines which documents are members of a database.
83	Usually an access module is a tied hash, whose keys are the Ids of the
84	documents (did = document id) and whose values are the documents
85	themselves. The indexing process loops over the keys using C<FIRSTKEY>
86	and C<NEXTKEY>. Documents are retrieved with C<FETCH>.
87
88	By convention access modules should be members of the
89	C<WAIT::Document> hierarchy. Have a look at the
90	C<WAIT::Document::Split> module to get the idea.
91
92
93	=head2 Parse
94
95	The task of the parse module is to split the documents into logical
96	parts via the C<split> method. E.g. the C<WAIT::Parse::Nroff> splits
97	manuals piped through B<nroff>(1) into the sections I<name>,
98	I<synopsis>, I<options>, I<description>, I<author>, I<example>,
99	I<bugs>, I<text>, I<see>, and I<environment>. Here is the
100	implementation of C<WAIT::Parse::Base> which handles documents with a
101	pretty simple tagged format:
102
103	AU: Pfeifer, U.; Fuhr, N.; Huynh, T.
104	TI: Searching Structured Documents with the Enhanced Retrieval
105	Functionality of freeWAIS-sf and SFgate
106	ER: D. Kroemker
107	BT: Computer Networks and ISDN Systems; Proceedings of the third
108	International World-Wide Web Conference
109	PN: Elsevier
110	PA: Amsterdam - Lausanne - New York - Oxford - Shannon - Tokyo
111	PP: 1027-1036
112	PY: 1995
113
114	sub split { # called as method
115	my %result;
116	my $fld;
117
118	for (split /\n/, $_[1]) {
119	if (s/^(\S+):\s*//) {
120	$fld = lc $1;
121	}
122	$result{$fld} .= $_ if defined $fld;
123	}
124	return \%result;
125	}
126
127	Since the original document cannot be reconstructed from its
128	attributes, we need a second method (I<tag>) which marks the regions
129	of the document with tags for the different attributes. This tagged
130	form is used by the display module to hilight search terms in the
131	documents. Besides the tags for the attributes, the method might assign
132	the special tags C<_b> and C<_i> for indicating bold and italic
133	regions.
134
135	sub tag {
136	my @result;
137	my $tag;
138
139	for (split /\n/, $_[1]) {
140	next if /^\w\w:\s*$/;
141	if (s/^(\S+)://) {
142	push @result, {_b => 1}, "$1:";
143	$tag = lc $1;
144	}
145	if (defined $tag) {
146	push @result, {$tag => 1}, "$_\n";
147	} else {
148	push @result, {}, "$_\n";
149	}
150	}
151	return @result; # we don't go for speed
152	}
153
154	Obviously one could implement C<split> via C<tag>. The reason for
155	having two functions is speed. We need to call C<split> for each
156	document when indexing a collection. Therefore speed is essential. On
157	the other hand, C<tag> is called in order to display a single document
158	and may be a little slower. It may care about tagging bold and italic
159	regions. See C<WAIT::Parse::Nroff> how this might decrease
160	performance.
161
162
163	=head2 Filter definition
164
165	From the Information Retrieval perspective, the hardest part of the
166	system is the filter module. The database administrator defines for
167	each attribute, how the contents should be processed before it is
168	stored in the index. Usually the processing contains steps to restrict
169	the character set, case transformation, splitting to words and
170	transforming to word stems. In WAIT these steps are defined naturally
171	as a pipeline of processing steps. The pipelines are made up by
172	functions in the package B<WAIT::Filter> which is pre-populated by the
173	most common functions but may be extended any time.
174
175	The equivalent for a typical freeWAIS-sf processing would be this
176	pipeline:
177
178	[ 'isotr', 'isolc', 'split2', 'stop', 'Stem']
179
180	The function C<isotr> replaces unknown characters by blanks. C<isolc>
181	transforms to lower case. C<split2> splits into words and removes
182	words shorter than two characters. C<stop> removes the freeWAIS-sf
183	stopwords and C<Stem> applies the Porter algorithm for computing the
184	stem of the words.
185
186	The filter definition for a collection defines a set of pipelines for
187	the attributes and modifies the pipelines which should be used for
188	prefix and interval searches.
189
190	Several complete working examples come with WAIT in the script
191	directory. It is recommended to follow the pattern of the scripts
192	smakewhatis and sman.
193
194	=cut
195