cvs-head/lib/WAIT.pm

#!/usr/bin/perl
#                              -*- Mode: Cperl -*-
# $Basename: WAIT.pm $
# $Revision: 1.7 $
# Author          : Ulrich Pfeifer
# Created On      : Wed Nov  5 16:59:32 1997
# Last Modified By: Ulrich Pfeifer
# Last Modified On: Mon May 31 22:34:35 1999
# Language        : CPerl
# Update Count    : 5
# Status          : Unknown, Use with caution!
#
# (C) Copyright 1997, Ulrich Pfeifer, all rights reserved.
#
#

package WAIT;
require DynaLoader;
use vars qw($VERSION @ISA);
@ISA = qw(DynaLoader);

# $Format: "$\VERSION = sprintf '%5.3f', ($ProjectMajorVersion$ * 100 + ($ProjectMinorVersion$-1))/1000;"$
$VERSION = sprintf '%5.3f', (18 * 100 + (1-1))/1000;


bootstrap WAIT $VERSION;

__END__

=head1 NAME

WAIT - a rewrite of the freeWAIS-sf engine in Perl and XS

=head1 SYNOPSIS

A Synopsis is not yet available.

=head1 Status of this document

I started writing down some information about the implementation
before I forget them in my spare time. The stuff is incomplete at
least. Any additions, corrections, ... welcome.

=head1 PURPOSE

As you might know, I developed and maintained B<freeWAIS-sf> (with the
help of many people in The Net). FreeWAIS-sf is based on B<freeWAIS>
maintained by the Clearing House for Network Information Retrieval
(CNIDR) which in turn is based on B<wais-8-b5> implemented by Thinking
Machine et al. During this long history - implementation started about
1989 - many people contributed to the distribution and added features
not foreseen by the original design. While the system fulfills its
task now, the code has reached a state where adding new features is
nearly impossible and even fixing longstanding bugs and removing
limitations has become a very time consuming task.

Therefore I decided to pass the maintenance  to WSC Inc. and built a
new system from scratch. For obvious reasons I choosed Perl as
implementation language.

=head1 DESCRIPTION

The central idea of the system is to provide a framework and the
building blocks for any indexing and search system the users might
want to build. Obviously the framework limits the class of system
which can be build.

       +------+     +-----+     +------+
   ==> |Access| ==> |Parse| ==> |      |
       +------+     +-----+     |      |
                       ||       |      |     +-----+
                       ||       |Filter| ==> |Index|
                       \/       |      |     +-----+
      +-------+     +-----+     |      |
   <= |Display| <== |Query| <-> |      |
      +-------+     +-----+     +------+

A collection (aka table) is defined by the instances of the B<access>
and B<parse> module together with the B<filter definitions>. At query
time in addition a B<query> and a B<display> module must be choosen.

=head2 Access

The access module defines which documents are members of a database.
Usually an access module is a tied hash, whose keys are the Ids of the
documents (did = document id) and whose values are the documents
themselves. The indexing process loops over the keys using C<FIRSTKEY>
and C<NEXTKEY>. Documents are retrieved with C<FETCH>.

By convention access modules should be members of the
C<WAIT::Document> hierarchy. Have a look at the
C<WAIT::Document::Split> module to get the idea.


=head2 Parse

The task of the parse module is to split the documents into logical
parts via the C<split> method. E.g. the C<WAIT::Parse::Nroff> splits
manuals piped through B<nroff>(1) into the sections I<name>,
I<synopsis>, I<options>, I<description>, I<author>, I<example>,
I<bugs>, I<text>, I<see>, and I<environment>. Here is the
implementation of C<WAIT::Parse::Base> which handles documents with a
pretty simple tagged format:

  AU: Pfeifer, U.; Fuhr, N.; Huynh, T.
  TI: Searching Structured Documents with the Enhanced Retrieval
      Functionality of freeWAIS-sf and SFgate
  ER: D. Kroemker
  BT: Computer Networks and ISDN Systems; Proceedings of the third
      International World-Wide Web Conference
  PN: Elsevier
  PA: Amsterdam - Lausanne - New York - Oxford - Shannon - Tokyo
  PP: 1027-1036
  PY: 1995

  sub split {                     # called as method
    my %result;
    my $fld;

    for (split /\n/, $_[1]) {
      if (s/^(\S+):\s*//) {
        $fld = lc $1;
      }
      $result{$fld} .= $_ if defined $fld;
    }
    return \%result;
  }

Since the original document cannot be reconstructed from its
attributes, we need a second method (I<tag>) which marks the regions
of the document with tags for the different attributes. This tagged
form is used by the display module to hilight search terms in the
documents. Besides the tags for the attributes, the method might assign
the special tags C<_b> and C<_i> for indicating bold and italic
regions.

  sub tag {
    my @result;
    my $tag;

    for (split /\n/, $_[1]) {
      next if /^\w\w:\s*$/;
      if (s/^(\S+)://) {
        push @result, {_b => 1}, "$1:";
        $tag = lc $1;
      }
      if (defined $tag) {
        push @result, {$tag => 1}, "$_\n";
      } else {
        push @result, {}, "$_\n";
      }
    }
    return @result;               # we don't go for speed
  }

Obviously one could implement C<split> via C<tag>. The reason for
having two functions is speed. We need to call C<split> for each
document when indexing a collection. Therefore speed is essential. On
the other hand, C<tag> is called in order to display a single document
and may be a little slower. It may care about tagging bold and italic
regions. See C<WAIT::Parse::Nroff> how this might decrease
performance.


=head2 Filter definition

From the Information Retrieval perspective, the hardest part of the
system is the filter module. The database administrator defines for
each attribute, how the contents should be processed before it is
stored in the index. Usually the processing contains steps to restrict
the character set, case transformation, splitting to words and
transforming to word stems. In WAIT these steps are defined naturally
as a pipeline of processing steps. The pipelines are made up by
functions in the package B<WAIT::Filter> which is pre-populated by the
most common functions but may be extended any time.

The equivalent for a typical freeWAIS-sf processing would be this
pipeline:

        [ 'isotr', 'isolc', 'split2', 'stop', 'Stem']

The function C<isotr> replaces unknown characters by blanks. C<isolc>
transforms to lower case. C<split2> splits into words and removes
words shorter than two characters. C<stop> removes the freeWAIS-sf
stopwords and C<Stem> applies the Porter algorithm for computing the
stem of the words.

The filter definition for a collection defines a set of pipelines for
the attributes and modifies the pipelines which should be used for
prefix and interval searches.

Several complete working examples come with WAIT in the script
directory. It is recommended to follow the pattern of the scripts
smakewhatis and sman.

=cut

1	#!/usr/bin/perl
2	# -- Mode: Cperl --
3	# $Basename: WAIT.pm $
4	# $Revision: 1.7 $
5	# Author : Ulrich Pfeifer
6	# Created On : Wed Nov 5 16:59:32 1997
7	# Last Modified By: Ulrich Pfeifer
8	# Last Modified On: Mon May 31 22:34:35 1999
9	# Language : CPerl
10	# Update Count : 5
11	# Status : Unknown, Use with caution!
12	#
13	# (C) Copyright 1997, Ulrich Pfeifer, all rights reserved.
14	#
15	#
16
17	package WAIT;
18	require DynaLoader;
19	use vars qw($VERSION @ISA);
20	@ISA = qw(DynaLoader);
21
22	# $Format: "$\VERSION = sprintf '%5.3f', ($ProjectMajorVersion$ * 100 + ($ProjectMinorVersion$-1))/1000;"$
23	$VERSION = sprintf '%5.3f', (18 * 100 + (1-1))/1000;
24
25
26	bootstrap WAIT $VERSION;
27
28	__END__
29
30	=head1 NAME
31
32	WAIT - a rewrite of the freeWAIS-sf engine in Perl and XS
33
34	=head1 SYNOPSIS
35
36	A Synopsis is not yet available.
37
38	=head1 Status of this document
39
40	I started writing down some information about the implementation
41	before I forget them in my spare time. The stuff is incomplete at
42	least. Any additions, corrections, ... welcome.
43
44	=head1 PURPOSE
45
46	As you might know, I developed and maintained B<freeWAIS-sf> (with the
47	help of many people in The Net). FreeWAIS-sf is based on B<freeWAIS>
48	maintained by the Clearing House for Network Information Retrieval
49	(CNIDR) which in turn is based on B<wais-8-b5> implemented by Thinking
50	Machine et al. During this long history - implementation started about
51	1989 - many people contributed to the distribution and added features
52	not foreseen by the original design. While the system fulfills its
53	task now, the code has reached a state where adding new features is
54	nearly impossible and even fixing longstanding bugs and removing
55	limitations has become a very time consuming task.
56
57	Therefore I decided to pass the maintenance to WSC Inc. and built a
58	new system from scratch. For obvious reasons I choosed Perl as
59	implementation language.
60
61	=head1 DESCRIPTION
62
63	The central idea of the system is to provide a framework and the
64	building blocks for any indexing and search system the users might
65	want to build. Obviously the framework limits the class of system
66	which can be build.
67
68	+------+ +-----+ +------+
69	==> \|Access\| ==> \|Parse\| ==> \| \|
70	+------+ +-----+ \| \|
71	\|\| \| \| +-----+
72	\|\| \|Filter\| ==> \|Index\|
73	\/ \| \| +-----+
74	+-------+ +-----+ \| \|
75	<= \|Display\| <== \|Query\| <-> \| \|
76	+-------+ +-----+ +------+
77
78	A collection (aka table) is defined by the instances of the B<access>
79	and B<parse> module together with the B<filter definitions>. At query
80	time in addition a B<query> and a B<display> module must be choosen.
81
82	=head2 Access
83
84	The access module defines which documents are members of a database.
85	Usually an access module is a tied hash, whose keys are the Ids of the
86	documents (did = document id) and whose values are the documents
87	themselves. The indexing process loops over the keys using C<FIRSTKEY>
88	and C<NEXTKEY>. Documents are retrieved with C<FETCH>.
89
90	By convention access modules should be members of the
91	C<WAIT::Document> hierarchy. Have a look at the
92	C<WAIT::Document::Split> module to get the idea.
93
94
95	=head2 Parse
96
97	The task of the parse module is to split the documents into logical
98	parts via the C<split> method. E.g. the C<WAIT::Parse::Nroff> splits
99	manuals piped through B<nroff>(1) into the sections I<name>,
100	I<synopsis>, I<options>, I<description>, I<author>, I<example>,
101	I<bugs>, I<text>, I<see>, and I<environment>. Here is the
102	implementation of C<WAIT::Parse::Base> which handles documents with a
103	pretty simple tagged format:
104
105	AU: Pfeifer, U.; Fuhr, N.; Huynh, T.
106	TI: Searching Structured Documents with the Enhanced Retrieval
107	Functionality of freeWAIS-sf and SFgate
108	ER: D. Kroemker
109	BT: Computer Networks and ISDN Systems; Proceedings of the third
110	International World-Wide Web Conference
111	PN: Elsevier
112	PA: Amsterdam - Lausanne - New York - Oxford - Shannon - Tokyo
113	PP: 1027-1036
114	PY: 1995
115
116	sub split { # called as method
117	my %result;
118	my $fld;
119
120	for (split /\n/, $_[1]) {
121	if (s/^(\S+):\s*//) {
122	$fld = lc $1;
123	}
124	$result{$fld} .= $_ if defined $fld;
125	}
126	return \%result;
127	}
128
129	Since the original document cannot be reconstructed from its
130	attributes, we need a second method (I<tag>) which marks the regions
131	of the document with tags for the different attributes. This tagged
132	form is used by the display module to hilight search terms in the
133	documents. Besides the tags for the attributes, the method might assign
134	the special tags C<_b> and C<_i> for indicating bold and italic
135	regions.
136
137	sub tag {
138	my @result;
139	my $tag;
140
141	for (split /\n/, $_[1]) {
142	next if /^\w\w:\s*$/;
143	if (s/^(\S+)://) {
144	push @result, {_b => 1}, "$1:";
145	$tag = lc $1;
146	}
147	if (defined $tag) {
148	push @result, {$tag => 1}, "$_\n";
149	} else {
150	push @result, {}, "$_\n";
151	}
152	}
153	return @result; # we don't go for speed
154	}
155
156	Obviously one could implement C<split> via C<tag>. The reason for
157	having two functions is speed. We need to call C<split> for each
158	document when indexing a collection. Therefore speed is essential. On
159	the other hand, C<tag> is called in order to display a single document
160	and may be a little slower. It may care about tagging bold and italic
161	regions. See C<WAIT::Parse::Nroff> how this might decrease
162	performance.
163
164
165	=head2 Filter definition
166
167	From the Information Retrieval perspective, the hardest part of the
168	system is the filter module. The database administrator defines for
169	each attribute, how the contents should be processed before it is
170	stored in the index. Usually the processing contains steps to restrict
171	the character set, case transformation, splitting to words and
172	transforming to word stems. In WAIT these steps are defined naturally
173	as a pipeline of processing steps. The pipelines are made up by
174	functions in the package B<WAIT::Filter> which is pre-populated by the
175	most common functions but may be extended any time.
176
177	The equivalent for a typical freeWAIS-sf processing would be this
178	pipeline:
179
180	[ 'isotr', 'isolc', 'split2', 'stop', 'Stem']
181
182	The function C<isotr> replaces unknown characters by blanks. C<isolc>
183	transforms to lower case. C<split2> splits into words and removes
184	words shorter than two characters. C<stop> removes the freeWAIS-sf
185	stopwords and C<Stem> applies the Porter algorithm for computing the
186	stem of the words.
187
188	The filter definition for a collection defines a set of pipelines for
189	the attributes and modifies the pipelines which should be used for
190	prefix and interval searches.
191
192	Several complete working examples come with WAIT in the script
193	directory. It is recommended to follow the pattern of the scripts
194	smakewhatis and sman.
195
196	=cut
197