/[wait]/trunk/lib/WAIT.pm
This is repository of my old source code which isn't updated any more. Go to git.rot13.org for current projects!
ViewVC logotype

Contents of /trunk/lib/WAIT.pm

Parent Directory Parent Directory | Revision Log Revision Log


Revision 108 - (show annotations)
Tue Jul 13 17:41:12 2004 UTC (19 years, 9 months ago) by dpavlin
File size: 6774 byte(s)
beginning of version 2.0 using BerkeleyDB (non-functional for now)

1 #!/usr/bin/perl
2 # -*- Mode: Cperl -*-
3 # $Basename: WAIT.pm $
4 # $Revision: 1.7 $
5 # Author : Ulrich Pfeifer
6 # Created On : Wed Nov 5 16:59:32 1997
7 # Last Modified By: Ulrich Pfeifer
8 # Last Modified On: Tue Apr 16 23:28:52 2002
9 # Language : CPerl
10 # Update Count : 8
11 # Status : Unknown, Use with caution!
12 #
13 # (C) Copyright 1997, Ulrich Pfeifer, all rights reserved.
14 #
15 #
16
17 package WAIT;
18 use XSLoader;
19 # require DynaLoader;
20 our $VERSION;
21 # @ISA = qw(DynaLoader);
22
23 $VERSION = '2.000';
24
25 XSLoader::load 'WAIT', $VERSION;
26
27 __END__
28
29 =head1 NAME
30
31 WAIT - a rewrite of the freeWAIS-sf engine in Perl and XS
32
33 =head1 SYNOPSIS
34
35 A Synopsis is not yet available.
36
37 =head1 Status of this document
38
39 I started writing down some information about the implementation
40 before I forget them in my spare time. The stuff is incomplete at
41 least. Any additions, corrections, ... welcome.
42
43 =head1 PURPOSE
44
45 As you might know, I developed and maintained B<freeWAIS-sf> (with the
46 help of many people in The Net). FreeWAIS-sf is based on B<freeWAIS>
47 maintained by the Clearing House for Network Information Retrieval
48 (CNIDR) which in turn is based on B<wais-8-b5> implemented by Thinking
49 Machine et al. During this long history - implementation started about
50 1989 - many people contributed to the distribution and added features
51 not foreseen by the original design. While the system fulfills its
52 task now, the code has reached a state where adding new features is
53 nearly impossible and even fixing longstanding bugs and removing
54 limitations has become a very time consuming task.
55
56 Therefore I decided to pass the maintenance to WSC Inc. and built a
57 new system from scratch. For obvious reasons I choosed Perl as
58 implementation language.
59
60 =head1 DESCRIPTION
61
62 The central idea of the system is to provide a framework and the
63 building blocks for any indexing and search system the users might
64 want to build. Obviously the framework limits the class of system
65 which can be build.
66
67 +------+ +-----+ +------+
68 ==> |Access| ==> |Parse| ==> | |
69 +------+ +-----+ | |
70 || | | +-----+
71 || |Filter| ==> |Index|
72 \/ | | +-----+
73 +-------+ +-----+ | |
74 <= |Display| <== |Query| <-> | |
75 +-------+ +-----+ +------+
76
77 A collection (aka table) is defined by the instances of the B<access>
78 and B<parse> module together with the B<filter definitions>. At query
79 time in addition a B<query> and a B<display> module must be choosen.
80
81 =head2 Access
82
83 The access module defines which documents are members of a database.
84 Usually an access module is a tied hash, whose keys are the Ids of the
85 documents (did = document id) and whose values are the documents
86 themselves. The indexing process loops over the keys using C<FIRSTKEY>
87 and C<NEXTKEY>. Documents are retrieved with C<FETCH>.
88
89 By convention access modules should be members of the
90 C<WAIT::Document> hierarchy. Have a look at the
91 C<WAIT::Document::Split> module to get the idea.
92
93
94 =head2 Parse
95
96 The task of the parse module is to split the documents into logical
97 parts via the C<split> method. E.g. the C<WAIT::Parse::Nroff> splits
98 manuals piped through B<nroff>(1) into the sections I<name>,
99 I<synopsis>, I<options>, I<description>, I<author>, I<example>,
100 I<bugs>, I<text>, I<see>, and I<environment>. Here is the
101 implementation of C<WAIT::Parse::Base> which handles documents with a
102 pretty simple tagged format:
103
104 AU: Pfeifer, U.; Fuhr, N.; Huynh, T.
105 TI: Searching Structured Documents with the Enhanced Retrieval
106 Functionality of freeWAIS-sf and SFgate
107 ER: D. Kroemker
108 BT: Computer Networks and ISDN Systems; Proceedings of the third
109 International World-Wide Web Conference
110 PN: Elsevier
111 PA: Amsterdam - Lausanne - New York - Oxford - Shannon - Tokyo
112 PP: 1027-1036
113 PY: 1995
114
115 sub split { # called as method
116 my %result;
117 my $fld;
118
119 for (split /\n/, $_[1]) {
120 if (s/^(\S+):\s*//) {
121 $fld = lc $1;
122 }
123 $result{$fld} .= $_ if defined $fld;
124 }
125 return \%result;
126 }
127
128 Since the original document cannot be reconstructed from its
129 attributes, we need a second method (I<tag>) which marks the regions
130 of the document with tags for the different attributes. This tagged
131 form is used by the display module to hilight search terms in the
132 documents. Besides the tags for the attributes, the method might assign
133 the special tags C<_b> and C<_i> for indicating bold and italic
134 regions.
135
136 sub tag {
137 my @result;
138 my $tag;
139
140 for (split /\n/, $_[1]) {
141 next if /^\w\w:\s*$/;
142 if (s/^(\S+)://) {
143 push @result, {_b => 1}, "$1:";
144 $tag = lc $1;
145 }
146 if (defined $tag) {
147 push @result, {$tag => 1}, "$_\n";
148 } else {
149 push @result, {}, "$_\n";
150 }
151 }
152 return @result; # we don't go for speed
153 }
154
155 Obviously one could implement C<split> via C<tag>. The reason for
156 having two functions is speed. We need to call C<split> for each
157 document when indexing a collection. Therefore speed is essential. On
158 the other hand, C<tag> is called in order to display a single document
159 and may be a little slower. It may care about tagging bold and italic
160 regions. See C<WAIT::Parse::Nroff> how this might decrease
161 performance.
162
163
164 =head2 Filter definition
165
166 From the Information Retrieval perspective, the hardest part of the
167 system is the filter module. The database administrator defines for
168 each attribute, how the contents should be processed before it is
169 stored in the index. Usually the processing contains steps to restrict
170 the character set, case transformation, splitting to words and
171 transforming to word stems. In WAIT these steps are defined naturally
172 as a pipeline of processing steps. The pipelines are made up by
173 functions in the package B<WAIT::Filter> which is pre-populated by the
174 most common functions but may be extended any time.
175
176 The equivalent for a typical freeWAIS-sf processing would be this
177 pipeline:
178
179 [ 'isotr', 'isolc', 'split2', 'stop', 'Stem']
180
181 The function C<isotr> replaces unknown characters by blanks. C<isolc>
182 transforms to lower case. C<split2> splits into words and removes
183 words shorter than two characters. C<stop> removes the freeWAIS-sf
184 stopwords and C<Stem> applies the Porter algorithm for computing the
185 stem of the words.
186
187 The filter definition for a collection defines a set of pipelines for
188 the attributes and modifies the pipelines which should be used for
189 prefix and interval searches.
190
191 Several complete working examples come with WAIT in the script
192 directory. It is recommended to follow the pattern of the scripts
193 smakewhatis and sman.
194
195 =cut
196

Properties

Name Value
cvs2svn:cvs-rev 1.3

  ViewVC Help
Powered by ViewVC 1.1.26