/[wait]/branches/CPAN/lib/WAIT.pm
This is repository of my old source code which isn't updated any more. Go to git.rot13.org for current projects!
ViewVC logotype

Contents of /branches/CPAN/lib/WAIT.pm

Parent Directory Parent Directory | Revision Log Revision Log


Revision 13 - (show annotations)
Fri Apr 28 15:42:44 2000 UTC (24 years ago) by ulpfr
File size: 6821 byte(s)
Import of WAIT-1.710

1 #!/usr/bin/perl
2 # -*- Mode: Cperl -*-
3 # $Basename: WAIT.pm $
4 # $Revision: 1.6 $
5 # Author : Ulrich Pfeifer
6 # Created On : Wed Nov 5 16:59:32 1997
7 # Last Modified By: Ulrich Pfeifer
8 # Last Modified On: Wed Nov 12 18:26:44 1997
9 # Language : CPerl
10 # Update Count : 4
11 # Status : Unknown, Use with caution!
12 #
13 # (C) Copyright 1997, Ulrich Pfeifer, all rights reserved.
14 #
15 #
16
17 package WAIT;
18 require DynaLoader;
19 use vars qw($VERSION @ISA);
20 @ISA = qw(DynaLoader);
21
22 $VERSION = sprintf '%.4f', map $_/10,'$ProjectVersion: 17.1 $ ' =~ /([\d.]+)/;
23
24 bootstrap WAIT $VERSION;
25
26 __END__
27
28 =head1 NAME
29
30 WAIT - a rewrite of the freeWAIS-sf engine in Perl and XS
31
32 =head1 SYNOPSIS
33
34 A Synopsis is not yet available.
35
36 =head1 Status of this document
37
38 I started writing down some information about the implementation
39 before I forget them in my spare time. The stuff is incomplete at
40 least. Any additions, corrections, ... welcome.
41
42 =head1 PURPOSE
43
44 As you might know, I developed and maintained B<freeWAIS-sf> (with the
45 help of many people in The Net). FreeWAIS-sf is based on B<freeWAIS>
46 maintained by the Clearing House for Network Information Retrieval
47 (CNIDR) which in turn is based on B<wais-8-b5> implemented by Thinking
48 Machine et al. During this long history - implementation started about
49 1989 - many people contributed to the distribution and added features
50 not foreseen by the original design. While the system fulfills its
51 task now, the code has reached a state where adding new features is
52 nearly impossible and even fixing longstanding bugs and removing
53 limitations has become a very time consuming task.
54
55 Therefore I decided to pass the maintenance to WSC Inc. and built a
56 new system from scratch. For obvious reasons I choosed Perl as
57 implementation language.
58
59 =head1 DESCRIPTION
60
61 The central idea of the system is to provide a framework and the
62 building blocks for any indexing and search system the users might
63 want to build. Obviously the framework limits the class of system
64 which can be build.
65
66 +------+ +-----+ +------+
67 ==> |Access| ==> |Parse| ==> | |
68 +------+ +-----+ | |
69 || | | +-----+
70 || |Filter| ==> |Index|
71 \/ | | +-----+
72 +-------+ +-----+ | |
73 <= |Display| <== |Query| <-> | |
74 +-------+ +-----+ +------+
75
76 A collection (aka table) is defined by the instances of the B<access>
77 and B<parse> module together with the B<filter definitions>. At query
78 time in addition a B<query> and a B<display> module must be choosen.
79
80 =head2 Access
81
82 The access module defines which documents are members of a database.
83 Usually an access module is a tied hash, whose keys are the Ids of the
84 documents (did = document id) and whose values are the documents
85 themselves. The indexing process loops over the keys using C<FIRSTKEY>
86 and C<NEXTKEY>. Documents are retrieved with C<FETCH>.
87
88 By convention access modules should be members of the
89 C<WAIT::Document> hierarchy. Have a look at the
90 C<WAIT::Document::Split> module to get the idea.
91
92
93 =head2 Parse
94
95 The task of the parse module is to split the documents into logical
96 parts via the C<split> method. E.g. the C<WAIT::Parse::Nroff> splits
97 manuals piped through B<nroff>(1) into the sections I<name>,
98 I<synopsis>, I<options>, I<description>, I<author>, I<example>,
99 I<bugs>, I<text>, I<see>, and I<environment>. Here is the
100 implementation of C<WAIT::Parse::Base> which handles documents with a
101 pretty simple tagged format:
102
103 AU: Pfeifer, U.; Fuhr, N.; Huynh, T.
104 TI: Searching Structured Documents with the Enhanced Retrieval
105 Functionality of freeWAIS-sf and SFgate
106 ER: D. Kroemker
107 BT: Computer Networks and ISDN Systems; Proceedings of the third
108 International World-Wide Web Conference
109 PN: Elsevier
110 PA: Amsterdam - Lausanne - New York - Oxford - Shannon - Tokyo
111 PP: 1027-1036
112 PY: 1995
113
114 sub split { # called as method
115 my %result;
116 my $fld;
117
118 for (split /\n/, $_[1]) {
119 if (s/^(\S+):\s*//) {
120 $fld = lc $1;
121 }
122 $result{$fld} .= $_ if defined $fld;
123 }
124 return \%result;
125 }
126
127 Since the original document cannot be reconstructed from its
128 attributes, we need a second method (I<tag>) which marks the regions
129 of the document with tags for the different attributes. This tagged
130 form is used by the display module to hilight search terms in the
131 documents. Besides the tags for the attributes, the method might assign
132 the special tags C<_b> and C<_i> for indicating bold and italic
133 regions.
134
135 sub tag {
136 my @result;
137 my $tag;
138
139 for (split /\n/, $_[1]) {
140 next if /^\w\w:\s*$/;
141 if (s/^(\S+)://) {
142 push @result, {_b => 1}, "$1:";
143 $tag = lc $1;
144 }
145 if (defined $tag) {
146 push @result, {$tag => 1}, "$_\n";
147 } else {
148 push @result, {}, "$_\n";
149 }
150 }
151 return @result; # we don't go for speed
152 }
153
154 Obviously one could implement C<split> via C<tag>. The reason for
155 having two functions is speed. We need to call C<split> for each
156 document when indexing a collection. Therefore speed is essential. On
157 the other hand, C<tag> is called in order to display a single document
158 and may be a little slower. It may care about tagging bold and italic
159 regions. See C<WAIT::Parse::Nroff> how this might decrease
160 performance.
161
162
163 =head2 Filter definition
164
165 From the Information Retrieval perspective, the hardest part of the
166 system is the filter module. The database administrator defines for
167 each attribute, how the contents should be processed before it is
168 stored in the index. Usually the processing contains steps to restrict
169 the character set, case transformation, splitting to words and
170 transforming to word stems. In WAIT these steps are defined naturally
171 as a pipeline of processing steps. The pipelines are made up by
172 functions in the package B<WAIT::Filter> which is pre-populated by the
173 most common functions but may be extended any time.
174
175 The equivalent for a typical freeWAIS-sf processing would be this
176 pipeline:
177
178 [ 'isotr', 'isolc', 'split2', 'stop', 'Stem']
179
180 The function C<isotr> replaces unknown characters by blanks. C<isolc>
181 transforms to lower case. C<split2> splits into words and removes
182 words shorter than two characters. C<stop> removes the freeWAIS-sf
183 stopwords and C<Stem> applies the Porter algorithm for computing the
184 stem of the words.
185
186 The filter definition for a collection defines a set of pipelines for
187 the attributes and modifies the pipelines which should be used for
188 prefix and interval searches.
189
190 Several complete working examples come with WAIT in the script
191 directory. It is recommended to follow the pattern of the scripts
192 smakewhatis and sman.
193
194 =cut
195

Properties

Name Value
cvs2svn:cvs-rev 1.1.1.2

  ViewVC Help
Powered by ViewVC 1.1.26