/[wait]/cvs-head/lib/WAIT.pm
This is repository of my old source code which isn't updated any more. Go to git.rot13.org for current projects!
ViewVC logotype

Annotation of /cvs-head/lib/WAIT.pm

Parent Directory Parent Directory | Revision Log Revision Log


Revision 19 - (hide annotations)
Tue May 9 11:29:45 2000 UTC (24 years ago) by ulpfr
Original Path: branches/CPAN/lib/WAIT.pm
File size: 6903 byte(s)
Import of WAIT-1.800

1 ulpfr 10 #!/usr/bin/perl
2 ulpfr 13 # -*- Mode: Cperl -*-
3 ulpfr 10 # $Basename: WAIT.pm $
4 ulpfr 19 # $Revision: 1.7 $
5 ulpfr 10 # Author : Ulrich Pfeifer
6     # Created On : Wed Nov 5 16:59:32 1997
7     # Last Modified By: Ulrich Pfeifer
8 ulpfr 19 # Last Modified On: Mon May 31 22:34:35 1999
9 ulpfr 10 # Language : CPerl
10 ulpfr 19 # Update Count : 5
11 ulpfr 10 # Status : Unknown, Use with caution!
12 ulpfr 13 #
13 ulpfr 10 # (C) Copyright 1997, Ulrich Pfeifer, all rights reserved.
14 ulpfr 13 #
15     #
16 ulpfr 10
17     package WAIT;
18     require DynaLoader;
19     use vars qw($VERSION @ISA);
20     @ISA = qw(DynaLoader);
21    
22 ulpfr 19 # $Format: "$\VERSION = sprintf '%5.3f', ($ProjectMajorVersion$ * 100 + ($ProjectMinorVersion$-1))/1000;"$
23     $VERSION = sprintf '%5.3f', (18 * 100 + (1-1))/1000;
24 ulpfr 10
25 ulpfr 19
26 ulpfr 10 bootstrap WAIT $VERSION;
27    
28     __END__
29    
30     =head1 NAME
31    
32 ulpfr 13 WAIT - a rewrite of the freeWAIS-sf engine in Perl and XS
33 ulpfr 10
34 ulpfr 13 =head1 SYNOPSIS
35    
36     A Synopsis is not yet available.
37    
38 ulpfr 10 =head1 Status of this document
39    
40     I started writing down some information about the implementation
41     before I forget them in my spare time. The stuff is incomplete at
42     least. Any additions, corrections, ... welcome.
43    
44     =head1 PURPOSE
45    
46     As you might know, I developed and maintained B<freeWAIS-sf> (with the
47     help of many people in The Net). FreeWAIS-sf is based on B<freeWAIS>
48     maintained by the Clearing House for Network Information Retrieval
49     (CNIDR) which in turn is based on B<wais-8-b5> implemented by Thinking
50     Machine et al. During this long history - implementation started about
51     1989 - many people contributed to the distribution and added features
52     not foreseen by the original design. While the system fulfills its
53     task now, the code has reached a state where adding new features is
54     nearly impossible and even fixing longstanding bugs and removing
55     limitations has become a very time consuming task.
56    
57     Therefore I decided to pass the maintenance to WSC Inc. and built a
58     new system from scratch. For obvious reasons I choosed Perl as
59     implementation language.
60    
61     =head1 DESCRIPTION
62    
63     The central idea of the system is to provide a framework and the
64     building blocks for any indexing and search system the users might
65     want to build. Obviously the framework limits the class of system
66     which can be build.
67    
68     +------+ +-----+ +------+
69     ==> |Access| ==> |Parse| ==> | |
70     +------+ +-----+ | |
71     || | | +-----+
72     || |Filter| ==> |Index|
73     \/ | | +-----+
74     +-------+ +-----+ | |
75     <= |Display| <== |Query| <-> | |
76     +-------+ +-----+ +------+
77    
78     A collection (aka table) is defined by the instances of the B<access>
79     and B<parse> module together with the B<filter definitions>. At query
80     time in addition a B<query> and a B<display> module must be choosen.
81    
82     =head2 Access
83    
84 ulpfr 13 The access module defines which documents are members of a database.
85     Usually an access module is a tied hash, whose keys are the Ids of the
86     documents (did = document id) and whose values are the documents
87     themselves. The indexing process loops over the keys using C<FIRSTKEY>
88     and C<NEXTKEY>. Documents are retrieved with C<FETCH>.
89 ulpfr 10
90     By convention access modules should be members of the
91     C<WAIT::Document> hierarchy. Have a look at the
92     C<WAIT::Document::Split> module to get the idea.
93    
94    
95     =head2 Parse
96    
97 ulpfr 13 The task of the parse module is to split the documents into logical
98     parts via the C<split> method. E.g. the C<WAIT::Parse::Nroff> splits
99 ulpfr 10 manuals piped through B<nroff>(1) into the sections I<name>,
100     I<synopsis>, I<options>, I<description>, I<author>, I<example>,
101     I<bugs>, I<text>, I<see>, and I<environment>. Here is the
102 ulpfr 13 implementation of C<WAIT::Parse::Base> which handles documents with a
103 ulpfr 10 pretty simple tagged format:
104    
105     AU: Pfeifer, U.; Fuhr, N.; Huynh, T.
106     TI: Searching Structured Documents with the Enhanced Retrieval
107     Functionality of freeWAIS-sf and SFgate
108     ER: D. Kroemker
109     BT: Computer Networks and ISDN Systems; Proceedings of the third
110     International World-Wide Web Conference
111     PN: Elsevier
112     PA: Amsterdam - Lausanne - New York - Oxford - Shannon - Tokyo
113     PP: 1027-1036
114     PY: 1995
115    
116     sub split { # called as method
117     my %result;
118     my $fld;
119 ulpfr 13
120 ulpfr 10 for (split /\n/, $_[1]) {
121     if (s/^(\S+):\s*//) {
122     $fld = lc $1;
123     }
124     $result{$fld} .= $_ if defined $fld;
125     }
126     return \%result;
127 ulpfr 13 }
128 ulpfr 10
129     Since the original document cannot be reconstructed from its
130     attributes, we need a second method (I<tag>) which marks the regions
131     of the document with tags for the different attributes. This tagged
132     form is used by the display module to hilight search terms in the
133     documents. Besides the tags for the attributes, the method might assign
134     the special tags C<_b> and C<_i> for indicating bold and italic
135     regions.
136    
137     sub tag {
138     my @result;
139     my $tag;
140 ulpfr 13
141 ulpfr 10 for (split /\n/, $_[1]) {
142     next if /^\w\w:\s*$/;
143     if (s/^(\S+)://) {
144     push @result, {_b => 1}, "$1:";
145     $tag = lc $1;
146     }
147     if (defined $tag) {
148     push @result, {$tag => 1}, "$_\n";
149     } else {
150     push @result, {}, "$_\n";
151     }
152     }
153     return @result; # we don't go for speed
154 ulpfr 13 }
155 ulpfr 10
156     Obviously one could implement C<split> via C<tag>. The reason for
157     having two functions is speed. We need to call C<split> for each
158     document when indexing a collection. Therefore speed is essential. On
159     the other hand, C<tag> is called in order to display a single document
160     and may be a little slower. It may care about tagging bold and italic
161     regions. See C<WAIT::Parse::Nroff> how this might decrease
162     performance.
163    
164    
165     =head2 Filter definition
166    
167     From the Information Retrieval perspective, the hardest part of the
168     system is the filter module. The database administrator defines for
169     each attribute, how the contents should be processed before it is
170     stored in the index. Usually the processing contains steps to restrict
171     the character set, case transformation, splitting to words and
172     transforming to word stems. In WAIT these steps are defined naturally
173     as a pipeline of processing steps. The pipelines are made up by
174     functions in the package B<WAIT::Filter> which is pre-populated by the
175     most common functions but may be extended any time.
176    
177     The equivalent for a typical freeWAIS-sf processing would be this
178     pipeline:
179    
180     [ 'isotr', 'isolc', 'split2', 'stop', 'Stem']
181    
182     The function C<isotr> replaces unknown characters by blanks. C<isolc>
183     transforms to lower case. C<split2> splits into words and removes
184     words shorter than two characters. C<stop> removes the freeWAIS-sf
185     stopwords and C<Stem> applies the Porter algorithm for computing the
186     stem of the words.
187    
188 ulpfr 13 The filter definition for a collection defines a set of pipelines for
189 ulpfr 10 the attributes and modifies the pipelines which should be used for
190     prefix and interval searches.
191    
192 ulpfr 13 Several complete working examples come with WAIT in the script
193     directory. It is recommended to follow the pattern of the scripts
194     smakewhatis and sman.
195 ulpfr 10
196 ulpfr 13 =cut
197 ulpfr 10

Properties

Name Value
cvs2svn:cvs-rev 1.1.1.3

  ViewVC Help
Powered by ViewVC 1.1.26