/[SWISH-PlusPlus]/trunk/PlusPlus.pm
This is repository of my old source code which isn't updated any more. Go to git.rot13.org for current projects!
ViewVC logotype

Contents of /trunk/PlusPlus.pm

Parent Directory Parent Directory | Revision Log Revision Log


Revision 22 - (show annotations)
Tue Dec 7 16:05:43 2004 UTC (19 years, 3 months ago) by dpavlin
File size: 16746 byte(s)
DEVELOPMENT STOPPED: it seems that swish++ isn't faster than swish-e if you
remove clever heuristics about english words. So, this project is abandoned.

1 package SWISH::PlusPlus;
2
3 use 5.008004;
4 use strict;
5 use warnings;
6
7 our $VERSION = '0.20';
8
9 use Carp;
10 use File::Temp qw/ tempdir /;
11 use BerkeleyDB;
12 use Storable qw(store retrieve freeze thaw);
13 use YAML;
14
15 =head1 NAME
16
17 SWISH::PlusPlus - Perl extension for full-text indexer SWISH++ with properties support
18
19 =head1 SYNOPSIS
20
21 use SWISH::PlusPlus;
22
23 my $i = new SWISH::PlusPlus(
24 index_dir => '/tmp/foo',
25 );
26 $i->add( 42 => 'meaning of life' );
27
28 print $i->search("meaning"); # returns 42
29
30 =head1 DESCRIPTION
31
32 This is perl module to use SWISH++ indexer by Paul J. Lucas. SWISH++ is
33 rewrite of swish-e in C++ which is extremely fast (due to mmap usage and
34 clever language heuristics), but without support for properties (which this
35 module tries to fix).
36
37 Implementation of API is something in-between C<SWISH::API> and
38 C<Plucene::Simple>. It should be easy to replace Plucene or swish-e with
39 this module for increased performance. However, this module is not plug-in
40 replacement.
41
42 =head1 METHODS
43
44 =head2 new
45
46 Create new instance for index.
47
48 my $i = SWISH::PlusPlus->new(
49 index_dir => '/path/to/index',
50 index => 'index++',
51 search => 'search++',
52 debug => 1,
53 meta_in_body => 1,
54 use_stopwords => 1,
55 );
56
57 Options are described below:
58
59 =over 5
60
61 =item C<index_dir>
62
63 Path to directory in which index and meta database will be created.
64
65 =item C<index>
66
67 Full or partial path to SWISH++ index executable. By default, it's B<index>
68 for self-compiled version. If you use Debian GNU/Linux package specify
69 B<index++>. See C<Debian>.
70
71 =item C<search>
72
73 Full or partial path to SWISH++ search executable. By default, it's B<search>.
74
75 =item C<debug>
76
77 This option (off by default) will produce a lot of debugging output on
78 C<STDERR> prefixed by C<##>.
79
80 =item C<meta_in_body>
81
82 This option (off by default) enables to search content of meta fields
83 without specifying them (like they are in body of document). This will
84 somewhat increase index size.
85
86 =item C<use_stopwords>
87
88 Use built-in SWISH++ stop words. By default, they are disabled.
89
90 =back
91
92 =cut
93
94 sub new {
95 my $class = shift;
96 my $self = {@_};
97 bless($self, $class);
98
99 foreach (qw(index_dir)) {
100 croak "need $_" unless $self->{$_};
101 }
102
103 my $index_dir = $self->{'index_dir'};
104
105 my $cwd;
106 chomp($cwd = `pwd`);
107 $self->{'cwd'} = $cwd || carp "can't get cwd!";
108
109 if ($index_dir !~ m#^/#) {
110 $index_dir = "$cwd/$index_dir";
111 print STDERR "## full path to index_dir: $index_dir\n" if ($self->{'debug'});
112 $self->{'index_dir'} = $index_dir;
113 }
114
115 if (! -e $index_dir) {
116 mkdir $index_dir || confess "can't create index ",$self->{'index'},": $!";
117 }
118
119 # default executables
120 $self->{'index'} ||= 'index';
121 $self->{'search'} ||= 'search';
122
123 print STDERR "## new index_dir: ",$index_dir," index: ",$self->{'index'}, " search: ",$self->{'search'},"\n" if ($self->{'debug'});
124
125 $self ? return $self : return undef;
126 }
127
128
129 =head2 check_bin
130
131 Check if SWISH++ binaries specified in L<new> are available and verify
132 version signature.
133
134 if ($i->check_bin) {
135 print "swish++ binaries found\n";
136 };
137
138 It will also setup property
139
140 $i->{'version'}
141
142 which you can examined to see numeric version (something like C<6.0.4>).
143
144 =cut
145
146 sub check_bin {
147 my $self = shift;
148
149 my $i = `$self->{'index'} -V 2>&1` || confess "can't find '",$self->{'index'},"' binary";
150 my $s = `$self->{'search'} -V 2>&1` || confess "can't find '",$self->{'search'},"' binary";
151
152 chomp $i;
153 chomp $s;
154
155 confess $self->{'index'}," binary is not SWISH++" unless ($i =~ m/^SWISH\+\+/);
156 confess $self->{'search'}," binary is not SWISH++" unless ($s =~ m/^SWISH\+\+/);
157
158 if ($i eq $s) {
159 $i =~ s/^SWISH\+\+\s+// || confess "can't strip SWISH++ from version";
160 $self->{'version'} = $i;
161 return 1;
162 } else {
163 carp "version difference: index is $i while search is $s";
164 return;
165 }
166
167 }
168
169 =head2 index_document
170
171 Quick way to add simple data to index.
172
173 $i->index_document($path, $data);
174 $i->index_document(
175 42 => 'meaning of life',
176 1984 => 'Oh!',
177 );
178
179 C<$path> value is really path, so you don't want to use directory
180 separators (slashes, /) in it probably.
181
182 =cut
183
184 sub index_document {
185 my $self = shift;
186
187 my %doc = @_;
188
189 foreach my $id (keys %doc) {
190 $self->_create_doc(
191 path => $id,
192 body => $doc{$id},
193 );
194 }
195
196 return 1;
197 }
198
199 =head2 add
200
201 Add document with meta-data to index.
202
203 $i->add(
204 path => 'path/to/document',
205 title => 'this is result title',
206 meta => {
207 description => 'this is description meta tag',
208 date => '2004-11-04',
209 author => 'Dobrica Pavlinusic',
210 }
211 body => 'this is text without meta data',
212 );
213
214 This is thin wrapper round L<_create_doc>.
215
216 =cut
217
218 sub add {
219 my $self = shift;
220
221 return $self->_create_doc(@_);
222 }
223
224
225 =head2 delete
226
227 Delete document from index.
228
229 $i->delete("document/path");
230
231 If deletion is succesfull returns revision of deleted document, otherwise
232 undef.
233
234 =cut
235
236 sub delete {
237 my $self = shift;
238
239 my $path = shift || carp "empty path?";
240
241 print STDERR "## delete: $path\n" if ($self->{'debug'});
242
243 my $rev = $self->{'meta_db'}->{"R$path"};
244 if ($rev) {
245 $self->{'_deleted'}->{$path} = $rev;
246 $self->{'_deleted_counter'}++;
247 print STDERR "## deleted revision $rev, counter: ",$self->{'_deleted_counter'}++,"\n" if ($self->{'debug'});
248 return $rev;
249 }
250
251 return undef;
252 }
253
254
255 =head2 search
256
257 Search your index using any valid SWISH++ query.
258
259 my @results = $i->search("swish query");
260
261 Returns array with elements like this:
262
263 {
264 rank => 10, # rank of result
265 path => 'path to result', # path to result
266 size => 999, # size in bytes
267 title => 'title of result' # title meta property
268 }
269
270 =cut
271
272 sub search {
273 my $self = shift;
274
275 my $query = shift || return;
276
277 $self->finish_update;
278 $self->_tie_meta_db(DB_RDONLY);
279
280 my @results;
281
282 # escape double quotes in query for shell
283 $query =~ s/"/\\"/g;
284
285 my $open_cmd = $self->{'search'} .
286 ' -i ' . $self->{'index_dir'}.'/index' .
287 ' "' . $query . '"'.
288 ' |';
289 print STDERR "## search: $open_cmd\n" if ($self->{'debug'});
290
291 my %r;
292
293 open(SEARCH, $open_cmd) || confess "can't start $open_cmd: $!";
294 my $l;
295 while($l = <SEARCH>) {
296 next if ($l =~ /^#/);
297 chomp($l);
298 print STDERR "## $l\n" if ($self->{'debug'});
299 my ($rank,$path,$size,$rev,$title) = split(/ /,$l,5);
300 $path =~ s#^\./##; # strip from path
301
302 # get current revision
303 $r{$path} = $self->{'meta_db'}->{"R$path"};
304
305 # skip if old revision
306 next if ($r{$path} > $rev);
307
308 print STDERR "## current revision $rev\n" if ($self->{'debug'});
309
310 push @results, {
311 rank => $rank,
312 path => $path,
313 size => $size,
314 title => $title,
315 } unless ($self->{'_deleted'}->{$path} && $self->{'_deleted'}->{$path} <= $rev);
316 }
317
318 close(SEARCH) || confess "can't close search";
319
320 #print STDERR "## results: ",Dump(@results),"\n" if ($self->{'debug'});
321
322 return @results;
323 }
324
325 =head2 property
326
327 Return stored meta property from result or result path.
328
329 print $i->property('path', 'meta name');
330 print $i->property($res->{'path'}, 'meta name');
331 print $i->property('path');
332 print $i->property($res->{'path'});
333
334 Returns one meta property (if meta name is specified) or whole hash with
335 all meta properties.
336
337 =cut
338
339 sub property {
340 my $self = shift;
341
342 my $path = shift || return;
343 my $meta = shift;
344
345 if ($path =~ m/^HASH/) {
346 $path = $path->{'path'} || confess "can't find path in input data";
347 }
348
349 my $val = $self->{'meta_db'}->{"M$path"};
350
351 # FIXME should we die here like swish-e does?
352 return unless ($val);
353
354 $val = thaw($val);
355
356 print STDERR "## property $path $meta: ",(Dump($val) || 'undef'),"\n" if ($self->{'debug'});
357
358 return $val->{$meta} if ($meta);
359
360 return $val;
361 }
362
363 =head2 finish_update
364
365 This method will close index binary and enable search. Searching is not
366 available while indexing is in process.
367
368 $i->finish_update;
369
370 Usually, you don't need to call this method directly. It will be called on
371 DESTROY when $i goes out of scope or when you first call search in session
372 if indexing was started.
373
374 =cut
375
376 sub finish_update {
377 my $self = shift;
378
379 print STDERR "## finish_update\n" if ($self->{'debug'});
380
381 $self->_close_index && $self->_untie_meta_db;
382 }
383
384 sub DESTROY {
385 my $self = shift;
386 $self->finish_update;
387 }
388
389 =head1 PRIVATE METHODS
390
391 Private methods implement internals for creating temporary files needed for
392 SWISH++. You should have no need to call them directly, and they are here
393 just to have documentation.
394
395 =head2 _init_indexer
396
397 Create temporary directory in which files for indexing will be created and
398 start index process.
399
400 my $i->_init_indexer || die "can't start indexer";
401
402 It will also create empty file C<_stopwords_> to disable stop words.
403
404 =cut
405
406 sub _init_indexer {
407 my $self = shift;
408
409 return if ($self->{'_index_fh'});
410
411 my $tmp_dir = tempdir( CLEANUP => 1 ) || confess "can't create temporary directory: $!";
412 $self->{'tmp_dir'} = $tmp_dir;
413
414 chdir $tmp_dir || confess "can't chdir to ".$tmp_dir.": $!";
415
416 print STDERR "## tmp_dir: $tmp_dir\n" if ($self->{'debug'});
417
418 my $opt = "-v " . ($self->{'debug'} || '0');
419
420 my $index_dir = $self->{'index_dir'} || confess "no index_dir?";
421 my $index_file = $index_dir . '/index';
422
423 if (-e $index_file && ! -z $index_file) {
424 $opt .= ' -I ';
425 $self->{'_incremental'} = 1;
426 print STDERR "## using incremental indexing for $index_file\n" if ($self->{'debug'});
427 } else {
428 $self->{'_incremental'} = 0;
429 }
430
431 unless ($self->{'use_stopwrods'}) {
432 open(STOP, '>', "_stopwords_") || carp "can't create empty stopword file, skipping\n";
433 print STOP " ";
434 close(STOP);
435 $opt .= " -s _stopwords_";
436 }
437
438 my $open_cmd = '| '.$self->{'index'}.' '.$opt.' -e "html:*" -i '.$index_file.' -';
439
440 print STDERR "## init_indexer: $open_cmd\n" if ($self->{'debug'});
441
442 open($self->{'_index_fh'}, $open_cmd) || confess "can't start index with $open_cmd: $!";
443
444 chdir $self->{'cwd'} || confess "can't chdir to ".$self->{'cwd'}.": $!";
445
446 $self->_tie_meta_db(DB_CREATE);
447
448 return $self->{'_index_fh'};
449 }
450
451 =head2 _create_doc
452
453 Create temporary file and pass it's name to SWISH++
454
455 $i->_create_doc(
456 path => 'path/to/store/in/index',
457 title => 'this is title in results',
458 body => 'data to story in body tag',
459 meta => {
460 'meta name' => 'data for this meta',
461 'another' => 'again more data',
462 }
463 );
464
465 =cut
466
467 sub _create_doc {
468 my $self = shift;
469
470 my $arg = {@_};
471
472 # open indexer if needed
473 $self->_init_indexer;
474
475 my $path = $self->{'tmp_dir'} || confess "no tmp_dir?";
476 my $id = $arg->{'path'} || confess "no path?";
477 $path .= "/$id";
478
479 my $rev = $self->{'rev'}++;
480
481 print STDERR "## _create_doc: $path [$rev]\n" if ($self->{'debug'});
482
483 open(TMP, '>', $path) || die "can't create temp file $path: $!";
484
485 print TMP '<html><head>';
486
487 my $body = $arg->{'body'};
488
489 if (defined($body)) {
490 $self->{'meta_db'}->{"B$id"} = $body;
491 } else {
492 $body = '';
493 }
494
495 my $title = $arg->{'title'};
496
497 if ($arg->{'meta'}) {
498 foreach my $name (keys %{$arg->{'meta'}}) {
499 my $content = $arg->{'meta'}->{$name};
500 print TMP qq{<meta name="$name" content="$content">};
501 $body .= " $content" if ($self->{'meta_in_body'});
502 }
503 $arg->{'meta'}->{'title'} = $title;
504 $self->{'meta_db'}->{"M$id"} = freeze($arg->{'meta'});
505 }
506
507 if (defined($title)) {
508 $title = "$rev $title";
509 $body .= " $title" if ($self->{'meta_in_body'});
510 } else {
511 $title = "$rev $id";
512 }
513
514 # dump html
515 print TMP "<title>$title</title></head><body>$body</body></html>";
516
517 close(TMP) || confess "can't close tmp file ".$arg->{'path'}.": $!";
518
519 print { $self->{'_index_fh'} } "$id\n" || confess "can't pass document $id to indexer: $!";
520
521 $self->{'meta_db'}->{"R$id"} = $rev;
522
523 # FIXME this is probably not the right place to update global
524 # maximum revision, but it keeps database in sane state
525 $self->{'meta_db'}->{"Crev"} = $rev;
526 }
527
528 =head2 _close_index
529
530 Close index after indexing.
531
532 $i->_close_index;
533
534 You have to close index before searching.
535
536 =cut
537
538 sub _close_index {
539 my $self = shift;
540
541 $self->_store_deleted;
542
543 return unless ($self->{'_index_fh'});
544
545 print STDERR "## close index\n" if ($self->{'debug'});
546
547 close($self->{'_index_fh'}) || confess "can't close index: $!";
548 undef $self->{'_index_fh'};
549
550 if ($self->{'_incremental'}) {
551 print STDERR "## move new index over old\n" if ($self->{'debug'});
552 rename $self->{'index_dir'}.'/index.new',$self->{'index_dir'}.'/index' || die "can't move new index over old one: $!";
553 }
554
555 return 1;
556 }
557
558 =head2 _tie_meta_db
559
560 Open BerkeleyDB database with meta properties.
561
562 $i->_tie_meta_db(DB_CREATE);
563 $i->_tie_meta_db(DB_RDONLY);
564
565 }
566
567 =cut
568
569 sub _tie_meta_db {
570 my $self = shift;
571
572 my $flags = shift || confess "need DB_CREATE or DB_RDONLY";
573
574 return if ($self->{'_meta_db_flags'} && $self->{'_meta_db_flags'} == $flags);
575
576 print STDERR "## _tie_meta_db($flags)\n" if ($self->{'debug'});
577
578 $self->_untie_meta_db;
579 $self->{'_meta_db_flags'} = $flags;
580
581 my $file = $self->{'index_dir'}.'/meta.db';
582
583 tie %{$self->{'meta_db'}}, "BerkeleyDB::Hash",
584 -Filename => $file,
585 -Flags => $flags
586 or confess "cannot open $file: $! $BerkeleyDB::Error\n" ;
587
588 $self->{'rev'} = $self->{'meta_db'}->{'Crev'} || 0;
589
590 my $delref = $self->{'meta_db'}->{'Cdeleted'};
591 if ($delref) {
592 $self->{'_deleted'} = thaw($delref);
593
594 print "## deleted ",keys %{$self->{'_deleted'}}," records\n" if ($self->{'debug'});
595 } else {
596 $self->{'_deleted'} = {};
597 }
598
599 $self->{'_deleted_counter'} = 0;
600 return 1;
601 }
602
603 =head2 _untie_meta_db
604
605 Close BerkeleyDB database with meta properties.
606
607 $i->_untie_meta_db;
608
609 =cut
610
611 sub _untie_meta_db {
612 my $self = shift;
613
614 return unless ($self->{'meta_db'});
615
616 print STDERR "## _untie_meta_db\n" if ($self->{'debug'});
617 untie %{$self->{'meta_db'}} || confess "can't untie!";
618 undef $self->{'meta_db'};
619 undef $self->{'_meta_db_flags'};
620
621 return 1;
622 }
623
624
625 =head2 _store_deleted
626
627 Save hash of deleted files using L<Storable>.
628
629 $i->_store_deleted;
630
631 =cut
632
633 sub _store_deleted {
634 my $self = shift;
635
636 return if (! $self->{'_deleted_counter'});
637
638 print STDERR "## save deleted ",Dump($self->{'_deleted'}) if ($self->{'debug'});
639
640 my $d = freeze($self->{'_deleted'});
641
642 $self->_tie_meta_db(DB_CREATE);
643
644 $self->{'meta_db'}->{'Cdeleted'} = $d ||
645 carp "can't store deleted: $!";
646
647 # reset counter
648 $self->{'_deleted_counter'} = 0;
649 }
650
651 1;
652 __END__
653
654 =head2 EXPORT
655
656 None by default.
657
658 =head1 RELATED
659
660 =head2 Debian
661
662 Debian version of SWISH++ is often old (version 5 at moment of this writing
663 while version 6 is available in source code), so this module by default
664 uses executable names B<index> and B<search> for self-compiled version
665 instead of one from Debian package. See L<new> how to specify Debian
666 default binaries B<index++> and B<search++>.
667
668 =head2 SWISH++
669
670 Aside from very good rewrite in C++, SWISH++ is faster because it uses
671 claver heuristics about which data in input files are words to index and
672 which are not. It's based on English language and might be best choice if
673 you plan to index large amount of long text documents.
674
675 However, if you plan to index all data from structured storage (e.g. RDBMS)
676 you might want B<all> words from data to end up in index as opposed to just
677 those which look like English words. This is especially important if you
678 don't plan to index English texts with this module.
679
680 With distribution build versions of SWISH++ you might have problems with
681 disapearing words. To overcome this problem, you will have to compile and
682 configure SWISH++ yourself (because language characteristics are
683 compilation-time option).
684
685 Compilation of SWISH++ is easy process well described on project's web
686 pages. To see my very relaxed sample configuration take a look at C<swish++>
687 directory included in distribution.
688
689 =head2 SWISH++ config
690
691 C<config.h> located in C<swish++> directory of this distribution is relaxed
692 SWISH++ configuration that will index all words passed to it. This
693 configuration is needed for B<date test> because default configuration
694 doesn't recognize 2004-12-05 as date. Have in mind that your index size
695 might explode.
696
697 =head1 BUGS
698
699 Currently there is no way to specify which meta data will be stored as
700 properties. B<This will be fixed very soon>.
701
702 There is no garbage collection on temporary files created for SWISH++. This
703 means that one run of indexer will take additional disk space for temporary
704 files, which will be removed at end. There should be some way to remove
705 files after they are indexed by SWISH++. However, at this early stage of
706 development it's just not supported yet. Have plenty of disk space!
707
708 =head1 SEE ALSO
709
710 SWISH++ web site L<http://homepage.mac.com/pauljlucas/software/swish/>
711
712 =head1 AUTHOR
713
714 Dobrica Pavlinusic, E<lt>dpavlin@rot13.orgE<gt>
715
716 =head1 COPYRIGHT AND LICENSE
717
718 Copyright (C) 2004 by Dobrica Pavlinusic
719
720 This library is free software; you can redistribute it and/or modify
721 it under the same terms as Perl itself, either Perl version 5.8.4 or,
722 at your option, any later version of Perl 5 you may have available.
723
724
725 =cut

  ViewVC Help
Powered by ViewVC 1.1.26