SWISH-PlusPlus/trunk/PlusPlus.pm

package SWISH::PlusPlus;

use 5.008004;
use strict;
use warnings;

our $VERSION = '0.20';

use Carp;
use File::Temp qw/ tempdir /;
use BerkeleyDB;
use Storable qw(store retrieve freeze thaw);
use YAML;

=head1 NAME

SWISH::PlusPlus - Perl extension for full-text indexer SWISH++ with properties support

=head1 SYNOPSIS

  use SWISH::PlusPlus;

  my $i = new SWISH::PlusPlus(
        index_dir => '/tmp/foo',
  );
  $i->add( 42 => 'meaning of life' );

  print $i->search("meaning");  # returns 42

=head1 DESCRIPTION

This is perl module to use SWISH++ indexer by Paul J. Lucas. SWISH++ is
rewrite of swish-e in C++ which is extremely fast (due to mmap usage and
clever language heuristics), but without support for properties (which this
module tries to fix).

Implementation of API is something in-between C<SWISH::API> and
C<Plucene::Simple>. It should be easy to replace Plucene or swish-e with
this module for increased performance. However, this module is not plug-in
replacement.

=head1 METHODS

=head2 new

Create new instance for index.

  my $i = SWISH::PlusPlus->new(
        index_dir => '/path/to/index',
        index => 'index++',
        search => 'search++',
        debug => 1,
        meta_in_body => 1,
        use_stopwords => 1,
  );

Options are described below:

=over 5

=item C<index_dir>

Path to directory in which index and meta database will be created.

=item C<index>

Full or partial path to SWISH++ index executable. By default, it's B<index>
for self-compiled version. If you use Debian GNU/Linux package specify
B<index++>. See C<Debian>.

=item C<search>

Full or partial path to SWISH++ search executable. By default, it's B<search>.

=item C<debug>

This option (off by default) will produce a lot of debugging output on
C<STDERR> prefixed by C<##>.

=item C<meta_in_body>

This option (off by default) enables to search content of meta fields
without specifying them (like they are in body of document). This will
somewhat increase index size.

=item C<use_stopwords>

Use built-in SWISH++ stop words. By default, they are disabled.

=back

=cut

sub new {
        my $class = shift;
        my $self = {@_};
        bless($self, $class);

        foreach (qw(index_dir)) {
                croak "need $_" unless $self->{$_};
        }

        my $index_dir = $self->{'index_dir'};

        my $cwd;
        chomp($cwd = `pwd`);
        $self->{'cwd'} = $cwd || carp "can't get cwd!";
        
        if ($index_dir !~ m#^/#) {
                $index_dir = "$cwd/$index_dir";
                print STDERR "## full path to index_dir: $index_dir\n" if ($self->{'debug'});
                $self->{'index_dir'} = $index_dir;
        }

        if (! -e $index_dir) {
                mkdir $index_dir || confess "can't create index ",$self->{'index'},": $!";
        }

        # default executables
        $self->{'index'} ||= 'index';
        $self->{'search'} ||= 'search';

        print STDERR "## new index_dir: ",$index_dir," index: ",$self->{'index'}, " search: ",$self->{'search'},"\n" if ($self->{'debug'});

        $self ? return $self : return undef;
}


=head2 check_bin

Check if SWISH++ binaries specified in L<new> are available and verify
version signature.

  if ($i->check_bin) {
        print "swish++ binaries found\n";
  };

It will also setup property 

  $i->{'version'}

which you can examined to see numeric version (something like C<6.0.4>).

=cut

sub check_bin {
        my $self = shift;

        my $i = `$self->{'index'} -V 2>&1` || confess "can't find '",$self->{'index'},"' binary";
        my $s = `$self->{'search'} -V 2>&1` || confess "can't find '",$self->{'search'},"' binary";

        chomp $i;
        chomp $s;

        confess $self->{'index'}," binary is not SWISH++" unless ($i =~ m/^SWISH\+\+/);
        confess $self->{'search'}," binary is not SWISH++" unless ($s =~ m/^SWISH\+\+/);

        if ($i eq $s) {
                $i =~ s/^SWISH\+\+\s+// || confess "can't strip SWISH++ from version";
                $self->{'version'} = $i;
                return 1;
        } else  {
                carp "version difference: index is $i while search is $s";
                return;
        }

}

=head2 index_document

Quick way to add simple data to index.

  $i->index_document($path, $data);
  $i->index_document(
        42 => 'meaning of life',
        1984 => 'Oh!',
  );

C<$path> value is really path, so you don't want to use directory
separators (slashes, /) in it probably.

=cut

sub index_document {
        my $self = shift;

        my %doc = @_;

        foreach my $id (keys %doc) {
                $self->_create_doc(
                        path => $id,
                        body => $doc{$id},
                );
        }

        return 1;
}

=head2 add

Add document with meta-data to index.

  $i->add(
        path => 'path/to/document',
        title => 'this is result title',
        meta => {
                description => 'this is description meta tag',
                date => '2004-11-04',
                author => 'Dobrica Pavlinusic',
        }
        body => 'this is text without meta data',
  );

This is thin wrapper round L<_create_doc>.

=cut

sub add {
        my $self = shift;

        return $self->_create_doc(@_);
}


=head2 delete

Delete document from index.

  $i->delete("document/path");

If deletion is succesfull returns revision of deleted document, otherwise
undef.

=cut

sub delete {
        my $self = shift;

        my $path = shift || carp "empty path?";

        print STDERR "## delete: $path\n" if ($self->{'debug'});

        my $rev = $self->{'meta_db'}->{"R$path"};
        if ($rev) {
                $self->{'_deleted'}->{$path} = $rev;
                $self->{'_deleted_counter'}++;
                print STDERR "## deleted revision $rev, counter: ",$self->{'_deleted_counter'}++,"\n" if ($self->{'debug'});
                return $rev;
        }

        return undef;
}


=head2 search

Search your index using any valid SWISH++ query.

  my @results = $i->search("swish query");

Returns array with elements like this:

  {
   rank => 10,                  # rank of result
   path => 'path to result',    # path to result
   size => 999,                 # size in bytes
   title => 'title of result'   # title meta property
  }

=cut

sub search {
        my $self = shift;

        my $query = shift || return;

        $self->finish_update;
        $self->_tie_meta_db(DB_RDONLY);

        my @results;

        # escape double quotes in query for shell
        $query =~ s/"/\\"/g;

        my $open_cmd = $self->{'search'} .
                ' -i ' . $self->{'index_dir'}.'/index' .
                ' "' . $query . '"'.
                ' |';
        print STDERR "## search: $open_cmd\n" if ($self->{'debug'});

        my %r;

        open(SEARCH, $open_cmd) || confess "can't start $open_cmd: $!";
        my $l;
        while($l = <SEARCH>) {
                next if ($l =~ /^#/);
                chomp($l);
                print STDERR "## $l\n" if ($self->{'debug'});
                my ($rank,$path,$size,$rev,$title) = split(/ /,$l,5);
                $path =~ s#^\./##; # strip from path

                # get current revision
                $r{$path} = $self->{'meta_db'}->{"R$path"};

                # skip if old revision
                next if ($r{$path} > $rev);

                print STDERR "## current revision $rev\n" if ($self->{'debug'});

                push @results, {
                        rank => $rank,
                        path => $path,
                        size => $size,
                        title => $title,
                } unless ($self->{'_deleted'}->{$path} && $self->{'_deleted'}->{$path} <= $rev);
        }

        close(SEARCH) || confess "can't close search";

        #print STDERR "## results: ",Dump(@results),"\n" if ($self->{'debug'});

        return @results;
}

=head2 property

Return stored meta property from result or result path.

  print $i->property('path', 'meta name');
  print $i->property($res->{'path'}, 'meta name');
  print $i->property('path');
  print $i->property($res->{'path'});

Returns one meta property (if meta name is specified) or whole hash with
all meta properties.

=cut

sub property {
        my $self = shift;

        my $path = shift || return;
        my $meta = shift;

        if ($path =~ m/^HASH/) {
                $path = $path->{'path'} || confess "can't find path in input data";
        }

        my $val = $self->{'meta_db'}->{"M$path"};

        # FIXME should we die here like swish-e does?
        return unless ($val);

        $val = thaw($val);

        print STDERR "## property $path $meta: ",(Dump($val) || 'undef'),"\n" if ($self->{'debug'});

        return $val->{$meta} if ($meta);

        return $val;
}

=head2 finish_update

This method will close index binary and enable search. Searching is not
available while indexing is in process.

  $i->finish_update;

Usually, you don't need to call this method directly. It will be called on
DESTROY when $i goes out of scope or when you first call search in session
if indexing was started.

=cut

sub finish_update {
        my $self = shift;

        print STDERR "## finish_update\n" if ($self->{'debug'});

        $self->_close_index && $self->_untie_meta_db;
}

sub DESTROY {
        my $self = shift;
        $self->finish_update;
}

=head1 PRIVATE METHODS

Private methods implement internals for creating temporary files needed for
SWISH++. You should have no need to call them directly, and they are here
just to have documentation.

=head2 _init_indexer

Create temporary directory in which files for indexing will be created and
start index process.

  my $i->_init_indexer || die "can't start indexer";

It will also create empty file C<_stopwords_> to disable stop words.

=cut

sub _init_indexer {
        my $self = shift;

        return if ($self->{'_index_fh'});

        my $tmp_dir = tempdir( CLEANUP => 1 ) || confess "can't create temporary directory: $!";
        $self->{'tmp_dir'} = $tmp_dir;

        chdir $tmp_dir || confess "can't chdir to ".$tmp_dir.": $!";

        print STDERR "## tmp_dir: $tmp_dir\n" if ($self->{'debug'});

        my $opt = "-v " . ($self->{'debug'} || '0');

        my $index_dir = $self->{'index_dir'} || confess "no index_dir?";
        my $index_file = $index_dir . '/index';

        if (-e $index_file && ! -z $index_file) {
                $opt .= ' -I ';
                $self->{'_incremental'} = 1;
                print STDERR "## using incremental indexing for $index_file\n" if ($self->{'debug'});
        } else {
                $self->{'_incremental'} = 0;
        }

        unless ($self->{'use_stopwrods'}) {
                open(STOP, '>', "_stopwords_") || carp "can't create empty stopword file, skipping\n";
                print STOP "  ";
                close(STOP);
                $opt .= " -s _stopwords_";
        }

        my $open_cmd = '| '.$self->{'index'}.' '.$opt.' -e "html:*" -i '.$index_file.' -';

        print STDERR "## init_indexer: $open_cmd\n" if ($self->{'debug'});

        open($self->{'_index_fh'}, $open_cmd) || confess "can't start index with $open_cmd: $!";

        chdir $self->{'cwd'} || confess "can't chdir to ".$self->{'cwd'}.": $!";

        $self->_tie_meta_db(DB_CREATE);

        return $self->{'_index_fh'};
}

=head2 _create_doc

Create temporary file and pass it's name to SWISH++

  $i->_create_doc(
        path => 'path/to/store/in/index',
        title => 'this is title in results',
        body => 'data to story in body tag',
        meta => {
                'meta name' => 'data for this meta',
                'another' => 'again more data',
        }
  );

=cut

sub _create_doc {
        my $self = shift;

        my $arg = {@_};

        # open indexer if needed
        $self->_init_indexer;

        my $path = $self->{'tmp_dir'} || confess "no tmp_dir?";
        my $id = $arg->{'path'} || confess "no path?";
        $path .= "/$id";

        my $rev = $self->{'rev'}++;

        print STDERR "## _create_doc: $path [$rev]\n" if ($self->{'debug'});

        open(TMP, '>', $path) || die "can't create temp file $path: $!";

        print TMP '<html><head>';

        my $body = $arg->{'body'};

        if (defined($body)) {
                $self->{'meta_db'}->{"B$id"} = $body;
        } else {
                $body = '';
        }

        my $title = $arg->{'title'};

        if ($arg->{'meta'}) {
                foreach my $name (keys %{$arg->{'meta'}}) {
                        my $content = $arg->{'meta'}->{$name};
                        print TMP qq{<meta name="$name" content="$content">};
                        $body .= " $content" if ($self->{'meta_in_body'});
                }
                $arg->{'meta'}->{'title'} = $title;
                $self->{'meta_db'}->{"M$id"} = freeze($arg->{'meta'});
        }

        if (defined($title)) {
                $title = "$rev $title";
                $body .= " $title" if ($self->{'meta_in_body'});
        } else {
                $title = "$rev $id";
        }

        # dump html
        print TMP "<title>$title</title></head><body>$body</body></html>";
        
        close(TMP) || confess "can't close tmp file ".$arg->{'path'}.": $!";

        print { $self->{'_index_fh'} } "$id\n" || confess "can't pass document $id to indexer: $!";
        
        $self->{'meta_db'}->{"R$id"} = $rev;

        # FIXME this is probably not the right place to update global
        # maximum revision, but it keeps database in sane state
        $self->{'meta_db'}->{"Crev"} = $rev;
}

=head2 _close_index

Close index after indexing.

  $i->_close_index;

You have to close index before searching.

=cut

sub _close_index {
        my $self = shift;

        $self->_store_deleted;

        return unless ($self->{'_index_fh'});

        print STDERR "## close index\n" if ($self->{'debug'});

        close($self->{'_index_fh'}) || confess "can't close index: $!";
        undef $self->{'_index_fh'};

        if ($self->{'_incremental'}) {
                print STDERR "## move new index over old\n" if ($self->{'debug'});
                rename $self->{'index_dir'}.'/index.new',$self->{'index_dir'}.'/index' || die "can't move new index over old one: $!";
        }

        return 1;
}

=head2 _tie_meta_db

Open BerkeleyDB database with meta properties.

  $i->_tie_meta_db(DB_CREATE);
  $i->_tie_meta_db(DB_RDONLY);

}

=cut

sub _tie_meta_db  {
        my $self = shift;

        my $flags = shift || confess "need DB_CREATE or DB_RDONLY";

        return if ($self->{'_meta_db_flags'} && $self->{'_meta_db_flags'} == $flags);

        print STDERR "## _tie_meta_db($flags)\n" if ($self->{'debug'});

        $self->_untie_meta_db;
        $self->{'_meta_db_flags'} = $flags;

        my $file = $self->{'index_dir'}.'/meta.db';

        tie %{$self->{'meta_db'}}, "BerkeleyDB::Hash",
                -Filename => $file,
                -Flags    => $flags
        or confess "cannot open $file: $! $BerkeleyDB::Error\n" ;

        $self->{'rev'} = $self->{'meta_db'}->{'Crev'} || 0;

        my $delref = $self->{'meta_db'}->{'Cdeleted'};
        if ($delref) {
                $self->{'_deleted'} = thaw($delref);

                print "## deleted ",keys %{$self->{'_deleted'}}," records\n" if ($self->{'debug'});
        } else {
                $self->{'_deleted'} = {};
        }

        $self->{'_deleted_counter'} = 0;
        return 1;
}

=head2 _untie_meta_db

Close BerkeleyDB database with meta properties.

  $i->_untie_meta_db;

=cut

sub _untie_meta_db {
        my $self = shift;

        return unless ($self->{'meta_db'});

        print STDERR "## _untie_meta_db\n" if ($self->{'debug'});
        untie %{$self->{'meta_db'}} || confess "can't untie!";
        undef $self->{'meta_db'};
        undef $self->{'_meta_db_flags'};

        return 1;
}


=head2 _store_deleted

Save hash of deleted files using L<Storable>.

  $i->_store_deleted;

=cut

sub _store_deleted {
        my $self = shift;

        return if (! $self->{'_deleted_counter'});

        print STDERR "## save deleted ",Dump($self->{'_deleted'}) if ($self->{'debug'});

        my $d = freeze($self->{'_deleted'});

        $self->_tie_meta_db(DB_CREATE);

        $self->{'meta_db'}->{'Cdeleted'} = $d ||
                carp "can't store deleted: $!";

        # reset counter
        $self->{'_deleted_counter'} = 0;
}

1;
__END__

=head2 EXPORT

None by default.

=head1 RELATED

=head2 Debian

Debian version of SWISH++ is often old (version 5 at moment of this writing
while version 6 is available in source code), so this module by default
uses executable names B<index> and B<search> for self-compiled version
instead of one from Debian package. See L<new> how to specify Debian
default binaries B<index++> and B<search++>.

=head2 SWISH++

Aside from very good rewrite in C++, SWISH++ is faster because it uses
claver heuristics about which data in input files are words to index and
which are not. It's based on English language and might be best choice if
you plan to index large amount of long text documents.

However, if you plan to index all data from structured storage (e.g. RDBMS)
you might want B<all> words from data to end up in index as opposed to just
those which look like English words. This is especially important if you
don't plan to index English texts with this module.

With distribution build versions of SWISH++ you might have problems with
disapearing words. To overcome this problem, you will have to compile and
configure SWISH++ yourself (because language characteristics are
compilation-time option).

Compilation of SWISH++ is easy process well described on project's web
pages. To see my very relaxed sample configuration take a look at C<swish++>
directory included in distribution.

=head2 SWISH++ config

C<config.h> located in C<swish++> directory of this distribution is relaxed
SWISH++ configuration that will index all words passed to it. This
configuration is needed for B<date test> because default configuration
doesn't recognize 2004-12-05 as date. Have in mind that your index size
might explode.

=head1 BUGS

Currently there is no way to specify which meta data will be stored as
properties. B<This will be fixed very soon>.

There is no garbage collection on temporary files created for SWISH++. This
means that one run of indexer will take additional disk space for temporary
files, which will be removed at end. There should be some way to remove
files after they are indexed by SWISH++. However, at this early stage of
development it's just not supported yet. Have plenty of disk space!

=head1 SEE ALSO

SWISH++ web site L<http://homepage.mac.com/pauljlucas/software/swish/>

=head1 AUTHOR

Dobrica Pavlinusic, E<lt>dpavlin@rot13.orgE<gt>

=head1 COPYRIGHT AND LICENSE

Copyright (C) 2004 by Dobrica Pavlinusic

This library is free software; you can redistribute it and/or modify
it under the same terms as Perl itself, either Perl version 5.8.4 or,
at your option, any later version of Perl 5 you may have available.


=cut
1	package SWISH::PlusPlus;
2
3	use 5.008004;
4	use strict;
5	use warnings;
6
7	our $VERSION = '0.20';
8
9	use Carp;
10	use File::Temp qw/ tempdir /;
11	use BerkeleyDB;
12	use Storable qw(store retrieve freeze thaw);
13	use YAML;
14
15	=head1 NAME
16
17	SWISH::PlusPlus - Perl extension for full-text indexer SWISH++ with properties support
18
19	=head1 SYNOPSIS
20
21	use SWISH::PlusPlus;
22
23	my $i = new SWISH::PlusPlus(
24	index_dir => '/tmp/foo',
25	);
26	$i->add( 42 => 'meaning of life' );
27
28	print $i->search("meaning"); # returns 42
29
30	=head1 DESCRIPTION
31
32	This is perl module to use SWISH++ indexer by Paul J. Lucas. SWISH++ is
33	rewrite of swish-e in C++ which is extremely fast (due to mmap usage and
34	clever language heuristics), but without support for properties (which this
35	module tries to fix).
36
37	Implementation of API is something in-between C<SWISH::API> and
38	C<Plucene::Simple>. It should be easy to replace Plucene or swish-e with
39	this module for increased performance. However, this module is not plug-in
40	replacement.
41
42	=head1 METHODS
43
44	=head2 new
45
46	Create new instance for index.
47
48	my $i = SWISH::PlusPlus->new(
49	index_dir => '/path/to/index',
50	index => 'index++',
51	search => 'search++',
52	debug => 1,
53	meta_in_body => 1,
54	use_stopwords => 1,
55	);
56
57	Options are described below:
58
59	=over 5
60
61	=item C<index_dir>
62
63	Path to directory in which index and meta database will be created.
64
65	=item C<index>
66
67	Full or partial path to SWISH++ index executable. By default, it's B<index>
68	for self-compiled version. If you use Debian GNU/Linux package specify
69	B<index++>. See C<Debian>.
70
71	=item C<search>
72
73	Full or partial path to SWISH++ search executable. By default, it's B<search>.
74
75	=item C<debug>
76
77	This option (off by default) will produce a lot of debugging output on
78	C<STDERR> prefixed by C<##>.
79
80	=item C<meta_in_body>
81
82	This option (off by default) enables to search content of meta fields
83	without specifying them (like they are in body of document). This will
84	somewhat increase index size.
85
86	=item C<use_stopwords>
87
88	Use built-in SWISH++ stop words. By default, they are disabled.
89
90	=back
91
92	=cut
93
94	sub new {
95	my $class = shift;
96	my $self = {@_};
97	bless($self, $class);
98
99	foreach (qw(index_dir)) {
100	croak "need $_" unless $self->{$_};
101	}
102
103	my $index_dir = $self->{'index_dir'};
104
105	my $cwd;
106	chomp($cwd = `pwd`);
107	$self->{'cwd'} = $cwd \|\| carp "can't get cwd!";
108
109	if ($index_dir !~ m#^/#) {
110	$index_dir = "$cwd/$index_dir";
111	print STDERR "## full path to index_dir: $index_dir\n" if ($self->{'debug'});
112	$self->{'index_dir'} = $index_dir;
113	}
114
115	if (! -e $index_dir) {
116	mkdir $index_dir \|\| confess "can't create index ",$self->{'index'},": $!";
117	}
118
119	# default executables
120	$self->{'index'} \|\|= 'index';
121	$self->{'search'} \|\|= 'search';
122
123	print STDERR "## new index_dir: ",$index_dir," index: ",$self->{'index'}, " search: ",$self->{'search'},"\n" if ($self->{'debug'});
124
125	$self ? return $self : return undef;
126	}
127
128
129	=head2 check_bin
130
131	Check if SWISH++ binaries specified in L<new> are available and verify
132	version signature.
133
134	if ($i->check_bin) {
135	print "swish++ binaries found\n";
136	};
137
138	It will also setup property
139
140	$i->{'version'}
141
142	which you can examined to see numeric version (something like C<6.0.4>).
143
144	=cut
145
146	sub check_bin {
147	my $self = shift;
148
149	my $i = `$self->{'index'} -V 2>&1` \|\| confess "can't find '",$self->{'index'},"' binary";
150	my $s = `$self->{'search'} -V 2>&1` \|\| confess "can't find '",$self->{'search'},"' binary";
151
152	chomp $i;
153	chomp $s;
154
155	confess $self->{'index'}," binary is not SWISH++" unless ($i =~ m/^SWISH\+\+/);
156	confess $self->{'search'}," binary is not SWISH++" unless ($s =~ m/^SWISH\+\+/);
157
158	if ($i eq $s) {
159	$i =~ s/^SWISH\+\+\s+// \|\| confess "can't strip SWISH++ from version";
160	$self->{'version'} = $i;
161	return 1;
162	} else {
163	carp "version difference: index is $i while search is $s";
164	return;
165	}
166
167	}
168
169	=head2 index_document
170
171	Quick way to add simple data to index.
172
173	$i->index_document($path, $data);
174	$i->index_document(
175	42 => 'meaning of life',
176	1984 => 'Oh!',
177	);
178
179	C<$path> value is really path, so you don't want to use directory
180	separators (slashes, /) in it probably.
181
182	=cut
183
184	sub index_document {
185	my $self = shift;
186
187	my %doc = @_;
188
189	foreach my $id (keys %doc) {
190	$self->_create_doc(
191	path => $id,
192	body => $doc{$id},
193	);
194	}
195
196	return 1;
197	}
198
199	=head2 add
200
201	Add document with meta-data to index.
202
203	$i->add(
204	path => 'path/to/document',
205	title => 'this is result title',
206	meta => {
207	description => 'this is description meta tag',
208	date => '2004-11-04',
209	author => 'Dobrica Pavlinusic',
210	}
211	body => 'this is text without meta data',
212	);
213
214	This is thin wrapper round L<_create_doc>.
215
216	=cut
217
218	sub add {
219	my $self = shift;
220
221	return $self->_create_doc(@_);
222	}
223
224
225	=head2 delete
226
227	Delete document from index.
228
229	$i->delete("document/path");
230
231	If deletion is succesfull returns revision of deleted document, otherwise
232	undef.
233
234	=cut
235
236	sub delete {
237	my $self = shift;
238
239	my $path = shift \|\| carp "empty path?";
240
241	print STDERR "## delete: $path\n" if ($self->{'debug'});
242
243	my $rev = $self->{'meta_db'}->{"R$path"};
244	if ($rev) {
245	$self->{'_deleted'}->{$path} = $rev;
246	$self->{'_deleted_counter'}++;
247	print STDERR "## deleted revision $rev, counter: ",$self->{'_deleted_counter'}++,"\n" if ($self->{'debug'});
248	return $rev;
249	}
250
251	return undef;
252	}
253
254
255	=head2 search
256
257	Search your index using any valid SWISH++ query.
258
259	my @results = $i->search("swish query");
260
261	Returns array with elements like this:
262
263	{
264	rank => 10, # rank of result
265	path => 'path to result', # path to result
266	size => 999, # size in bytes
267	title => 'title of result' # title meta property
268	}
269
270	=cut
271
272	sub search {
273	my $self = shift;
274
275	my $query = shift \|\| return;
276
277	$self->finish_update;
278	$self->_tie_meta_db(DB_RDONLY);
279
280	my @results;
281
282	# escape double quotes in query for shell
283	$query =~ s/"/\\"/g;
284
285	my $open_cmd = $self->{'search'} .
286	' -i ' . $self->{'index_dir'}.'/index' .
287	' "' . $query . '"'.
288	' \|';
289	print STDERR "## search: $open_cmd\n" if ($self->{'debug'});
290
291	my %r;
292
293	open(SEARCH, $open_cmd) \|\| confess "can't start $open_cmd: $!";
294	my $l;
295	while($l = <SEARCH>) {
296	next if ($l =~ /^#/);
297	chomp($l);
298	print STDERR "## $l\n" if ($self->{'debug'});
299	my ($rank,$path,$size,$rev,$title) = split(/ /,$l,5);
300	$path =~ s#^\./##; # strip from path
301
302	# get current revision
303	$r{$path} = $self->{'meta_db'}->{"R$path"};
304
305	# skip if old revision
306	next if ($r{$path} > $rev);
307
308	print STDERR "## current revision $rev\n" if ($self->{'debug'});
309
310	push @results, {
311	rank => $rank,
312	path => $path,
313	size => $size,
314	title => $title,
315	} unless ($self->{'_deleted'}->{$path} && $self->{'_deleted'}->{$path} <= $rev);
316	}
317
318	close(SEARCH) \|\| confess "can't close search";
319
320	#print STDERR "## results: ",Dump(@results),"\n" if ($self->{'debug'});
321
322	return @results;
323	}
324
325	=head2 property
326
327	Return stored meta property from result or result path.
328
329	print $i->property('path', 'meta name');
330	print $i->property($res->{'path'}, 'meta name');
331	print $i->property('path');
332	print $i->property($res->{'path'});
333
334	Returns one meta property (if meta name is specified) or whole hash with
335	all meta properties.
336
337	=cut
338
339	sub property {
340	my $self = shift;
341
342	my $path = shift \|\| return;
343	my $meta = shift;
344
345	if ($path =~ m/^HASH/) {
346	$path = $path->{'path'} \|\| confess "can't find path in input data";
347	}
348
349	my $val = $self->{'meta_db'}->{"M$path"};
350
351	# FIXME should we die here like swish-e does?
352	return unless ($val);
353
354	$val = thaw($val);
355
356	print STDERR "## property $path $meta: ",(Dump($val) \|\| 'undef'),"\n" if ($self->{'debug'});
357
358	return $val->{$meta} if ($meta);
359
360	return $val;
361	}
362
363	=head2 finish_update
364
365	This method will close index binary and enable search. Searching is not
366	available while indexing is in process.
367
368	$i->finish_update;
369
370	Usually, you don't need to call this method directly. It will be called on
371	DESTROY when $i goes out of scope or when you first call search in session
372	if indexing was started.
373
374	=cut
375
376	sub finish_update {
377	my $self = shift;
378
379	print STDERR "## finish_update\n" if ($self->{'debug'});
380
381	$self->_close_index && $self->_untie_meta_db;
382	}
383
384	sub DESTROY {
385	my $self = shift;
386	$self->finish_update;
387	}
388
389	=head1 PRIVATE METHODS
390
391	Private methods implement internals for creating temporary files needed for
392	SWISH++. You should have no need to call them directly, and they are here
393	just to have documentation.
394
395	=head2 _init_indexer
396
397	Create temporary directory in which files for indexing will be created and
398	start index process.
399
400	my $i->_init_indexer \|\| die "can't start indexer";
401
402	It will also create empty file C<_stopwords_> to disable stop words.
403
404	=cut
405
406	sub _init_indexer {
407	my $self = shift;
408
409	return if ($self->{'_index_fh'});
410
411	my $tmp_dir = tempdir( CLEANUP => 1 ) \|\| confess "can't create temporary directory: $!";
412	$self->{'tmp_dir'} = $tmp_dir;
413
414	chdir $tmp_dir \|\| confess "can't chdir to ".$tmp_dir.": $!";
415
416	print STDERR "## tmp_dir: $tmp_dir\n" if ($self->{'debug'});
417
418	my $opt = "-v " . ($self->{'debug'} \|\| '0');
419
420	my $index_dir = $self->{'index_dir'} \|\| confess "no index_dir?";
421	my $index_file = $index_dir . '/index';
422
423	if (-e $index_file && ! -z $index_file) {
424	$opt .= ' -I ';
425	$self->{'_incremental'} = 1;
426	print STDERR "## using incremental indexing for $index_file\n" if ($self->{'debug'});
427	} else {
428	$self->{'_incremental'} = 0;
429	}
430
431	unless ($self->{'use_stopwrods'}) {
432	open(STOP, '>', "_stopwords_") \|\| carp "can't create empty stopword file, skipping\n";
433	print STOP " ";
434	close(STOP);
435	$opt .= " -s _stopwords_";
436	}
437
438	my $open_cmd = '\| '.$self->{'index'}.' '.$opt.' -e "html:*" -i '.$index_file.' -';
439
440	print STDERR "## init_indexer: $open_cmd\n" if ($self->{'debug'});
441
442	open($self->{'_index_fh'}, $open_cmd) \|\| confess "can't start index with $open_cmd: $!";
443
444	chdir $self->{'cwd'} \|\| confess "can't chdir to ".$self->{'cwd'}.": $!";
445
446	$self->_tie_meta_db(DB_CREATE);
447
448	return $self->{'_index_fh'};
449	}
450
451	=head2 _create_doc
452
453	Create temporary file and pass it's name to SWISH++
454
455	$i->_create_doc(
456	path => 'path/to/store/in/index',
457	title => 'this is title in results',
458	body => 'data to story in body tag',
459	meta => {
460	'meta name' => 'data for this meta',
461	'another' => 'again more data',
462	}
463	);
464
465	=cut
466
467	sub _create_doc {
468	my $self = shift;
469
470	my $arg = {@_};
471
472	# open indexer if needed
473	$self->_init_indexer;
474
475	my $path = $self->{'tmp_dir'} \|\| confess "no tmp_dir?";
476	my $id = $arg->{'path'} \|\| confess "no path?";
477	$path .= "/$id";
478
479	my $rev = $self->{'rev'}++;
480
481	print STDERR "## _create_doc: $path [$rev]\n" if ($self->{'debug'});
482
483	open(TMP, '>', $path) \|\| die "can't create temp file $path: $!";
484
485	print TMP '<html><head>';
486
487	my $body = $arg->{'body'};
488
489	if (defined($body)) {
490	$self->{'meta_db'}->{"B$id"} = $body;
491	} else {
492	$body = '';
493	}
494
495	my $title = $arg->{'title'};
496
497	if ($arg->{'meta'}) {
498	foreach my $name (keys %{$arg->{'meta'}}) {
499	my $content = $arg->{'meta'}->{$name};
500	print TMP qq{<meta name="$name" content="$content">};
501	$body .= " $content" if ($self->{'meta_in_body'});
502	}
503	$arg->{'meta'}->{'title'} = $title;
504	$self->{'meta_db'}->{"M$id"} = freeze($arg->{'meta'});
505	}
506
507	if (defined($title)) {
508	$title = "$rev $title";
509	$body .= " $title" if ($self->{'meta_in_body'});
510	} else {
511	$title = "$rev $id";
512	}
513
514	# dump html
515	print TMP "<title>$title</title></head><body>$body</body></html>";
516
517	close(TMP) \|\| confess "can't close tmp file ".$arg->{'path'}.": $!";
518
519	print { $self->{'_index_fh'} } "$id\n" \|\| confess "can't pass document $id to indexer: $!";
520
521	$self->{'meta_db'}->{"R$id"} = $rev;
522
523	# FIXME this is probably not the right place to update global
524	# maximum revision, but it keeps database in sane state
525	$self->{'meta_db'}->{"Crev"} = $rev;
526	}
527
528	=head2 _close_index
529
530	Close index after indexing.
531
532	$i->_close_index;
533
534	You have to close index before searching.
535
536	=cut
537
538	sub _close_index {
539	my $self = shift;
540
541	$self->_store_deleted;
542
543	return unless ($self->{'_index_fh'});
544
545	print STDERR "## close index\n" if ($self->{'debug'});
546
547	close($self->{'_index_fh'}) \|\| confess "can't close index: $!";
548	undef $self->{'_index_fh'};
549
550	if ($self->{'_incremental'}) {
551	print STDERR "## move new index over old\n" if ($self->{'debug'});
552	rename $self->{'index_dir'}.'/index.new',$self->{'index_dir'}.'/index' \|\| die "can't move new index over old one: $!";
553	}
554
555	return 1;
556	}
557
558	=head2 _tie_meta_db
559
560	Open BerkeleyDB database with meta properties.
561
562	$i->_tie_meta_db(DB_CREATE);
563	$i->_tie_meta_db(DB_RDONLY);
564
565	}
566
567	=cut
568
569	sub _tie_meta_db {
570	my $self = shift;
571
572	my $flags = shift \|\| confess "need DB_CREATE or DB_RDONLY";
573
574	return if ($self->{'_meta_db_flags'} && $self->{'_meta_db_flags'} == $flags);
575
576	print STDERR "## _tie_meta_db($flags)\n" if ($self->{'debug'});
577
578	$self->_untie_meta_db;
579	$self->{'_meta_db_flags'} = $flags;
580
581	my $file = $self->{'index_dir'}.'/meta.db';
582
583	tie %{$self->{'meta_db'}}, "BerkeleyDB::Hash",
584	-Filename => $file,
585	-Flags => $flags
586	or confess "cannot open $file: $! $BerkeleyDB::Error\n" ;
587
588	$self->{'rev'} = $self->{'meta_db'}->{'Crev'} \|\| 0;
589
590	my $delref = $self->{'meta_db'}->{'Cdeleted'};
591	if ($delref) {
592	$self->{'_deleted'} = thaw($delref);
593
594	print "## deleted ",keys %{$self->{'_deleted'}}," records\n" if ($self->{'debug'});
595	} else {
596	$self->{'_deleted'} = {};
597	}
598
599	$self->{'_deleted_counter'} = 0;
600	return 1;
601	}
602
603	=head2 _untie_meta_db
604
605	Close BerkeleyDB database with meta properties.
606
607	$i->_untie_meta_db;
608
609	=cut
610
611	sub _untie_meta_db {
612	my $self = shift;
613
614	return unless ($self->{'meta_db'});
615
616	print STDERR "## _untie_meta_db\n" if ($self->{'debug'});
617	untie %{$self->{'meta_db'}} \|\| confess "can't untie!";
618	undef $self->{'meta_db'};
619	undef $self->{'_meta_db_flags'};
620
621	return 1;
622	}
623
624
625	=head2 _store_deleted
626
627	Save hash of deleted files using L<Storable>.
628
629	$i->_store_deleted;
630
631	=cut
632
633	sub _store_deleted {
634	my $self = shift;
635
636	return if (! $self->{'_deleted_counter'});
637
638	print STDERR "## save deleted ",Dump($self->{'_deleted'}) if ($self->{'debug'});
639
640	my $d = freeze($self->{'_deleted'});
641
642	$self->_tie_meta_db(DB_CREATE);
643
644	$self->{'meta_db'}->{'Cdeleted'} = $d \|\|
645	carp "can't store deleted: $!";
646
647	# reset counter
648	$self->{'_deleted_counter'} = 0;
649	}
650
651	1;
652	__END__
653
654	=head2 EXPORT
655
656	None by default.
657
658	=head1 RELATED
659
660	=head2 Debian
661
662	Debian version of SWISH++ is often old (version 5 at moment of this writing
663	while version 6 is available in source code), so this module by default
664	uses executable names B<index> and B<search> for self-compiled version
665	instead of one from Debian package. See L<new> how to specify Debian
666	default binaries B<index++> and B<search++>.
667
668	=head2 SWISH++
669
670	Aside from very good rewrite in C++, SWISH++ is faster because it uses
671	claver heuristics about which data in input files are words to index and
672	which are not. It's based on English language and might be best choice if
673	you plan to index large amount of long text documents.
674
675	However, if you plan to index all data from structured storage (e.g. RDBMS)
676	you might want B<all> words from data to end up in index as opposed to just
677	those which look like English words. This is especially important if you
678	don't plan to index English texts with this module.
679
680	With distribution build versions of SWISH++ you might have problems with
681	disapearing words. To overcome this problem, you will have to compile and
682	configure SWISH++ yourself (because language characteristics are
683	compilation-time option).
684
685	Compilation of SWISH++ is easy process well described on project's web
686	pages. To see my very relaxed sample configuration take a look at C<swish++>
687	directory included in distribution.
688
689	=head2 SWISH++ config
690
691	C<config.h> located in C<swish++> directory of this distribution is relaxed
692	SWISH++ configuration that will index all words passed to it. This
693	configuration is needed for B<date test> because default configuration
694	doesn't recognize 2004-12-05 as date. Have in mind that your index size
695	might explode.
696
697	=head1 BUGS
698
699	Currently there is no way to specify which meta data will be stored as
700	properties. B<This will be fixed very soon>.
701
702	There is no garbage collection on temporary files created for SWISH++. This
703	means that one run of indexer will take additional disk space for temporary
704	files, which will be removed at end. There should be some way to remove
705	files after they are indexed by SWISH++. However, at this early stage of
706	development it's just not supported yet. Have plenty of disk space!
707
708	=head1 SEE ALSO
709
710	SWISH++ web site L<http://homepage.mac.com/pauljlucas/software/swish/>
711
712	=head1 AUTHOR
713
714	Dobrica Pavlinusic, E<lt>dpavlin@rot13.orgE<gt>
715
716	=head1 COPYRIGHT AND LICENSE
717
718	Copyright (C) 2004 by Dobrica Pavlinusic
719
720	This library is free software; you can redistribute it and/or modify
721	it under the same terms as Perl itself, either Perl version 5.8.4 or,
722	at your option, any later version of Perl 5 you may have available.
723
724
725	=cut