SWISH-PlusPlus/trunk/PlusPlus.pm

package SWISH::PlusPlus;

use 5.008004;
use strict;
use warnings;

our $VERSION = '0.20';

use Carp;
use File::Temp qw/ tempdir /;
use BerkeleyDB;
use Storable qw(store retrieve freeze thaw);
use YAML;

=head1 NAME

SWISH::PlusPlus - Perl extension for full-text indexer SWISH++ with properties support

=head1 SYNOPSIS

  use SWISH::PlusPlus;

  my $i = new SWISH::PlusPlus(
        index_dir => '/tmp/foo',
  );
  $i->add( 42 => 'meaning of life' );

  print $i->search("meaning");  # returns 42

=head1 DESCRIPTION

This is perl module to use SWISH++ indexer by Paul J. Lucas. SWISH++ is
rewrite of swish-e in C++ which is extremely fast (due to mmap usage and
clever language heuristics), but without support for properties (which this
module tries to fix).

Implementation of API is something in-between C<SWISH::API> and
C<Plucene::Simple>. It should be easy to replace Plucene or swish-e with
this module for increased performance. However, this module is not plug-in
replacement.

=head1 METHODS

=head2 new

Create new instance for index.

  my $i = SWISH::PlusPlus->new(
        index_dir => '/path/to/index',
        index => 'index++',
        search => 'search++',
        debug => 1,
        meta_in_body => 1,
        use_stopwords => 1,
  );

Options are described below:

=over 5

=item C<index_dir>

Path to directory in which index and meta database will be created.

=item C<index>

Full or partial path to SWISH++ index executable. By default, it's B<index>
for self-compiled version. If you use Debian GNU/Linux package specify
B<index++>. See C<Debian>.

=item C<search>

Full or partial path to SWISH++ search executable. By default, it's B<search>.

=item C<debug>

This option (off by default) will produce a lot of debugging output on
C<STDERR> prefixed by C<##>.

=item C<meta_in_body>

This option (off by default) enables to search content of meta fields
without specifying them (like they are in body of document). This will
somewhat increase index size.

=item C<use_stopwords>

Use built-in SWISH++ stop words. By default, they are disabled.

=back

=cut

sub new {
        my $class = shift;
        my $self = {@_};
        bless($self, $class);

        foreach (qw(index_dir)) {
                croak "need $_" unless $self->{$_};
        }

        my $index_dir = $self->{'index_dir'};

        my $cwd;
        chomp($cwd = `pwd`);
        $self->{'cwd'} = $cwd || carp "can't get cwd!";
        
        if ($index_dir !~ m#^/#) {
                $index_dir = "$cwd/$index_dir";
                print STDERR "## full path to index_dir: $index_dir\n" if ($self->{'debug'});
                $self->{'index_dir'} = $index_dir;
        }

        if (! -e $index_dir) {
                mkdir $index_dir || confess "can't create index ",$self->{'index'},": $!";
        }

        # default executables
        $self->{'index'} ||= 'index';
        $self->{'search'} ||= 'search';

        print STDERR "## new index_dir: ",$index_dir," index: ",$self->{'index'}, " search: ",$self->{'search'},"\n" if ($self->{'debug'});

        $self ? return $self : return undef;
}


=head2 check_bin

Check if SWISH++ binaries specified in L<new> are available and verify
version signature.

  if ($i->check_bin) {
        print "swish++ binaries found\n";
  };

It will also setup property 

  $i->{'version'}

which you can examined to see numeric version (something like C<6.0.4>).

=cut

sub check_bin {
        my $self = shift;

        my $i = `$self->{'index'} -V 2>&1` || confess "can't find '",$self->{'index'},"' binary";
        my $s = `$self->{'search'} -V 2>&1` || confess "can't find '",$self->{'search'},"' binary";

        chomp $i;
        chomp $s;

        confess $self->{'index'}," binary is not SWISH++" unless ($i =~ m/^SWISH\+\+/);
        confess $self->{'search'}," binary is not SWISH++" unless ($s =~ m/^SWISH\+\+/);

        if ($i eq $s) {
                $i =~ s/^SWISH\+\+\s+// || confess "can't strip SWISH++ from version";
                $self->{'version'} = $i;
                return 1;
        } else  {
                carp "version difference: index is $i while search is $s";
                return;
        }

}

=head2 index_document

Quick way to add simple data to index.

  $i->index_document($path, $data);
  $i->index_document(
        42 => 'meaning of life',
        1984 => 'Oh!',
  );

C<$path> value is really path, so you don't want to use directory
separators (slashes, /) in it probably.

=cut

sub index_document {
        my $self = shift;

        my %doc = @_;

        foreach my $id (keys %doc) {
                $self->_create_doc(
                        path => $id,
                        body => $doc{$id},
                );
        }

        return 1;
}

=head2 add

Add document with meta-data to index.

  $i->add(
        path => 'path/to/document',
        title => 'this is result title',
        meta => {
                description => 'this is description meta tag',
                date => '2004-11-04',
                author => 'Dobrica Pavlinusic',
        }
        body => 'this is text without meta data',
  );

This is thin wrapper round L<_create_doc>.

=cut

sub add {
        my $self = shift;

        return $self->_create_doc(@_);
}


=head2 delete

Delete document from index.

  $i->delete("document/path");

If deletion is succesfull returns revision of deleted document, otherwise
undef.

=cut

sub delete {
        my $self = shift;

        my $path = shift || carp "empty path?";

        print STDERR "## delete: $path\n" if ($self->{'debug'});

        my $rev = $self->{'meta_db'}->{"R$path"};
        if ($rev) {
                $self->{'_deleted'}->{$path} = $rev;
                $self->{'_deleted_counter'}++;
                print STDERR "## deleted revision $rev, counter: ",$self->{'_deleted_counter'}++,"\n" if ($self->{'debug'});
                return $rev;
        }

        return undef;
}


=head2 search

Search your index using any valid SWISH++ query.

  my @results = $i->search("swish query");

Returns array with elements like this:

  {
   rank => 10,                  # rank of result
   path => 'path to result',    # path to result
   size => 999,                 # size in bytes
   title => 'title of result'   # title meta property
  }

=cut

sub search {
        my $self = shift;

        my $query = shift || return;

        $self->finish_update;
        $self->_tie_meta_db(DB_RDONLY);

        my @results;

        # escape double quotes in query for shell
        $query =~ s/"/\\"/g;

        my $open_cmd = $self->{'search'} .
                ' -i ' . $self->{'index_dir'}.'/index' .
                ' "' . $query . '"'.
                ' |';
        print STDERR "## search: $open_cmd\n" if ($self->{'debug'});

        my %r;

        open(SEARCH, $open_cmd) || confess "can't start $open_cmd: $!";
        my $l;
        while($l = <SEARCH>) {
                next if ($l =~ /^#/);
                chomp($l);
                print STDERR "## $l\n" if ($self->{'debug'});
                my ($rank,$path,$size,$rev,$title) = split(/ /,$l,5);
                $path =~ s#^\./##; # strip from path

                # get current revision
                $r{$path} = $self->{'meta_db'}->{"R$path"};

                # skip if old revision
                next if ($r{$path} > $rev);

                print STDERR "## current revision $rev\n" if ($self->{'debug'});

                push @results, {
                        rank => $rank,
                        path => $path,
                        size => $size,
                        title => $title,
                } unless ($self->{'_deleted'}->{$path} && $self->{'_deleted'}->{$path} <= $rev);
        }

        close(SEARCH) || confess "can't close search";

        #print STDERR "## results: ",Dump(@results),"\n" if ($self->{'debug'});

        return @results;
}

=head2 property

Return stored meta property from result or result path.

  print $i->property('path', 'meta name');
  print $i->property($res->{'path'}, 'meta name');
  print $i->property('path');
  print $i->property($res->{'path'});

Returns one meta property (if meta name is specified) or whole hash with
all meta properties.

=cut

sub property {
        my $self = shift;

        my $path = shift || return;
        my $meta = shift;

        if ($path =~ m/^HASH/) {
                $path = $path->{'path'} || confess "can't find path in input data";
        }

        my $val = $self->{'meta_db'}->{"M$path"};

        # FIXME should we die here like swish-e does?
        return unless ($val);

        $val = thaw($val);

        print STDERR "## property $path $meta: ",(Dump($val) || 'undef'),"\n" if ($self->{'debug'});

        return $val->{$meta} if ($meta);

        return $val;
}

=head2 finish_update

This method will close index binary and enable search. Searching is not
available while indexing is in process.

  $i->finish_update;

Usually, you don't need to call this method directly. It will be called on
DESTROY when $i goes out of scope or when you first call search in session
if indexing was started.

=cut

sub finish_update {
        my $self = shift;

        print STDERR "## finish_update\n" if ($self->{'debug'});

        $self->_close_index && $self->_untie_meta_db;
}

sub DESTROY {
        my $self = shift;
        $self->finish_update;
}

=head1 PRIVATE METHODS

Private methods implement internals for creating temporary files needed for
SWISH++. You should have no need to call them directly, and they are here
just to have documentation.

=head2 _init_indexer

Create temporary directory in which files for indexing will be created and
start index process.

  my $i->_init_indexer || die "can't start indexer";

It will also create empty file C<_stopwords_> to disable stop words.

=cut

sub _init_indexer {
        my $self = shift;

        return if ($self->{'_index_fh'});

        my $tmp_dir = tempdir( CLEANUP => 1 ) || confess "can't create temporary directory: $!";
        $self->{'tmp_dir'} = $tmp_dir;

        chdir $tmp_dir || confess "can't chdir to ".$tmp_dir.": $!";

        print STDERR "## tmp_dir: $tmp_dir\n" if ($self->{'debug'});

        my $opt = "-v " . ($self->{'debug'} || '0');

        my $index_dir = $self->{'index_dir'} || confess "no index_dir?";
        my $index_file = $index_dir . '/index';

        if (-e $index_file && ! -z $index_file) {
                $opt .= ' -I ';
                $self->{'_incremental'} = 1;
                print STDERR "## using incremental indexing for $index_file\n" if ($self->{'debug'});
        } else {
                $self->{'_incremental'} = 0;
        }

        unless ($self->{'use_stopwrods'}) {
                open(STOP, '>', "_stopwords_") || carp "can't create empty stopword file, skipping\n";
                print STOP "  ";
                close(STOP);
                $opt .= " -s _stopwords_";
        }

        my $open_cmd = '| '.$self->{'index'}.' '.$opt.' -e "html:*" -i '.$index_file.' -';

        print STDERR "## init_indexer: $open_cmd\n" if ($self->{'debug'});

        open($self->{'_index_fh'}, $open_cmd) || confess "can't start index with $open_cmd: $!";

        chdir $self->{'cwd'} || confess "can't chdir to ".$self->{'cwd'}.": $!";

        $self->_tie_meta_db(DB_CREATE);

        return $self->{'_index_fh'};
}

=head2 _create_doc

Create temporary file and pass it's name to SWISH++

  $i->_create_doc(
        path => 'path/to/store/in/index',
        title => 'this is title in results',
        body => 'data to story in body tag',
        meta => {
                'meta name' => 'data for this meta',
                'another' => 'again more data',
        }
  );

=cut

sub _create_doc {
        my $self = shift;

        my $arg = {@_};

        # open indexer if needed
        $self->_init_indexer;

        my $path = $self->{'tmp_dir'} || confess "no tmp_dir?";
        my $id = $arg->{'path'} || confess "no path?";
        $path .= "/$id";

        my $rev = $self->{'rev'}++;

        print STDERR "## _create_doc: $path [$rev]\n" if ($self->{'debug'});

        open(TMP, '>', $path) || die "can't create temp file $path: $!";

        print TMP '<html><head>';

        my $body = $arg->{'body'};

        if (defined($body)) {
                $self->{'meta_db'}->{"B$id"} = $body;
        } else {
                $body = '';
        }

        my $title = $arg->{'title'};

        if ($arg->{'meta'}) {
                foreach my $name (keys %{$arg->{'meta'}}) {
                        my $content = $arg->{'meta'}->{$name};
                        print TMP qq{<meta name="$name" content="$content">};
                        $body .= " $content" if ($self->{'meta_in_body'});
                }
                $arg->{'meta'}->{'title'} = $title;
                $self->{'meta_db'}->{"M$id"} = freeze($arg->{'meta'});
        }

        if (defined($title)) {
                $title = "$rev $title";
                $body .= " $title" if ($self->{'meta_in_body'});
        } else {
                $title = "$rev $id";
        }

        # dump html
        print TMP "<title>$title</title></head><body>$body</body></html>";
        
        close(TMP) || confess "can't close tmp file ".$arg->{'path'}.": $!";

        print { $self->{'_index_fh'} } "$id\n" || confess "can't pass document $id to indexer: $!";
        
        $self->{'meta_db'}->{"R$id"} = $rev;

        # FIXME this is probably not the right place to update global
        # maximum revision, but it keeps database in sane state
        $self->{'meta_db'}->{"Crev"} = $rev;
}

=head2 _close_index

Close index after indexing.

  $i->_close_index;

You have to close index before searching.

=cut

sub _close_index {
        my $self = shift;

        $self->_store_deleted;

        return unless ($self->{'_index_fh'});

        print STDERR "## close index\n" if ($self->{'debug'});

        close($self->{'_index_fh'}) || confess "can't close index: $!";
        undef $self->{'_index_fh'};

        if ($self->{'_incremental'}) {
                print STDERR "## move new index over old\n" if ($self->{'debug'});
                rename $self->{'index_dir'}.'/index.new',$self->{'index_dir'}.'/index' || die "can't move new index over old one: $!";
        }

        return 1;
}

=head2 _tie_meta_db

Open BerkeleyDB database with meta properties.

  $i->_tie_meta_db(DB_CREATE);
  $i->_tie_meta_db(DB_RDONLY);

}

=cut

sub _tie_meta_db  {
        my $self = shift;

        my $flags = shift || confess "need DB_CREATE or DB_RDONLY";

        return if ($self->{'_meta_db_flags'} && $self->{'_meta_db_flags'} == $flags);

        print STDERR "## _tie_meta_db($flags)\n" if ($self->{'debug'});

        $self->_untie_meta_db;
        $self->{'_meta_db_flags'} = $flags;

        my $file = $self->{'index_dir'}.'/meta.db';

        tie %{$self->{'meta_db'}}, "BerkeleyDB::Hash",
                -Filename => $file,
                -Flags    => $flags
        or confess "cannot open $file: $! $BerkeleyDB::Error\n" ;

        $self->{'rev'} = $self->{'meta_db'}->{'Crev'} || 0;

        my $delref = $self->{'meta_db'}->{'Cdeleted'};
        if ($delref) {
                $self->{'_deleted'} = thaw($delref);

                print "## deleted ",keys %{$self->{'_deleted'}}," records\n" if ($self->{'debug'});
        } else {
                $self->{'_deleted'} = {};
        }

        $self->{'_deleted_counter'} = 0;
        return 1;
}

=head2 _untie_meta_db

Close BerkeleyDB database with meta properties.

  $i->_untie_meta_db;

=cut

sub _untie_meta_db {
        my $self = shift;

        return unless ($self->{'meta_db'});

        print STDERR "## _untie_meta_db\n" if ($self->{'debug'});
        untie %{$self->{'meta_db'}} || confess "can't untie!";
        undef $self->{'meta_db'};
        undef $self->{'_meta_db_flags'};

        return 1;
}


=head2 _store_deleted

Save hash of deleted files using L<Storable>.

  $i->_store_deleted;

=cut

sub _store_deleted {
        my $self = shift;

        return if (! $self->{'_deleted_counter'});

        print STDERR "## save deleted ",Dump($self->{'_deleted'}) if ($self->{'debug'});

        my $d = freeze($self->{'_deleted'});

        $self->_tie_meta_db(DB_CREATE);

        $self->{'meta_db'}->{'Cdeleted'} = $d ||
                carp "can't store deleted: $!";

        # reset counter
        $self->{'_deleted_counter'} = 0;
}

1;
__END__

=head2 EXPORT

None by default.

=head1 RELATED

=head2 Debian

Debian version of SWISH++ is often old (version 5 at moment of this writing
while version 6 is available in source code), so this module by default
uses executable names B<index> and B<search> for self-compiled version
instead of one from Debian package. See L<new> how to specify Debian
default binaries B<index++> and B<search++>.

=head2 SWISH++

Aside from very good rewrite in C++, SWISH++ is faster because it uses
claver heuristics about which data in input files are words to index and
which are not. It's based on English language and might be best choice if
you plan to index large amount of long text documents.

However, if you plan to index all data from structured storage (e.g. RDBMS)
you might want B<all> words from data to end up in index as opposed to just
those which look like English words. This is especially important if you
don't plan to index English texts with this module.

With distribution build versions of SWISH++ you might have problems with
disapearing words. To overcome this problem, you will have to compile and
configure SWISH++ yourself (because language characteristics are
compilation-time option).

Compilation of SWISH++ is easy process well described on project's web
pages. To see my very relaxed sample configuration take a look at C<swish++>
directory included in distribution.

=head2 SWISH++ config

C<config.h> located in C<swish++> directory of this distribution is relaxed
SWISH++ configuration that will index all words passed to it. This
configuration is needed for B<date test> because default configuration
doesn't recognize 2004-12-05 as date. Have in mind that your index size
might explode.

=head1 BUGS

Currently there is no way to specify which meta data will be stored as
properties. B<This will be fixed very soon>.

There is no garbage collection on temporary files created for SWISH++. This
means that one run of indexer will take additional disk space for temporary
files, which will be removed at end. There should be some way to remove
files after they are indexed by SWISH++. However, at this early stage of
development it's just not supported yet. Have plenty of disk space!

=head1 SEE ALSO

SWISH++ web site L<http://homepage.mac.com/pauljlucas/software/swish/>

=head1 AUTHOR

Dobrica Pavlinusic, E<lt>dpavlin@rot13.orgE<gt>

=head1 COPYRIGHT AND LICENSE

Copyright (C) 2004 by Dobrica Pavlinusic

This library is free software; you can redistribute it and/or modify
it under the same terms as Perl itself, either Perl version 5.8.4 or,
at your option, any later version of Perl 5 you may have available.


=cut
1	dpavlin	1	package SWISH::PlusPlus;
2
3			use 5.008004;
4			use strict;
5			use warnings;
6
7	dpavlin	22	our $VERSION = '0.20';
8	dpavlin	1
9			use Carp;
10	dpavlin	4	use File::Temp qw/ tempdir /;
11	dpavlin	16	use BerkeleyDB;
12	dpavlin	22	use Storable qw(store retrieve freeze thaw);
13			use YAML;
14	dpavlin	1
15			=head1 NAME
16
17	dpavlin	21	SWISH::PlusPlus - Perl extension for full-text indexer SWISH++ with properties support
18	dpavlin	1
19			=head1 SYNOPSIS
20
21			use SWISH::PlusPlus;
22
23	dpavlin	22	my $i = new SWISH::PlusPlus(
24			index_dir => '/tmp/foo',
25			);
26			$i->add( 42 => 'meaning of life' );
27
28			print $i->search("meaning"); # returns 42
29
30	dpavlin	1	=head1 DESCRIPTION
31
32			This is perl module to use SWISH++ indexer by Paul J. Lucas. SWISH++ is
33	dpavlin	21	rewrite of swish-e in C++ which is extremely fast (due to mmap usage and
34			clever language heuristics), but without support for properties (which this
35			module tries to fix).
36	dpavlin	1
37	dpavlin	21	Implementation of API is something in-between C<SWISH::API> and
38			C<Plucene::Simple>. It should be easy to replace Plucene or swish-e with
39			this module for increased performance. However, this module is not plug-in
40			replacement.
41	dpavlin	3
42	dpavlin	1	=head1 METHODS
43
44	dpavlin	10	=head2 new
45	dpavlin	1
46	dpavlin	21	Create new instance for index.
47	dpavlin	1
48	dpavlin	10	my $i = SWISH::PlusPlus->new(
49	dpavlin	3	index_dir => '/path/to/index',
50			index => 'index++',
51			search => 'search++',
52	dpavlin	8	debug => 1,
53	dpavlin	9	meta_in_body => 1,
54			use_stopwords => 1,
55	dpavlin	1	);
56
57	dpavlin	21	Options are described below:
58	dpavlin	1
59			=over 5
60
61	dpavlin	3	=item C<index_dir>
62
63	dpavlin	21	Path to directory in which index and meta database will be created.
64	dpavlin	3
65	dpavlin	1	=item C<index>
66
67	dpavlin	3	Full or partial path to SWISH++ index executable. By default, it's B<index>
68			for self-compiled version. If you use Debian GNU/Linux package specify
69			B<index++>. See C<Debian>.
70	dpavlin	1
71	dpavlin	3	=item C<search>
72
73			Full or partial path to SWISH++ search executable. By default, it's B<search>.
74
75	dpavlin	8	=item C<debug>
76
77			This option (off by default) will produce a lot of debugging output on
78			C<STDERR> prefixed by C<##>.
79
80	dpavlin	9	=item C<meta_in_body>
81
82			This option (off by default) enables to search content of meta fields
83	dpavlin	21	without specifying them (like they are in body of document). This will
84			somewhat increase index size.
85	dpavlin	9
86			=item C<use_stopwords>
87
88			Use built-in SWISH++ stop words. By default, they are disabled.
89
90	dpavlin	1	=back
91
92			=cut
93
94	dpavlin	10	sub new {
95	dpavlin	1	my $class = shift;
96			my $self = {@_};
97			bless($self, $class);
98
99	dpavlin	3	foreach (qw(index_dir)) {
100	dpavlin	1	croak "need $_" unless $self->{$_};
101			}
102
103	dpavlin	13	my $index_dir = $self->{'index_dir'};
104
105	dpavlin	14	my $cwd;
106			chomp($cwd = `pwd`);
107			$self->{'cwd'} = $cwd \|\| carp "can't get cwd!";
108
109	dpavlin	13	if ($index_dir !~ m#^/#) {
110			$index_dir = "$cwd/$index_dir";
111			print STDERR "## full path to index_dir: $index_dir\n" if ($self->{'debug'});
112			$self->{'index_dir'} = $index_dir;
113	dpavlin	1	}
114
115	dpavlin	13	if (! -e $index_dir) {
116			mkdir $index_dir \|\| confess "can't create index ",$self->{'index'},": $!";
117			}
118
119	dpavlin	3	# default executables
120			$self->{'index'} \|\|= 'index';
121			$self->{'search'} \|\|= 'search';
122
123	dpavlin	13	print STDERR "## new index_dir: ",$index_dir," index: ",$self->{'index'}, " search: ",$self->{'search'},"\n" if ($self->{'debug'});
124	dpavlin	8
125	dpavlin	1	$self ? return $self : return undef;
126			}
127
128
129	dpavlin	3	=head2 check_bin
130
131	dpavlin	21	Check if SWISH++ binaries specified in L<new> are available and verify
132	dpavlin	3	version signature.
133
134			if ($i->check_bin) {
135			print "swish++ binaries found\n";
136			};
137
138			It will also setup property
139
140			$i->{'version'}
141
142	dpavlin	21	which you can examined to see numeric version (something like C<6.0.4>).
143	dpavlin	3
144			=cut
145
146			sub check_bin {
147			my $self = shift;
148
149			my $i = `$self->{'index'} -V 2>&1` \|\| confess "can't find '",$self->{'index'},"' binary";
150			my $s = `$self->{'search'} -V 2>&1` \|\| confess "can't find '",$self->{'search'},"' binary";
151
152			chomp $i;
153			chomp $s;
154
155			confess $self->{'index'}," binary is not SWISH++" unless ($i =~ m/^SWISH\+\+/);
156			confess $self->{'search'}," binary is not SWISH++" unless ($s =~ m/^SWISH\+\+/);
157
158			if ($i eq $s) {
159	dpavlin	14	$i =~ s/^SWISH\+\+\s+// \|\| confess "can't strip SWISH++ from version";
160	dpavlin	3	$self->{'version'} = $i;
161			return 1;
162			} else {
163			carp "version difference: index is $i while search is $s";
164			return;
165			}
166
167			}
168
169	dpavlin	4	=head2 index_document
170
171			Quick way to add simple data to index.
172
173	dpavlin	21	$i->index_document($path, $data);
174	dpavlin	22	$i->index_document(
175			42 => 'meaning of life',
176			1984 => 'Oh!',
177			);
178	dpavlin	4
179	dpavlin	21	C<$path> value is really path, so you don't want to use directory
180			separators (slashes, /) in it probably.
181
182	dpavlin	4	=cut
183
184			sub index_document {
185			my $self = shift;
186
187			my %doc = @_;
188
189			foreach my $id (keys %doc) {
190			$self->_create_doc(
191			path => $id,
192			body => $doc{$id},
193			);
194			}
195
196			return 1;
197			}
198
199	dpavlin	9	=head2 add
200
201	dpavlin	21	Add document with meta-data to index.
202	dpavlin	9
203			$i->add(
204			path => 'path/to/document',
205			title => 'this is result title',
206			meta => {
207			description => 'this is description meta tag',
208			date => '2004-11-04',
209			author => 'Dobrica Pavlinusic',
210			}
211			body => 'this is text without meta data',
212			);
213
214			This is thin wrapper round L<_create_doc>.
215
216			=cut
217
218			sub add {
219			my $self = shift;
220
221	dpavlin	22	return $self->_create_doc(@_);
222			}
223	dpavlin	9
224	dpavlin	22
225			=head2 delete
226
227			Delete document from index.
228
229			$i->delete("document/path");
230
231			If deletion is succesfull returns revision of deleted document, otherwise
232			undef.
233
234			=cut
235
236			sub delete {
237			my $self = shift;
238
239			my $path = shift \|\| carp "empty path?";
240
241			print STDERR "## delete: $path\n" if ($self->{'debug'});
242
243			my $rev = $self->{'meta_db'}->{"R$path"};
244			if ($rev) {
245			$self->{'_deleted'}->{$path} = $rev;
246			$self->{'_deleted_counter'}++;
247			print STDERR "## deleted revision $rev, counter: ",$self->{'_deleted_counter'}++,"\n" if ($self->{'debug'});
248			return $rev;
249			}
250
251			return undef;
252	dpavlin	9	}
253	dpavlin	21
254	dpavlin	22
255	dpavlin	8	=head2 search
256
257	dpavlin	21	Search your index using any valid SWISH++ query.
258	dpavlin	8
259	dpavlin	21	my @results = $i->search("swish query");
260	dpavlin	8
261	dpavlin	21	Returns array with elements like this:
262	dpavlin	8
263	dpavlin	21	{
264			rank => 10, # rank of result
265			path => 'path to result', # path to result
266			size => 999, # size in bytes
267			title => 'title of result' # title meta property
268			}
269
270	dpavlin	8	=cut
271
272			sub search {
273			my $self = shift;
274
275			my $query = shift \|\| return;
276
277	dpavlin	14	$self->finish_update;
278	dpavlin	16	$self->_tie_meta_db(DB_RDONLY);
279	dpavlin	8
280			my @results;
281
282			# escape double quotes in query for shell
283			$query =~ s/"/\\"/g;
284
285	dpavlin	16	my $open_cmd = $self->{'search'} .
286			' -i ' . $self->{'index_dir'}.'/index' .
287			' "' . $query . '"'.
288			' \|';
289			print STDERR "## search: $open_cmd\n" if ($self->{'debug'});
290	dpavlin	8
291	dpavlin	22	my %r;
292
293	dpavlin	10	open(SEARCH, $open_cmd) \|\| confess "can't start $open_cmd: $!";
294	dpavlin	16	my $l;
295			while($l = <SEARCH>) {
296			next if ($l =~ /^#/);
297			chomp($l);
298			print STDERR "## $l\n" if ($self->{'debug'});
299	dpavlin	22	my ($rank,$path,$size,$rev,$title) = split(/ /,$l,5);
300	dpavlin	16	$path =~ s#^\./##; # strip from path
301	dpavlin	22
302			# get current revision
303			$r{$path} = $self->{'meta_db'}->{"R$path"};
304
305			# skip if old revision
306			next if ($r{$path} > $rev);
307
308			print STDERR "## current revision $rev\n" if ($self->{'debug'});
309
310	dpavlin	8	push @results, {
311			rank => $rank,
312			path => $path,
313			size => $size,
314			title => $title,
315	dpavlin	22	} unless ($self->{'_deleted'}->{$path} && $self->{'_deleted'}->{$path} <= $rev);
316	dpavlin	8	}
317
318			close(SEARCH) \|\| confess "can't close search";
319
320			#print STDERR "## results: ",Dump(@results),"\n" if ($self->{'debug'});
321
322			return @results;
323			}
324
325	dpavlin	16	=head2 property
326
327			Return stored meta property from result or result path.
328
329	dpavlin	22	print $i->property('path', 'meta name');
330			print $i->property($res->{'path'}, 'meta name');
331			print $i->property('path');
332			print $i->property($res->{'path'});
333	dpavlin	16
334	dpavlin	22	Returns one meta property (if meta name is specified) or whole hash with
335			all meta properties.
336
337	dpavlin	16	=cut
338
339			sub property {
340			my $self = shift;
341
342	dpavlin	22	my $path = shift \|\| return;
343			my $meta = shift;
344	dpavlin	16
345			if ($path =~ m/^HASH/) {
346			$path = $path->{'path'} \|\| confess "can't find path in input data";
347			}
348
349	dpavlin	22	my $val = $self->{'meta_db'}->{"M$path"};
350	dpavlin	16
351	dpavlin	22	# FIXME should we die here like swish-e does?
352			return unless ($val);
353
354			$val = thaw($val);
355
356			print STDERR "## property $path $meta: ",(Dump($val) \|\| 'undef'),"\n" if ($self->{'debug'});
357
358			return $val->{$meta} if ($meta);
359
360	dpavlin	16	return $val;
361			}
362
363	dpavlin	13	=head2 finish_update
364
365	dpavlin	21	This method will close index binary and enable search. Searching is not
366			available while indexing is in process.
367	dpavlin	13
368			$i->finish_update;
369
370	dpavlin	21	Usually, you don't need to call this method directly. It will be called on
371			DESTROY when $i goes out of scope or when you first call search in session
372			if indexing was started.
373	dpavlin	13
374			=cut
375
376			sub finish_update {
377			my $self = shift;
378
379	dpavlin	14	print STDERR "## finish_update\n" if ($self->{'debug'});
380
381	dpavlin	16	$self->_close_index && $self->_untie_meta_db;
382	dpavlin	13	}
383
384			sub DESTROY {
385			my $self = shift;
386			$self->finish_update;
387			}
388
389	dpavlin	4	=head1 PRIVATE METHODS
390
391	dpavlin	21	Private methods implement internals for creating temporary files needed for
392			SWISH++. You should have no need to call them directly, and they are here
393	dpavlin	4	just to have documentation.
394
395	dpavlin	9	=head2 _init_indexer
396	dpavlin	4
397			Create temporary directory in which files for indexing will be created and
398			start index process.
399
400	dpavlin	9	my $i->_init_indexer \|\| die "can't start indexer";
401	dpavlin	4
402	dpavlin	9	It will also create empty file C<_stopwords_> to disable stop words.
403
404	dpavlin	4	=cut
405
406	dpavlin	9	sub _init_indexer {
407	dpavlin	4	my $self = shift;
408
409	dpavlin	14	return if ($self->{'_index_fh'});
410	dpavlin	4
411	dpavlin	14	my $tmp_dir = tempdir( CLEANUP => 1 ) \|\| confess "can't create temporary directory: $!";
412			$self->{'tmp_dir'} = $tmp_dir;
413	dpavlin	9
414	dpavlin	14	chdir $tmp_dir \|\| confess "can't chdir to ".$tmp_dir.": $!";
415
416	dpavlin	22	print STDERR "## tmp_dir: $tmp_dir\n" if ($self->{'debug'});
417	dpavlin	14
418	dpavlin	13	my $opt = "-v " . ($self->{'debug'} \|\| '0');
419	dpavlin	4
420	dpavlin	22	my $index_dir = $self->{'index_dir'} \|\| confess "no index_dir?";
421			my $index_file = $index_dir . '/index';
422
423			if (-e $index_file && ! -z $index_file) {
424			$opt .= ' -I ';
425			$self->{'_incremental'} = 1;
426			print STDERR "## using incremental indexing for $index_file\n" if ($self->{'debug'});
427			} else {
428			$self->{'_incremental'} = 0;
429			}
430
431	dpavlin	9	unless ($self->{'use_stopwrods'}) {
432	dpavlin	10	open(STOP, '>', "_stopwords_") \|\| carp "can't create empty stopword file, skipping\n";
433	dpavlin	9	print STOP " ";
434			close(STOP);
435			$opt .= " -s _stopwords_";
436			}
437
438	dpavlin	22	my $open_cmd = '\| '.$self->{'index'}.' '.$opt.' -e "html:*" -i '.$index_file.' -';
439	dpavlin	4
440	dpavlin	14	print STDERR "## init_indexer: $open_cmd\n" if ($self->{'debug'});
441	dpavlin	4
442	dpavlin	14	open($self->{'_index_fh'}, $open_cmd) \|\| confess "can't start index with $open_cmd: $!";
443	dpavlin	4
444	dpavlin	14	chdir $self->{'cwd'} \|\| confess "can't chdir to ".$self->{'cwd'}.": $!";
445	dpavlin	9
446	dpavlin	16	$self->_tie_meta_db(DB_CREATE);
447
448	dpavlin	14	return $self->{'_index_fh'};
449	dpavlin	4	}
450
451			=head2 _create_doc
452
453	dpavlin	21	Create temporary file and pass it's name to SWISH++
454	dpavlin	4
455			$i->_create_doc(
456			path => 'path/to/store/in/index',
457	dpavlin	9	title => 'this is title in results',
458	dpavlin	4	body => 'data to story in body tag',
459			meta => {
460			'meta name' => 'data for this meta',
461			'another' => 'again more data',
462			}
463			);
464
465			=cut
466
467			sub _create_doc {
468			my $self = shift;
469
470			my $arg = {@_};
471
472			# open indexer if needed
473	dpavlin	14	$self->_init_indexer;
474	dpavlin	4
475			my $path = $self->{'tmp_dir'} \|\| confess "no tmp_dir?";
476	dpavlin	16	my $id = $arg->{'path'} \|\| confess "no path?";
477			$path .= "/$id";
478	dpavlin	4
479	dpavlin	22	my $rev = $self->{'rev'}++;
480	dpavlin	4
481	dpavlin	22	print STDERR "## _create_doc: $path [$rev]\n" if ($self->{'debug'});
482
483	dpavlin	14	open(TMP, '>', $path) \|\| die "can't create temp file $path: $!";
484
485	dpavlin	9	print TMP '<html><head>';
486	dpavlin	4
487	dpavlin	22	my $body = $arg->{'body'};
488	dpavlin	9
489	dpavlin	22	if (defined($body)) {
490			$self->{'meta_db'}->{"B$id"} = $body;
491			} else {
492			$body = '';
493			}
494
495			my $title = $arg->{'title'};
496
497	dpavlin	4	if ($arg->{'meta'}) {
498	dpavlin	11	foreach my $name (keys %{$arg->{'meta'}}) {
499			my $content = $arg->{'meta'}->{$name};
500			print TMP qq{<meta name="$name" content="$content">};
501	dpavlin	22	$body .= " $content" if ($self->{'meta_in_body'});
502	dpavlin	11	}
503	dpavlin	22	$arg->{'meta'}->{'title'} = $title;
504			$self->{'meta_db'}->{"M$id"} = freeze($arg->{'meta'});
505	dpavlin	4	}
506	dpavlin	9
507	dpavlin	16	if (defined($title)) {
508	dpavlin	22	$title = "$rev $title";
509			$body .= " $title" if ($self->{'meta_in_body'});
510			} else {
511			$title = "$rev $id";
512	dpavlin	9	}
513
514	dpavlin	22	# dump html
515			print TMP "<title>$title</title></head><body>$body</body></html>";
516	dpavlin	4
517			close(TMP) \|\| confess "can't close tmp file ".$arg->{'path'}.": $!";
518
519	dpavlin	22	print { $self->{'_index_fh'} } "$id\n" \|\| confess "can't pass document $id to indexer: $!";
520
521			$self->{'meta_db'}->{"R$id"} = $rev;
522
523			# FIXME this is probably not the right place to update global
524			# maximum revision, but it keeps database in sane state
525			$self->{'meta_db'}->{"Crev"} = $rev;
526	dpavlin	4	}
527
528	dpavlin	8	=head2 _close_index
529
530			Close index after indexing.
531
532			$i->_close_index;
533
534			You have to close index before searching.
535
536			=cut
537
538			sub _close_index {
539			my $self = shift;
540
541	dpavlin	22	$self->_store_deleted;
542
543	dpavlin	14	return unless ($self->{'_index_fh'});
544	dpavlin	8
545			print STDERR "## close index\n" if ($self->{'debug'});
546
547	dpavlin	16	close($self->{'_index_fh'}) \|\| confess "can't close index: $!";
548	dpavlin	14	undef $self->{'_index_fh'};
549	dpavlin	16
550	dpavlin	22	if ($self->{'_incremental'}) {
551			print STDERR "## move new index over old\n" if ($self->{'debug'});
552			rename $self->{'index_dir'}.'/index.new',$self->{'index_dir'}.'/index' \|\| die "can't move new index over old one: $!";
553			}
554
555	dpavlin	16	return 1;
556	dpavlin	8	}
557
558	dpavlin	21	=head2 _tie_meta_db
559
560			Open BerkeleyDB database with meta properties.
561
562			$i->_tie_meta_db(DB_CREATE);
563			$i->_tie_meta_db(DB_RDONLY);
564
565			}
566
567			=cut
568
569			sub _tie_meta_db {
570			my $self = shift;
571
572			my $flags = shift \|\| confess "need DB_CREATE or DB_RDONLY";
573
574			return if ($self->{'_meta_db_flags'} && $self->{'_meta_db_flags'} == $flags);
575
576			print STDERR "## _tie_meta_db($flags)\n" if ($self->{'debug'});
577
578			$self->_untie_meta_db;
579			$self->{'_meta_db_flags'} = $flags;
580
581			my $file = $self->{'index_dir'}.'/meta.db';
582
583			tie %{$self->{'meta_db'}}, "BerkeleyDB::Hash",
584			-Filename => $file,
585			-Flags => $flags
586			or confess "cannot open $file: $! $BerkeleyDB::Error\n" ;
587
588	dpavlin	22	$self->{'rev'} = $self->{'meta_db'}->{'Crev'} \|\| 0;
589
590			my $delref = $self->{'meta_db'}->{'Cdeleted'};
591			if ($delref) {
592			$self->{'_deleted'} = thaw($delref);
593
594			print "## deleted ",keys %{$self->{'_deleted'}}," records\n" if ($self->{'debug'});
595			} else {
596			$self->{'_deleted'} = {};
597			}
598
599			$self->{'_deleted_counter'} = 0;
600	dpavlin	21	return 1;
601			}
602
603			=head2 _untie_meta_db
604
605			Close BerkeleyDB database with meta properties.
606
607			$i->_untie_meta_db;
608
609			=cut
610
611			sub _untie_meta_db {
612			my $self = shift;
613
614			return unless ($self->{'meta_db'});
615
616			print STDERR "## _untie_meta_db\n" if ($self->{'debug'});
617			untie %{$self->{'meta_db'}} \|\| confess "can't untie!";
618			undef $self->{'meta_db'};
619			undef $self->{'_meta_db_flags'};
620
621			return 1;
622			}
623
624	dpavlin	22
625			=head2 _store_deleted
626
627			Save hash of deleted files using L<Storable>.
628
629			$i->_store_deleted;
630
631			=cut
632
633			sub _store_deleted {
634			my $self = shift;
635
636			return if (! $self->{'_deleted_counter'});
637
638			print STDERR "## save deleted ",Dump($self->{'_deleted'}) if ($self->{'debug'});
639
640			my $d = freeze($self->{'_deleted'});
641
642			$self->_tie_meta_db(DB_CREATE);
643
644			$self->{'meta_db'}->{'Cdeleted'} = $d \|\|
645			carp "can't store deleted: $!";
646
647			# reset counter
648			$self->{'_deleted_counter'} = 0;
649			}
650
651	dpavlin	1	1;
652			__END__
653
654			=head2 EXPORT
655
656			None by default.
657
658	dpavlin	3	=head1 RELATED
659
660			=head2 Debian
661
662	dpavlin	21	Debian version of SWISH++ is often old (version 5 at moment of this writing
663	dpavlin	3	while version 6 is available in source code), so this module by default
664			uses executable names B<index> and B<search> for self-compiled version
665	dpavlin	10	instead of one from Debian package. See L<new> how to specify Debian
666	dpavlin	3	default binaries B<index++> and B<search++>.
667
668	dpavlin	5	=head2 SWISH++
669	dpavlin	1
670	dpavlin	21	Aside from very good rewrite in C++, SWISH++ is faster because it uses
671	dpavlin	5	claver heuristics about which data in input files are words to index and
672			which are not. It's based on English language and might be best choice if
673	dpavlin	21	you plan to index large amount of long text documents.
674	dpavlin	1
675	dpavlin	5	However, if you plan to index all data from structured storage (e.g. RDBMS)
676			you might want B<all> words from data to end up in index as opposed to just
677			those which look like English words. This is especially important if you
678			don't plan to index English texts with this module.
679	dpavlin	1
680	dpavlin	5	With distribution build versions of SWISH++ you might have problems with
681	dpavlin	21	disapearing words. To overcome this problem, you will have to compile and
682	dpavlin	5	configure SWISH++ yourself (because language characteristics are
683			compilation-time option).
684	dpavlin	1
685	dpavlin	5	Compilation of SWISH++ is easy process well described on project's web
686			pages. To see my very relaxed sample configuration take a look at C<swish++>
687			directory included in distribution.
688
689	dpavlin	11	=head2 SWISH++ config
690
691			C<config.h> located in C<swish++> directory of this distribution is relaxed
692			SWISH++ configuration that will index all words passed to it. This
693			configuration is needed for B<date test> because default configuration
694			doesn't recognize 2004-12-05 as date. Have in mind that your index size
695			might explode.
696
697	dpavlin	21	=head1 BUGS
698
699			Currently there is no way to specify which meta data will be stored as
700			properties. B<This will be fixed very soon>.
701
702			There is no garbage collection on temporary files created for SWISH++. This
703			means that one run of indexer will take additional disk space for temporary
704			files, which will be removed at end. There should be some way to remove
705			files after they are indexed by SWISH++. However, at this early stage of
706			development it's just not supported yet. Have plenty of disk space!
707
708	dpavlin	5	=head1 SEE ALSO
709
710	dpavlin	21	SWISH++ web site L<http://homepage.mac.com/pauljlucas/software/swish/>
711	dpavlin	5
712	dpavlin	1	=head1 AUTHOR
713
714	dpavlin	5	Dobrica Pavlinusic, E<lt>dpavlin@rot13.orgE<gt>
715	dpavlin	1
716			=head1 COPYRIGHT AND LICENSE
717
718			Copyright (C) 2004 by Dobrica Pavlinusic
719
720			This library is free software; you can redistribute it and/or modify
721			it under the same terms as Perl itself, either Perl version 5.8.4 or,
722			at your option, any later version of Perl 5 you may have available.
723
724
725			=cut