Line # Revision Author
1 64 dpavlin =head1 pgestraier - PostgreSQL full-text search using Hyper Estraier
2 23 dpavlin
3 64 dpavlin This package is essentially composed of two different parts:
4 23 dpavlin
5 64 dpavlin =over 4
6
7 =item search function
8
9 PostgreSQL function to search Hyper Estraier full-text index, using
10 full-text queries and attribute filtering to return user-specified
11 table of results.
12
13 This function can mimic SQL C<LIMIT>, C<OFFSET> and C<ORDER BY>
14 functionality much faster than using those SQL constructs on search
15 results.
16
17 =item trigger function
18
19 PostgreSQL trigger function to keep Hyper Estraier in sync with PostgreSQL.
20 It triggers after insert, update or delete and update full-text index
21 accordingly.
22
23 =back
24
25 Both functions are written in C, while test framework and supporting
26 utilities are written in perl.
27
28 You can use just one of those functions. If you want just to search existing
29 Hyper Estraier index or generate it off-line (after nightly batch jobs, for
30 example), just use search function.
31
32 On the other hand, if you want just to keep your Hyper Estraier index in
33 sync with PostgreSQL data, you can use just trigger function to achieve that.
34
35 33 dpavlin =head1 Why is it written?
36 23 dpavlin
37 64 dpavlin Aside from providing single query language (SQL) to RDBMS and full text index
38 23 dpavlin (using any language that has PostgreSQL client libraries), real power is
39 hidden in ability to join results from full text index and structured data
40 in RDBMS.
41
42 59 dpavlin For simple real-life example which address problem
43 64 dpavlin C<< WHERE name LIKE '%foo%' OR surname LIKE '%foo%' >>
44 75 dpavlin is slow see L<Tutorial> and L<pgest-index> documentation.
45 59 dpavlin
46 33 dpavlin =head1 How to install
47 23 dpavlin
48 Installation should be simple. However, you will have to have following
49 64 dpavlin software already installed before you try this functions:
50 23 dpavlin
51 33 dpavlin =over
52 23 dpavlin
53 33 dpavlin =item *
54
55 64 dpavlin PostgreSQL (tested with versions 7.4, 8.0 and 8.1) with development libraries
56 33 dpavlin
57 =item *
58
59 64 dpavlin Hyper Estraier (tested with various versions, recommended 1.2.4 or newer)
60 with development headers
61 33 dpavlin
62 64 dpavlin =item *
63
64 working C compiler (tested with gcc)
65
66 33 dpavlin =back
67
68 64 dpavlin If you want to use helper script to create consistency triggers to keep
69 Hyper Estraier in sync with PostgreSQL database, you will also need:
70 23 dpavlin
71 33 dpavlin =over
72 23 dpavlin
73 33 dpavlin =item *
74
75 working perl installation
76
77 =item *
78
79 64 dpavlin perl modules C<Search::Estraier>, C<DBI> and C<DBD::Pg>
80 33 dpavlin
81 64 dpavlin =back
82
83 To run tests you will also need:
84
85 =over
86
87 33 dpavlin =item *
88
89 64 dpavlin perl module C<Test::More>
90
91 =item *
92
93 87 dpavlin C<trivia.list.gz> from Internet Movie Database in C<data/> directory.
94 You can download it from L<http://www.imdb.com/interfaces>
95 33 dpavlin
96 =item *
97
98 43 dpavlin PostgreSQL database C<test> with permissions for current user
99 33 dpavlin
100 43 dpavlin =item *
101
102 59 dpavlin Hyper Estraier C<estmaster> running with permissions for C<admin> user
103 to create C<trivia> node.
104 43 dpavlin
105 33 dpavlin =back
106
107 23 dpavlin If you have all that, you should be able to type
108
109 make
110
111 and see sample results. You will be asked your password once (via sudo) to
112 43 dpavlin install C<pgest.so> shared library in system-wide location so that PostgreSQL
113 23 dpavlin could access it.
114
115 54 dpavlin =head2 Create sample index using Hyper Estraier perl bindings
116 29 dpavlin
117 54 dpavlin Perl bindings for Hyper Estraier are available at CPAN:
118 29 dpavlin
119 54 dpavlin L<http://search.cpan.org/~dpavlin/Search-Estraier/>
120 29 dpavlin
121
122 54 dpavlin After installing C<Search::Estraier> you can create index using following commands:
123 42 dpavlin
124 29 dpavlin cd data
125 54 dpavlin make index
126 29 dpavlin cd ..
127
128 23 dpavlin To run tests (which require that you have estcmd in your $PATH) issue
129
130 make test
131
132 54 dpavlin See also included file C<test.sql> for more examples of usage.
133 23 dpavlin
134 64 dpavlin =head1 Usage of search function pgest from SQL
135 42 dpavlin
136 54 dpavlin C<pgest> PostgreSQL function tries to mimic usage of normal database tables (with support for attribute filtering, limit and offset) in following way:
137 42 dpavlin
138 SELECT
139 -- columns to return (defined later)
140 id,title,size
141 FROM pgest(
142 54 dpavlin -- node URI, login, password and depth of search
143 49 dpavlin 'http://localhost:1978/node/trivia', 'admin', 'admin', 42,
144 42 dpavlin -- query
145 'blade runner',
146 -- additional attributes, use NULL or '' to disable
147 -- multiple attributes conditions can be separated by {{!}}
148 '@title ISTRINC blade',
149 -- order results by
150 '@title STRA',
151 -- limit, use NULL or 0 to disable
152 null,
153 -- offset, use NULL or 0 to disable
154 null,
155 -- attributes to return as columns
156 ARRAY['@id','@title','@size']
157 ) AS (
158 -- specify names and types of returned attributes
159 id text, title text, size text
160 );
161
162 54 dpavlin You should note that Hyper Estraier uses UTF-8 encoding, while your
163 PostgreSQL installation might use different encoding. To fix that, use
164 C<convert> function in PostgreSQL to convert encodings.
165
166 =head2 Using index via C<estmaster> server process
167
168 This is default and recommended way to use C<pgest> functionality. In this
169 case, C<pgest> will use node API and access index through C<estmaster>
170 process which should be running on (local or remote) machine.
171
172 This will remove database opening overhead, at a cost of (small) additional network
173 traffic. However, you can have Hyper Estraier C<estmaster> process running on
174 different machine or update index while doing searches, so benefits of this
175 approach are obvious.
176
177 42 dpavlin =head2 Accessing database directly
178
179 54 dpavlin B<Please note that direct access to database is depreciated.> As such, it's
180 not stated in example, and it's kept just for backward compatibility, but it
181 will probably be removed in future versions of C<pgest>.
182 42 dpavlin
183 54 dpavlin If you want to access database directly (without running C<estmaster> process), you
184 have to replace node URI, login, password and depth with full path to database file.
185
186 42 dpavlin Have in mind that C<postgres> user under which PostgreSQL is running must
187 have read permission on Hyper Estraier database files.
188
189 This will work a bit faster on really small indexes. However, when your
190 index grows bigger, you might consider using node API to remove overhead of
191 database opening on each query.
192
193 64 dpavlin =head1 Usage of trigger function pgest_trigger from SQL
194
195 Let's first say that I really suggest that you use C<dbi-index.pl> helper script to
196 create triggers because it already supports following steps automatically:
197
198 =over
199
200 =item begin transaction
201
202 Transaction is needed to catch updates which might happen while creation
203 of full-text index is in progress (and on huge collections this can take a while,
204 just like normal index creation in PostgreSQL).
205
206 =item insert all existing data in full-text index
207
208 This will be done directly from PostgreSQL database to Hyper Estraier index.
209 This is somewhat faster than waiting for trigger to fire for each existing
210 row.
211
212 =item create insert, update and delete triggers
213
214 Which will keep data in sync later
215
216 =item commit transaction
217
218 =back
219
220 If you still want to do that manually, you will need to know format of
221 C<pgest_trigger> function:
222
223 CREATE TRIGGER pgest_trigger_insert AFTER INSERT
224 ON table FOR EACH ROW
225 EXECUTE PROCEDURE pgest_trigger(
226 -- node URI, login and password
227 'http://localhost:1978/node/trivia', 'admin', 'admin',
228 -- name of primary key column
229 'id',
230 -- names of all other columns to index (one or more)
231 'column', 'another_one', 'and_another'
232 )
233
234 You have to create triggers for C<UPDATE> and C<DELETE> in similar way.
235
236 33 dpavlin =head1 Who wrote this?
237 23 dpavlin
238 Hyper Estraier is written by Mikio Hirabayashi.
239
240 PostgreSQL is written by hackers calling themselves PostgreSQL Global
241 Development Group.
242
243 75 dpavlin This small C functions are written by L<Dobrica Pavlinusic|http://www.rot13.org/~dpavlin/>, dpavlin@rot13.org.
244
245 =head1 See also
246
247 81 dpavlin =over
248
249 =item *
250
251 75 dpavlin L<Tutorial> - how to create first full-text index in under 10 minutes!
252
253 81 dpavlin =item *
254
255 75 dpavlin L<ChangeLog> - what has changed since last version
256
257 81 dpavlin =item *
258
259 75 dpavlin L<pgest-index> - helper script to create index and triggers
260 81 dpavlin
261 =item *
262
263 L<pgFoundry|http://pgfoundry.org/projects/pgestraier/> hosts home page of this project
264
265 =item *
266
267 L<Hyper Estraier user guide|http://hyperestraier.sourceforge.net/uguide-en.html#searchcond>
268 has a documentaton about query format. C<pgestraier> is using noraml queries (with
269 C<AND>, C<OR> etc.) and not simplified queryies (with C<|>).
270
271 =back