/[hyperestraier]/upstream/0.5.2/doc/uguide-en.html
This is repository of my old source code which isn't updated any more. Go to git.rot13.org for current projects!
ViewVC logotype

Contents of /upstream/0.5.2/doc/uguide-en.html

Parent Directory Parent Directory | Revision Log Revision Log


Revision 9 - (show annotations)
Wed Aug 3 15:21:15 2005 UTC (18 years, 9 months ago) by dpavlin
File MIME type: text/html
File size: 29118 byte(s)
import upstream version 0.5.2

1 <?xml version="1.0" encoding="UTF-8"?>
2
3 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
4
5 <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
6
7 <head>
8 <meta http-equiv="Content-Language" content="en" />
9 <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
10 <meta http-equiv="Content-Style-Type" content="text/css" />
11 <meta name="author" content="Mikio Hirabayashi" />
12 <meta name="keywords" content="Hyper Estraier, Estraier, full-text search, API" />
13 <meta name="description" content="User's Guide of Hyper Estraier" />
14 <link rel="contents" href="./" />
15 <link rel="alternate" href="uguide-ja.html" hreflang="ja" title="the Japanese version" />
16 <link rel="stylesheet" href="common.css" />
17 <link rel="icon" href="icon16.png" />
18 <link rev="made" href="mailto:mikio@users.sourceforge.net" />
19 <title>User's Guide of Hyper Estraier Version 1</title>
20 </head>
21
22 <body>
23
24 <h1>User's Guide</h1>
25
26 <div class="note">Copyright (C) 2004-2005 Mikio Hirabayashi</div>
27 <div class="note">Last Update: Tue, 07 Jun 2005 06:17:00 +0900</div>
28 <div class="navi">[<a href="uguide-ja.html" hreflang="ja">Japanese</a>] [<a href="index.html">HOME</a>]</div>
29
30 <hr />
31
32 <h2 id="tableofcontents">Table of Contents</h2>
33
34 <ol>
35 <li><a href="#introduction">Introduction</a></li>
36 <li><a href="#attributes">Attributes</a></li>
37 <li><a href="#formats">File Formats</a></li>
38 <li><a href="#searchcond">Search Conditions</a></li>
39 <li><a href="#estcmd">Administration Command</a></li>
40 <li><a href="#estseek">CGI Script for Search</a></li>
41 </ol>
42
43 <hr />
44
45 <h2 id="introduction">Introduction</h2>
46
47 <p>This document describes detail of how to use applications of Hyper Estraier. If you have never read <a href="intro-en.html">the introduction document</a>, please read it beforehand.</p>
48
49 <p>Hyper Estraier is a full-text search system using index database. So, before search, it is needed to prepare an index into which target documents have been registered. Hyper Estraier provides the administration command `estcmd' and the CGI script `estsearch.cgi' for search. The former is used in order to administrate the index by command line interface. The latter is used in order to search the index for documents with a web browser.</p>
50
51 <p>estcmd can handle various file formats and features various operations to administrate index. How to use it is described in this document.</p>
52
53 <p>Hyper Estraier supports such various methods for search as combining some search phrase and search with attributes of documents. Moreover, it is possible to customize presentation according to the configuration of estseek.cgi. How to do it is described in this document.</p>
54
55 <hr />
56
57 <h2 id="attributes">Attributes</h2>
58
59 <p>Not only information of the body text but also such attributes as the title, the modification date, and so on can be added to documents handled by Hyper Estraier. Attributes are used for such various purposes as search with attributes and determination of difference updating.</p>
60
61 <h3>Attribute Name</h3>
62
63 <p>Any attribute has a name. As the name can be determined arbitrarily, some names are reserved for being used as system attributes. Names of system attributes begin with `@'. There are the following system attributes.</p>
64
65 <ul>
66 <li><kbd>@id</kbd> : the ID number determined automatically when the document is registered.</li>
67 <li><kbd>@uri</kbd> : the location of a document which any document should have.</li>
68 <li><kbd>@cdate</kbd> : the creation date.</li>
69 <li><kbd>@mdate</kbd> : the modification date.</li>
70 <li><kbd>@title</kbd> : the title used as a headline in the search result.</li>
71 <li><kbd>@author</kbd> : the author.</li>
72 <li><kbd>@type</kbd> : the media type.</li>
73 <li><kbd>@lang</kbd> : the language.</li>
74 <li><kbd>@size</kbd> : the size.</li>
75 </ul>
76
77 <p>The other attributes except for system attributes are called user-defined attributes. They can be defined by document draft said later. Meta attributes in HTML and headers of MIME are also treated as user-defined attributes.</p>
78
79 <h3>Attribute Type</h3>
80
81 <p>There are two data types for attributes; string and number. Data of the string type are arbitrary strings. There are such operations as full matching, forward matching, backward matching, partial matching. Data of the number type are numbers or date information. A string of the number type is converted into the number and calculated according to the following formats. If the format is for date, the value is computed based on the UNIX epoch (1 Jan 1970).</p>
82
83 <ul>
84 <li>If all characters are digits, it is computed as a decimal number.</li>
85 <li>If it begins with "0x", it is computed as a hexadecimal number.</li>
86 <li>If it conforms to W3CDTF (e.g. 1978-02-11T18:05:32+09:00), it is computed as a date.</li>
87 <li>If it conforms to RFC822 (e.g. Sat, 11 Feb 1978 18:05:32 +0900), it is computed as a date.</li>
88 <li>If it is in YYYY/MM/DD format (e.g. 1978/02/11 18:05:32), it is computed as a date.</li>
89 <li>Else, it is computed as -1.</li>
90 </ul>
91
92 <p>The data type is not determined when registration. It is determined when search. Length of the value of an attribute is not limited.</p>
93
94 <p>Attributes and the body text of a document should be expressed in UTF-8 encoding. If another encoding is used, it should be converted into UTF-8. By the way, estcmd detect the encoding automatically if it is not clearly specified.</p>
95
96 <p>estcmd defines the URI attribute begins with "file://" for each document. However, if a document defines its own URI, it comes first. The path on the local file system is defined as an attribute whose name is "_lpath".</p>
97
98 <hr />
99
100 <h2 id="formats">File Formats</h2>
101
102 <p>estcmd handles four file formats. This section describes how the four are processed.</p>
103
104 <h3>Plain Text</h3>
105
106 <p>A document of plain-text is composed of strings with no structure. By default, files whose names end with ".txt", ".text", or ".asc" are treated as plain-text.</p>
107
108 <ul>
109 <li>The character encoding is detected automatically.</li>
110 <li>"text/plain" is recorded as the "@type" attribute.</li>
111 <li>The file size is recorded as the "@size" attribute.</li>
112 </ul>
113
114 <h3>HTML</h3>
115
116 <p>As we all know, a document of HTML is used as a hyper-text on the Web. By default, files whose names end with ".html", ".htm", "xhtml", or ".xht" are treated as HTML.</p>
117
118 <ul>
119 <li>The character encoding is detected automatically. But, if the encoding is specified by a "meta" attribute, it comes first.</li>
120 <li>If there is a "title" attribute, its content is recorded as the "@title" attribute.</li>
121 <li>If the "name" attribute of a "meta" element specifies "author", the value of the "content" attribute is recorded as the "@author" element.</li>
122 <li>"text/html" is recorded as the "@type" attribute.</li>
123 <li>The file size is recorded as the "@size" attribute.</li>
124 <li>If the "name" or the "http-equiv" is specified in a "meta" element, the value of the "content" attribute is recorded as an attribute whose name is converted from the value of them into lower cases.</li>
125 <li>The value of the attribute "@title" is treated as a hidden text.</li>
126 </ul>
127
128 <h3>MIME (e-mail)</h3>
129
130 <p>MIME is used for communication by e-mail based on RFC822 and so on. By default, files whose names end with ".eml", ".mime", "mht", or ".mhtml" are treated as HTML.</p>
131
132 <ul>
133 <li>The character encoding is detected automatically. But, if the encoding is specified by the "Content-Type" header, it comes first.</li>
134 <li>If the "Subject" header is, the value is recorded as the "@title" attribute.</li>
135 <li>If the "From" header is, the value is recorded as the "@author" attribute.</li>
136 <li>If the "Date" header is, the value is recorded as the "@cdate" attribute and the "@mdate" attribute.</li>
137 <li>"message/rfc822" is recorded as the "@type" attribute.</li>
138 <li>The file size is recorded as the "@size" attribute.</li>
139 <li>The value of each header is recorded as an attribute whose name is converted from the header name into lower cases.</li>
140 <li>The value of the attribute "@title" is treated as a hidden text.</li>
141 </ul>
142
143 <p>If the content of each part of multipart is "text/plain", "text/html", or "message/rfc822", the content is treated as a part of the body text so that web archive can be supported.</p>
144
145 <h3>Document Draft</h3>
146
147 <p>Document draft is a original format of Hyper Estraier. It is possible to handle various formats in the integrative way by using document draft as intermediate format. By default, files whose names end with ".est" are treated as document draft.</p>
148
149 <p>Though format of document draft is similar to RFC822, detail points differ. The delimiter for headers is not ":" but "=". Moreover, no space character is needed after "=". The following is an example data to handle a MIDI document.</p>
150
151 <pre>@uri=http://www.music-estraier.com/mididb/t/tw/twinkle.kar
152 @title=Twinkle Twinkle Little Star
153 @author=Jane Taylor
154 @cdate=2004-11-01T23:11:18+09:00
155 @mdate=2005-03-21T08:07:45+09:00
156 category=chorus,dance
157
158 Twinkle, twinkle, little star,
159 How I wonder what you are.
160 Up above the world so high,
161 Like a diamond in the sky.
162 Twinkle, twinkle, little star,
163 How I wonder what you are!
164 Twinkle Twinkle Little Star
165 Jane Taylor
166 </pre>
167
168 <p>The following specifications are required for document draft.</p>
169
170 <ul>
171 <li>It is composed of valid UTF-8 strings.</li>
172 <li>Line feeds are one of UNIX style (LF) or DOS style (CR+LF).</li>
173 <li>There are the attribute section and the text section and they are separated by the first empty line.</li>
174 <li>Each line in the attribute section specifies an attribute. The name and the value is separated by the first "=".</li>
175 <li>Each line in the text section specifies a sentence of the body text. If a line begins with a tab character, the line is treated as a hidden text.</li>
176 </ul>
177
178 <p>A hidden text is the same as normal text except not displayed in the snippet of the result. It is useful to search with some attributes.</p>
179
180 <hr />
181
182 <h2 id="searchcond">Search Conditions</h2>
183
184 <p>Two kinds of search conditions are supported. One is for full-text search and the other is for attribute search. If both are specified at the same time, documents corresponding to the both are searched for. Moreover, usual format and simplified format are supported for full-text search condition.</p>
185
186 <h3>Full-text Search Conditions</h3>
187
188 <p>The purpose of full-text search is to search for documents including some specified words. For example, if you search for documents including a word "computer", specify "computer" in the search phrase as it is.</p>
189
190 <p>You can specify two or more words. For example, if you specify "United Nations", documents including "united" followed by "nations" are searched for. In case of simplified form, specify the following.</p>
191
192 <pre>"united nations"
193 </pre>
194
195 <p>Intersection operation is supported by the "AND" operator. For example, if you specify "internet AND security", documents including both of "internet" and "security" are searched for. In case of simplified form, specify the following.</p>
196
197 <pre>internet security
198 </pre>
199
200 <p>Difference operation is supported by the "ANDNOT" operator. For example, if you specify "hacker ANDNOT cracker", documents including "hacker" but not including "cracker" are searched for. In case of simplified form, specify the following.</p>
201
202 <pre>hacker ! cracker
203 </pre>
204
205 <p>Union operation is supported by the "OR" operator. For example, if you specify "proxy OR firewall", documents including one or both of "proxy" and "firewall" are searched for. In case of simplified form, specify the following.</p>
206
207 <pre>proxy | firewall
208 </pre>
209
210 <p>Note that the priority of "OR" is higher than ones of "AND" and "ANDNOT". For example, if you specify "F1 OR F-1 OR Formula One AND Champion OR Victory", documents including one or both of "f1", "f-1", and "formula one", and including one or both of "champion" and "victory". In case of simplified form, specify the following.</p>
211
212 <pre>F1 | F-1 | "Formula One" Champion | Victory
213 </pre>
214
215 <p>Search words are case insensitive. However, operators are case sensitive. If you want to search for documents including "AND", specify "and" instead.</p>
216
217 <h3>Attribute Search Conditions</h3>
218
219 <p>The purpose of attribute search is to search for documents whose attributes are corresponding to the specified expression. An expression of attribute search is composed of an attribute name, an operator, and a value. They are separated with space characters. For example, if you specify "@title STRINC IMPORTANT", documents whose title includes "IMPORTANT". The following operators for attribute search are supported.</p>
220
221 <ul>
222 <li>STREQ : is equal to the string</li>
223 <li>STRNE : is not equal to the string</li>
224 <li>STRINC : includes the string</li>
225 <li>STRBW : begins with the string</li>
226 <li>STREW : ends with the string</li>
227 <li>NUMEQ : is equal to the number</li>
228 <li>NUMNE : is not equal to the number</li>
229 <li>NUMLT : is less than the number</li>
230 <li>NUMLE : is less than or equal to the number or date</li>
231 <li>NUMGT : is greater than the number</li>
232 <li>NUMGE : is greater than or equal to the number or date</li>
233 </ul>
234
235 <p>If an operator is leaded by "!", the meaning is inverted. If an operator is leaded by "I", case of the value is ignored.</p>
236
237 <h3>Order of the Result</h3>
238
239 <p>You can specify the order of the result by an expression. An ordering expression is composed of an attribute name and an operator. For example, if you specify "@size NUMA", documents in the result are in ascending order of the size. The following operators for ordering are supported.</p>
240
241 <ul>
242 <li>STRA : ascending by string</li>
243 <li>STRD : descending by string</li>
244 <li>NUMA : ascending by number or date</li>
245 <li>NUMD : descending by number or date</li>
246 </ul>
247
248 <p>By default, the order of the result is descending by score. The score is calculated by the number of specified words in each document.</p>
249
250 <hr />
251
252 <h2 id="estcmd">Administration Command</h2>
253
254 <p>This section describes specification of estcmd. estcmd can do not only indexing but also search.</p>
255
256 <h3>Synopsis and Description</h3>
257
258 <p>estcmd is an aggregation of sub commands. The name of a sub command is specified by the first argument. Other arguments are parsed according to each sub command. The argument <var>db</var> specifies the path of an index.</p>
259
260 <dl>
261 <dt><kbd>estcmd put [-cl] <var>db</var> [<var>file</var>]</kbd></dt>
262 <dd>Register a document of document draft to an index.</dd>
263 <dd><var>file</var> specifies a target file. If it is omitted, the standard input is read.</dd>
264 <dd>If -cl is specifed, regions of a overwritten document are cleaned up.</dd>
265 </dl>
266
267 <dl>
268 <dt><kbd>estcmd out [-cl] <var>db</var> <var>expr</var></kbd></dt>
269 <dd>Remove information of a document from an index.</dd>
270 <dd><var>expr</var> specifies the ID number or the URI of a document.</dd>
271 <dd>If -cl is specifed, regions of the document are cleaned up.</dd>
272 </dl>
273
274 <dl>
275 <dt><kbd>estcmd get <var>db</var> <var>expr</var> [<var>attr</var>]</kbd></dt>
276 <dd>Output document draft of a document in an index.</dd>
277 <dd><var>expr</var> specifies the ID number or the URI of a document.</dd>
278 <dd>If <var>attr</var> is specified, only the value of the attribute is output.</dd>
279 </dl>
280
281 <dl>
282 <dt><kbd>estcmd list <var>db</var></kbd></dt>
283 <dd>Output a list of all document in an index.</dd>
284 </dl>
285
286 <dl>
287 <dt><kbd>estcmd uriid <var>db</var> <var>uri</var></kbd></dt>
288 <dd>Output the ID number of a document specified by URI.</dd>
289 <dd><var>uri</var> specifies the URI of a document.</dd>
290 </dl>
291
292 <dl>
293 <dt><kbd>estcmd meta <var>db</var> [<var>name</var> [<var>value</var>]]</kbd></dt>
294 <dd>Handle meta data.</dd>
295 <dd><var>name</var> specifies the name of a piece of meta data. If it is omitted, a list of all names is output.</dd>
296 <dd><var>value</var> specifies the value of the meta data to be recorded. If it is omitted, the current value is output. If it is an empty string, the meta data is removed.</dd>
297 </dl>
298
299
300 <dl>
301 <dt><kbd>estcmd inform <var>db</var></kbd></dt>
302 <dd>Output the number of documents and the number of unique words in an index.</dd>
303 </dl>
304
305 <dl>
306 <dt><kbd>estcmd optimize [-onp] [-ond] <var>db</var></kbd></dt>
307 <dd>Optimize an index and clean up dispensable regions.</dd>
308 <dd>If -onp is specified, it is omitted to clean up dispensable regions.</dd>
309 <dd>If -ond is specified, it is omitted to optimize the database files.</dd>
310 </dl>
311
312 <dl>
313 <dt><kbd>estcmd search [-ic <var>enc</var>] [-vu|-va|-vf|-vs|-vh|-vx|-dd] [-gs|-gf|-ga] [-ni] [-sf] [-hs] [-attr <var>expr</var>] [-ord <var>expr</var>] [-max <var>num</var>] [-sim <var>id</var>] <var>db</var> [<var>phrase</var>]</kbd></dt>
314 <dd>Search an index for documents.</dd>
315 <dd><var>phrase</var> specifies the search phrase.</dd>
316 <dd>-ic specifies the input encoding. By default, it is UTF-8.</dd>
317 <dd>If -vu is specified, TSV of ID number and URI are output.</dd>
318 <dd>If -va is specified, multipart format including attributes is output.</dd>
319 <dd>If -vf is specified, multipart format including document draft is output.</dd>
320 <dd>If -vs is specified, multipart format including attributes and snippets is output.</dd>
321 <dd>If -vh is specified, human readable format including attributes and snippets is output.</dd>
322 <dd>If -vx is specified, XML including including attributes and snippets is output.</dd>
323 <dd>If -dd is specified, document draft data are dumped and saved into separated files.</dd>
324 <dd>If -gs is specified, every key of N-gram is checked. By default, it is alternately.</dd>
325 <dd>If -gf is specified, keys of N-gram are checked every three.</dd>
326 <dd>If -ga is specified, keys of N-gram are checked every four.</dd>
327 <dd>If -ni is specified, TF-IDF tuning is omitted.</dd>
328 <dd>If -sf is specified, the phrase is treated as a simplefied form.</dd>
329 <dd>If -hs is specified, score information is output as a hint.</dd>
330 <dd>-attr specifies an attribute search condition. This option can be specified multiple times.</dd>
331 <dd>-ord specifies the order expression. By default, it is descending by score.</dd>
332 <dd>-max specifies the maximum number of shown documents. Negative means unlimited. By default, it is 10.</dd>
333 <dd>-sim specifies the ID number of the seed document for similarity search.</dd>
334 </dl>
335
336 <dl>
337 <dt><kbd>estcmd gather [-cl] [-fe|-ft|-fh|-fm] [-fx <var>sufs</var> <var>cmd</var>] [-fz] [-fo] [-ic <var>enc</var>] [-il <var>lang</var>] [-pc <var>enc</var>] [-pf] [-apn] [-sd] [-cm] [-cs <var>num</var>] <var>db</var> [<var>file</var>|<var>dir</var>]</kbd></dt>
338 <dd>Scan the local file system and register documents into an index.</dd>
339 <dd>If the third argument is the name of a file, a list of paths of target documents are read from it. If it is "-", the standard input is specified.</dd>
340 <dd>If the third argument is the name of a directory. All files under the directory are treated as target documents.</dd>
341 <dd>If -cl is specified, regions of overwritten documents are cleaned up.</dd>
342 <dd>If -fe is specified, target files are treated as document draft. By default, the format is detected by the suffix of each document.</dd>
343 <dd>If -ft is specified, target files are treated as plain text.</dd>
344 <dd>If -fh is specified, target files are treated as HTML.</dd>
345 <dd>If -fm is specified, target files are treated as MIME.</dd>
346 <dd>If -fx is specified, target files with the specified suffixes are processed by the specified outer command. If the command is leaded by "T@", the output of the command is treated as plain text. If the command is leaded by "H@", the output of the command is treated as HTML. If the command is leaded by "M@", the output of the command is treated as MIME. Else, the output is treated as document draft. This option can be specified multiple times.</dd>
347 <dd>If -fz is specified, documents which do not corresponding to the condition of -fx are ignored.</dd>
348 <dd>If -fo is specifies, target files are not read.</dd>
349 <dd>-ic specifies the input encoding. By default, it is detected automatically.</dd>
350 <dd>-il specifies the preferred input language. By default, English is preferred.</dd>
351 <dd>-pc specifies the encoding of file paths. By default, it is ISO-8859-1.</dd>
352 <dd>If -pf is specified, the full path is recorded as an attribute instead of the file name.</dd>
353 <dd>If -apn is specified, N-gram analysis is performed against Europian text also.</dd>
354 <dd>If -sd is specified, the creation date and the modification date of each file is recorded as attributes.</dd>
355 <dd>If -cm is specified, documents whose modification date has never changed are ignored.</dd>
356 <dd>-cs specifies the size of cache memory by mega bytes. By default, it is 64Mb.</dd>
357 </dl>
358
359 <dl>
360 <dt><kbd>estcmd purge [-cl] [-fc] <var>db</var> [<var>prefix</var>]</kbd></dt>
361 <dd>Purge information of documents which do not exist on the file system.</dd>
362 <dd>If <var>prefix</var> is specified, only documents whose URIs are begins with it.</dd>
363 <dd>If -cl is specified, regions of the deleted documents are cleaned up.</dd>
364 <dd>If -fc is specified, information of all target documents are deleted.</dd>
365 </dl>
366
367 <dl>
368 <dt><kbd>estcmd extkeys [-fc] [-ni] [-kn <var>num</var>] <var>db</var> [<var>prefix</var>]</kbd></dt>
369 <dd>Create a database of keywords extracted from documents.</dd>
370 <dd>If <var>prefix</var> is specified, only documents whose URIs are begins with it.</dd>
371 <dd>If -fc is specified, all target documents are processed whichever they have existing records or not.</dd>
372 <dd>If -ni is specified, TF-IDF tuning is omitted.</dd>
373 <dd>-kn specifies the number of keywords to be extracted.</dd>
374 </dl>
375
376 <dl>
377 <dt><kbd>estcmd draft [-ft|-fh|-fm] [-ic <var>enc</var>] [-il <var>lang</var>] [<var>file</var>]</kbd></dt>
378 <dd>For test and debug.</dd>
379 </dl>
380
381 <dl>
382 <dt><kbd>estcmd break [-ic <var>enc</var>] [-il <var>lang</var>] [-apn] [-wt] [<var>file</var>]</kbd></dt>
383 <dd>For test and debug.</dd>
384 </dl>
385
386 <dl>
387 <dt><kbd>estcmd randput [-ren|-rla|-reu|-ror|-rjp|-rch] [-cs num] <var>db</var> <var>dnum</var></kbd></dt>
388 <dd>For test and debug.</dd>
389 </dl>
390
391 <dl>
392 <dt><kbd>estcmd wicked <var>db</var> <var>dnum</var></kbd></dt>
393 <dd>For test and debug.</dd>
394 </dl>
395
396 <dl>
397 <dt><kbd>estcmd regression <var>db</var></kbd></dt>
398 <dd>For test and debug.</dd>
399 </dl>
400
401 <dl>
402 <dt><kbd>estcmd version</kbd></dt>
403 <dd>Show the version information.</dd>
404 </dl>
405
406 <p>All sub commands return 0 if the operation is success, else return 1. As for put, out, gather, purge, randput, wicked, and regression, they finish with closing the database when they catch the signal 1 (SIGHUP), 2 (SIGINT), 3 (SIGQUIT), 13 (SIGPIPE), or 15 (SIGTERM).</p>
407
408 <p>The encoding name specified by -ic option should be such name registered to IETF as UTF-8, ISO-8859-1, and so on. The language name specified by -il option should be one of "en" (English), "ja" (Japanese, "zh" (Chinese), "ko" (Korean).</p>
409
410 <p>An outer command specified by -fx option of gather receives the path of the target document by the first argument and the path for output by the second argument. The original path of the target document is given as the value of the environment variable `ESTORIGFILE'.</p>
411
412 <h3>Examples</h3>
413
414 <p>The following is to register mail files of mh format.</p>
415
416 <pre>find /home/mikio/Mail -type f | egrep 'inbox/(business|friends)/[0-9]+$' |
417 estcmd gather -cl -fm -cm casket -
418 </pre>
419
420 <p>The following is to register MS-Office files. estfxmsotohtml requires wvWare and xlhtml.</p>
421
422 <pre>PATH=$PATH:/usr/local/share/hyperestraier/filter ; export PATH
423 estcmd gather -cl -fx ".doc,.xls,.ppt" "H@estfxmsotohtml" -fz -sd -cm casket .
424 </pre>
425
426 <p>The following is to register PDF files. estfxpdftohtml requires pdftotext.</p>
427
428 <pre>PATH=$PATH:/usr/local/share/hyperestraier/filter ; export PATH
429 estcmd gather -cl -fx ".pdf" "H@estfxpdftohtml" -fz -sd -cm casket .
430 </pre>
431
432 <p>The following is to output the search result as XML.</p>
433
434 <pre>estcmd search -vx -max 8 casket 'socket AND shutdown'
435 </pre>
436
437 <hr />
438
439 <h2 id="estseek">CGI Script for Search</h2>
440
441 <p>This section describes specification of estseek.cgi. The subject matter is to write configuration files.</p>
442
443 <h3>Composition</h3>
444
445 <p>estseek.cgi needs three configuration files; the prime configuration file, the template file, and the top page file. Their default names are `estseek.cgi', `estseek.tmpl', and `estseek.top'.</p>
446
447 <p>The name of the prime configuration file is determined by changeing the suffix of the CGI script to ".conf". If you change the name of `estseek.cgi' to `estsearch.cgi', `estsearch.conf' is read. Names of the template file and the top page file is described in the prime configuration file. So, you can install some sets of search scripts in one directory.</p>
448
449 <p>As `estseek.cgi' is installed as `/usr/local/libexec/estseek.cgi', copy it to a directory for CGI scripts. Moreover, as samples of configurations are installed in `/usr/local/share/hyperestraier/', copy and modify them.</p>
450
451 <h3>Prime Configuration File</h3>
452
453 <p>The prime configuration file is composed of lines and the name of an variable and the value separated by `:' are in each line. By default, the following configuration is there.</p>
454
455 <pre>indexname: casket
456 tmplfile: estseek.tmpl
457 topfile: estseek.top
458 logfile:
459 lprefix: file:///home/mikio/public_html/
460 gprefix: http://localhost/
461 gsuffix:
462 dirindex: index.html
463 replace: //localhost/{{!}}//127.0.0.1/
464 replace: //127.0.0.1:80/{{!}}//127.0.0.1/
465 perpage: 10,20,30,40,50,100
466 attrselect: false
467 showscore: false
468 extattr: author|Author
469 extattr: from|From
470 extattr: to|To
471 extattr: cc|Cc
472 extattr: date|Date
473 snipwwidth: 480
474 sniphwidth: 96
475 snipawidth: 96
476 condgstep: 2
477 dotfidf: true
478 smplphrase: true
479 candetail: true
480 smlrvnum: 0
481 spcache:
482 </pre>
483
484 <p>Means of each variable is the following.</p>
485
486 <ul>
487 <li><kbd>indexname</kbd> : specifies the name of the index.</li>
488 <li><kbd>tmplfile</kbd> : specifies the path of the template file.</li>
489 <li><kbd>topfile</kbd> : specifies the path the top page file.</li>
490 <li><kbd>logfile</kbd> : specifies the path of the log file.</li>
491 <li><kbd>lprefix</kbd> : specifies the prefix of the local URI of each document.</li>
492 <li><kbd>gprefix</kbd> : specifies the prefix of the global URI of each document.</li>
493 <li><kbd>gsuffix</kbd> : specifies the suffix added to the global URI of each document.</li>
494 <li><kbd>dirindex</kbd> : specifies the name of the directory index file.</li>
495 <li><kbd>replace</kbd> : specifies an expression to replace the URI of each document. Before string and after string are separated by "{{!}}". This can be more than once.</li>
496 <li><kbd>perpage</kbd> : specifies the numbers of shown pages in a page.</li>
497 <li><kbd>attrselect</kbd> : specifies whether to use select boxes for attribute conditions.</li>
498 <li><kbd>attrselect</kbd> : specifies whether to show scores.</li>
499 <li><kbd>extattr</kbd> : specifies an attribute to be shown. The name and the label are separated by "|". This can be more than once.</li>
500 <li><kbd>snipwwidth</kbd> : specifies whole width of the snippet of each shown document.</li>
501 <li><kbd>sniphwidth</kbd> : specifies width of strings picked up from the beginning of the text.</li>
502 <li><kbd>snipawidth</kbd> : specifies width of strings picked up around each highlighted word.</li>
503 <li><kbd>condgstep</kbd> : specifies accuracy of N-gram checking. "1" is to check every key. "2" is to check keys of N-gram are checked every two. "3" is every three. "4" is every four.</li>
504 <li><kbd>dotfidf</kbd> : specifies whether to do TF-IDF score tuning. "true" or "false".</li>
505 <li><kbd>smplphrase</kbd> : specifies whether to use simplefied search phrase. "true" or "false".</li>
506 <li><kbd>candetail</kbd> : specifies whether to enable detail display of a document. "true" or "false".</li>
507 <li><kbd>smlrvnum</kbd> : specifies the number of the number of keywords for similarity search. If it is less than 1, similarity search is disabled.</li>
508 <li><kbd>spcache</kbd> : specifies the name of an attribute of the special cache.</li>
509 </ul>
510
511 <h3>Template File</h3>
512
513 <p>The template file is to determine appearance of the page. It describes HTML and the data is shown as it is. However, "&lt;!--ESTFORM--&gt;" is replaced by the form to input search conditions. "&lt;!--ESTRESULT--&gt;" is replaced by the search result. "&lt;!--ESTINFO--&gt;" is replaced by information of the index.</p>
514
515 <h3>Top Page File</h3>
516
517 <p>When a user access the CGI script first or if no configuration is input, the content of the top page file is displayed instead of the search result. By default, usage of the CGI script is described there.</p>
518
519 <h3>Search Form</h3>
520
521 <p>If you want set the search form in another page, write the following HTML.</p>
522
523 <pre>&lt;form method="get" action="estseek.cgi"&gt;
524 &lt;div&gt;
525 &lt;input type="text" name="phrase" value="" size="32" /&gt;
526 &lt;input type="submit" value="Search" /&gt;
527 &lt;input type="hidden" name="enc" value="UTF-8" /&gt;
528 &lt;/div&gt;
529 &lt;/form&gt;
530 </pre>
531
532 <p>Change "estseek.cgi" to the URI of setseek.cgi. Change "UTF-8" to the encoding name of the page.</p>
533
534 <hr />
535
536 </body>
537
538 </html>
539
540 <!-- END OF FILE -->

  ViewVC Help
Powered by ViewVC 1.1.26