1 |
dpavlin |
237 |
This document discusses various issues of indexing |
2 |
|
|
(creating entries for the inverted file). |
3 |
|
|
|
4 |
|
|
|
5 |
|
|
* the inverted file |
6 |
|
|
|
7 |
|
|
The inverted file maps keys to (sorted) sequences of values. |
8 |
|
|
Although an application can store arbitrary information in values, |
9 |
|
|
in standard use, they denote a hit (where the key was found in a record). |
10 |
|
|
A hit is a structure of five numbers, specifying (in that order) |
11 |
|
|
a database number, a rowid (mfn), tag, occurence (of this tag in it's record) |
12 |
|
|
and position (of this "word" in it's field). |
13 |
|
|
|
14 |
|
|
The inverted file is implemented as a B-Link-Tree (Lehmann/Yao) |
15 |
|
|
with variable length keys of up to 255 bytes. |
16 |
|
|
The length of values is fixed for a given index, but can be configured |
17 |
|
|
on index creation in the range 8 to 23 bytes (8+n, with 4 bits for n). |
18 |
|
|
The 8 byte minimum configuration (corresponding to the traditional IFP) uses |
19 |
|
|
- 2 bytes pos |
20 |
|
|
- 1 byte occ |
21 |
|
|
- 2 bytes tag |
22 |
|
|
- 3 bytes mfn |
23 |
|
|
A ninth byte is used for the mfn, bytes 10 and 11 for a database number, |
24 |
|
|
additional bytes have no standard usage in hits (by now). |
25 |
|
|
|
26 |
|
|
Occurence and position are typically used for "near" conditions in queries. |
27 |
|
|
Where this is not needed, they could also be used, for example, |
28 |
|
|
to apply weights to the hits. |
29 |
|
|
|
30 |
|
|
|
31 |
|
|
* creation of index entries |
32 |
|
|
|
33 |
|
|
Traditional ISIS systems based indexing on |
34 |
|
|
- a "print format" to create text from a record |
35 |
|
|
- an "indexing technique" to extract entries. |
36 |
|
|
The various indexing techniques split the text at newlines, subfields, |
37 |
|
|
word boundaries, angle brackets or other delimiters. |
38 |
|
|
|
39 |
|
|
OpenIsis uses a different approach based on an "index record", |
40 |
|
|
which is created either explicitly by application code or using |
41 |
|
|
> Views a view. |
42 |
|
|
Most split up like extracting keywords from between angle brackets |
43 |
|
|
or other delimiters is done during creation of this record, |
44 |
|
|
containing, in the simplest case, one field per index entry. |
45 |
|
|
|
46 |
|
|
The only additional transformation that may be applied |
47 |
|
|
is word split to be done during |
48 |
|
|
> Collating |
49 |
|
|
, because the definition of which characters constitute words is |
50 |
|
|
configured together with the collation info. |
51 |
|
|
|
52 |
|
|
|
53 |
|
|
In the index record, fields with a non-negative tag are treated as |
54 |
|
|
index entries according to the current mode. |
55 |
|
|
All fields with negative tags give processing instructions as follows |
56 |
|
|
(denoted with symbolic tag names): |
57 |
|
|
- index f[ields] [startocc] |
58 |
|
|
an xmode field with value 'f' (or some prefix of 'fields') sets default |
59 |
|
|
indexing mode, where each value (of a non-negative field) is used as |
60 |
|
|
literal index entry. The occurence is set to startocc or 0, |
61 |
|
|
incremented for runs of fields with the same tag, |
62 |
|
|
and reset to 0 on tag change. |
63 |
|
|
- index w[ords] [startpos] |
64 |
|
|
sets word mode, assuming fields to contain pre-split words. |
65 |
|
|
Instead of the occ, the pos is set, incremented and reset. |
66 |
|
|
- index s[plit] |
67 |
|
|
sets word autosplit mode. Index entries are split into words according |
68 |
|
|
to collation info; each word increments the pos, each field resets the |
69 |
|
|
pos and increments or resets the occ as in fields mode. |
70 |
|
|
If the index has no collation info, all characters but the well-known |
71 |
|
|
ASCII non-letters are assumed to be word characters. |
72 |
|
|
- index a[dd]|d[el] |
73 |
|
|
sets add (default) or del mode for index entries. |
74 |
|
|
- index m[fn] mfn |
75 |
|
|
sets the mfn to use |
76 |
|
|
- index n |
77 |
|
|
where n is a number, sets max key length to n (default is 30) |
78 |
|
|
- hit [+|-][[dbn.]mfn.]tag.[occ].[pos]|tag[.pos] key |
79 |
|
|
declares explicit hit data for key. |
80 |
|
|
With a leading + or -, explicitly add or del the key regardless of mode. |
81 |
|
|
Unspecified fields are determined as usual from the environment. |
82 |
|
|
Since search results can be sorted by increasing pos, |
83 |
|
|
this field can also be used to specify a weight |
84 |
|
|
(starting from 0 for a 100% match). |
85 |
|
|
- fst viewspec |
86 |
|
|
declares a view line to create fields from the current data record, see |
87 |
|
|
> Views |
88 |
|
|
|
89 |
|
|
The index fields can also be used with a tag and value in their field value, |
90 |
|
|
in which case the given mode applies to that tag and value only. |
91 |
|
|
(TODO) |
92 |
|
|
|
93 |
|
|
While view-based indexing is yet to be done (as of May 03), |
94 |
|
|
application controlled indexing is working and robust |
95 |
|
|
(see the save method in openIsis.tcl and sample.fsp). |
96 |
|
|
|
97 |
|
|
|
98 |
|
|
* special values |
99 |
|
|
|
100 |
|
|
The entries created according to the index record are then subject to a |
101 |
|
|
> Collating collation transformation |
102 |
|
|
to create index keys. |
103 |
|
|
|
104 |
|
|
Finally they are handed to the B-Tree for insertion. |
105 |
|
|
If the key is already there, it may have an associated special value, |
106 |
|
|
which does not denote a hit, but rather processing instructions. |
107 |
|
|
Special values are recognized by a mfn of 0. |
108 |
|
|
|
109 |
|
|
- a completely blank value of all zeroes marks a stopword. |
110 |
|
|
no other values for this key are added to the index. |
111 |
|
|
(This is implemented). |
112 |
|
|
- a value of mfn 0, first byte of tag 1, marks a ref. |
113 |
|
|
The rest of the value and some successive values contain an alternate key, |
114 |
|
|
under which the entry is made (and searched). |
115 |
|
|
Since this has some difficulties, it will probably take some time |
116 |
|
|
until this is implemented... |
117 |
|
|
|
118 |
|
|
A multilingual (English, French, Italian, Spanish and Portuguese) |
119 |
|
|
> http://www.bib.wau.nl/isis/docum.txt stopword file |
120 |
|
|
can be loaded like |
121 |
|
|
$ |
122 |
|
|
tr '[a-z]' '[A-Z]' <docum.txt | sort | sto/openisis -db db/test/test -swload |
123 |
|
|
$ |
124 |
|
|
|
125 |
|
|
--- |
126 |
|
|
$Id: Inverting.txt,v 1.3 2003/05/08 13:35:20 kripke Exp $ |