1 |
This document discusses various issues of indexing |
2 |
(creating entries for the inverted file). |
3 |
|
4 |
|
5 |
* the inverted file |
6 |
|
7 |
The inverted file maps keys to (sorted) sequences of values. |
8 |
Although an application can store arbitrary information in values, |
9 |
in standard use, they denote a hit (where the key was found in a record). |
10 |
A hit is a structure of five numbers, specifying (in that order) |
11 |
a database number, a rowid (mfn), tag, occurence (of this tag in it's record) |
12 |
and position (of this "word" in it's field). |
13 |
|
14 |
The inverted file is implemented as a B-Link-Tree (Lehmann/Yao) |
15 |
with variable length keys of up to 255 bytes. |
16 |
The length of values is fixed for a given index, but can be configured |
17 |
on index creation in the range 8 to 23 bytes (8+n, with 4 bits for n). |
18 |
The 8 byte minimum configuration (corresponding to the traditional IFP) uses |
19 |
- 2 bytes pos |
20 |
- 1 byte occ |
21 |
- 2 bytes tag |
22 |
- 3 bytes mfn |
23 |
A ninth byte is used for the mfn, bytes 10 and 11 for a database number, |
24 |
additional bytes have no standard usage in hits (by now). |
25 |
|
26 |
Occurence and position are typically used for "near" conditions in queries. |
27 |
Where this is not needed, they could also be used, for example, |
28 |
to apply weights to the hits. |
29 |
|
30 |
|
31 |
* creation of index entries |
32 |
|
33 |
Traditional ISIS systems based indexing on |
34 |
- a "print format" to create text from a record |
35 |
- an "indexing technique" to extract entries. |
36 |
The various indexing techniques split the text at newlines, subfields, |
37 |
word boundaries, angle brackets or other delimiters. |
38 |
|
39 |
OpenIsis uses a different approach based on an "index record", |
40 |
which is created either explicitly by application code or using |
41 |
> Views a view. |
42 |
Most split up like extracting keywords from between angle brackets |
43 |
or other delimiters is done during creation of this record, |
44 |
containing, in the simplest case, one field per index entry. |
45 |
|
46 |
The only additional transformation that may be applied |
47 |
is word split to be done during |
48 |
> Collating |
49 |
, because the definition of which characters constitute words is |
50 |
configured together with the collation info. |
51 |
|
52 |
|
53 |
In the index record, fields with a non-negative tag are treated as |
54 |
index entries according to the current mode. |
55 |
All fields with negative tags give processing instructions as follows |
56 |
(denoted with symbolic tag names): |
57 |
- index f[ields] [startocc] |
58 |
an xmode field with value 'f' (or some prefix of 'fields') sets default |
59 |
indexing mode, where each value (of a non-negative field) is used as |
60 |
literal index entry. The occurence is set to startocc or 0, |
61 |
incremented for runs of fields with the same tag, |
62 |
and reset to 0 on tag change. |
63 |
- index w[ords] [startpos] |
64 |
sets word mode, assuming fields to contain pre-split words. |
65 |
Instead of the occ, the pos is set, incremented and reset. |
66 |
- index s[plit] |
67 |
sets word autosplit mode. Index entries are split into words according |
68 |
to collation info; each word increments the pos, each field resets the |
69 |
pos and increments or resets the occ as in fields mode. |
70 |
If the index has no collation info, all characters but the well-known |
71 |
ASCII non-letters are assumed to be word characters. |
72 |
- index a[dd]|d[el] |
73 |
sets add (default) or del mode for index entries. |
74 |
- index m[fn] mfn |
75 |
sets the mfn to use |
76 |
- index n |
77 |
where n is a number, sets max key length to n (default is 30) |
78 |
- hit [+|-][[dbn.]mfn.]tag.[occ].[pos]|tag[.pos] key |
79 |
declares explicit hit data for key. |
80 |
With a leading + or -, explicitly add or del the key regardless of mode. |
81 |
Unspecified fields are determined as usual from the environment. |
82 |
Since search results can be sorted by increasing pos, |
83 |
this field can also be used to specify a weight |
84 |
(starting from 0 for a 100% match). |
85 |
- fst viewspec |
86 |
declares a view line to create fields from the current data record, see |
87 |
> Views |
88 |
|
89 |
The index fields can also be used with a tag and value in their field value, |
90 |
in which case the given mode applies to that tag and value only. |
91 |
(TODO) |
92 |
|
93 |
While view-based indexing is yet to be done (as of May 03), |
94 |
application controlled indexing is working and robust |
95 |
(see the save method in openIsis.tcl and sample.fsp). |
96 |
|
97 |
|
98 |
* special values |
99 |
|
100 |
The entries created according to the index record are then subject to a |
101 |
> Collating collation transformation |
102 |
to create index keys. |
103 |
|
104 |
Finally they are handed to the B-Tree for insertion. |
105 |
If the key is already there, it may have an associated special value, |
106 |
which does not denote a hit, but rather processing instructions. |
107 |
Special values are recognized by a mfn of 0. |
108 |
|
109 |
- a completely blank value of all zeroes marks a stopword. |
110 |
no other values for this key are added to the index. |
111 |
(This is implemented). |
112 |
- a value of mfn 0, first byte of tag 1, marks a ref. |
113 |
The rest of the value and some successive values contain an alternate key, |
114 |
under which the entry is made (and searched). |
115 |
Since this has some difficulties, it will probably take some time |
116 |
until this is implemented... |
117 |
|
118 |
A multilingual (English, French, Italian, Spanish and Portuguese) |
119 |
> http://www.bib.wau.nl/isis/docum.txt stopword file |
120 |
can be loaded like |
121 |
$ |
122 |
tr '[a-z]' '[A-Z]' <docum.txt | sort | sto/openisis -db db/test/test -swload |
123 |
$ |
124 |
|
125 |
--- |
126 |
$Id: Inverting.txt,v 1.3 2003/05/08 13:35:20 kripke Exp $ |