openisis/doc/Inverting.txt

This document discusses various issues of indexing
(creating entries for the inverted file).


*       the inverted file

The inverted file maps keys to (sorted) sequences of values.
Although an application can store arbitrary information in values,
in standard use, they denote a hit (where the key was found in a record).
A hit is a structure of five numbers, specifying (in that order)
a database number, a rowid (mfn), tag, occurence (of this tag in it's record)
and position (of this "word" in it's field).

The inverted file is implemented as a B-Link-Tree (Lehmann/Yao)
with variable length keys of up to 255 bytes.
The length of values is fixed for a given index, but can be configured
on index creation in the range 8 to 23 bytes (8+n, with 4 bits for n).
The 8 byte minimum configuration (corresponding to the traditional IFP) uses
-       2 bytes pos
-       1 byte occ
-       2 bytes tag
-       3 bytes mfn
A ninth byte is used for the mfn, bytes 10 and 11 for a database number,
additional bytes have no standard usage in hits (by now).

Occurence and position are typically used for "near" conditions in queries.
Where this is not needed, they could also be used, for example,
to apply weights to the hits.


*       creation of index entries

Traditional ISIS systems based indexing on
-       a "print format" to create text from a record
-       an "indexing technique" to extract entries.
        The various indexing techniques split the text at newlines, subfields,
        word boundaries, angle brackets or other delimiters.

OpenIsis uses a different approach based on an "index record",
which is created either explicitly by application code or using
>       Views   a view.
Most split up like extracting keywords from between angle brackets
or other delimiters is done during creation of this record,
containing, in the simplest case, one field per index entry.

The only additional transformation that may be applied
is word split to be done during
>       Collating
, because the definition of which characters constitute words is
configured together with the collation info.


In the index record, fields with a non-negative tag are treated as
index entries according to the current mode.
All fields with negative tags give processing instructions as follows
(denoted with symbolic tag names):
-       index f[ields]  [startocc]
        an xmode field with value 'f' (or some prefix of 'fields') sets default
        indexing mode, where each value (of a non-negative field) is used as
        literal index entry. The occurence is set to startocc or 0,
        incremented for runs of fields with the same tag,
        and reset to 0 on tag change.
-       index w[ords]   [startpos]
        sets word mode, assuming fields to contain pre-split words.
        Instead of the occ, the pos is set, incremented and reset.
-       index s[plit]
        sets word autosplit mode. Index entries are split into words according
        to collation info; each word increments the pos, each field resets the
        pos and increments or resets the occ as in fields mode.
        If the index has no collation info, all characters but the well-known
        ASCII non-letters are assumed to be word characters.
-       index a[dd]|d[el]
        sets add (default) or del mode for index entries.
-       index m[fn]     mfn
        sets the mfn to use
-       index n
        where n is a number, sets max key length to n (default is 30)
-       hit [+|-][[dbn.]mfn.]tag.[occ].[pos]|tag[.pos]  key
        declares explicit hit data for key.
        With a leading + or -, explicitly add or del the key regardless of mode.
        Unspecified fields are determined as usual from the environment.
        Since search results can be sorted by increasing pos,
        this field can also be used to specify a weight
        (starting from 0 for a 100% match).
-       fst     viewspec
        declares a view line to create fields from the current data record, see
>       Views

The index fields can also be used with a tag and value in their field value,
in which case the given mode applies to that tag and value only.
(TODO)

While view-based indexing is yet to be done (as of May 03),
application controlled indexing is working and robust
(see the save method in openIsis.tcl and sample.fsp).


*       special values

The entries created according to the index record are then subject to a
>       Collating       collation transformation
to create index keys.

Finally they are handed to the B-Tree for insertion.
If the key is already there, it may have an associated special value,
which does not denote a hit, but rather processing instructions.
Special values are recognized by a mfn of 0.

-       a completely blank value of all zeroes marks a stopword.
        no other values for this key are added to the index.
        (This is implemented).
-       a value of mfn 0, first byte of tag 1, marks a ref.
        The rest of the value and some successive values contain an alternate key,
        under which the entry is made (and searched).
        Since this has some difficulties, it will probably take some time
        until this is implemented...

A multilingual (English, French, Italian, Spanish and Portuguese)
>       http://www.bib.wau.nl/isis/docum.txt    stopword file
can be loaded like
$
tr '[a-z]' '[A-Z]' <docum.txt | sort | sto/openisis -db db/test/test -swload
$

---
        $Id: Inverting.txt,v 1.3 2003/05/08 13:35:20 kripke Exp $
1	This document discusses various issues of indexing
2	(creating entries for the inverted file).
3
4
5	* the inverted file
6
7	The inverted file maps keys to (sorted) sequences of values.
8	Although an application can store arbitrary information in values,
9	in standard use, they denote a hit (where the key was found in a record).
10	A hit is a structure of five numbers, specifying (in that order)
11	a database number, a rowid (mfn), tag, occurence (of this tag in it's record)
12	and position (of this "word" in it's field).
13
14	The inverted file is implemented as a B-Link-Tree (Lehmann/Yao)
15	with variable length keys of up to 255 bytes.
16	The length of values is fixed for a given index, but can be configured
17	on index creation in the range 8 to 23 bytes (8+n, with 4 bits for n).
18	The 8 byte minimum configuration (corresponding to the traditional IFP) uses
19	- 2 bytes pos
20	- 1 byte occ
21	- 2 bytes tag
22	- 3 bytes mfn
23	A ninth byte is used for the mfn, bytes 10 and 11 for a database number,
24	additional bytes have no standard usage in hits (by now).
25
26	Occurence and position are typically used for "near" conditions in queries.
27	Where this is not needed, they could also be used, for example,
28	to apply weights to the hits.
29
30
31	* creation of index entries
32
33	Traditional ISIS systems based indexing on
34	- a "print format" to create text from a record
35	- an "indexing technique" to extract entries.
36	The various indexing techniques split the text at newlines, subfields,
37	word boundaries, angle brackets or other delimiters.
38
39	OpenIsis uses a different approach based on an "index record",
40	which is created either explicitly by application code or using
41	> Views a view.
42	Most split up like extracting keywords from between angle brackets
43	or other delimiters is done during creation of this record,
44	containing, in the simplest case, one field per index entry.
45
46	The only additional transformation that may be applied
47	is word split to be done during
48	> Collating
49	, because the definition of which characters constitute words is
50	configured together with the collation info.
51
52
53	In the index record, fields with a non-negative tag are treated as
54	index entries according to the current mode.
55	All fields with negative tags give processing instructions as follows
56	(denoted with symbolic tag names):
57	- index f[ields] [startocc]
58	an xmode field with value 'f' (or some prefix of 'fields') sets default
59	indexing mode, where each value (of a non-negative field) is used as
60	literal index entry. The occurence is set to startocc or 0,
61	incremented for runs of fields with the same tag,
62	and reset to 0 on tag change.
63	- index w[ords] [startpos]
64	sets word mode, assuming fields to contain pre-split words.
65	Instead of the occ, the pos is set, incremented and reset.
66	- index s[plit]
67	sets word autosplit mode. Index entries are split into words according
68	to collation info; each word increments the pos, each field resets the
69	pos and increments or resets the occ as in fields mode.
70	If the index has no collation info, all characters but the well-known
71	ASCII non-letters are assumed to be word characters.
72	- index a[dd]\|d[el]
73	sets add (default) or del mode for index entries.
74	- index m[fn] mfn
75	sets the mfn to use
76	- index n
77	where n is a number, sets max key length to n (default is 30)
78	- hit [+\|-][[dbn.]mfn.]tag.[occ].[pos]\|tag[.pos] key
79	declares explicit hit data for key.
80	With a leading + or -, explicitly add or del the key regardless of mode.
81	Unspecified fields are determined as usual from the environment.
82	Since search results can be sorted by increasing pos,
83	this field can also be used to specify a weight
84	(starting from 0 for a 100% match).
85	- fst viewspec
86	declares a view line to create fields from the current data record, see
87	> Views
88
89	The index fields can also be used with a tag and value in their field value,
90	in which case the given mode applies to that tag and value only.
91	(TODO)
92
93	While view-based indexing is yet to be done (as of May 03),
94	application controlled indexing is working and robust
95	(see the save method in openIsis.tcl and sample.fsp).
96
97
98	* special values
99
100	The entries created according to the index record are then subject to a
101	> Collating collation transformation
102	to create index keys.
103
104	Finally they are handed to the B-Tree for insertion.
105	If the key is already there, it may have an associated special value,
106	which does not denote a hit, but rather processing instructions.
107	Special values are recognized by a mfn of 0.
108
109	- a completely blank value of all zeroes marks a stopword.
110	no other values for this key are added to the index.
111	(This is implemented).
112	- a value of mfn 0, first byte of tag 1, marks a ref.
113	The rest of the value and some successive values contain an alternate key,
114	under which the entry is made (and searched).
115	Since this has some difficulties, it will probably take some time
116	until this is implemented...
117
118	A multilingual (English, French, Italian, Spanish and Portuguese)
119	> http://www.bib.wau.nl/isis/docum.txt stopword file
120	can be loaded like
121	$
122	tr '[a-z]' '[A-Z]' <docum.txt \| sort \| sto/openisis -db db/test/test -swload
123	$
124
125	---
126	$Id: Inverting.txt,v 1.3 2003/05/08 13:35:20 kripke Exp $