/[webpac]/openisis/0.9.0/doc/Inverting.txt
This is repository of my old source code which isn't updated any more. Go to git.rot13.org for current projects!
ViewVC logotype

Contents of /openisis/0.9.0/doc/Inverting.txt

Parent Directory Parent Directory | Revision Log Revision Log


Revision 238 - (show annotations)
Mon Mar 8 17:46:16 2004 UTC (20 years ago) by dpavlin
File MIME type: text/plain
File size: 5102 byte(s)
tagging openisis 0.9.0

1 This document discusses various issues of indexing
2 (creating entries for the inverted file).
3
4
5 * the inverted file
6
7 The inverted file maps keys to (sorted) sequences of values.
8 Although an application can store arbitrary information in values,
9 in standard use, they denote a hit (where the key was found in a record).
10 A hit is a structure of five numbers, specifying (in that order)
11 a database number, a rowid (mfn), tag, occurence (of this tag in it's record)
12 and position (of this "word" in it's field).
13
14 The inverted file is implemented as a B-Link-Tree (Lehmann/Yao)
15 with variable length keys of up to 255 bytes.
16 The length of values is fixed for a given index, but can be configured
17 on index creation in the range 8 to 23 bytes (8+n, with 4 bits for n).
18 The 8 byte minimum configuration (corresponding to the traditional IFP) uses
19 - 2 bytes pos
20 - 1 byte occ
21 - 2 bytes tag
22 - 3 bytes mfn
23 A ninth byte is used for the mfn, bytes 10 and 11 for a database number,
24 additional bytes have no standard usage in hits (by now).
25
26 Occurence and position are typically used for "near" conditions in queries.
27 Where this is not needed, they could also be used, for example,
28 to apply weights to the hits.
29
30
31 * creation of index entries
32
33 Traditional ISIS systems based indexing on
34 - a "print format" to create text from a record
35 - an "indexing technique" to extract entries.
36 The various indexing techniques split the text at newlines, subfields,
37 word boundaries, angle brackets or other delimiters.
38
39 OpenIsis uses a different approach based on an "index record",
40 which is created either explicitly by application code or using
41 > Views a view.
42 Most split up like extracting keywords from between angle brackets
43 or other delimiters is done during creation of this record,
44 containing, in the simplest case, one field per index entry.
45
46 The only additional transformation that may be applied
47 is word split to be done during
48 > Collating
49 , because the definition of which characters constitute words is
50 configured together with the collation info.
51
52
53 In the index record, fields with a non-negative tag are treated as
54 index entries according to the current mode.
55 All fields with negative tags give processing instructions as follows
56 (denoted with symbolic tag names):
57 - index f[ields] [startocc]
58 an xmode field with value 'f' (or some prefix of 'fields') sets default
59 indexing mode, where each value (of a non-negative field) is used as
60 literal index entry. The occurence is set to startocc or 0,
61 incremented for runs of fields with the same tag,
62 and reset to 0 on tag change.
63 - index w[ords] [startpos]
64 sets word mode, assuming fields to contain pre-split words.
65 Instead of the occ, the pos is set, incremented and reset.
66 - index s[plit]
67 sets word autosplit mode. Index entries are split into words according
68 to collation info; each word increments the pos, each field resets the
69 pos and increments or resets the occ as in fields mode.
70 If the index has no collation info, all characters but the well-known
71 ASCII non-letters are assumed to be word characters.
72 - index a[dd]|d[el]
73 sets add (default) or del mode for index entries.
74 - index m[fn] mfn
75 sets the mfn to use
76 - index n
77 where n is a number, sets max key length to n (default is 30)
78 - hit [+|-][[dbn.]mfn.]tag.[occ].[pos]|tag[.pos] key
79 declares explicit hit data for key.
80 With a leading + or -, explicitly add or del the key regardless of mode.
81 Unspecified fields are determined as usual from the environment.
82 Since search results can be sorted by increasing pos,
83 this field can also be used to specify a weight
84 (starting from 0 for a 100% match).
85 - fst viewspec
86 declares a view line to create fields from the current data record, see
87 > Views
88
89 The index fields can also be used with a tag and value in their field value,
90 in which case the given mode applies to that tag and value only.
91 (TODO)
92
93 While view-based indexing is yet to be done (as of May 03),
94 application controlled indexing is working and robust
95 (see the save method in openIsis.tcl and sample.fsp).
96
97
98 * special values
99
100 The entries created according to the index record are then subject to a
101 > Collating collation transformation
102 to create index keys.
103
104 Finally they are handed to the B-Tree for insertion.
105 If the key is already there, it may have an associated special value,
106 which does not denote a hit, but rather processing instructions.
107 Special values are recognized by a mfn of 0.
108
109 - a completely blank value of all zeroes marks a stopword.
110 no other values for this key are added to the index.
111 (This is implemented).
112 - a value of mfn 0, first byte of tag 1, marks a ref.
113 The rest of the value and some successive values contain an alternate key,
114 under which the entry is made (and searched).
115 Since this has some difficulties, it will probably take some time
116 until this is implemented...
117
118 A multilingual (English, French, Italian, Spanish and Portuguese)
119 > http://www.bib.wau.nl/isis/docum.txt stopword file
120 can be loaded like
121 $
122 tr '[a-z]' '[A-Z]' <docum.txt | sort | sto/openisis -db db/test/test -swload
123 $
124
125 ---
126 $Id: Inverting.txt,v 1.3 2003/05/08 13:35:20 kripke Exp $

  ViewVC Help
Powered by ViewVC 1.1.26