/[webpac]/openisis/current/doc/Inverting.txt
This is repository of my old source code which isn't updated any more. Go to git.rot13.org for current projects!
ViewVC logotype

Annotation of /openisis/current/doc/Inverting.txt

Parent Directory Parent Directory | Revision Log Revision Log


Revision 237 - (hide annotations)
Mon Mar 8 17:43:12 2004 UTC (20 years, 1 month ago) by dpavlin
File MIME type: text/plain
File size: 5102 byte(s)
initial import of openisis 0.9.0 vendor drop

1 dpavlin 237 This document discusses various issues of indexing
2     (creating entries for the inverted file).
3    
4    
5     * the inverted file
6    
7     The inverted file maps keys to (sorted) sequences of values.
8     Although an application can store arbitrary information in values,
9     in standard use, they denote a hit (where the key was found in a record).
10     A hit is a structure of five numbers, specifying (in that order)
11     a database number, a rowid (mfn), tag, occurence (of this tag in it's record)
12     and position (of this "word" in it's field).
13    
14     The inverted file is implemented as a B-Link-Tree (Lehmann/Yao)
15     with variable length keys of up to 255 bytes.
16     The length of values is fixed for a given index, but can be configured
17     on index creation in the range 8 to 23 bytes (8+n, with 4 bits for n).
18     The 8 byte minimum configuration (corresponding to the traditional IFP) uses
19     - 2 bytes pos
20     - 1 byte occ
21     - 2 bytes tag
22     - 3 bytes mfn
23     A ninth byte is used for the mfn, bytes 10 and 11 for a database number,
24     additional bytes have no standard usage in hits (by now).
25    
26     Occurence and position are typically used for "near" conditions in queries.
27     Where this is not needed, they could also be used, for example,
28     to apply weights to the hits.
29    
30    
31     * creation of index entries
32    
33     Traditional ISIS systems based indexing on
34     - a "print format" to create text from a record
35     - an "indexing technique" to extract entries.
36     The various indexing techniques split the text at newlines, subfields,
37     word boundaries, angle brackets or other delimiters.
38    
39     OpenIsis uses a different approach based on an "index record",
40     which is created either explicitly by application code or using
41     > Views a view.
42     Most split up like extracting keywords from between angle brackets
43     or other delimiters is done during creation of this record,
44     containing, in the simplest case, one field per index entry.
45    
46     The only additional transformation that may be applied
47     is word split to be done during
48     > Collating
49     , because the definition of which characters constitute words is
50     configured together with the collation info.
51    
52    
53     In the index record, fields with a non-negative tag are treated as
54     index entries according to the current mode.
55     All fields with negative tags give processing instructions as follows
56     (denoted with symbolic tag names):
57     - index f[ields] [startocc]
58     an xmode field with value 'f' (or some prefix of 'fields') sets default
59     indexing mode, where each value (of a non-negative field) is used as
60     literal index entry. The occurence is set to startocc or 0,
61     incremented for runs of fields with the same tag,
62     and reset to 0 on tag change.
63     - index w[ords] [startpos]
64     sets word mode, assuming fields to contain pre-split words.
65     Instead of the occ, the pos is set, incremented and reset.
66     - index s[plit]
67     sets word autosplit mode. Index entries are split into words according
68     to collation info; each word increments the pos, each field resets the
69     pos and increments or resets the occ as in fields mode.
70     If the index has no collation info, all characters but the well-known
71     ASCII non-letters are assumed to be word characters.
72     - index a[dd]|d[el]
73     sets add (default) or del mode for index entries.
74     - index m[fn] mfn
75     sets the mfn to use
76     - index n
77     where n is a number, sets max key length to n (default is 30)
78     - hit [+|-][[dbn.]mfn.]tag.[occ].[pos]|tag[.pos] key
79     declares explicit hit data for key.
80     With a leading + or -, explicitly add or del the key regardless of mode.
81     Unspecified fields are determined as usual from the environment.
82     Since search results can be sorted by increasing pos,
83     this field can also be used to specify a weight
84     (starting from 0 for a 100% match).
85     - fst viewspec
86     declares a view line to create fields from the current data record, see
87     > Views
88    
89     The index fields can also be used with a tag and value in their field value,
90     in which case the given mode applies to that tag and value only.
91     (TODO)
92    
93     While view-based indexing is yet to be done (as of May 03),
94     application controlled indexing is working and robust
95     (see the save method in openIsis.tcl and sample.fsp).
96    
97    
98     * special values
99    
100     The entries created according to the index record are then subject to a
101     > Collating collation transformation
102     to create index keys.
103    
104     Finally they are handed to the B-Tree for insertion.
105     If the key is already there, it may have an associated special value,
106     which does not denote a hit, but rather processing instructions.
107     Special values are recognized by a mfn of 0.
108    
109     - a completely blank value of all zeroes marks a stopword.
110     no other values for this key are added to the index.
111     (This is implemented).
112     - a value of mfn 0, first byte of tag 1, marks a ref.
113     The rest of the value and some successive values contain an alternate key,
114     under which the entry is made (and searched).
115     Since this has some difficulties, it will probably take some time
116     until this is implemented...
117    
118     A multilingual (English, French, Italian, Spanish and Portuguese)
119     > http://www.bib.wau.nl/isis/docum.txt stopword file
120     can be loaded like
121     $
122     tr '[a-z]' '[A-Z]' <docum.txt | sort | sto/openisis -db db/test/test -swload
123     $
124    
125     ---
126     $Id: Inverting.txt,v 1.3 2003/05/08 13:35:20 kripke Exp $

  ViewVC Help
Powered by ViewVC 1.1.26