/[webpac]/trunk/openisis/doc/Collating.txt
This is repository of my old source code which isn't updated any more. Go to git.rot13.org for current projects!
ViewVC logotype

Contents of /trunk/openisis/doc/Collating.txt

Parent Directory Parent Directory | Revision Log Revision Log


Revision 239 - (show annotations)
Mon Mar 8 17:49:13 2004 UTC (20 years, 1 month ago) by dpavlin
File MIME type: text/plain
File size: 6656 byte(s)
including openisis 0.9.0 into webpac tree

1 This document discusses the issues involved in collating entries
2 for the index (inverted file).
3
4 Collating starts after preliminary steps of inverting like
5 selecting data from a record (or other sources), including
6 extraction of keywords and the like, which are described in
7 > Inverting
8 .
9
10 Collating involves a string transformation, so that the resulting strings
11 - are collation keys,
12 i.e. yield the desired ordering on binary comparision (a la strcmp).
13 - are normalized,
14 i.e. some variants of the same contents are mapped to a canonical base
15
16 The same transformation is applied to both entries before written to the
17 index and keys before searched in the index.
18
19
20 A desired property of this transformation (mapping) is reversibility
21 to a legible representation of the original key,
22 so that index contents can be listed.
23 However, an alternative is to use the original record contents
24 for listing, where this can be easily identified (e.g. from a thesaurus).
25
26 Since collating involves looking up characters, it can also detect
27 word boundaries.
28
29
30 The traditional means provided by the ISISAC.TAB and ISISUC.TAB tables
31 do not extend well to multibyte character sets like UNICODE,
32 so we explore alternatives.
33
34
35 * basic collation
36
37 A basic variant using ISO-C/POSIX means (ctype.h, string.h) would be
38 - using toupper for normalization
39 - using isalpha to detect word boundaries
40 - using strxfrm to create collation keys
41
42 This fallback approach has several disadvantages:
43 - normalization is very limited
44 - it depends on a "locale" which is a global state of a process,
45 so working with databases with different settings is not multithreading safe.
46 - the dependency from the locale is not specified precisely
47 - there is no standard way to customize settings
48 - it is not revertible
49
50
51 * full unicode collation
52
53 Using a library such as Big Blue's
54 > http://oss.software.ibm.com/icu/userguide/Collate_Intro.html ICU
55 , we can use the full unicode collation algorithm (
56 > http://unicode.org/unicode/reports/tr10/ UCA
57 ), including the customization options described there.
58
59 - use the given encoding to convert input to unicode
60 - uppercase and canonicalize input, especially decompose composite characters
61 - remove most diacritics (but not the ring on the A when in sweden)
62 - use the alphabetic character property to detect word boundaries
63 - get a collation key
64
65 This approach is very powerful. It provides prepared data for virtually
66 all scripts, encodings and locales without the need for customization.
67 It provides for three or four levels of sorting
68 and can even handle french accent ordering
69 (where accents later in the otherwise same word have higher preceedence).
70 Sooner or later, OpenIsis will provide a means to link to ICU.
71
72 However, this still has a couple of disadvantages
73 - it requires a library that alone is several times larger than OpenIsis.
74 The used algorithms are very complex, so it can not be expected
75 to perform all too well on low hardware ressources.
76 - customization for special normalizations is difficult
77 - reverting a compressed collation key might not always be possible,
78 at least it is a pretty non-trivial task (and not supported by ICU)
79
80
81 * isis custom collation
82
83 We devise a relatively simple scheme, that combines the mappings used for
84 normalization, collation key creation and word boundary detection into one step.
85 It does (by now) not provide multi-level ordering, since the discriminators
86 for secondary levels (like case and diacritics) are typically discarded anyway
87 during normalization.
88
89
90 Features are
91 - efficiency
92 - easy customization
93 - independence of encoding
94 - mapping of sequences
95
96
97 The mapping is specified by an isis record,
98 that could be stored in an separate file or be part of embedded header data.
99 In any way, the mappings are described using the actual characters,
100 in the encoding of the database, rather than code numbers.
101 This way, simple explicit mappings will survive a recoding
102 (ranges, however, may be broken).
103
104
105 Fields used in the customization are mapping rules,
106 containing one or more "items" separated by TAB characters.
107 An item is
108 - a single character
109 if it is a single character (sic!)
110 - a literal sequence
111 if the first one is a quote '"' (34)
112 - a range
113 is it consists of three characters, and the second one is a dash '-' (45)
114 - an enumeration
115 else. A dash at second position or quote at first position can be
116 avoided by permutation or by using multiple rules.
117
118
119 The fields in the mapping (record) map the second and following items
120 to the first one. Let n the number of characters in the target (the first item),
121 m those of the source, where a sequence is regarded as a single character.
122 - if n >= m,
123 the source characters are mapped to the first m corresponding
124 target characters
125 - if n < m,
126 the first n source characters are mapped to the corresponding
127 target characters, other source characters are mapped to the last
128 character of the target (or a blank, if n=0).
129 Sequences of characters mapping to the last one are squeezed as with
130 the -s option of the "tr" tool, i.e. produce only one character.
131
132 All but "map" fields describe a final rule,
133 which assigns a character class (letter,word,other,null)
134 and one or a range of collation values to the listed items.
135 Collation values are assigned in the order of rules (fields).
136
137 - map
138 do NOT assign a collation value, but process the result recursively.
139 - ignore
140 assigns character class null and no collation value.
141 ignored characters are silently discarded and don't break words
142 (like soft hyphen, combining diacritical marks).
143 - letter
144 assigns character class letter, i.e. characters that are part of words.
145 - word
146 assigns character class word, i.e. characters that are themselves words.
147 - other
148 assigns character class other, i.e. characters that are not part of words
149 and are discarded in word-split mode.
150
151 The builtin default rule maps all characters to a blank,
152 which thus always has the lowest collation value.
153
154 In any case, the longest matching sequence is used,
155 and map rules should (?) take preceedence over final rules.
156 TODO: detailled preceedence of conflicting rules.
157
158
159 The actual collation key is a sequence of collation values,
160 encoded as unsigned characters, if less than 256, or UTF-8 else.
161 (A fixed-number-of-bits encoding could be used alternatively).
162 The reverse mapping is simple: for each collation value,
163 there is a corresponding final rule that produced this value
164 or a range containing this value. The corresponding character
165 (or sequence) in the first item of this rule is used.
166
167
168 Simple example for traditional spanish collation:
169 $
170 letter A-L a-l
171 letter LL ll
172 letter M-Z m-z
173 map AEIOUN ÁÉÍÓÚÑ áéíóúñ
174 $

  ViewVC Help
Powered by ViewVC 1.1.26