1 |
This document discusses the issues involved in collating entries |
2 |
for the index (inverted file). |
3 |
|
4 |
Collating starts after preliminary steps of inverting like |
5 |
selecting data from a record (or other sources), including |
6 |
extraction of keywords and the like, which are described in |
7 |
> Inverting |
8 |
. |
9 |
|
10 |
Collating involves a string transformation, so that the resulting strings |
11 |
- are collation keys, |
12 |
i.e. yield the desired ordering on binary comparision (a la strcmp). |
13 |
- are normalized, |
14 |
i.e. some variants of the same contents are mapped to a canonical base |
15 |
|
16 |
The same transformation is applied to both entries before written to the |
17 |
index and keys before searched in the index. |
18 |
|
19 |
|
20 |
A desired property of this transformation (mapping) is reversibility |
21 |
to a legible representation of the original key, |
22 |
so that index contents can be listed. |
23 |
However, an alternative is to use the original record contents |
24 |
for listing, where this can be easily identified (e.g. from a thesaurus). |
25 |
|
26 |
Since collating involves looking up characters, it can also detect |
27 |
word boundaries. |
28 |
|
29 |
|
30 |
The traditional means provided by the ISISAC.TAB and ISISUC.TAB tables |
31 |
do not extend well to multibyte character sets like UNICODE, |
32 |
so we explore alternatives. |
33 |
|
34 |
|
35 |
* basic collation |
36 |
|
37 |
A basic variant using ISO-C/POSIX means (ctype.h, string.h) would be |
38 |
- using toupper for normalization |
39 |
- using isalpha to detect word boundaries |
40 |
- using strxfrm to create collation keys |
41 |
|
42 |
This fallback approach has several disadvantages: |
43 |
- normalization is very limited |
44 |
- it depends on a "locale" which is a global state of a process, |
45 |
so working with databases with different settings is not multithreading safe. |
46 |
- the dependency from the locale is not specified precisely |
47 |
- there is no standard way to customize settings |
48 |
- it is not revertible |
49 |
|
50 |
|
51 |
* full unicode collation |
52 |
|
53 |
Using a library such as Big Blue's |
54 |
> http://oss.software.ibm.com/icu/userguide/Collate_Intro.html ICU |
55 |
, we can use the full unicode collation algorithm ( |
56 |
> http://unicode.org/unicode/reports/tr10/ UCA |
57 |
), including the customization options described there. |
58 |
|
59 |
- use the given encoding to convert input to unicode |
60 |
- uppercase and canonicalize input, especially decompose composite characters |
61 |
- remove most diacritics (but not the ring on the A when in sweden) |
62 |
- use the alphabetic character property to detect word boundaries |
63 |
- get a collation key |
64 |
|
65 |
This approach is very powerful. It provides prepared data for virtually |
66 |
all scripts, encodings and locales without the need for customization. |
67 |
It provides for three or four levels of sorting |
68 |
and can even handle french accent ordering |
69 |
(where accents later in the otherwise same word have higher preceedence). |
70 |
Sooner or later, OpenIsis will provide a means to link to ICU. |
71 |
|
72 |
However, this still has a couple of disadvantages |
73 |
- it requires a library that alone is several times larger than OpenIsis. |
74 |
The used algorithms are very complex, so it can not be expected |
75 |
to perform all too well on low hardware ressources. |
76 |
- customization for special normalizations is difficult |
77 |
- reverting a compressed collation key might not always be possible, |
78 |
at least it is a pretty non-trivial task (and not supported by ICU) |
79 |
|
80 |
|
81 |
* isis custom collation |
82 |
|
83 |
We devise a relatively simple scheme, that combines the mappings used for |
84 |
normalization, collation key creation and word boundary detection into one step. |
85 |
It does (by now) not provide multi-level ordering, since the discriminators |
86 |
for secondary levels (like case and diacritics) are typically discarded anyway |
87 |
during normalization. |
88 |
|
89 |
|
90 |
Features are |
91 |
- efficiency |
92 |
- easy customization |
93 |
- independence of encoding |
94 |
- mapping of sequences |
95 |
|
96 |
|
97 |
The mapping is specified by an isis record, |
98 |
that could be stored in an separate file or be part of embedded header data. |
99 |
In any way, the mappings are described using the actual characters, |
100 |
in the encoding of the database, rather than code numbers. |
101 |
This way, simple explicit mappings will survive a recoding |
102 |
(ranges, however, may be broken). |
103 |
|
104 |
|
105 |
Fields used in the customization are mapping rules, |
106 |
containing one or more "items" separated by TAB characters. |
107 |
An item is |
108 |
- a single character |
109 |
if it is a single character (sic!) |
110 |
- a literal sequence |
111 |
if the first one is a quote '"' (34) |
112 |
- a range |
113 |
is it consists of three characters, and the second one is a dash '-' (45) |
114 |
- an enumeration |
115 |
else. A dash at second position or quote at first position can be |
116 |
avoided by permutation or by using multiple rules. |
117 |
|
118 |
|
119 |
The fields in the mapping (record) map the second and following items |
120 |
to the first one. Let n the number of characters in the target (the first item), |
121 |
m those of the source, where a sequence is regarded as a single character. |
122 |
- if n >= m, |
123 |
the source characters are mapped to the first m corresponding |
124 |
target characters |
125 |
- if n < m, |
126 |
the first n source characters are mapped to the corresponding |
127 |
target characters, other source characters are mapped to the last |
128 |
character of the target (or a blank, if n=0). |
129 |
Sequences of characters mapping to the last one are squeezed as with |
130 |
the -s option of the "tr" tool, i.e. produce only one character. |
131 |
|
132 |
All but "map" fields describe a final rule, |
133 |
which assigns a character class (letter,word,other,null) |
134 |
and one or a range of collation values to the listed items. |
135 |
Collation values are assigned in the order of rules (fields). |
136 |
|
137 |
- map |
138 |
do NOT assign a collation value, but process the result recursively. |
139 |
- ignore |
140 |
assigns character class null and no collation value. |
141 |
ignored characters are silently discarded and don't break words |
142 |
(like soft hyphen, combining diacritical marks). |
143 |
- letter |
144 |
assigns character class letter, i.e. characters that are part of words. |
145 |
- word |
146 |
assigns character class word, i.e. characters that are themselves words. |
147 |
- other |
148 |
assigns character class other, i.e. characters that are not part of words |
149 |
and are discarded in word-split mode. |
150 |
|
151 |
The builtin default rule maps all characters to a blank, |
152 |
which thus always has the lowest collation value. |
153 |
|
154 |
In any case, the longest matching sequence is used, |
155 |
and map rules should (?) take preceedence over final rules. |
156 |
TODO: detailled preceedence of conflicting rules. |
157 |
|
158 |
|
159 |
The actual collation key is a sequence of collation values, |
160 |
encoded as unsigned characters, if less than 256, or UTF-8 else. |
161 |
(A fixed-number-of-bits encoding could be used alternatively). |
162 |
The reverse mapping is simple: for each collation value, |
163 |
there is a corresponding final rule that produced this value |
164 |
or a range containing this value. The corresponding character |
165 |
(or sequence) in the first item of this rule is used. |
166 |
|
167 |
|
168 |
Simple example for traditional spanish collation: |
169 |
$ |
170 |
letter A-L a-l |
171 |
letter LL ll |
172 |
letter M-Z m-z |
173 |
map AEIOUN ÁÉÍÓÚÑ áéíóúñ |
174 |
$ |