1 |
dpavlin |
237 |
This document discusses the issues involved in collating entries |
2 |
|
|
for the index (inverted file). |
3 |
|
|
|
4 |
|
|
Collating starts after preliminary steps of inverting like |
5 |
|
|
selecting data from a record (or other sources), including |
6 |
|
|
extraction of keywords and the like, which are described in |
7 |
|
|
> Inverting |
8 |
|
|
. |
9 |
|
|
|
10 |
|
|
Collating involves a string transformation, so that the resulting strings |
11 |
|
|
- are collation keys, |
12 |
|
|
i.e. yield the desired ordering on binary comparision (a la strcmp). |
13 |
|
|
- are normalized, |
14 |
|
|
i.e. some variants of the same contents are mapped to a canonical base |
15 |
|
|
|
16 |
|
|
The same transformation is applied to both entries before written to the |
17 |
|
|
index and keys before searched in the index. |
18 |
|
|
|
19 |
|
|
|
20 |
|
|
A desired property of this transformation (mapping) is reversibility |
21 |
|
|
to a legible representation of the original key, |
22 |
|
|
so that index contents can be listed. |
23 |
|
|
However, an alternative is to use the original record contents |
24 |
|
|
for listing, where this can be easily identified (e.g. from a thesaurus). |
25 |
|
|
|
26 |
|
|
Since collating involves looking up characters, it can also detect |
27 |
|
|
word boundaries. |
28 |
|
|
|
29 |
|
|
|
30 |
|
|
The traditional means provided by the ISISAC.TAB and ISISUC.TAB tables |
31 |
|
|
do not extend well to multibyte character sets like UNICODE, |
32 |
|
|
so we explore alternatives. |
33 |
|
|
|
34 |
|
|
|
35 |
|
|
* basic collation |
36 |
|
|
|
37 |
|
|
A basic variant using ISO-C/POSIX means (ctype.h, string.h) would be |
38 |
|
|
- using toupper for normalization |
39 |
|
|
- using isalpha to detect word boundaries |
40 |
|
|
- using strxfrm to create collation keys |
41 |
|
|
|
42 |
|
|
This fallback approach has several disadvantages: |
43 |
|
|
- normalization is very limited |
44 |
|
|
- it depends on a "locale" which is a global state of a process, |
45 |
|
|
so working with databases with different settings is not multithreading safe. |
46 |
|
|
- the dependency from the locale is not specified precisely |
47 |
|
|
- there is no standard way to customize settings |
48 |
|
|
- it is not revertible |
49 |
|
|
|
50 |
|
|
|
51 |
|
|
* full unicode collation |
52 |
|
|
|
53 |
|
|
Using a library such as Big Blue's |
54 |
|
|
> http://oss.software.ibm.com/icu/userguide/Collate_Intro.html ICU |
55 |
|
|
, we can use the full unicode collation algorithm ( |
56 |
|
|
> http://unicode.org/unicode/reports/tr10/ UCA |
57 |
|
|
), including the customization options described there. |
58 |
|
|
|
59 |
|
|
- use the given encoding to convert input to unicode |
60 |
|
|
- uppercase and canonicalize input, especially decompose composite characters |
61 |
|
|
- remove most diacritics (but not the ring on the A when in sweden) |
62 |
|
|
- use the alphabetic character property to detect word boundaries |
63 |
|
|
- get a collation key |
64 |
|
|
|
65 |
|
|
This approach is very powerful. It provides prepared data for virtually |
66 |
|
|
all scripts, encodings and locales without the need for customization. |
67 |
|
|
It provides for three or four levels of sorting |
68 |
|
|
and can even handle french accent ordering |
69 |
|
|
(where accents later in the otherwise same word have higher preceedence). |
70 |
|
|
Sooner or later, OpenIsis will provide a means to link to ICU. |
71 |
|
|
|
72 |
|
|
However, this still has a couple of disadvantages |
73 |
|
|
- it requires a library that alone is several times larger than OpenIsis. |
74 |
|
|
The used algorithms are very complex, so it can not be expected |
75 |
|
|
to perform all too well on low hardware ressources. |
76 |
|
|
- customization for special normalizations is difficult |
77 |
|
|
- reverting a compressed collation key might not always be possible, |
78 |
|
|
at least it is a pretty non-trivial task (and not supported by ICU) |
79 |
|
|
|
80 |
|
|
|
81 |
|
|
* isis custom collation |
82 |
|
|
|
83 |
|
|
We devise a relatively simple scheme, that combines the mappings used for |
84 |
|
|
normalization, collation key creation and word boundary detection into one step. |
85 |
|
|
It does (by now) not provide multi-level ordering, since the discriminators |
86 |
|
|
for secondary levels (like case and diacritics) are typically discarded anyway |
87 |
|
|
during normalization. |
88 |
|
|
|
89 |
|
|
|
90 |
|
|
Features are |
91 |
|
|
- efficiency |
92 |
|
|
- easy customization |
93 |
|
|
- independence of encoding |
94 |
|
|
- mapping of sequences |
95 |
|
|
|
96 |
|
|
|
97 |
|
|
The mapping is specified by an isis record, |
98 |
|
|
that could be stored in an separate file or be part of embedded header data. |
99 |
|
|
In any way, the mappings are described using the actual characters, |
100 |
|
|
in the encoding of the database, rather than code numbers. |
101 |
|
|
This way, simple explicit mappings will survive a recoding |
102 |
|
|
(ranges, however, may be broken). |
103 |
|
|
|
104 |
|
|
|
105 |
|
|
Fields used in the customization are mapping rules, |
106 |
|
|
containing one or more "items" separated by TAB characters. |
107 |
|
|
An item is |
108 |
|
|
- a single character |
109 |
|
|
if it is a single character (sic!) |
110 |
|
|
- a literal sequence |
111 |
|
|
if the first one is a quote '"' (34) |
112 |
|
|
- a range |
113 |
|
|
is it consists of three characters, and the second one is a dash '-' (45) |
114 |
|
|
- an enumeration |
115 |
|
|
else. A dash at second position or quote at first position can be |
116 |
|
|
avoided by permutation or by using multiple rules. |
117 |
|
|
|
118 |
|
|
|
119 |
|
|
The fields in the mapping (record) map the second and following items |
120 |
|
|
to the first one. Let n the number of characters in the target (the first item), |
121 |
|
|
m those of the source, where a sequence is regarded as a single character. |
122 |
|
|
- if n >= m, |
123 |
|
|
the source characters are mapped to the first m corresponding |
124 |
|
|
target characters |
125 |
|
|
- if n < m, |
126 |
|
|
the first n source characters are mapped to the corresponding |
127 |
|
|
target characters, other source characters are mapped to the last |
128 |
|
|
character of the target (or a blank, if n=0). |
129 |
|
|
Sequences of characters mapping to the last one are squeezed as with |
130 |
|
|
the -s option of the "tr" tool, i.e. produce only one character. |
131 |
|
|
|
132 |
|
|
All but "map" fields describe a final rule, |
133 |
|
|
which assigns a character class (letter,word,other,null) |
134 |
|
|
and one or a range of collation values to the listed items. |
135 |
|
|
Collation values are assigned in the order of rules (fields). |
136 |
|
|
|
137 |
|
|
- map |
138 |
|
|
do NOT assign a collation value, but process the result recursively. |
139 |
|
|
- ignore |
140 |
|
|
assigns character class null and no collation value. |
141 |
|
|
ignored characters are silently discarded and don't break words |
142 |
|
|
(like soft hyphen, combining diacritical marks). |
143 |
|
|
- letter |
144 |
|
|
assigns character class letter, i.e. characters that are part of words. |
145 |
|
|
- word |
146 |
|
|
assigns character class word, i.e. characters that are themselves words. |
147 |
|
|
- other |
148 |
|
|
assigns character class other, i.e. characters that are not part of words |
149 |
|
|
and are discarded in word-split mode. |
150 |
|
|
|
151 |
|
|
The builtin default rule maps all characters to a blank, |
152 |
|
|
which thus always has the lowest collation value. |
153 |
|
|
|
154 |
|
|
In any case, the longest matching sequence is used, |
155 |
|
|
and map rules should (?) take preceedence over final rules. |
156 |
|
|
TODO: detailled preceedence of conflicting rules. |
157 |
|
|
|
158 |
|
|
|
159 |
|
|
The actual collation key is a sequence of collation values, |
160 |
|
|
encoded as unsigned characters, if less than 256, or UTF-8 else. |
161 |
|
|
(A fixed-number-of-bits encoding could be used alternatively). |
162 |
|
|
The reverse mapping is simple: for each collation value, |
163 |
|
|
there is a corresponding final rule that produced this value |
164 |
|
|
or a range containing this value. The corresponding character |
165 |
|
|
(or sequence) in the first item of this rule is used. |
166 |
|
|
|
167 |
|
|
|
168 |
|
|
Simple example for traditional spanish collation: |
169 |
|
|
$ |
170 |
|
|
letter A-L a-l |
171 |
|
|
letter LL ll |
172 |
|
|
letter M-Z m-z |
173 |
|
|
map AEIOUN ÁÉÍÓÚÑ áéíóúñ |
174 |
|
|
$ |