current/doc/Collating.txt

This document discusses the issues involved in collating entries
for the index (inverted file).

Collating starts after preliminary steps of inverting like
selecting data from a record (or other sources), including
extraction of keywords and the like, which are described in
>       Inverting
.

Collating involves a string transformation, so that the resulting strings
-       are collation keys,
        i.e. yield the desired ordering on binary comparision (a la strcmp).
-       are normalized,
        i.e. some variants of the same contents are mapped to a canonical base

The same transformation is applied to both entries before written to the
index and keys before searched in the index.


A desired property of this transformation (mapping) is reversibility
to a legible representation of the original key,
so that index contents can be listed.
However, an alternative is to use the original record contents
for listing, where this can be easily identified (e.g. from a thesaurus).

Since collating involves looking up characters, it can also detect
word boundaries.


The traditional means provided by the ISISAC.TAB and ISISUC.TAB tables
do not extend well to multibyte character sets like UNICODE,
so we explore alternatives.


*       basic collation

A basic variant using ISO-C/POSIX means (ctype.h, string.h) would be
-       using toupper for normalization
-       using isalpha to detect word boundaries
-       using strxfrm to create collation keys

This fallback approach has several disadvantages:
-       normalization is very limited
-       it depends on a "locale" which is a global state of a process,
        so working with databases with different settings is not multithreading safe.
-       the dependency from the locale is not specified precisely
-       there is no standard way to customize settings
-       it is not revertible


*       full unicode collation

Using a library such as Big Blue's
>       http://oss.software.ibm.com/icu/userguide/Collate_Intro.html    ICU
, we can use the full unicode collation algorithm (
>       http://unicode.org/unicode/reports/tr10/        UCA
), including the customization options described there.

-       use the given encoding to convert input to unicode
-       uppercase and canonicalize input, especially decompose composite characters
-       remove most diacritics (but not the ring on the A when in sweden)
-       use the alphabetic character property to detect word boundaries
-       get a collation key

This approach is very powerful. It provides prepared data for virtually
all scripts, encodings and locales without the need for customization.
It provides for three or four levels of sorting
and can even handle french accent ordering
(where accents later in the otherwise same word have higher preceedence).
Sooner or later, OpenIsis will provide a means to link to ICU.

However, this still has a couple of disadvantages
-       it requires a library that alone is several times larger than OpenIsis.
        The used algorithms are very complex, so it can not be expected
        to perform all too well on low hardware ressources.
-       customization for special normalizations is difficult
-       reverting a compressed collation key might not always be possible,
        at least it is a pretty non-trivial task (and not supported by ICU)


*       isis custom collation

We devise a relatively simple scheme, that combines the mappings used for
normalization, collation key creation and word boundary detection into one step.
It does (by now) not provide multi-level ordering, since the discriminators
for secondary levels (like case and diacritics) are typically discarded anyway
during normalization.


Features are
-       efficiency
-       easy customization
-       independence of encoding
-       mapping of sequences


The mapping is specified by an isis record,
that could be stored in an separate file or be part of embedded header data.
In any way, the mappings are described using the actual characters,
in the encoding of the database, rather than code numbers.
This way, simple explicit mappings will survive a recoding
(ranges, however, may be broken).


Fields used in the customization are mapping rules,
containing one or more "items" separated by TAB characters.
An item is
-       a single character
        if it is a single character (sic!)
-       a literal sequence
        if the first one is a quote '"' (34)
-       a range
        is it consists of three characters, and the second one is a dash '-' (45)
-       an enumeration
        else. A dash at second position or quote at first position can be
        avoided by permutation or by using multiple rules.


The fields in the mapping (record) map the second and following items
to the first one. Let n the number of characters in the target (the first item),
m those of the source, where a sequence is regarded as a single character.
-       if n &gt;= m,
        the source characters are mapped to the first m corresponding
        target characters
-       if n &lt; m,
        the first n source characters are mapped to the corresponding
        target characters, other source characters are mapped to the last
        character of the target (or a blank, if n=0).
        Sequences of characters mapping to the last one are squeezed as with
        the -s option of the "tr" tool, i.e. produce only one character.

All but "map" fields describe a final rule,
which assigns a character class (letter,word,other,null)
and one or a range of collation values to the listed items.
Collation values are assigned in the order of rules (fields).

-       map
        do NOT assign a collation value, but process the result recursively. 
-       ignore
        assigns character class null and no collation value.
        ignored characters are silently discarded and don't break words
        (like soft hyphen, combining diacritical marks).
-       letter
        assigns character class letter, i.e. characters that are part of words.
-       word
        assigns character class word, i.e. characters that are themselves words.
-       other
        assigns character class other, i.e. characters that are not part of words
        and are discarded in word-split mode.

The builtin default rule maps all characters to a blank,
which thus always has the lowest collation value.

In any case, the longest matching sequence is used,
and map rules should (?) take preceedence over final rules.
TODO: detailled preceedence of conflicting rules.


The actual collation key is a sequence of collation values,
encoded as unsigned characters, if less than 256, or UTF-8 else.
(A fixed-number-of-bits encoding could be used alternatively).
The reverse mapping is simple: for each collation value,
there is a corresponding final rule that produced this value
or a range containing this value. The corresponding character
(or sequence) in the first item of this rule is used.


Simple example for traditional spanish collation:
$
letter  A-L     a-l
letter  LL      ll
letter  M-Z     m-z
map     AEIOUN  ÁÉÍÓÚÑ  áéíóúñ
$
1	dpavlin	237	This document discusses the issues involved in collating entries
2			for the index (inverted file).
3
4			Collating starts after preliminary steps of inverting like
5			selecting data from a record (or other sources), including
6			extraction of keywords and the like, which are described in
7			> Inverting
8			.
9
10			Collating involves a string transformation, so that the resulting strings
11			- are collation keys,
12			i.e. yield the desired ordering on binary comparision (a la strcmp).
13			- are normalized,
14			i.e. some variants of the same contents are mapped to a canonical base
15
16			The same transformation is applied to both entries before written to the
17			index and keys before searched in the index.
18
19
20			A desired property of this transformation (mapping) is reversibility
21			to a legible representation of the original key,
22			so that index contents can be listed.
23			However, an alternative is to use the original record contents
24			for listing, where this can be easily identified (e.g. from a thesaurus).
25
26			Since collating involves looking up characters, it can also detect
27			word boundaries.
28
29
30			The traditional means provided by the ISISAC.TAB and ISISUC.TAB tables
31			do not extend well to multibyte character sets like UNICODE,
32			so we explore alternatives.
33
34
35			* basic collation
36
37			A basic variant using ISO-C/POSIX means (ctype.h, string.h) would be
38			- using toupper for normalization
39			- using isalpha to detect word boundaries
40			- using strxfrm to create collation keys
41
42			This fallback approach has several disadvantages:
43			- normalization is very limited
44			- it depends on a "locale" which is a global state of a process,
45			so working with databases with different settings is not multithreading safe.
46			- the dependency from the locale is not specified precisely
47			- there is no standard way to customize settings
48			- it is not revertible
49
50
51			* full unicode collation
52
53			Using a library such as Big Blue's
54			> http://oss.software.ibm.com/icu/userguide/Collate_Intro.html ICU
55			, we can use the full unicode collation algorithm (
56			> http://unicode.org/unicode/reports/tr10/ UCA
57			), including the customization options described there.
58
59			- use the given encoding to convert input to unicode
60			- uppercase and canonicalize input, especially decompose composite characters
61			- remove most diacritics (but not the ring on the A when in sweden)
62			- use the alphabetic character property to detect word boundaries
63			- get a collation key
64
65			This approach is very powerful. It provides prepared data for virtually
66			all scripts, encodings and locales without the need for customization.
67			It provides for three or four levels of sorting
68			and can even handle french accent ordering
69			(where accents later in the otherwise same word have higher preceedence).
70			Sooner or later, OpenIsis will provide a means to link to ICU.
71
72			However, this still has a couple of disadvantages
73			- it requires a library that alone is several times larger than OpenIsis.
74			The used algorithms are very complex, so it can not be expected
75			to perform all too well on low hardware ressources.
76			- customization for special normalizations is difficult
77			- reverting a compressed collation key might not always be possible,
78			at least it is a pretty non-trivial task (and not supported by ICU)
79
80
81			* isis custom collation
82
83			We devise a relatively simple scheme, that combines the mappings used for
84			normalization, collation key creation and word boundary detection into one step.
85			It does (by now) not provide multi-level ordering, since the discriminators
86			for secondary levels (like case and diacritics) are typically discarded anyway
87			during normalization.
88
89
90			Features are
91			- efficiency
92			- easy customization
93			- independence of encoding
94			- mapping of sequences
95
96
97			The mapping is specified by an isis record,
98			that could be stored in an separate file or be part of embedded header data.
99			In any way, the mappings are described using the actual characters,
100			in the encoding of the database, rather than code numbers.
101			This way, simple explicit mappings will survive a recoding
102			(ranges, however, may be broken).
103
104
105			Fields used in the customization are mapping rules,
106			containing one or more "items" separated by TAB characters.
107			An item is
108			- a single character
109			if it is a single character (sic!)
110			- a literal sequence
111			if the first one is a quote '"' (34)
112			- a range
113			is it consists of three characters, and the second one is a dash '-' (45)
114			- an enumeration
115			else. A dash at second position or quote at first position can be
116			avoided by permutation or by using multiple rules.
117
118
119			The fields in the mapping (record) map the second and following items
120			to the first one. Let n the number of characters in the target (the first item),
121			m those of the source, where a sequence is regarded as a single character.
122			- if n >= m,
123			the source characters are mapped to the first m corresponding
124			target characters
125			- if n < m,
126			the first n source characters are mapped to the corresponding
127			target characters, other source characters are mapped to the last
128			character of the target (or a blank, if n=0).
129			Sequences of characters mapping to the last one are squeezed as with
130			the -s option of the "tr" tool, i.e. produce only one character.
131
132			All but "map" fields describe a final rule,
133			which assigns a character class (letter,word,other,null)
134			and one or a range of collation values to the listed items.
135			Collation values are assigned in the order of rules (fields).
136
137			- map
138			do NOT assign a collation value, but process the result recursively.
139			- ignore
140			assigns character class null and no collation value.
141			ignored characters are silently discarded and don't break words
142			(like soft hyphen, combining diacritical marks).
143			- letter
144			assigns character class letter, i.e. characters that are part of words.
145			- word
146			assigns character class word, i.e. characters that are themselves words.
147			- other
148			assigns character class other, i.e. characters that are not part of words
149			and are discarded in word-split mode.
150
151			The builtin default rule maps all characters to a blank,
152			which thus always has the lowest collation value.
153
154			In any case, the longest matching sequence is used,
155			and map rules should (?) take preceedence over final rules.
156			TODO: detailled preceedence of conflicting rules.
157
158
159			The actual collation key is a sequence of collation values,
160			encoded as unsigned characters, if less than 256, or UTF-8 else.
161			(A fixed-number-of-bits encoding could be used alternatively).
162			The reverse mapping is simple: for each collation value,
163			there is a corresponding final rule that produced this value
164			or a range containing this value. The corresponding character
165			(or sequence) in the first item of this rule is used.
166
167
168			Simple example for traditional spanish collation:
169			$
170			letter A-L a-l
171			letter LL ll
172			letter M-Z m-z
173			map AEIOUN ÁÉÍÓÚÑ áéíóúñ
174			$