openisis/doc/Collating.txt

This document discusses the issues involved in collating entries
for the index (inverted file).

Collating starts after preliminary steps of inverting like
selecting data from a record (or other sources), including
extraction of keywords and the like, which are described in
>       Inverting
.

Collating involves a string transformation, so that the resulting strings
-       are collation keys,
        i.e. yield the desired ordering on binary comparision (a la strcmp).
-       are normalized,
        i.e. some variants of the same contents are mapped to a canonical base

The same transformation is applied to both entries before written to the
index and keys before searched in the index.


A desired property of this transformation (mapping) is reversibility
to a legible representation of the original key,
so that index contents can be listed.
However, an alternative is to use the original record contents
for listing, where this can be easily identified (e.g. from a thesaurus).

Since collating involves looking up characters, it can also detect
word boundaries.


The traditional means provided by the ISISAC.TAB and ISISUC.TAB tables
do not extend well to multibyte character sets like UNICODE,
so we explore alternatives.


*       basic collation

A basic variant using ISO-C/POSIX means (ctype.h, string.h) would be
-       using toupper for normalization
-       using isalpha to detect word boundaries
-       using strxfrm to create collation keys

This fallback approach has several disadvantages:
-       normalization is very limited
-       it depends on a "locale" which is a global state of a process,
        so working with databases with different settings is not multithreading safe.
-       the dependency from the locale is not specified precisely
-       there is no standard way to customize settings
-       it is not revertible


*       full unicode collation

Using a library such as Big Blue's
>       http://oss.software.ibm.com/icu/userguide/Collate_Intro.html    ICU
, we can use the full unicode collation algorithm (
>       http://unicode.org/unicode/reports/tr10/        UCA
), including the customization options described there.

-       use the given encoding to convert input to unicode
-       uppercase and canonicalize input, especially decompose composite characters
-       remove most diacritics (but not the ring on the A when in sweden)
-       use the alphabetic character property to detect word boundaries
-       get a collation key

This approach is very powerful. It provides prepared data for virtually
all scripts, encodings and locales without the need for customization.
It provides for three or four levels of sorting
and can even handle french accent ordering
(where accents later in the otherwise same word have higher preceedence).
Sooner or later, OpenIsis will provide a means to link to ICU.

However, this still has a couple of disadvantages
-       it requires a library that alone is several times larger than OpenIsis.
        The used algorithms are very complex, so it can not be expected
        to perform all too well on low hardware ressources.
-       customization for special normalizations is difficult
-       reverting a compressed collation key might not always be possible,
        at least it is a pretty non-trivial task (and not supported by ICU)


*       isis custom collation

We devise a relatively simple scheme, that combines the mappings used for
normalization, collation key creation and word boundary detection into one step.
It does (by now) not provide multi-level ordering, since the discriminators
for secondary levels (like case and diacritics) are typically discarded anyway
during normalization.


Features are
-       efficiency
-       easy customization
-       independence of encoding
-       mapping of sequences


The mapping is specified by an isis record,
that could be stored in an separate file or be part of embedded header data.
In any way, the mappings are described using the actual characters,
in the encoding of the database, rather than code numbers.
This way, simple explicit mappings will survive a recoding
(ranges, however, may be broken).


Fields used in the customization are mapping rules,
containing one or more "items" separated by TAB characters.
An item is
-       a single character
        if it is a single character (sic!)
-       a literal sequence
        if the first one is a quote '"' (34)
-       a range
        is it consists of three characters, and the second one is a dash '-' (45)
-       an enumeration
        else. A dash at second position or quote at first position can be
        avoided by permutation or by using multiple rules.


The fields in the mapping (record) map the second and following items
to the first one. Let n the number of characters in the target (the first item),
m those of the source, where a sequence is regarded as a single character.
-       if n &gt;= m,
        the source characters are mapped to the first m corresponding
        target characters
-       if n &lt; m,
        the first n source characters are mapped to the corresponding
        target characters, other source characters are mapped to the last
        character of the target (or a blank, if n=0).
        Sequences of characters mapping to the last one are squeezed as with
        the -s option of the "tr" tool, i.e. produce only one character.

All but "map" fields describe a final rule,
which assigns a character class (letter,word,other,null)
and one or a range of collation values to the listed items.
Collation values are assigned in the order of rules (fields).

-       map
        do NOT assign a collation value, but process the result recursively. 
-       ignore
        assigns character class null and no collation value.
        ignored characters are silently discarded and don't break words
        (like soft hyphen, combining diacritical marks).
-       letter
        assigns character class letter, i.e. characters that are part of words.
-       word
        assigns character class word, i.e. characters that are themselves words.
-       other
        assigns character class other, i.e. characters that are not part of words
        and are discarded in word-split mode.

The builtin default rule maps all characters to a blank,
which thus always has the lowest collation value.

In any case, the longest matching sequence is used,
and map rules should (?) take preceedence over final rules.
TODO: detailled preceedence of conflicting rules.


The actual collation key is a sequence of collation values,
encoded as unsigned characters, if less than 256, or UTF-8 else.
(A fixed-number-of-bits encoding could be used alternatively).
The reverse mapping is simple: for each collation value,
there is a corresponding final rule that produced this value
or a range containing this value. The corresponding character
(or sequence) in the first item of this rule is used.


Simple example for traditional spanish collation:
$
letter  A-L     a-l
letter  LL      ll
letter  M-Z     m-z
map     AEIOUN  ÁÉÍÓÚÑ  áéíóúñ
$
1	This document discusses the issues involved in collating entries
2	for the index (inverted file).
3
4	Collating starts after preliminary steps of inverting like
5	selecting data from a record (or other sources), including
6	extraction of keywords and the like, which are described in
7	> Inverting
8	.
9
10	Collating involves a string transformation, so that the resulting strings
11	- are collation keys,
12	i.e. yield the desired ordering on binary comparision (a la strcmp).
13	- are normalized,
14	i.e. some variants of the same contents are mapped to a canonical base
15
16	The same transformation is applied to both entries before written to the
17	index and keys before searched in the index.
18
19
20	A desired property of this transformation (mapping) is reversibility
21	to a legible representation of the original key,
22	so that index contents can be listed.
23	However, an alternative is to use the original record contents
24	for listing, where this can be easily identified (e.g. from a thesaurus).
25
26	Since collating involves looking up characters, it can also detect
27	word boundaries.
28
29
30	The traditional means provided by the ISISAC.TAB and ISISUC.TAB tables
31	do not extend well to multibyte character sets like UNICODE,
32	so we explore alternatives.
33
34
35	* basic collation
36
37	A basic variant using ISO-C/POSIX means (ctype.h, string.h) would be
38	- using toupper for normalization
39	- using isalpha to detect word boundaries
40	- using strxfrm to create collation keys
41
42	This fallback approach has several disadvantages:
43	- normalization is very limited
44	- it depends on a "locale" which is a global state of a process,
45	so working with databases with different settings is not multithreading safe.
46	- the dependency from the locale is not specified precisely
47	- there is no standard way to customize settings
48	- it is not revertible
49
50
51	* full unicode collation
52
53	Using a library such as Big Blue's
54	> http://oss.software.ibm.com/icu/userguide/Collate_Intro.html ICU
55	, we can use the full unicode collation algorithm (
56	> http://unicode.org/unicode/reports/tr10/ UCA
57	), including the customization options described there.
58
59	- use the given encoding to convert input to unicode
60	- uppercase and canonicalize input, especially decompose composite characters
61	- remove most diacritics (but not the ring on the A when in sweden)
62	- use the alphabetic character property to detect word boundaries
63	- get a collation key
64
65	This approach is very powerful. It provides prepared data for virtually
66	all scripts, encodings and locales without the need for customization.
67	It provides for three or four levels of sorting
68	and can even handle french accent ordering
69	(where accents later in the otherwise same word have higher preceedence).
70	Sooner or later, OpenIsis will provide a means to link to ICU.
71
72	However, this still has a couple of disadvantages
73	- it requires a library that alone is several times larger than OpenIsis.
74	The used algorithms are very complex, so it can not be expected
75	to perform all too well on low hardware ressources.
76	- customization for special normalizations is difficult
77	- reverting a compressed collation key might not always be possible,
78	at least it is a pretty non-trivial task (and not supported by ICU)
79
80
81	* isis custom collation
82
83	We devise a relatively simple scheme, that combines the mappings used for
84	normalization, collation key creation and word boundary detection into one step.
85	It does (by now) not provide multi-level ordering, since the discriminators
86	for secondary levels (like case and diacritics) are typically discarded anyway
87	during normalization.
88
89
90	Features are
91	- efficiency
92	- easy customization
93	- independence of encoding
94	- mapping of sequences
95
96
97	The mapping is specified by an isis record,
98	that could be stored in an separate file or be part of embedded header data.
99	In any way, the mappings are described using the actual characters,
100	in the encoding of the database, rather than code numbers.
101	This way, simple explicit mappings will survive a recoding
102	(ranges, however, may be broken).
103
104
105	Fields used in the customization are mapping rules,
106	containing one or more "items" separated by TAB characters.
107	An item is
108	- a single character
109	if it is a single character (sic!)
110	- a literal sequence
111	if the first one is a quote '"' (34)
112	- a range
113	is it consists of three characters, and the second one is a dash '-' (45)
114	- an enumeration
115	else. A dash at second position or quote at first position can be
116	avoided by permutation or by using multiple rules.
117
118
119	The fields in the mapping (record) map the second and following items
120	to the first one. Let n the number of characters in the target (the first item),
121	m those of the source, where a sequence is regarded as a single character.
122	- if n >= m,
123	the source characters are mapped to the first m corresponding
124	target characters
125	- if n < m,
126	the first n source characters are mapped to the corresponding
127	target characters, other source characters are mapped to the last
128	character of the target (or a blank, if n=0).
129	Sequences of characters mapping to the last one are squeezed as with
130	the -s option of the "tr" tool, i.e. produce only one character.
131
132	All but "map" fields describe a final rule,
133	which assigns a character class (letter,word,other,null)
134	and one or a range of collation values to the listed items.
135	Collation values are assigned in the order of rules (fields).
136
137	- map
138	do NOT assign a collation value, but process the result recursively.
139	- ignore
140	assigns character class null and no collation value.
141	ignored characters are silently discarded and don't break words
142	(like soft hyphen, combining diacritical marks).
143	- letter
144	assigns character class letter, i.e. characters that are part of words.
145	- word
146	assigns character class word, i.e. characters that are themselves words.
147	- other
148	assigns character class other, i.e. characters that are not part of words
149	and are discarded in word-split mode.
150
151	The builtin default rule maps all characters to a blank,
152	which thus always has the lowest collation value.
153
154	In any case, the longest matching sequence is used,
155	and map rules should (?) take preceedence over final rules.
156	TODO: detailled preceedence of conflicting rules.
157
158
159	The actual collation key is a sequence of collation values,
160	encoded as unsigned characters, if less than 256, or UTF-8 else.
161	(A fixed-number-of-bits encoding could be used alternatively).
162	The reverse mapping is simple: for each collation value,
163	there is a corresponding final rule that produced this value
164	or a range containing this value. The corresponding character
165	(or sequence) in the first item of this rule is used.
166
167
168	Simple example for traditional spanish collation:
169	$
170	letter A-L a-l
171	letter LL ll
172	letter M-Z m-z
173	map AEIOUN ÁÉÍÓÚÑ áéíóúñ
174	$