/[webpac]/openisis/current/doc/Collating.txt
This is repository of my old source code which isn't updated any more. Go to git.rot13.org for current projects!
ViewVC logotype

Annotation of /openisis/current/doc/Collating.txt

Parent Directory Parent Directory | Revision Log Revision Log


Revision 237 - (hide annotations)
Mon Mar 8 17:43:12 2004 UTC (20 years, 1 month ago) by dpavlin
File MIME type: text/plain
File size: 6656 byte(s)
initial import of openisis 0.9.0 vendor drop

1 dpavlin 237 This document discusses the issues involved in collating entries
2     for the index (inverted file).
3    
4     Collating starts after preliminary steps of inverting like
5     selecting data from a record (or other sources), including
6     extraction of keywords and the like, which are described in
7     > Inverting
8     .
9    
10     Collating involves a string transformation, so that the resulting strings
11     - are collation keys,
12     i.e. yield the desired ordering on binary comparision (a la strcmp).
13     - are normalized,
14     i.e. some variants of the same contents are mapped to a canonical base
15    
16     The same transformation is applied to both entries before written to the
17     index and keys before searched in the index.
18    
19    
20     A desired property of this transformation (mapping) is reversibility
21     to a legible representation of the original key,
22     so that index contents can be listed.
23     However, an alternative is to use the original record contents
24     for listing, where this can be easily identified (e.g. from a thesaurus).
25    
26     Since collating involves looking up characters, it can also detect
27     word boundaries.
28    
29    
30     The traditional means provided by the ISISAC.TAB and ISISUC.TAB tables
31     do not extend well to multibyte character sets like UNICODE,
32     so we explore alternatives.
33    
34    
35     * basic collation
36    
37     A basic variant using ISO-C/POSIX means (ctype.h, string.h) would be
38     - using toupper for normalization
39     - using isalpha to detect word boundaries
40     - using strxfrm to create collation keys
41    
42     This fallback approach has several disadvantages:
43     - normalization is very limited
44     - it depends on a "locale" which is a global state of a process,
45     so working with databases with different settings is not multithreading safe.
46     - the dependency from the locale is not specified precisely
47     - there is no standard way to customize settings
48     - it is not revertible
49    
50    
51     * full unicode collation
52    
53     Using a library such as Big Blue's
54     > http://oss.software.ibm.com/icu/userguide/Collate_Intro.html ICU
55     , we can use the full unicode collation algorithm (
56     > http://unicode.org/unicode/reports/tr10/ UCA
57     ), including the customization options described there.
58    
59     - use the given encoding to convert input to unicode
60     - uppercase and canonicalize input, especially decompose composite characters
61     - remove most diacritics (but not the ring on the A when in sweden)
62     - use the alphabetic character property to detect word boundaries
63     - get a collation key
64    
65     This approach is very powerful. It provides prepared data for virtually
66     all scripts, encodings and locales without the need for customization.
67     It provides for three or four levels of sorting
68     and can even handle french accent ordering
69     (where accents later in the otherwise same word have higher preceedence).
70     Sooner or later, OpenIsis will provide a means to link to ICU.
71    
72     However, this still has a couple of disadvantages
73     - it requires a library that alone is several times larger than OpenIsis.
74     The used algorithms are very complex, so it can not be expected
75     to perform all too well on low hardware ressources.
76     - customization for special normalizations is difficult
77     - reverting a compressed collation key might not always be possible,
78     at least it is a pretty non-trivial task (and not supported by ICU)
79    
80    
81     * isis custom collation
82    
83     We devise a relatively simple scheme, that combines the mappings used for
84     normalization, collation key creation and word boundary detection into one step.
85     It does (by now) not provide multi-level ordering, since the discriminators
86     for secondary levels (like case and diacritics) are typically discarded anyway
87     during normalization.
88    
89    
90     Features are
91     - efficiency
92     - easy customization
93     - independence of encoding
94     - mapping of sequences
95    
96    
97     The mapping is specified by an isis record,
98     that could be stored in an separate file or be part of embedded header data.
99     In any way, the mappings are described using the actual characters,
100     in the encoding of the database, rather than code numbers.
101     This way, simple explicit mappings will survive a recoding
102     (ranges, however, may be broken).
103    
104    
105     Fields used in the customization are mapping rules,
106     containing one or more "items" separated by TAB characters.
107     An item is
108     - a single character
109     if it is a single character (sic!)
110     - a literal sequence
111     if the first one is a quote '"' (34)
112     - a range
113     is it consists of three characters, and the second one is a dash '-' (45)
114     - an enumeration
115     else. A dash at second position or quote at first position can be
116     avoided by permutation or by using multiple rules.
117    
118    
119     The fields in the mapping (record) map the second and following items
120     to the first one. Let n the number of characters in the target (the first item),
121     m those of the source, where a sequence is regarded as a single character.
122     - if n >= m,
123     the source characters are mapped to the first m corresponding
124     target characters
125     - if n < m,
126     the first n source characters are mapped to the corresponding
127     target characters, other source characters are mapped to the last
128     character of the target (or a blank, if n=0).
129     Sequences of characters mapping to the last one are squeezed as with
130     the -s option of the "tr" tool, i.e. produce only one character.
131    
132     All but "map" fields describe a final rule,
133     which assigns a character class (letter,word,other,null)
134     and one or a range of collation values to the listed items.
135     Collation values are assigned in the order of rules (fields).
136    
137     - map
138     do NOT assign a collation value, but process the result recursively.
139     - ignore
140     assigns character class null and no collation value.
141     ignored characters are silently discarded and don't break words
142     (like soft hyphen, combining diacritical marks).
143     - letter
144     assigns character class letter, i.e. characters that are part of words.
145     - word
146     assigns character class word, i.e. characters that are themselves words.
147     - other
148     assigns character class other, i.e. characters that are not part of words
149     and are discarded in word-split mode.
150    
151     The builtin default rule maps all characters to a blank,
152     which thus always has the lowest collation value.
153    
154     In any case, the longest matching sequence is used,
155     and map rules should (?) take preceedence over final rules.
156     TODO: detailled preceedence of conflicting rules.
157    
158    
159     The actual collation key is a sequence of collation values,
160     encoded as unsigned characters, if less than 256, or UTF-8 else.
161     (A fixed-number-of-bits encoding could be used alternatively).
162     The reverse mapping is simple: for each collation value,
163     there is a corresponding final rule that produced this value
164     or a range containing this value. The corresponding character
165     (or sequence) in the first item of this rule is used.
166    
167    
168     Simple example for traditional spanish collation:
169     $
170     letter A-L a-l
171     letter LL ll
172     letter M-Z m-z
173     map AEIOUN ÁÉÍÓÚÑ áéíóúñ
174     $

  ViewVC Help
Powered by ViewVC 1.1.26