1 |
dpavlin |
604 |
character set support in Malete |
2 |
|
|
|
3 |
|
|
See end for usage note. |
4 |
|
|
|
5 |
|
|
|
6 |
|
|
* overview |
7 |
|
|
|
8 |
|
|
Malete supports any "character set" (in the MIME sense of "charset" = |
9 |
|
|
> http://ietf.org/rfc/rfc2278.txt CCS + CES |
10 |
|
|
) which is compatible with |
11 |
|
|
> http://ietf.org/rfc/rfc0020.txt ASCII |
12 |
|
|
so that |
13 |
|
|
- every character defined by ASCII is encoded by the same byte value |
14 |
|
|
- every byte with a value in the range 0-127 inclusive encodes the |
15 |
|
|
character as specified by ASCII |
16 |
|
|
|
17 |
|
|
This |
18 |
|
|
- includes |
19 |
|
|
> http://aspell.net/charsets/iso8859.html ISO-8859-* |
20 |
|
|
, Unicode in the |
21 |
|
|
> http://www.ietf.org/rfc/rfc2279.txt UTF-8 |
22 |
|
|
encoding, various far east |
23 |
|
|
> http://web.archive.org/web/czyborra.com/utf/#EUC EUC |
24 |
|
|
and similar encodings usings pairs of bytes greater than 127 |
25 |
|
|
and works well with most (but not all) IBM/M$ |
26 |
|
|
> http://aspell.net/charsets/codepages.html codepages |
27 |
|
|
unless you really abused the control characters 0-31 for graphics |
28 |
|
|
- excludes |
29 |
|
|
> http://web.archive.org/web/czyborra.com/utf/#UTF-16 UTF-16 |
30 |
|
|
and other formats using bytes 0-127 as part of multibyte sequences |
31 |
|
|
and, of course, the anti-ASCII |
32 |
|
|
> http://aspell.net/charsets/iso646.html#EBCDIC EBCDIC |
33 |
|
|
- should work with some restrictions on searching for encodings which at |
34 |
|
|
least preserve linefeed (10), horizontal TAB (9) and the digits (48-57) |
35 |
|
|
like the Unicode standard compression scheme |
36 |
|
|
> http://www.unicode.org/unicode/reports/tr6/ SCSU |
37 |
|
|
, |
38 |
|
|
> http://aspell.net/charsets/vietnamese.html VISCII |
39 |
|
|
and even old |
40 |
|
|
> http://aspell.net/charsets/iso646.html#i18n ISO-646-* |
41 |
|
|
or Cyrillic |
42 |
|
|
> http://aspell.net/charsets/cyrillic.html KOI |
43 |
|
|
|
44 |
|
|
In order to store and retrieve (by record id) data, |
45 |
|
|
Malete does not need to know anything about the character set. |
46 |
|
|
However, the |
47 |
|
|
> MetaData content-type |
48 |
|
|
may contain a charset name using a preferred MIME name from the |
49 |
|
|
> http://www.iana.org/assignments/character-sets iana registered charsets. |
50 |
|
|
The basic server does not support character set conversion, since many |
51 |
|
|
client environments like Java or Tcl are well prepared to handle this. |
52 |
|
|
A Tcl based server may support charset conversion. |
53 |
|
|
|
54 |
|
|
|
55 |
|
|
* indexing |
56 |
|
|
|
57 |
|
|
For indexing, we use quite a lot of information about characters similar to, |
58 |
|
|
but extending the traditional |
59 |
|
|
> http://www.cindoc.csic.es/isis/21-5.htm ISISUC.TAB |
60 |
|
|
and |
61 |
|
|
> http://www.cindoc.csic.es/isis/21-6.htm ISISAC.TAB |
62 |
|
|
- a sort order (collation sequence), |
63 |
|
|
possibly using |
64 |
|
|
> http://oss.software.ibm.com/icu/userguide/Collate_Concepts.html multilevel comparison |
65 |
|
|
according to the |
66 |
|
|
> http://unicode.org/unicode/reports/tr10/ UCA |
67 |
|
|
- a mapping of certain code sequences to others, |
68 |
|
|
for example to use uppercased versions in the index |
69 |
|
|
- some notion of which characters are "alphabetic" |
70 |
|
|
(parts of words) for indexing in word split mode |
71 |
|
|
|
72 |
|
|
This data is configured by a sequence of strings, |
73 |
|
|
which may be obtained from the collation (4) fields of a database's |
74 |
|
|
> MetaData metadata |
75 |
|
|
record (in _database_.m0d) or as the lines of a |
76 |
|
|
textfile (_collation_.mcd, not implemented). |
77 |
|
|
|
78 |
|
|
|
79 |
|
|
These strings start with a one or two character mnemonic |
80 |
|
|
which is followed by a tabulator separated sequences of bytes, |
81 |
|
|
representing single characters or sequences of characters |
82 |
|
|
in the database's encoding. |
83 |
|
|
Unlike CDS/ISIS, Malete always deals with multibyte entitities |
84 |
|
|
and does not use explicit codes as decimal numbers. |
85 |
|
|
Consequently, the collation configuration can be converted between |
86 |
|
|
charsets just like the database itself (e.g. using recode or iconv). |
87 |
|
|
|
88 |
|
|
|
89 |
|
|
Malete does not care whether a multibyte sequence holds the two ASCII |
90 |
|
|
characters 'C' and 'H' in order to assign 'CH' a separate rank between |
91 |
|
|
'C' and 'D' in spanish collation (a |
92 |
|
|
> http://oss.software.ibm.com/icu/userguide/Collate_Concepts.html#Contractions Contraction |
93 |
|
|
) or an |
94 |
|
|
> http://www.loc.gov/marc/specifications/speccharmarc8.html ANSEL |
95 |
|
|
( |
96 |
|
|
> http://lcweb2.loc.gov/cocoon/codetables/45.html codes |
97 |
|
|
, |
98 |
|
|
> http://www.niso.org/standards/resources/Z39-47.pdf Z39.47 |
99 |
|
|
) style composition or |
100 |
|
|
UTF-8 using two or more bytes to encode a character with a diacritical mark |
101 |
|
|
(in precomposed or decomposed form). |
102 |
|
|
|
103 |
|
|
|
104 |
|
|
Configuration entries supported in the initial version are: |
105 |
|
|
- collation C _name_ [_options_] |
106 |
|
|
assigns a name to this collation or refers to an external collation. |
107 |
|
|
Only the first 31 bytes in _name_ are considered. |
108 |
|
|
Should be a C identifier (plain ASCII) for best interoperability. |
109 |
|
|
Proposed (but probably not implemented) options are 'c' for compression |
110 |
|
|
and 'f' for french (reverse) secondaries (see below). |
111 |
|
|
- word W _entities_ |
112 |
|
|
specifies that the listed entities are considered parts of words |
113 |
|
|
and assigns sort ranks in ascending order to them |
114 |
|
|
- nonword N _entities_ |
115 |
|
|
like W, but the entities separate "words" in word split mode. |
116 |
|
|
Multiple W and N entries can be used to assign successive sort ranks. |
117 |
|
|
- alias A _entities_ |
118 |
|
|
the entities are assigned as aliases to the corresponding entities |
119 |
|
|
of the last seen W or N, e.g. a sequence of lowercase characters |
120 |
|
|
to their uppercase equivalents |
121 |
|
|
- map M _entities_ |
122 |
|
|
the second and following entities are mapped to first one, |
123 |
|
|
which will be iteratively checked for other rules (but not maps). |
124 |
|
|
This can be used to map entities to empty (completey discarding them) |
125 |
|
|
or to multiple entities as an |
126 |
|
|
> http://oss.software.ibm.com/icu/userguide/Collate_Concepts.html#Expansions Expansion |
127 |
|
|
|
128 |
|
|
Example:) |
129 |
|
|
$ |
130 |
|
|
4 C spanish_de_PHONEBOOK |
131 |
|
|
4 N . , ; - _ |
132 |
|
|
4 N 0 1 2 3 4 5 6 7 8 9 |
133 |
|
|
4 W a b c ch d e f g h ... |
134 |
|
|
4 A A B C CH D E F G H ... |
135 |
|
|
4 A á é |
136 |
|
|
4 A Á É |
137 |
|
|
4 M ae Ä ä |
138 |
|
|
4 M oe Ö ö |
139 |
|
|
$ |
140 |
|
|
|
141 |
|
|
Here, 'coche' sorts exactly like 'COCHE' after 'cocina', |
142 |
|
|
since the ch sorts after the c (and the i is not even considered). |
143 |
|
|
'König' in german phonebooks sorts exactly like 'Koenig', |
144 |
|
|
and in a terms listing, both will be displayed as 'koenig'. |
145 |
|
|
(Unlike |
146 |
|
|
> http://oss.software.ibm.com/icu/charts/collation/de__PHONEBOOK.html "de_PHONEBOOK" |
147 |
|
|
, the |
148 |
|
|
> http://oss.software.ibm.com/icu/charts/collation/de.html "de" |
149 |
|
|
collation has the o-umlaut as secondary to o). |
150 |
|
|
|
151 |
|
|
|
152 |
|
|
Note that as a possibly confusing, although correct sideeffect, |
153 |
|
|
a prefix search for 'coc' will NOT match 'coche', |
154 |
|
|
since it does not beginn with the codes for 'c'-'o'-'c', |
155 |
|
|
but with those for 'c'-'o'-'ch'. |
156 |
|
|
|
157 |
|
|
|
158 |
|
|
* implementation |
159 |
|
|
|
160 |
|
|
The collation is basically implemented by means of a recoding. |
161 |
|
|
Every W and N entity (byte sequence) corresponds to one code number, |
162 |
|
|
increasing from 2 in the order of their definition. |
163 |
|
|
Every unrecognized byte value, and especially the TAB (9), which can not |
164 |
|
|
be redefined, maps to code number 1 and runs of those are squeezed |
165 |
|
|
(i.e. only one 1 for a sequence of unrecognized bytes). |
166 |
|
|
Aliases use the same code number as their corresponding W or N entity. |
167 |
|
|
|
168 |
|
|
The recoded key is then a sequence of code numbers corresponding to |
169 |
|
|
the recognized entities. Depending on the highest code number, |
170 |
|
|
one or two bytes (big endian) are used for every number. |
171 |
|
|
|
172 |
|
|
|
173 |
|
|
This transformation is applied to every index entry before storing it, |
174 |
|
|
as well as to every term before lookup. From the table of entities, |
175 |
|
|
the original term (in W and N entities, not aliases or maps) can be |
176 |
|
|
reconstructed for display of indexed terms. |
177 |
|
|
|
178 |
|
|
Note that the term decoded from a collation key does not necessarily map |
179 |
|
|
to the same key. Where their byte sequences overlap with others, |
180 |
|
|
they may become parts of other contractions. |
181 |
|
|
|
182 |
|
|
|
183 |
|
|
The implementation limits both the length (in bytes) of sequences |
184 |
|
|
and the number of codes of a map target to 15. |
185 |
|
|
|
186 |
|
|
|
187 |
|
|
* compression |
188 |
|
|
|
189 |
|
|
With the compression option implemented and enabled, the number of bits used |
190 |
|
|
per code is the minimal number of bits needed to represent the highest |
191 |
|
|
code number, and the bitstring is padded with 0 bits to the full byte. |
192 |
|
|
In the spanish environment one would need 29 alphabetic codes |
193 |
|
|
(including CH, LL and Ñ), 10 digits and some punctuation, so six bits |
194 |
|
|
(codes 2-63) are sufficient and we can reduce key size by up to 25%. |
195 |
|
|
|
196 |
|
|
This probably is especially interesting for databases integrating a lot of |
197 |
|
|
phoenician and/or brahmi scripts, using more than 256 but less than 512 codes. |
198 |
|
|
Here one would need only 9 instead of 16 bits, saving more than 40%. |
199 |
|
|
In a CJK environment, you will need at least 15 bits anyway. |
200 |
|
|
|
201 |
|
|
|
202 |
|
|
Do not confuse this compression of single keys with the option of |
203 |
|
|
> FileFormats compressing the index |
204 |
|
|
based on common prefixes between adjacent keys. |
205 |
|
|
|
206 |
|
|
|
207 |
|
|
* multilevel comparison |
208 |
|
|
|
209 |
|
|
Some future version will also support S and T entries to support |
210 |
|
|
secondary (optionally french) and tertiary levels, possibly one day even |
211 |
|
|
quaternary and identical levels Q and I, should there be demand. |
212 |
|
|
|
213 |
|
|
|
214 |
|
|
Example (c.f. the |
215 |
|
|
> http://oss.software.ibm.com/icu/charts/collation/es.html es chart |
216 |
|
|
, lacking the ch contraction): |
217 |
|
|
$ |
218 |
|
|
4 C spanish |
219 |
|
|
4 W a b c ch d e f |
220 |
|
|
4 T A B C CH D E F |
221 |
|
|
4 S á é |
222 |
|
|
4 T Á É |
223 |
|
|
4 S ä |
224 |
|
|
4 T Ä |
225 |
|
|
$ |
226 |
|
|
|
227 |
|
|
Here, 'coche' sorts before 'Coche', |
228 |
|
|
since on the third level the 'c' sorts before 'C'. |
229 |
|
|
(Unlike plain ASCII sorting, most collations sort lowercase before uppercase). |
230 |
|
|
Still 'coche' sorts after 'Cocina', since the primary difference |
231 |
|
|
between 'ch' and 'c' takes precedence over the tertiary difference, |
232 |
|
|
although the latter occurs earlier in the word. |
233 |
|
|
Just for the fun of it, the a-umlaut is not expanded here, |
234 |
|
|
but listed as another secondary variant of 'a' with it's own tertiary. |
235 |
|
|
|
236 |
|
|
|
237 |
|
|
For multilevel comparison, a 0 code plus additional bits are appended to |
238 |
|
|
the recoded key. First, for every character some bits are appended to code |
239 |
|
|
it's secondary variant, depending on how many variants are defined for the |
240 |
|
|
character, then likewise for tertiary variants. |
241 |
|
|
|
242 |
|
|
In Latin scripts, typically every alphabetical character has one |
243 |
|
|
tertiary variant (it's lowercase equivalent, using one bit) |
244 |
|
|
and some or all vowels can have one or more diacritical marks. |
245 |
|
|
|
246 |
|
|
|
247 |
|
|
By appending additional bits not only do terms sort properly, |
248 |
|
|
but moreover we have the option for an exact match sensitive to all levels |
249 |
|
|
or a match insensitive to third or second and third level |
250 |
|
|
very similar to a prefix match (since the first level IS a prefix). |
251 |
|
|
|
252 |
|
|
An actual prefix match should usually be done using only the first level bits, |
253 |
|
|
checking for second and third level prefix is a little bit more complicated. |
254 |
|
|
|
255 |
|
|
|
256 |
|
|
For french secondary sorting, the second level bits are appended in reverse |
257 |
|
|
order. Must not be used together with left-to-right secondary sorting. |
258 |
|
|
|
259 |
|
|
|
260 |
|
|
Using the additional bits, a terms listing can reconstruct the input with |
261 |
|
|
regard to all variants, i.e. with proper case and diacritics. |
262 |
|
|
|
263 |
|
|
However, aliases and mappings can not be reversed: |
264 |
|
|
where the a umlaut should sort *exactly* |
265 |
|
|
like an a followed by an e, it uses exactly the same bytes and we can not |
266 |
|
|
tell from the index that once there was an a umlaut in the input. |
267 |
|
|
|
268 |
|
|
|
269 |
|
|
* east asian word indexing |
270 |
|
|
|
271 |
|
|
Segmentation of |
272 |
|
|
> http://www.multilingual.com/FMPro?-db=archives&-format=ad%2fselected%5fresults.html&-find= Chinese text |
273 |
|
|
is, in general, not a trivial task. |
274 |
|
|
|
275 |
|
|
Of course, the use of spaces to explicitly separate words is an option. |
276 |
|
|
The usual word split will just work as for any other scripts. |
277 |
|
|
|
278 |
|
|
|
279 |
|
|
Since Malete support full application controlled indexing, it is also possible |
280 |
|
|
to use any existing segmentation algorithm on the application level. |
281 |
|
|
|
282 |
|
|
|
283 |
|
|
Where this is unwanted or not possible, a somewhat brute force, |
284 |
|
|
yet feasible approach is to put every single character or |
285 |
|
|
every digraph or trigraph in the index. |
286 |
|
|
(I.e. every character together with the two or three following characters). |
287 |
|
|
|
288 |
|
|
Please contact us, should there be demand for such a character or |
289 |
|
|
m-graph indexing method as alternative to "word" split. |
290 |
|
|
|
291 |
|
|
|
292 |
|
|
* using collation |
293 |
|
|
|
294 |
|
|
When a database is opened, Malete looks for the file _database_.m0d |
295 |
|
|
(where _database_ is the name yo your database, e.g. cds.m0d). |
296 |
|
|
If this exists, it is scanned for 4 fields. |
297 |
|
|
|
298 |
|
|
If there is a "4 C" naming the collation _name_, |
299 |
|
|
and there either is a compiled collation file _name_.mcx |
300 |
|
|
or another database is already using a collation of that name, |
301 |
|
|
the existing collation info is used. |
302 |
|
|
|
303 |
|
|
Else a new collation (with the given name or anonymous) |
304 |
|
|
is created from the description in the metadata and, |
305 |
|
|
if it is named, saved as file _name_.mcx (in the current directory). |
306 |
|
|
|
307 |
|
|
|
308 |
|
|
The distribution contains two sample collation definitions, a Latin-1 |
309 |
|
|
based as test/cds.m0d and a UTF-8 (Unicode) based as test/unicode.m0d. |
310 |
|
|
Use a UTF-8 capable editor like "vim '+set encoding=utf-8'" in a "xterm -u8". |
311 |
|
|
|
312 |
|
|
Creating a collation definition from existing ISISAC.TAB and ISISUC.TAB |
313 |
|
|
is a straightforward exercise left to the reader. |
314 |
|
|
|
315 |
|
|
|
316 |
|
|
- *WARNING*: |
317 |
|
|
Adding or changing the collation in _database_.m0d will render your |
318 |
|
|
encoded index data garbage! |
319 |
|
|
Make sure to save the plaintext index data using |
320 |
|
|
"malete qdump _database_ >_database_.mqt" before |
321 |
|
|
and reencode the index with "malete qload _database_ <_database_.mqt" |
322 |
|
|
after editing the m0d file. |
323 |
|
|
- *WARNING*: |
324 |
|
|
While named (and thus shared) collations are much more efficient, |
325 |
|
|
multiple databases using a different specification for the same |
326 |
|
|
collation name will not properly coexist. |
327 |
|
|
When changing a named collation, be sure to remove the .mcx file |
328 |
|
|
and reload the indexes for all affected databases. |
329 |
|
|
When in doubt, remove or change the collation name. |
330 |
|
|
|
331 |
|
|
|
332 |
|
|
* links |
333 |
|
|
|
334 |
|
|
> http://openisis.org/Doc/encoding list of encodings supported by Java |
335 |
|
|
|
336 |
|
|
> http://openisis.org/Doc/Unicode notes on charsets and unicode with ISIS |
337 |
|
|
|
338 |
|
|
> http://openisis.org/Doc/CsTables tables of some western latin sets |
339 |
|
|
|
340 |
|
|
> http://openisis.org/Doc/UniStats statistics on unicode character properties |
341 |
|
|
|
342 |
|
|
> http://openisis.org/Doc/Collating approaches to collation |
343 |
|
|
|
344 |
|
|
> http://www.iana.org/assignments/character-sets iana registered charsets |
345 |
|
|
|
346 |
|
|
> http://web.archive.org/web/czyborra.com/unicode/characters.html#scripts unicode characters and scripts |
347 |
|
|
|
348 |
|
|
ICU Collation |
349 |
|
|
> http://oss.software.ibm.com/icu/userguide/Collate_Intro.html Introduction |
350 |
|
|
and |
351 |
|
|
> http://oss.software.ibm.com/icu/userguide/Collate_Concepts.html Concepts |
352 |
|
|
|
353 |
|
|
Collation charts (without contractions) by |
354 |
|
|
> http://oss.software.ibm.com/icu/charts/collation/ locale |
355 |
|
|
or |
356 |
|
|
> http://www.unicode.org/charts/collation/ script |
357 |
|
|
(according to Unicode |
358 |
|
|
> http://www.unicode.org/unicode/reports/tr10/tr10-11.html#Default_Unicode_Collation_Element_Table default collation |
359 |
|
|
) |
360 |
|
|
|
361 |
|
|
--- |
362 |
|
|
$Id: CharSet.txt,v 1.12 2004/11/12 11:18:23 kripke Exp $ |