1 |
character set support in Malete |
2 |
|
3 |
See end for usage note. |
4 |
|
5 |
|
6 |
* overview |
7 |
|
8 |
Malete supports any "character set" (in the MIME sense of "charset" = |
9 |
> http://ietf.org/rfc/rfc2278.txt CCS + CES |
10 |
) which is compatible with |
11 |
> http://ietf.org/rfc/rfc0020.txt ASCII |
12 |
so that |
13 |
- every character defined by ASCII is encoded by the same byte value |
14 |
- every byte with a value in the range 0-127 inclusive encodes the |
15 |
character as specified by ASCII |
16 |
|
17 |
This |
18 |
- includes |
19 |
> http://aspell.net/charsets/iso8859.html ISO-8859-* |
20 |
, Unicode in the |
21 |
> http://www.ietf.org/rfc/rfc2279.txt UTF-8 |
22 |
encoding, various far east |
23 |
> http://web.archive.org/web/czyborra.com/utf/#EUC EUC |
24 |
and similar encodings usings pairs of bytes greater than 127 |
25 |
and works well with most (but not all) IBM/M$ |
26 |
> http://aspell.net/charsets/codepages.html codepages |
27 |
unless you really abused the control characters 0-31 for graphics |
28 |
- excludes |
29 |
> http://web.archive.org/web/czyborra.com/utf/#UTF-16 UTF-16 |
30 |
and other formats using bytes 0-127 as part of multibyte sequences |
31 |
and, of course, the anti-ASCII |
32 |
> http://aspell.net/charsets/iso646.html#EBCDIC EBCDIC |
33 |
- should work with some restrictions on searching for encodings which at |
34 |
least preserve linefeed (10), horizontal TAB (9) and the digits (48-57) |
35 |
like the Unicode standard compression scheme |
36 |
> http://www.unicode.org/unicode/reports/tr6/ SCSU |
37 |
, |
38 |
> http://aspell.net/charsets/vietnamese.html VISCII |
39 |
and even old |
40 |
> http://aspell.net/charsets/iso646.html#i18n ISO-646-* |
41 |
or Cyrillic |
42 |
> http://aspell.net/charsets/cyrillic.html KOI |
43 |
|
44 |
In order to store and retrieve (by record id) data, |
45 |
Malete does not need to know anything about the character set. |
46 |
However, the |
47 |
> MetaData content-type |
48 |
may contain a charset name using a preferred MIME name from the |
49 |
> http://www.iana.org/assignments/character-sets iana registered charsets. |
50 |
The basic server does not support character set conversion, since many |
51 |
client environments like Java or Tcl are well prepared to handle this. |
52 |
A Tcl based server may support charset conversion. |
53 |
|
54 |
|
55 |
* indexing |
56 |
|
57 |
For indexing, we use quite a lot of information about characters similar to, |
58 |
but extending the traditional |
59 |
> http://www.cindoc.csic.es/isis/21-5.htm ISISUC.TAB |
60 |
and |
61 |
> http://www.cindoc.csic.es/isis/21-6.htm ISISAC.TAB |
62 |
- a sort order (collation sequence), |
63 |
possibly using |
64 |
> http://oss.software.ibm.com/icu/userguide/Collate_Concepts.html multilevel comparison |
65 |
according to the |
66 |
> http://unicode.org/unicode/reports/tr10/ UCA |
67 |
- a mapping of certain code sequences to others, |
68 |
for example to use uppercased versions in the index |
69 |
- some notion of which characters are "alphabetic" |
70 |
(parts of words) for indexing in word split mode |
71 |
|
72 |
This data is configured by a sequence of strings, |
73 |
which may be obtained from the collation (4) fields of a database's |
74 |
> MetaData metadata |
75 |
record (in _database_.m0d) or as the lines of a |
76 |
textfile (_collation_.mcd, not implemented). |
77 |
|
78 |
|
79 |
These strings start with a one or two character mnemonic |
80 |
which is followed by a tabulator separated sequences of bytes, |
81 |
representing single characters or sequences of characters |
82 |
in the database's encoding. |
83 |
Unlike CDS/ISIS, Malete always deals with multibyte entitities |
84 |
and does not use explicit codes as decimal numbers. |
85 |
Consequently, the collation configuration can be converted between |
86 |
charsets just like the database itself (e.g. using recode or iconv). |
87 |
|
88 |
|
89 |
Malete does not care whether a multibyte sequence holds the two ASCII |
90 |
characters 'C' and 'H' in order to assign 'CH' a separate rank between |
91 |
'C' and 'D' in spanish collation (a |
92 |
> http://oss.software.ibm.com/icu/userguide/Collate_Concepts.html#Contractions Contraction |
93 |
) or an |
94 |
> http://www.loc.gov/marc/specifications/speccharmarc8.html ANSEL |
95 |
( |
96 |
> http://lcweb2.loc.gov/cocoon/codetables/45.html codes |
97 |
, |
98 |
> http://www.niso.org/standards/resources/Z39-47.pdf Z39.47 |
99 |
) style composition or |
100 |
UTF-8 using two or more bytes to encode a character with a diacritical mark |
101 |
(in precomposed or decomposed form). |
102 |
|
103 |
|
104 |
Configuration entries supported in the initial version are: |
105 |
- collation C _name_ [_options_] |
106 |
assigns a name to this collation or refers to an external collation. |
107 |
Only the first 31 bytes in _name_ are considered. |
108 |
Should be a C identifier (plain ASCII) for best interoperability. |
109 |
Proposed (but probably not implemented) options are 'c' for compression |
110 |
and 'f' for french (reverse) secondaries (see below). |
111 |
- word W _entities_ |
112 |
specifies that the listed entities are considered parts of words |
113 |
and assigns sort ranks in ascending order to them |
114 |
- nonword N _entities_ |
115 |
like W, but the entities separate "words" in word split mode. |
116 |
Multiple W and N entries can be used to assign successive sort ranks. |
117 |
- alias A _entities_ |
118 |
the entities are assigned as aliases to the corresponding entities |
119 |
of the last seen W or N, e.g. a sequence of lowercase characters |
120 |
to their uppercase equivalents |
121 |
- map M _entities_ |
122 |
the second and following entities are mapped to first one, |
123 |
which will be iteratively checked for other rules (but not maps). |
124 |
This can be used to map entities to empty (completey discarding them) |
125 |
or to multiple entities as an |
126 |
> http://oss.software.ibm.com/icu/userguide/Collate_Concepts.html#Expansions Expansion |
127 |
|
128 |
Example:) |
129 |
$ |
130 |
4 C spanish_de_PHONEBOOK |
131 |
4 N . , ; - _ |
132 |
4 N 0 1 2 3 4 5 6 7 8 9 |
133 |
4 W a b c ch d e f g h ... |
134 |
4 A A B C CH D E F G H ... |
135 |
4 A á é |
136 |
4 A Á É |
137 |
4 M ae Ä ä |
138 |
4 M oe Ö ö |
139 |
$ |
140 |
|
141 |
Here, 'coche' sorts exactly like 'COCHE' after 'cocina', |
142 |
since the ch sorts after the c (and the i is not even considered). |
143 |
'König' in german phonebooks sorts exactly like 'Koenig', |
144 |
and in a terms listing, both will be displayed as 'koenig'. |
145 |
(Unlike |
146 |
> http://oss.software.ibm.com/icu/charts/collation/de__PHONEBOOK.html "de_PHONEBOOK" |
147 |
, the |
148 |
> http://oss.software.ibm.com/icu/charts/collation/de.html "de" |
149 |
collation has the o-umlaut as secondary to o). |
150 |
|
151 |
|
152 |
Note that as a possibly confusing, although correct sideeffect, |
153 |
a prefix search for 'coc' will NOT match 'coche', |
154 |
since it does not beginn with the codes for 'c'-'o'-'c', |
155 |
but with those for 'c'-'o'-'ch'. |
156 |
|
157 |
|
158 |
* implementation |
159 |
|
160 |
The collation is basically implemented by means of a recoding. |
161 |
Every W and N entity (byte sequence) corresponds to one code number, |
162 |
increasing from 2 in the order of their definition. |
163 |
Every unrecognized byte value, and especially the TAB (9), which can not |
164 |
be redefined, maps to code number 1 and runs of those are squeezed |
165 |
(i.e. only one 1 for a sequence of unrecognized bytes). |
166 |
Aliases use the same code number as their corresponding W or N entity. |
167 |
|
168 |
The recoded key is then a sequence of code numbers corresponding to |
169 |
the recognized entities. Depending on the highest code number, |
170 |
one or two bytes (big endian) are used for every number. |
171 |
|
172 |
|
173 |
This transformation is applied to every index entry before storing it, |
174 |
as well as to every term before lookup. From the table of entities, |
175 |
the original term (in W and N entities, not aliases or maps) can be |
176 |
reconstructed for display of indexed terms. |
177 |
|
178 |
Note that the term decoded from a collation key does not necessarily map |
179 |
to the same key. Where their byte sequences overlap with others, |
180 |
they may become parts of other contractions. |
181 |
|
182 |
|
183 |
The implementation limits both the length (in bytes) of sequences |
184 |
and the number of codes of a map target to 15. |
185 |
|
186 |
|
187 |
* compression |
188 |
|
189 |
With the compression option implemented and enabled, the number of bits used |
190 |
per code is the minimal number of bits needed to represent the highest |
191 |
code number, and the bitstring is padded with 0 bits to the full byte. |
192 |
In the spanish environment one would need 29 alphabetic codes |
193 |
(including CH, LL and Ñ), 10 digits and some punctuation, so six bits |
194 |
(codes 2-63) are sufficient and we can reduce key size by up to 25%. |
195 |
|
196 |
This probably is especially interesting for databases integrating a lot of |
197 |
phoenician and/or brahmi scripts, using more than 256 but less than 512 codes. |
198 |
Here one would need only 9 instead of 16 bits, saving more than 40%. |
199 |
In a CJK environment, you will need at least 15 bits anyway. |
200 |
|
201 |
|
202 |
Do not confuse this compression of single keys with the option of |
203 |
> FileFormats compressing the index |
204 |
based on common prefixes between adjacent keys. |
205 |
|
206 |
|
207 |
* multilevel comparison |
208 |
|
209 |
Some future version will also support S and T entries to support |
210 |
secondary (optionally french) and tertiary levels, possibly one day even |
211 |
quaternary and identical levels Q and I, should there be demand. |
212 |
|
213 |
|
214 |
Example (c.f. the |
215 |
> http://oss.software.ibm.com/icu/charts/collation/es.html es chart |
216 |
, lacking the ch contraction): |
217 |
$ |
218 |
4 C spanish |
219 |
4 W a b c ch d e f |
220 |
4 T A B C CH D E F |
221 |
4 S á é |
222 |
4 T Á É |
223 |
4 S ä |
224 |
4 T Ä |
225 |
$ |
226 |
|
227 |
Here, 'coche' sorts before 'Coche', |
228 |
since on the third level the 'c' sorts before 'C'. |
229 |
(Unlike plain ASCII sorting, most collations sort lowercase before uppercase). |
230 |
Still 'coche' sorts after 'Cocina', since the primary difference |
231 |
between 'ch' and 'c' takes precedence over the tertiary difference, |
232 |
although the latter occurs earlier in the word. |
233 |
Just for the fun of it, the a-umlaut is not expanded here, |
234 |
but listed as another secondary variant of 'a' with it's own tertiary. |
235 |
|
236 |
|
237 |
For multilevel comparison, a 0 code plus additional bits are appended to |
238 |
the recoded key. First, for every character some bits are appended to code |
239 |
it's secondary variant, depending on how many variants are defined for the |
240 |
character, then likewise for tertiary variants. |
241 |
|
242 |
In Latin scripts, typically every alphabetical character has one |
243 |
tertiary variant (it's lowercase equivalent, using one bit) |
244 |
and some or all vowels can have one or more diacritical marks. |
245 |
|
246 |
|
247 |
By appending additional bits not only do terms sort properly, |
248 |
but moreover we have the option for an exact match sensitive to all levels |
249 |
or a match insensitive to third or second and third level |
250 |
very similar to a prefix match (since the first level IS a prefix). |
251 |
|
252 |
An actual prefix match should usually be done using only the first level bits, |
253 |
checking for second and third level prefix is a little bit more complicated. |
254 |
|
255 |
|
256 |
For french secondary sorting, the second level bits are appended in reverse |
257 |
order. Must not be used together with left-to-right secondary sorting. |
258 |
|
259 |
|
260 |
Using the additional bits, a terms listing can reconstruct the input with |
261 |
regard to all variants, i.e. with proper case and diacritics. |
262 |
|
263 |
However, aliases and mappings can not be reversed: |
264 |
where the a umlaut should sort *exactly* |
265 |
like an a followed by an e, it uses exactly the same bytes and we can not |
266 |
tell from the index that once there was an a umlaut in the input. |
267 |
|
268 |
|
269 |
* east asian word indexing |
270 |
|
271 |
Segmentation of |
272 |
> http://www.multilingual.com/FMPro?-db=archives&-format=ad%2fselected%5fresults.html&-find= Chinese text |
273 |
is, in general, not a trivial task. |
274 |
|
275 |
Of course, the use of spaces to explicitly separate words is an option. |
276 |
The usual word split will just work as for any other scripts. |
277 |
|
278 |
|
279 |
Since Malete support full application controlled indexing, it is also possible |
280 |
to use any existing segmentation algorithm on the application level. |
281 |
|
282 |
|
283 |
Where this is unwanted or not possible, a somewhat brute force, |
284 |
yet feasible approach is to put every single character or |
285 |
every digraph or trigraph in the index. |
286 |
(I.e. every character together with the two or three following characters). |
287 |
|
288 |
Please contact us, should there be demand for such a character or |
289 |
m-graph indexing method as alternative to "word" split. |
290 |
|
291 |
|
292 |
* using collation |
293 |
|
294 |
When a database is opened, Malete looks for the file _database_.m0d |
295 |
(where _database_ is the name yo your database, e.g. cds.m0d). |
296 |
If this exists, it is scanned for 4 fields. |
297 |
|
298 |
If there is a "4 C" naming the collation _name_, |
299 |
and there either is a compiled collation file _name_.mcx |
300 |
or another database is already using a collation of that name, |
301 |
the existing collation info is used. |
302 |
|
303 |
Else a new collation (with the given name or anonymous) |
304 |
is created from the description in the metadata and, |
305 |
if it is named, saved as file _name_.mcx (in the current directory). |
306 |
|
307 |
|
308 |
The distribution contains two sample collation definitions, a Latin-1 |
309 |
based as test/cds.m0d and a UTF-8 (Unicode) based as test/unicode.m0d. |
310 |
Use a UTF-8 capable editor like "vim '+set encoding=utf-8'" in a "xterm -u8". |
311 |
|
312 |
Creating a collation definition from existing ISISAC.TAB and ISISUC.TAB |
313 |
is a straightforward exercise left to the reader. |
314 |
|
315 |
|
316 |
- *WARNING*: |
317 |
Adding or changing the collation in _database_.m0d will render your |
318 |
encoded index data garbage! |
319 |
Make sure to save the plaintext index data using |
320 |
"malete qdump _database_ >_database_.mqt" before |
321 |
and reencode the index with "malete qload _database_ <_database_.mqt" |
322 |
after editing the m0d file. |
323 |
- *WARNING*: |
324 |
While named (and thus shared) collations are much more efficient, |
325 |
multiple databases using a different specification for the same |
326 |
collation name will not properly coexist. |
327 |
When changing a named collation, be sure to remove the .mcx file |
328 |
and reload the indexes for all affected databases. |
329 |
When in doubt, remove or change the collation name. |
330 |
|
331 |
|
332 |
* links |
333 |
|
334 |
> http://openisis.org/Doc/encoding list of encodings supported by Java |
335 |
|
336 |
> http://openisis.org/Doc/Unicode notes on charsets and unicode with ISIS |
337 |
|
338 |
> http://openisis.org/Doc/CsTables tables of some western latin sets |
339 |
|
340 |
> http://openisis.org/Doc/UniStats statistics on unicode character properties |
341 |
|
342 |
> http://openisis.org/Doc/Collating approaches to collation |
343 |
|
344 |
> http://www.iana.org/assignments/character-sets iana registered charsets |
345 |
|
346 |
> http://web.archive.org/web/czyborra.com/unicode/characters.html#scripts unicode characters and scripts |
347 |
|
348 |
ICU Collation |
349 |
> http://oss.software.ibm.com/icu/userguide/Collate_Intro.html Introduction |
350 |
and |
351 |
> http://oss.software.ibm.com/icu/userguide/Collate_Concepts.html Concepts |
352 |
|
353 |
Collation charts (without contractions) by |
354 |
> http://oss.software.ibm.com/icu/charts/collation/ locale |
355 |
or |
356 |
> http://www.unicode.org/charts/collation/ script |
357 |
(according to Unicode |
358 |
> http://www.unicode.org/unicode/reports/tr10/tr10-11.html#Default_Unicode_Collation_Element_Table default collation |
359 |
) |
360 |
|
361 |
--- |
362 |
$Id: CharSet.txt,v 1.12 2004/11/12 11:18:23 kripke Exp $ |