0.9.9e/doc/CharSet.txt

character set support in Malete

See end for usage note.


*       overview

Malete supports any "character set" (in the MIME sense of "charset" =
>       http://ietf.org/rfc/rfc2278.txt CCS + CES
) which is compatible with
>       http://ietf.org/rfc/rfc0020.txt ASCII
so that
-       every character defined by ASCII is encoded by the same byte value
-       every byte with a value in the range 0-127 inclusive encodes the
        character as specified by ASCII

This
-       includes
>       http://aspell.net/charsets/iso8859.html ISO-8859-*
        , Unicode in the
>       http://www.ietf.org/rfc/rfc2279.txt     UTF-8
        encoding, various far east
>       http://web.archive.org/web/czyborra.com/utf/#EUC        EUC
        and similar encodings usings pairs of bytes greater than 127
        and works well with most (but not all) IBM/M$
>       http://aspell.net/charsets/codepages.html       codepages
        unless you really abused the control characters 0-31 for graphics
-       excludes
>       http://web.archive.org/web/czyborra.com/utf/#UTF-16     UTF-16
        and other formats using bytes 0-127 as part of multibyte sequences
        and, of course, the anti-ASCII
>       http://aspell.net/charsets/iso646.html#EBCDIC   EBCDIC
-       should work with some restrictions on searching for encodings which at
        least preserve linefeed (10), horizontal TAB (9) and the digits (48-57)
        like the Unicode standard compression scheme
>       http://www.unicode.org/unicode/reports/tr6/     SCSU
        ,
>       http://aspell.net/charsets/vietnamese.html      VISCII
        and even old
>       http://aspell.net/charsets/iso646.html#i18n     ISO-646-*
        or Cyrillic
>       http://aspell.net/charsets/cyrillic.html        KOI

In order to store and retrieve (by record id) data,
Malete does not need to know anything about the character set.
However, the
>       MetaData        content-type
may contain a charset name using a preferred MIME name from the
>       http://www.iana.org/assignments/character-sets  iana registered charsets.
The basic server does not support character set conversion, since many
client environments like Java or Tcl are well prepared to handle this.
A Tcl based server may support charset conversion.


*       indexing

For indexing, we use quite a lot of information about characters similar to,
but extending the traditional
>       http://www.cindoc.csic.es/isis/21-5.htm ISISUC.TAB
and
>       http://www.cindoc.csic.es/isis/21-6.htm ISISAC.TAB
-       a sort order (collation sequence),
        possibly using
>       http://oss.software.ibm.com/icu/userguide/Collate_Concepts.html multilevel comparison
        according to the
>       http://unicode.org/unicode/reports/tr10/        UCA
-       a mapping of certain code sequences to others,
        for example to use uppercased versions in the index
-       some notion of which characters are "alphabetic"
        (parts of words) for indexing in word split mode

This data is configured by a sequence of strings,
which may be obtained from the collation (4) fields of a database's
>       MetaData        metadata
record (in _database_.m0d) or as the lines of a
textfile (_collation_.mcd, not implemented).


These strings start with a one or two character mnemonic
which is followed by a tabulator separated sequences of bytes,
representing single characters or sequences of characters
in the database's encoding.
Unlike CDS/ISIS, Malete always deals with multibyte entitities
and does not use explicit codes as decimal numbers.
Consequently, the collation configuration can be converted between
charsets just like the database itself (e.g. using recode or iconv).


Malete does not care whether a multibyte sequence holds the two ASCII
characters 'C' and 'H' in order to assign 'CH' a separate rank between
'C' and 'D' in spanish collation (a
>       http://oss.software.ibm.com/icu/userguide/Collate_Concepts.html#Contractions    Contraction
) or an
>       http://www.loc.gov/marc/specifications/speccharmarc8.html       ANSEL
(
>       http://lcweb2.loc.gov/cocoon/codetables/45.html codes
,
>       http://www.niso.org/standards/resources/Z39-47.pdf      Z39.47
) style composition or
UTF-8 using two or more bytes to encode a character with a diacritical mark
(in precomposed or decomposed form).


Configuration entries supported in the initial version are:
-       collation C _name_ [_options_]
        assigns a name to this collation or refers to an external collation.
        Only the first 31 bytes in _name_ are considered.
        Should be a C identifier (plain ASCII) for best interoperability.
        Proposed (but probably not implemented) options are 'c' for compression
        and 'f' for french (reverse) secondaries (see below).
-       word W _entities_
        specifies that the listed entities are considered parts of words
        and assigns sort ranks in ascending order to them
-       nonword N _entities_
        like W, but the entities separate "words" in word split mode.
        Multiple W and N entries can be used to assign successive sort ranks.
-       alias A _entities_
        the entities are assigned as aliases to the corresponding entities
        of the last seen W or N, e.g. a sequence of lowercase characters
        to their uppercase equivalents
-       map M _entities_
        the second and following entities are mapped to first one,
        which will be iteratively checked for other rules (but not maps).
        This can be used to map entities to empty (completey discarding them)
        or to multiple entities as an
>       http://oss.software.ibm.com/icu/userguide/Collate_Concepts.html#Expansions      Expansion

Example:)
$
4       C       spanish_de_PHONEBOOK
4       N       .       ,       ;       -       _
4       N       0       1       2       3       4       5       6       7       8       9
4       W       a       b       c       ch      d       e       f       g       h       ...
4       A       A       B       C       CH      D       E       F       G       H       ...
4       A       á                                       é
4       A       Á                                       É
4       M       ae      Ä       ä
4       M       oe      Ö       ö
$

Here, 'coche' sorts exactly like 'COCHE' after 'cocina',
since the ch sorts after the c (and the i is not even considered).
'König' in german phonebooks sorts exactly like 'Koenig',
and in a terms listing, both will be displayed as 'koenig'.
(Unlike
>       http://oss.software.ibm.com/icu/charts/collation/de__PHONEBOOK.html     "de_PHONEBOOK"
, the
>       http://oss.software.ibm.com/icu/charts/collation/de.html        "de"
collation has the o-umlaut as secondary to o).


Note that as a possibly confusing, although correct sideeffect,
a prefix search for 'coc' will NOT match 'coche',
since it does not beginn with the codes for 'c'-'o'-'c',
but with those for 'c'-'o'-'ch'.


*       implementation

The collation is basically implemented by means of a recoding.
Every W and N entity (byte sequence) corresponds to one code number,
increasing from 2 in the order of their definition. 
Every unrecognized byte value, and especially the TAB (9), which can not
be redefined, maps to code number 1 and runs of those are squeezed
(i.e. only one 1 for a sequence of unrecognized bytes).
Aliases use the same code number as their corresponding W or N entity.

The recoded key is then a sequence of code numbers corresponding to
the recognized entities. Depending on the highest code number,
one or two bytes (big endian) are used for every number.


This transformation is applied to every index entry before storing it,
as well as to every term before lookup. From the table of entities,
the original term (in W and N entities, not aliases or maps) can be
reconstructed for display of indexed terms.

Note that the term decoded from a collation key does not necessarily map
to the same key. Where their byte sequences overlap with others,
they may become parts of other contractions.


The implementation limits both the length (in bytes) of sequences
and the number of codes of a map target to 15.


* compression

With the compression option implemented and enabled, the number of bits used
per code is the minimal number of bits needed to represent the highest
code number, and the bitstring is padded with 0 bits to the full byte.
In the spanish environment one would need 29 alphabetic codes
(including CH, LL and Ñ), 10 digits and some punctuation, so six bits
(codes 2-63) are sufficient and we can reduce key size by up to 25%.

This probably is especially interesting for databases integrating a lot of
phoenician and/or brahmi scripts, using more than 256 but less than 512 codes.
Here one would need only 9 instead of 16 bits, saving more than 40%.
In a CJK environment, you will need at least 15 bits anyway.


Do not confuse this compression of single keys with the option of
>       FileFormats     compressing the index
based on common prefixes between adjacent keys.


*       multilevel comparison

Some future version will also support S and T entries to support
secondary (optionally french) and tertiary levels, possibly one day even
quaternary and identical levels Q and I, should there be demand.


Example (c.f. the
>       http://oss.software.ibm.com/icu/charts/collation/es.html        es chart
, lacking the ch contraction):
$
4       C       spanish
4       W       a       b       c       ch      d       e       f
4       T       A       B       C       CH      D       E       F
4       S       á                                       é
4       T       Á                                       É
4       S       ä
4       T       Ä
$

Here, 'coche' sorts before 'Coche',
since on the third level the 'c' sorts before 'C'.
(Unlike plain ASCII sorting, most collations sort lowercase before uppercase).
Still 'coche' sorts after 'Cocina', since the primary difference
between 'ch' and 'c' takes precedence over the tertiary difference,
although the latter occurs earlier in the word.
Just for the fun of it, the a-umlaut is not expanded here,
but listed as another secondary variant of 'a' with it's own tertiary.


For multilevel comparison, a 0 code plus additional bits are appended to
the recoded key. First, for every character some bits are appended to code
it's secondary variant, depending on how many variants are defined for the
character, then likewise for tertiary variants.

In Latin scripts, typically every alphabetical character has one
tertiary variant (it's lowercase equivalent, using one bit)
and some or all vowels can have one or more diacritical marks.


By appending additional bits not only do terms sort properly,
but moreover we have the option for an exact match sensitive to all levels
or a match insensitive to third or second and third level
very similar to a prefix match (since the first level IS a prefix).

An actual prefix match should usually be done using only the first level bits,
checking for second and third level prefix is a little bit more complicated.


For french secondary sorting, the second level bits are appended in reverse
order. Must not be used together with left-to-right secondary sorting.


Using the additional bits, a terms listing can reconstruct the input with
regard to all variants, i.e. with proper case and diacritics.

However, aliases and mappings can not be reversed:
where the a umlaut should sort *exactly*
like an a followed by an e, it uses exactly the same bytes and we can not
tell from the index that once there was an a umlaut in the input.


*       east asian word indexing

Segmentation of
>       http://www.multilingual.com/FMPro?-db=archives&-format=ad%2fselected%5fresults.html&-find=      Chinese text
is, in general, not a trivial task.

Of course, the use of spaces to explicitly separate words is an option.
The usual word split will just work as for any other scripts.


Since Malete support full application controlled indexing, it is also possible
to use any existing segmentation algorithm on the application level.


Where this is unwanted or not possible, a somewhat brute force,
yet feasible approach is to put every single character or
every digraph or trigraph in the index.
(I.e. every character together with the two or three following characters).

Please contact us, should there be demand for such a character or
m-graph indexing method as alternative to "word" split.


*       using collation

When a database is opened, Malete looks for the file _database_.m0d
(where _database_ is the name yo your database, e.g. cds.m0d).
If this exists, it is scanned for 4 fields.

If there is a "4 C" naming the collation _name_,
and there either is a compiled collation file _name_.mcx
or another database is already using a collation of that name,
the existing collation info is used.

Else a new collation (with the given name or anonymous)
is created from the description in the metadata and,
if it is named, saved as file _name_.mcx (in the current directory).


The distribution contains two sample collation definitions, a Latin-1
based as test/cds.m0d and a UTF-8 (Unicode) based as test/unicode.m0d.
Use a UTF-8 capable editor like "vim '+set encoding=utf-8'" in a "xterm -u8".

Creating a collation definition from existing ISISAC.TAB and ISISUC.TAB
is a straightforward exercise left to the reader.


-       *WARNING*:
        Adding or changing the collation in _database_.m0d will render your
        encoded index data garbage!
        Make sure to save the plaintext index data using
        "malete qdump _database_ >_database_.mqt" before
        and reencode the index with "malete qload _database_ <_database_.mqt"
        after editing the m0d file.
-       *WARNING*:
        While named (and thus shared) collations are much more efficient,
        multiple databases using a different specification for the same
        collation name will not properly coexist.
        When changing a named collation, be sure to remove the .mcx file
        and reload the indexes for all affected databases.
        When in doubt, remove or change the collation name.


*       links

>       http://openisis.org/Doc/encoding        list of encodings supported by Java

>       http://openisis.org/Doc/Unicode notes on charsets and unicode with ISIS

>       http://openisis.org/Doc/CsTables        tables of some western latin sets

>       http://openisis.org/Doc/UniStats        statistics on unicode character properties

>       http://openisis.org/Doc/Collating       approaches to collation

>       http://www.iana.org/assignments/character-sets  iana registered charsets

>       http://web.archive.org/web/czyborra.com/unicode/characters.html#scripts unicode characters and scripts

ICU Collation
>       http://oss.software.ibm.com/icu/userguide/Collate_Intro.html    Introduction
and
>       http://oss.software.ibm.com/icu/userguide/Collate_Concepts.html Concepts

Collation charts (without contractions) by
>       http://oss.software.ibm.com/icu/charts/collation/       locale
or
>       http://www.unicode.org/charts/collation/        script
(according to Unicode
>       http://www.unicode.org/unicode/reports/tr10/tr10-11.html#Default_Unicode_Collation_Element_Table        default collation
)

---
        $Id: CharSet.txt,v 1.12 2004/11/12 11:18:23 kripke Exp $
1	dpavlin	604	character set support in Malete
2
3			See end for usage note.
4
5
6			* overview
7
8			Malete supports any "character set" (in the MIME sense of "charset" =
9			> http://ietf.org/rfc/rfc2278.txt CCS + CES
10			) which is compatible with
11			> http://ietf.org/rfc/rfc0020.txt ASCII
12			so that
13			- every character defined by ASCII is encoded by the same byte value
14			- every byte with a value in the range 0-127 inclusive encodes the
15			character as specified by ASCII
16
17			This
18			- includes
19			> http://aspell.net/charsets/iso8859.html ISO-8859-*
20			, Unicode in the
21			> http://www.ietf.org/rfc/rfc2279.txt UTF-8
22			encoding, various far east
23			> http://web.archive.org/web/czyborra.com/utf/#EUC EUC
24			and similar encodings usings pairs of bytes greater than 127
25			and works well with most (but not all) IBM/M$
26			> http://aspell.net/charsets/codepages.html codepages
27			unless you really abused the control characters 0-31 for graphics
28			- excludes
29			> http://web.archive.org/web/czyborra.com/utf/#UTF-16 UTF-16
30			and other formats using bytes 0-127 as part of multibyte sequences
31			and, of course, the anti-ASCII
32			> http://aspell.net/charsets/iso646.html#EBCDIC EBCDIC
33			- should work with some restrictions on searching for encodings which at
34			least preserve linefeed (10), horizontal TAB (9) and the digits (48-57)
35			like the Unicode standard compression scheme
36			> http://www.unicode.org/unicode/reports/tr6/ SCSU
37			,
38			> http://aspell.net/charsets/vietnamese.html VISCII
39			and even old
40			> http://aspell.net/charsets/iso646.html#i18n ISO-646-*
41			or Cyrillic
42			> http://aspell.net/charsets/cyrillic.html KOI
43
44			In order to store and retrieve (by record id) data,
45			Malete does not need to know anything about the character set.
46			However, the
47			> MetaData content-type
48			may contain a charset name using a preferred MIME name from the
49			> http://www.iana.org/assignments/character-sets iana registered charsets.
50			The basic server does not support character set conversion, since many
51			client environments like Java or Tcl are well prepared to handle this.
52			A Tcl based server may support charset conversion.
53
54
55			* indexing
56
57			For indexing, we use quite a lot of information about characters similar to,
58			but extending the traditional
59			> http://www.cindoc.csic.es/isis/21-5.htm ISISUC.TAB
60			and
61			> http://www.cindoc.csic.es/isis/21-6.htm ISISAC.TAB
62			- a sort order (collation sequence),
63			possibly using
64			> http://oss.software.ibm.com/icu/userguide/Collate_Concepts.html multilevel comparison
65			according to the
66			> http://unicode.org/unicode/reports/tr10/ UCA
67			- a mapping of certain code sequences to others,
68			for example to use uppercased versions in the index
69			- some notion of which characters are "alphabetic"
70			(parts of words) for indexing in word split mode
71
72			This data is configured by a sequence of strings,
73			which may be obtained from the collation (4) fields of a database's
74			> MetaData metadata
75			record (in _database_.m0d) or as the lines of a
76			textfile (_collation_.mcd, not implemented).
77
78
79			These strings start with a one or two character mnemonic
80			which is followed by a tabulator separated sequences of bytes,
81			representing single characters or sequences of characters
82			in the database's encoding.
83			Unlike CDS/ISIS, Malete always deals with multibyte entitities
84			and does not use explicit codes as decimal numbers.
85			Consequently, the collation configuration can be converted between
86			charsets just like the database itself (e.g. using recode or iconv).
87
88
89			Malete does not care whether a multibyte sequence holds the two ASCII
90			characters 'C' and 'H' in order to assign 'CH' a separate rank between
91			'C' and 'D' in spanish collation (a
92			> http://oss.software.ibm.com/icu/userguide/Collate_Concepts.html#Contractions Contraction
93			) or an
94			> http://www.loc.gov/marc/specifications/speccharmarc8.html ANSEL
95			(
96			> http://lcweb2.loc.gov/cocoon/codetables/45.html codes
97			,
98			> http://www.niso.org/standards/resources/Z39-47.pdf Z39.47
99			) style composition or
100			UTF-8 using two or more bytes to encode a character with a diacritical mark
101			(in precomposed or decomposed form).
102
103
104			Configuration entries supported in the initial version are:
105			- collation C _name_ [_options_]
106			assigns a name to this collation or refers to an external collation.
107			Only the first 31 bytes in _name_ are considered.
108			Should be a C identifier (plain ASCII) for best interoperability.
109			Proposed (but probably not implemented) options are 'c' for compression
110			and 'f' for french (reverse) secondaries (see below).
111			- word W _entities_
112			specifies that the listed entities are considered parts of words
113			and assigns sort ranks in ascending order to them
114			- nonword N _entities_
115			like W, but the entities separate "words" in word split mode.
116			Multiple W and N entries can be used to assign successive sort ranks.
117			- alias A _entities_
118			the entities are assigned as aliases to the corresponding entities
119			of the last seen W or N, e.g. a sequence of lowercase characters
120			to their uppercase equivalents
121			- map M _entities_
122			the second and following entities are mapped to first one,
123			which will be iteratively checked for other rules (but not maps).
124			This can be used to map entities to empty (completey discarding them)
125			or to multiple entities as an
126			> http://oss.software.ibm.com/icu/userguide/Collate_Concepts.html#Expansions Expansion
127
128			Example:)
129			$
130			4 C spanish_de_PHONEBOOK
131			4 N . , ; - _
132			4 N 0 1 2 3 4 5 6 7 8 9
133			4 W a b c ch d e f g h ...
134			4 A A B C CH D E F G H ...
135			4 A á é
136			4 A Á É
137			4 M ae Ä ä
138			4 M oe Ö ö
139			$
140
141			Here, 'coche' sorts exactly like 'COCHE' after 'cocina',
142			since the ch sorts after the c (and the i is not even considered).
143			'König' in german phonebooks sorts exactly like 'Koenig',
144			and in a terms listing, both will be displayed as 'koenig'.
145			(Unlike
146			> http://oss.software.ibm.com/icu/charts/collation/de__PHONEBOOK.html "de_PHONEBOOK"
147			, the
148			> http://oss.software.ibm.com/icu/charts/collation/de.html "de"
149			collation has the o-umlaut as secondary to o).
150
151
152			Note that as a possibly confusing, although correct sideeffect,
153			a prefix search for 'coc' will NOT match 'coche',
154			since it does not beginn with the codes for 'c'-'o'-'c',
155			but with those for 'c'-'o'-'ch'.
156
157
158			* implementation
159
160			The collation is basically implemented by means of a recoding.
161			Every W and N entity (byte sequence) corresponds to one code number,
162			increasing from 2 in the order of their definition.
163			Every unrecognized byte value, and especially the TAB (9), which can not
164			be redefined, maps to code number 1 and runs of those are squeezed
165			(i.e. only one 1 for a sequence of unrecognized bytes).
166			Aliases use the same code number as their corresponding W or N entity.
167
168			The recoded key is then a sequence of code numbers corresponding to
169			the recognized entities. Depending on the highest code number,
170			one or two bytes (big endian) are used for every number.
171
172
173			This transformation is applied to every index entry before storing it,
174			as well as to every term before lookup. From the table of entities,
175			the original term (in W and N entities, not aliases or maps) can be
176			reconstructed for display of indexed terms.
177
178			Note that the term decoded from a collation key does not necessarily map
179			to the same key. Where their byte sequences overlap with others,
180			they may become parts of other contractions.
181
182
183			The implementation limits both the length (in bytes) of sequences
184			and the number of codes of a map target to 15.
185
186
187			* compression
188
189			With the compression option implemented and enabled, the number of bits used
190			per code is the minimal number of bits needed to represent the highest
191			code number, and the bitstring is padded with 0 bits to the full byte.
192			In the spanish environment one would need 29 alphabetic codes
193			(including CH, LL and Ñ), 10 digits and some punctuation, so six bits
194			(codes 2-63) are sufficient and we can reduce key size by up to 25%.
195
196			This probably is especially interesting for databases integrating a lot of
197			phoenician and/or brahmi scripts, using more than 256 but less than 512 codes.
198			Here one would need only 9 instead of 16 bits, saving more than 40%.
199			In a CJK environment, you will need at least 15 bits anyway.
200
201
202			Do not confuse this compression of single keys with the option of
203			> FileFormats compressing the index
204			based on common prefixes between adjacent keys.
205
206
207			* multilevel comparison
208
209			Some future version will also support S and T entries to support
210			secondary (optionally french) and tertiary levels, possibly one day even
211			quaternary and identical levels Q and I, should there be demand.
212
213
214			Example (c.f. the
215			> http://oss.software.ibm.com/icu/charts/collation/es.html es chart
216			, lacking the ch contraction):
217			$
218			4 C spanish
219			4 W a b c ch d e f
220			4 T A B C CH D E F
221			4 S á é
222			4 T Á É
223			4 S ä
224			4 T Ä
225			$
226
227			Here, 'coche' sorts before 'Coche',
228			since on the third level the 'c' sorts before 'C'.
229			(Unlike plain ASCII sorting, most collations sort lowercase before uppercase).
230			Still 'coche' sorts after 'Cocina', since the primary difference
231			between 'ch' and 'c' takes precedence over the tertiary difference,
232			although the latter occurs earlier in the word.
233			Just for the fun of it, the a-umlaut is not expanded here,
234			but listed as another secondary variant of 'a' with it's own tertiary.
235
236
237			For multilevel comparison, a 0 code plus additional bits are appended to
238			the recoded key. First, for every character some bits are appended to code
239			it's secondary variant, depending on how many variants are defined for the
240			character, then likewise for tertiary variants.
241
242			In Latin scripts, typically every alphabetical character has one
243			tertiary variant (it's lowercase equivalent, using one bit)
244			and some or all vowels can have one or more diacritical marks.
245
246
247			By appending additional bits not only do terms sort properly,
248			but moreover we have the option for an exact match sensitive to all levels
249			or a match insensitive to third or second and third level
250			very similar to a prefix match (since the first level IS a prefix).
251
252			An actual prefix match should usually be done using only the first level bits,
253			checking for second and third level prefix is a little bit more complicated.
254
255
256			For french secondary sorting, the second level bits are appended in reverse
257			order. Must not be used together with left-to-right secondary sorting.
258
259
260			Using the additional bits, a terms listing can reconstruct the input with
261			regard to all variants, i.e. with proper case and diacritics.
262
263			However, aliases and mappings can not be reversed:
264			where the a umlaut should sort exactly
265			like an a followed by an e, it uses exactly the same bytes and we can not
266			tell from the index that once there was an a umlaut in the input.
267
268
269			* east asian word indexing
270
271			Segmentation of
272			> http://www.multilingual.com/FMPro?-db=archives&-format=ad%2fselected%5fresults.html&-find= Chinese text
273			is, in general, not a trivial task.
274
275			Of course, the use of spaces to explicitly separate words is an option.
276			The usual word split will just work as for any other scripts.
277
278
279			Since Malete support full application controlled indexing, it is also possible
280			to use any existing segmentation algorithm on the application level.
281
282
283			Where this is unwanted or not possible, a somewhat brute force,
284			yet feasible approach is to put every single character or
285			every digraph or trigraph in the index.
286			(I.e. every character together with the two or three following characters).
287
288			Please contact us, should there be demand for such a character or
289			m-graph indexing method as alternative to "word" split.
290
291
292			* using collation
293
294			When a database is opened, Malete looks for the file _database_.m0d
295			(where _database_ is the name yo your database, e.g. cds.m0d).
296			If this exists, it is scanned for 4 fields.
297
298			If there is a "4 C" naming the collation _name_,
299			and there either is a compiled collation file _name_.mcx
300			or another database is already using a collation of that name,
301			the existing collation info is used.
302
303			Else a new collation (with the given name or anonymous)
304			is created from the description in the metadata and,
305			if it is named, saved as file _name_.mcx (in the current directory).
306
307
308			The distribution contains two sample collation definitions, a Latin-1
309			based as test/cds.m0d and a UTF-8 (Unicode) based as test/unicode.m0d.
310			Use a UTF-8 capable editor like "vim '+set encoding=utf-8'" in a "xterm -u8".
311
312			Creating a collation definition from existing ISISAC.TAB and ISISUC.TAB
313			is a straightforward exercise left to the reader.
314
315
316			- WARNING:
317			Adding or changing the collation in _database_.m0d will render your
318			encoded index data garbage!
319			Make sure to save the plaintext index data using
320			"malete qdump _database_ >_database_.mqt" before
321			and reencode the index with "malete qload _database_ <_database_.mqt"
322			after editing the m0d file.
323			- WARNING:
324			While named (and thus shared) collations are much more efficient,
325			multiple databases using a different specification for the same
326			collation name will not properly coexist.
327			When changing a named collation, be sure to remove the .mcx file
328			and reload the indexes for all affected databases.
329			When in doubt, remove or change the collation name.
330
331
332			* links
333
334			> http://openisis.org/Doc/encoding list of encodings supported by Java
335
336			> http://openisis.org/Doc/Unicode notes on charsets and unicode with ISIS
337
338			> http://openisis.org/Doc/CsTables tables of some western latin sets
339
340			> http://openisis.org/Doc/UniStats statistics on unicode character properties
341
342			> http://openisis.org/Doc/Collating approaches to collation
343
344			> http://www.iana.org/assignments/character-sets iana registered charsets
345
346			> http://web.archive.org/web/czyborra.com/unicode/characters.html#scripts unicode characters and scripts
347
348			ICU Collation
349			> http://oss.software.ibm.com/icu/userguide/Collate_Intro.html Introduction
350			and
351			> http://oss.software.ibm.com/icu/userguide/Collate_Concepts.html Concepts
352
353			Collation charts (without contractions) by
354			> http://oss.software.ibm.com/icu/charts/collation/ locale
355			or
356			> http://www.unicode.org/charts/collation/ script
357			(according to Unicode
358			> http://www.unicode.org/unicode/reports/tr10/tr10-11.html#Default_Unicode_Collation_Element_Table default collation
359			)
360
361			---
362			$Id: CharSet.txt,v 1.12 2004/11/12 11:18:23 kripke Exp $