/[webpac]/openisis/0.9.9e/doc/CharSet.txt
This is repository of my old source code which isn't updated any more. Go to git.rot13.org for current projects!
ViewVC logotype

Annotation of /openisis/0.9.9e/doc/CharSet.txt

Parent Directory Parent Directory | Revision Log Revision Log


Revision 604 - (hide annotations)
Mon Dec 27 21:49:01 2004 UTC (19 years, 4 months ago) by dpavlin
File MIME type: text/plain
File size: 13929 byte(s)
import of new openisis release, 0.9.9e

1 dpavlin 604 character set support in Malete
2    
3     See end for usage note.
4    
5    
6     * overview
7    
8     Malete supports any "character set" (in the MIME sense of "charset" =
9     > http://ietf.org/rfc/rfc2278.txt CCS + CES
10     ) which is compatible with
11     > http://ietf.org/rfc/rfc0020.txt ASCII
12     so that
13     - every character defined by ASCII is encoded by the same byte value
14     - every byte with a value in the range 0-127 inclusive encodes the
15     character as specified by ASCII
16    
17     This
18     - includes
19     > http://aspell.net/charsets/iso8859.html ISO-8859-*
20     , Unicode in the
21     > http://www.ietf.org/rfc/rfc2279.txt UTF-8
22     encoding, various far east
23     > http://web.archive.org/web/czyborra.com/utf/#EUC EUC
24     and similar encodings usings pairs of bytes greater than 127
25     and works well with most (but not all) IBM/M$
26     > http://aspell.net/charsets/codepages.html codepages
27     unless you really abused the control characters 0-31 for graphics
28     - excludes
29     > http://web.archive.org/web/czyborra.com/utf/#UTF-16 UTF-16
30     and other formats using bytes 0-127 as part of multibyte sequences
31     and, of course, the anti-ASCII
32     > http://aspell.net/charsets/iso646.html#EBCDIC EBCDIC
33     - should work with some restrictions on searching for encodings which at
34     least preserve linefeed (10), horizontal TAB (9) and the digits (48-57)
35     like the Unicode standard compression scheme
36     > http://www.unicode.org/unicode/reports/tr6/ SCSU
37     ,
38     > http://aspell.net/charsets/vietnamese.html VISCII
39     and even old
40     > http://aspell.net/charsets/iso646.html#i18n ISO-646-*
41     or Cyrillic
42     > http://aspell.net/charsets/cyrillic.html KOI
43    
44     In order to store and retrieve (by record id) data,
45     Malete does not need to know anything about the character set.
46     However, the
47     > MetaData content-type
48     may contain a charset name using a preferred MIME name from the
49     > http://www.iana.org/assignments/character-sets iana registered charsets.
50     The basic server does not support character set conversion, since many
51     client environments like Java or Tcl are well prepared to handle this.
52     A Tcl based server may support charset conversion.
53    
54    
55     * indexing
56    
57     For indexing, we use quite a lot of information about characters similar to,
58     but extending the traditional
59     > http://www.cindoc.csic.es/isis/21-5.htm ISISUC.TAB
60     and
61     > http://www.cindoc.csic.es/isis/21-6.htm ISISAC.TAB
62     - a sort order (collation sequence),
63     possibly using
64     > http://oss.software.ibm.com/icu/userguide/Collate_Concepts.html multilevel comparison
65     according to the
66     > http://unicode.org/unicode/reports/tr10/ UCA
67     - a mapping of certain code sequences to others,
68     for example to use uppercased versions in the index
69     - some notion of which characters are "alphabetic"
70     (parts of words) for indexing in word split mode
71    
72     This data is configured by a sequence of strings,
73     which may be obtained from the collation (4) fields of a database's
74     > MetaData metadata
75     record (in _database_.m0d) or as the lines of a
76     textfile (_collation_.mcd, not implemented).
77    
78    
79     These strings start with a one or two character mnemonic
80     which is followed by a tabulator separated sequences of bytes,
81     representing single characters or sequences of characters
82     in the database's encoding.
83     Unlike CDS/ISIS, Malete always deals with multibyte entitities
84     and does not use explicit codes as decimal numbers.
85     Consequently, the collation configuration can be converted between
86     charsets just like the database itself (e.g. using recode or iconv).
87    
88    
89     Malete does not care whether a multibyte sequence holds the two ASCII
90     characters 'C' and 'H' in order to assign 'CH' a separate rank between
91     'C' and 'D' in spanish collation (a
92     > http://oss.software.ibm.com/icu/userguide/Collate_Concepts.html#Contractions Contraction
93     ) or an
94     > http://www.loc.gov/marc/specifications/speccharmarc8.html ANSEL
95     (
96     > http://lcweb2.loc.gov/cocoon/codetables/45.html codes
97     ,
98     > http://www.niso.org/standards/resources/Z39-47.pdf Z39.47
99     ) style composition or
100     UTF-8 using two or more bytes to encode a character with a diacritical mark
101     (in precomposed or decomposed form).
102    
103    
104     Configuration entries supported in the initial version are:
105     - collation C _name_ [_options_]
106     assigns a name to this collation or refers to an external collation.
107     Only the first 31 bytes in _name_ are considered.
108     Should be a C identifier (plain ASCII) for best interoperability.
109     Proposed (but probably not implemented) options are 'c' for compression
110     and 'f' for french (reverse) secondaries (see below).
111     - word W _entities_
112     specifies that the listed entities are considered parts of words
113     and assigns sort ranks in ascending order to them
114     - nonword N _entities_
115     like W, but the entities separate "words" in word split mode.
116     Multiple W and N entries can be used to assign successive sort ranks.
117     - alias A _entities_
118     the entities are assigned as aliases to the corresponding entities
119     of the last seen W or N, e.g. a sequence of lowercase characters
120     to their uppercase equivalents
121     - map M _entities_
122     the second and following entities are mapped to first one,
123     which will be iteratively checked for other rules (but not maps).
124     This can be used to map entities to empty (completey discarding them)
125     or to multiple entities as an
126     > http://oss.software.ibm.com/icu/userguide/Collate_Concepts.html#Expansions Expansion
127    
128     Example:)
129     $
130     4 C spanish_de_PHONEBOOK
131     4 N . , ; - _
132     4 N 0 1 2 3 4 5 6 7 8 9
133     4 W a b c ch d e f g h ...
134     4 A A B C CH D E F G H ...
135     4 A á é
136     4 A Á É
137     4 M ae Ä ä
138     4 M oe Ö ö
139     $
140    
141     Here, 'coche' sorts exactly like 'COCHE' after 'cocina',
142     since the ch sorts after the c (and the i is not even considered).
143     'König' in german phonebooks sorts exactly like 'Koenig',
144     and in a terms listing, both will be displayed as 'koenig'.
145     (Unlike
146     > http://oss.software.ibm.com/icu/charts/collation/de__PHONEBOOK.html "de_PHONEBOOK"
147     , the
148     > http://oss.software.ibm.com/icu/charts/collation/de.html "de"
149     collation has the o-umlaut as secondary to o).
150    
151    
152     Note that as a possibly confusing, although correct sideeffect,
153     a prefix search for 'coc' will NOT match 'coche',
154     since it does not beginn with the codes for 'c'-'o'-'c',
155     but with those for 'c'-'o'-'ch'.
156    
157    
158     * implementation
159    
160     The collation is basically implemented by means of a recoding.
161     Every W and N entity (byte sequence) corresponds to one code number,
162     increasing from 2 in the order of their definition.
163     Every unrecognized byte value, and especially the TAB (9), which can not
164     be redefined, maps to code number 1 and runs of those are squeezed
165     (i.e. only one 1 for a sequence of unrecognized bytes).
166     Aliases use the same code number as their corresponding W or N entity.
167    
168     The recoded key is then a sequence of code numbers corresponding to
169     the recognized entities. Depending on the highest code number,
170     one or two bytes (big endian) are used for every number.
171    
172    
173     This transformation is applied to every index entry before storing it,
174     as well as to every term before lookup. From the table of entities,
175     the original term (in W and N entities, not aliases or maps) can be
176     reconstructed for display of indexed terms.
177    
178     Note that the term decoded from a collation key does not necessarily map
179     to the same key. Where their byte sequences overlap with others,
180     they may become parts of other contractions.
181    
182    
183     The implementation limits both the length (in bytes) of sequences
184     and the number of codes of a map target to 15.
185    
186    
187     * compression
188    
189     With the compression option implemented and enabled, the number of bits used
190     per code is the minimal number of bits needed to represent the highest
191     code number, and the bitstring is padded with 0 bits to the full byte.
192     In the spanish environment one would need 29 alphabetic codes
193     (including CH, LL and Ñ), 10 digits and some punctuation, so six bits
194     (codes 2-63) are sufficient and we can reduce key size by up to 25%.
195    
196     This probably is especially interesting for databases integrating a lot of
197     phoenician and/or brahmi scripts, using more than 256 but less than 512 codes.
198     Here one would need only 9 instead of 16 bits, saving more than 40%.
199     In a CJK environment, you will need at least 15 bits anyway.
200    
201    
202     Do not confuse this compression of single keys with the option of
203     > FileFormats compressing the index
204     based on common prefixes between adjacent keys.
205    
206    
207     * multilevel comparison
208    
209     Some future version will also support S and T entries to support
210     secondary (optionally french) and tertiary levels, possibly one day even
211     quaternary and identical levels Q and I, should there be demand.
212    
213    
214     Example (c.f. the
215     > http://oss.software.ibm.com/icu/charts/collation/es.html es chart
216     , lacking the ch contraction):
217     $
218     4 C spanish
219     4 W a b c ch d e f
220     4 T A B C CH D E F
221     4 S á é
222     4 T Á É
223     4 S ä
224     4 T Ä
225     $
226    
227     Here, 'coche' sorts before 'Coche',
228     since on the third level the 'c' sorts before 'C'.
229     (Unlike plain ASCII sorting, most collations sort lowercase before uppercase).
230     Still 'coche' sorts after 'Cocina', since the primary difference
231     between 'ch' and 'c' takes precedence over the tertiary difference,
232     although the latter occurs earlier in the word.
233     Just for the fun of it, the a-umlaut is not expanded here,
234     but listed as another secondary variant of 'a' with it's own tertiary.
235    
236    
237     For multilevel comparison, a 0 code plus additional bits are appended to
238     the recoded key. First, for every character some bits are appended to code
239     it's secondary variant, depending on how many variants are defined for the
240     character, then likewise for tertiary variants.
241    
242     In Latin scripts, typically every alphabetical character has one
243     tertiary variant (it's lowercase equivalent, using one bit)
244     and some or all vowels can have one or more diacritical marks.
245    
246    
247     By appending additional bits not only do terms sort properly,
248     but moreover we have the option for an exact match sensitive to all levels
249     or a match insensitive to third or second and third level
250     very similar to a prefix match (since the first level IS a prefix).
251    
252     An actual prefix match should usually be done using only the first level bits,
253     checking for second and third level prefix is a little bit more complicated.
254    
255    
256     For french secondary sorting, the second level bits are appended in reverse
257     order. Must not be used together with left-to-right secondary sorting.
258    
259    
260     Using the additional bits, a terms listing can reconstruct the input with
261     regard to all variants, i.e. with proper case and diacritics.
262    
263     However, aliases and mappings can not be reversed:
264     where the a umlaut should sort *exactly*
265     like an a followed by an e, it uses exactly the same bytes and we can not
266     tell from the index that once there was an a umlaut in the input.
267    
268    
269     * east asian word indexing
270    
271     Segmentation of
272     > http://www.multilingual.com/FMPro?-db=archives&-format=ad%2fselected%5fresults.html&-find= Chinese text
273     is, in general, not a trivial task.
274    
275     Of course, the use of spaces to explicitly separate words is an option.
276     The usual word split will just work as for any other scripts.
277    
278    
279     Since Malete support full application controlled indexing, it is also possible
280     to use any existing segmentation algorithm on the application level.
281    
282    
283     Where this is unwanted or not possible, a somewhat brute force,
284     yet feasible approach is to put every single character or
285     every digraph or trigraph in the index.
286     (I.e. every character together with the two or three following characters).
287    
288     Please contact us, should there be demand for such a character or
289     m-graph indexing method as alternative to "word" split.
290    
291    
292     * using collation
293    
294     When a database is opened, Malete looks for the file _database_.m0d
295     (where _database_ is the name yo your database, e.g. cds.m0d).
296     If this exists, it is scanned for 4 fields.
297    
298     If there is a "4 C" naming the collation _name_,
299     and there either is a compiled collation file _name_.mcx
300     or another database is already using a collation of that name,
301     the existing collation info is used.
302    
303     Else a new collation (with the given name or anonymous)
304     is created from the description in the metadata and,
305     if it is named, saved as file _name_.mcx (in the current directory).
306    
307    
308     The distribution contains two sample collation definitions, a Latin-1
309     based as test/cds.m0d and a UTF-8 (Unicode) based as test/unicode.m0d.
310     Use a UTF-8 capable editor like "vim '+set encoding=utf-8'" in a "xterm -u8".
311    
312     Creating a collation definition from existing ISISAC.TAB and ISISUC.TAB
313     is a straightforward exercise left to the reader.
314    
315    
316     - *WARNING*:
317     Adding or changing the collation in _database_.m0d will render your
318     encoded index data garbage!
319     Make sure to save the plaintext index data using
320     "malete qdump _database_ >_database_.mqt" before
321     and reencode the index with "malete qload _database_ <_database_.mqt"
322     after editing the m0d file.
323     - *WARNING*:
324     While named (and thus shared) collations are much more efficient,
325     multiple databases using a different specification for the same
326     collation name will not properly coexist.
327     When changing a named collation, be sure to remove the .mcx file
328     and reload the indexes for all affected databases.
329     When in doubt, remove or change the collation name.
330    
331    
332     * links
333    
334     > http://openisis.org/Doc/encoding list of encodings supported by Java
335    
336     > http://openisis.org/Doc/Unicode notes on charsets and unicode with ISIS
337    
338     > http://openisis.org/Doc/CsTables tables of some western latin sets
339    
340     > http://openisis.org/Doc/UniStats statistics on unicode character properties
341    
342     > http://openisis.org/Doc/Collating approaches to collation
343    
344     > http://www.iana.org/assignments/character-sets iana registered charsets
345    
346     > http://web.archive.org/web/czyborra.com/unicode/characters.html#scripts unicode characters and scripts
347    
348     ICU Collation
349     > http://oss.software.ibm.com/icu/userguide/Collate_Intro.html Introduction
350     and
351     > http://oss.software.ibm.com/icu/userguide/Collate_Concepts.html Concepts
352    
353     Collation charts (without contractions) by
354     > http://oss.software.ibm.com/icu/charts/collation/ locale
355     or
356     > http://www.unicode.org/charts/collation/ script
357     (according to Unicode
358     > http://www.unicode.org/unicode/reports/tr10/tr10-11.html#Default_Unicode_Collation_Element_Table default collation
359     )
360    
361     ---
362     $Id: CharSet.txt,v 1.12 2004/11/12 11:18:23 kripke Exp $

  ViewVC Help
Powered by ViewVC 1.1.26