1 |
|
2 |
some notes on the use of charsets with ISIS |
3 |
|
4 |
|
5 |
* what are charsets? |
6 |
|
7 |
Since computers can store nothing but numbers, but we want them to |
8 |
store characters, there has to a table telling which character is |
9 |
stored as which number, or, vice versa, which number is to display and |
10 |
print as which character. such tables are called charsets. |
11 |
Since the smallest unit of number storage is a byte, which can hold |
12 |
256 different numbers from 0 to 255, many charsets are based on one |
13 |
byte and thus can hold up to 256 characters. such charsets are called |
14 |
one-byte-charsets . |
15 |
For many scripts, like the various versions of latin, greek, cyrillic, |
16 |
hebrew and arabic, 256 characters are more than enough. |
17 |
For others, namely chinese, japanese and korean ( |
18 |
> http://czyborra.com/charsets/cjk.html CJK |
19 |
) scripts with several thousand characters, it's not enough. |
20 |
The modern |
21 |
> http://czyborra.com/charsets/vietnamese.html vietnamese |
22 |
script is based on latin letters but needs a vast amount |
23 |
of accented letters, so 256 isn't enough. Those scripts don't get by |
24 |
with one byte per character, so they need multi-byte-charsets, where |
25 |
two or more bytes are needed to encode one character. |
26 |
|
27 |
|
28 |
* what is UNICODE |
29 |
|
30 |
> http://czyborra.com/unicode/standard.html UNICODE |
31 |
is a big multi-byte-charset designed to include all |
32 |
> http://czyborra.com/unicode/characters.html characters |
33 |
needed in the world (over 40.000 by now), even for some |
34 |
ancient languages. The problems having several charsets are a) you |
35 |
have to know which charset is used in a given text, b) computer |
36 |
systems need to be aware of all possible charsets and c) it's not |
37 |
possible to have a text or database contain characters which are |
38 |
encoded in different charsets. Having all text in unicode solves those |
39 |
problems. Check out |
40 |
> http://www.unicode.org/iuc/iuc10/x-utf8.html this sample page |
41 |
- with a 21st century browser |
42 |
like Mozilla 5 (Netscape 6) you will see most or all of the letters. |
43 |
|
44 |
|
45 |
* ASCII-compatible charsets and encodings |
46 |
|
47 |
Many charsets use the numbers 0 to 127 in the same way: to represent |
48 |
the basic set of latin characters defined by |
49 |
> http://czyborra.com/charsets/iso646.html ASCII |
50 |
. Whenever |
51 |
there's a byte with a number in that range, this byte has the meaning |
52 |
of the corresponding ASCII-character. For example, the number 43 |
53 |
always is a plus sign +, which is important if a query expression is |
54 |
scanned for such characters. |
55 |
All |
56 |
> http://czyborra.com/charsets/iso8859.html ISO-8859-x |
57 |
charsets are ASCII-compatible. Older |
58 |
> http://czyborra.com/charsets/cyrillic.html Cyrillic |
59 |
charsets are NOT compatible with ASCII. Some of the eastern |
60 |
multi-byte-charsets are, some are not. |
61 |
Some of the multi-byte-charsets have different encodings, that is, |
62 |
there is only one table mapping numbers to letters, but distinct ways |
63 |
to use multiple bytes to express such a number, some of which use the |
64 |
numbers in the ASCII-range only for ASCII characters, others don't. |
65 |
UNICODE has two widely used encodings, |
66 |
> http://czyborra.com/utf/#UTF-8 UTF-8 |
67 |
and UTF-16 (UCS-2). UTF-8 is ASCII-compatible, UTF-16 is not. |
68 |
|
69 |
|
70 |
* so what about ISIS |
71 |
|
72 |
- the ISIS database format itself is capable of storing anything and |
73 |
thus can store text in any charset/encoding. |
74 |
tools like biremes mx may store and retrieve (by MFN) text in |
75 |
nearly any encoding (but depending on how the programming is done, |
76 |
UTF-16 may not work because it may use bytes with value 0). |
77 |
- the ISIS query and formatting language depends on special |
78 |
ASCII-characters having special meaning and therefore will require |
79 |
an ASCII-compatible encoding. All the ISO-8859-x charsets will do |
80 |
as will UTF-8 encoded unicode (although some care must be taken |
81 |
when multiple bytes representing one character are cut off in the |
82 |
midth). At least in theory, mx and wwwisis are able to search for |
83 |
records in any ASCII-compatible encoding including UTF-8 unicode |
84 |
(given carefull web-programming). |
85 |
- winisis doesn't know about the possibility of one character having |
86 |
multiple bytes. It will work with any ASCII-compatible |
87 |
one-byte-charset , as long as it doesn't have to know what it |
88 |
does. That is, if your computer has some preferred charset |
89 |
installed, you will see all characters displayed according to that |
90 |
charset, and a character possibly entered as the german รค could |
91 |
show up as greek delta :). No support for multi-byte-charsets, |
92 |
especially not unicode. |
93 |
- Like any Java software, |
94 |
> http://web.tiscali.it/javaisis/ JavaISIS |
95 |
is - in theory - able to |
96 |
handle unicode characters and even to do the transformation |
97 |
between unicode and most of the other charsets. Some limitations |
98 |
may result from the underlying wwwisis. In practice, version 3.5 |
99 |
claims to give "Multi-language encoding support", but |
100 |
unfortunately it's in beta since March 2001 (sources made |
101 |
available in Feb 2002). |
102 |
- openisis supports any charset and with it's Java-binding, |
103 |
especially unicode and all the conversions. openisis alone can do |
104 |
it on the web, and in combination with JavaISIS (once new sources |
105 |
are available) also with a winisis-like interface. |
106 |
|
107 |
|
108 |
* some other resources on unicode |
109 |
|
110 |
To see all those characters, you need fonts to tell your display or |
111 |
printer how they look like. Here's a |
112 |
> http://www.hclrss.demon.co.uk/unicode/fonts.html very fine page |
113 |
on how to |
114 |
acquire and install those fonts (and some more advice). James Kass has |
115 |
a |
116 |
> http://home.att.net/~jameskass/scriptlinks.htm long list |
117 |
of high quality links related to Unicode. If you for some reason have to |
118 |
waste your time with M$ products, you may want to check out |
119 |
> http://www.microsoft.com/typography/fonts/ this page |
120 |
. Especially there's the one-size(23 MB)-fits-all fat font |
121 |
> http://office.microsoft.com/downloads/2000/aruniupd.aspx Arial Unicode MS |
122 |
(TM, (c), ... expect the worst) containing nearly all unicode glyphs, |
123 |
which is also included with newer Windoze and/or Ophice versions. |
124 |
|
125 |
|
126 |
See |
127 |
> UniStat statistics |
128 |
about how characters are distributed amongth Unicode. |
129 |
For example, the only scripts using uppercase/lowercase are those |
130 |
derived from Greek (i.e. Latin, Cyrillic, Armenian and Georgian). |
131 |
|
132 |
--- |
133 |
$Id: Unicode.txt,v 1.2 2003/05/08 14:04:39 kripke Exp $ |