1 |
dpavlin |
237 |
|
2 |
|
|
some notes on the use of charsets with ISIS |
3 |
|
|
|
4 |
|
|
|
5 |
|
|
* what are charsets? |
6 |
|
|
|
7 |
|
|
Since computers can store nothing but numbers, but we want them to |
8 |
|
|
store characters, there has to a table telling which character is |
9 |
|
|
stored as which number, or, vice versa, which number is to display and |
10 |
|
|
print as which character. such tables are called charsets. |
11 |
|
|
Since the smallest unit of number storage is a byte, which can hold |
12 |
|
|
256 different numbers from 0 to 255, many charsets are based on one |
13 |
|
|
byte and thus can hold up to 256 characters. such charsets are called |
14 |
|
|
one-byte-charsets . |
15 |
|
|
For many scripts, like the various versions of latin, greek, cyrillic, |
16 |
|
|
hebrew and arabic, 256 characters are more than enough. |
17 |
|
|
For others, namely chinese, japanese and korean ( |
18 |
|
|
> http://czyborra.com/charsets/cjk.html CJK |
19 |
|
|
) scripts with several thousand characters, it's not enough. |
20 |
|
|
The modern |
21 |
|
|
> http://czyborra.com/charsets/vietnamese.html vietnamese |
22 |
|
|
script is based on latin letters but needs a vast amount |
23 |
|
|
of accented letters, so 256 isn't enough. Those scripts don't get by |
24 |
|
|
with one byte per character, so they need multi-byte-charsets, where |
25 |
|
|
two or more bytes are needed to encode one character. |
26 |
|
|
|
27 |
|
|
|
28 |
|
|
* what is UNICODE |
29 |
|
|
|
30 |
|
|
> http://czyborra.com/unicode/standard.html UNICODE |
31 |
|
|
is a big multi-byte-charset designed to include all |
32 |
|
|
> http://czyborra.com/unicode/characters.html characters |
33 |
|
|
needed in the world (over 40.000 by now), even for some |
34 |
|
|
ancient languages. The problems having several charsets are a) you |
35 |
|
|
have to know which charset is used in a given text, b) computer |
36 |
|
|
systems need to be aware of all possible charsets and c) it's not |
37 |
|
|
possible to have a text or database contain characters which are |
38 |
|
|
encoded in different charsets. Having all text in unicode solves those |
39 |
|
|
problems. Check out |
40 |
|
|
> http://www.unicode.org/iuc/iuc10/x-utf8.html this sample page |
41 |
|
|
- with a 21st century browser |
42 |
|
|
like Mozilla 5 (Netscape 6) you will see most or all of the letters. |
43 |
|
|
|
44 |
|
|
|
45 |
|
|
* ASCII-compatible charsets and encodings |
46 |
|
|
|
47 |
|
|
Many charsets use the numbers 0 to 127 in the same way: to represent |
48 |
|
|
the basic set of latin characters defined by |
49 |
|
|
> http://czyborra.com/charsets/iso646.html ASCII |
50 |
|
|
. Whenever |
51 |
|
|
there's a byte with a number in that range, this byte has the meaning |
52 |
|
|
of the corresponding ASCII-character. For example, the number 43 |
53 |
|
|
always is a plus sign +, which is important if a query expression is |
54 |
|
|
scanned for such characters. |
55 |
|
|
All |
56 |
|
|
> http://czyborra.com/charsets/iso8859.html ISO-8859-x |
57 |
|
|
charsets are ASCII-compatible. Older |
58 |
|
|
> http://czyborra.com/charsets/cyrillic.html Cyrillic |
59 |
|
|
charsets are NOT compatible with ASCII. Some of the eastern |
60 |
|
|
multi-byte-charsets are, some are not. |
61 |
|
|
Some of the multi-byte-charsets have different encodings, that is, |
62 |
|
|
there is only one table mapping numbers to letters, but distinct ways |
63 |
|
|
to use multiple bytes to express such a number, some of which use the |
64 |
|
|
numbers in the ASCII-range only for ASCII characters, others don't. |
65 |
|
|
UNICODE has two widely used encodings, |
66 |
|
|
> http://czyborra.com/utf/#UTF-8 UTF-8 |
67 |
|
|
and UTF-16 (UCS-2). UTF-8 is ASCII-compatible, UTF-16 is not. |
68 |
|
|
|
69 |
|
|
|
70 |
|
|
* so what about ISIS |
71 |
|
|
|
72 |
|
|
- the ISIS database format itself is capable of storing anything and |
73 |
|
|
thus can store text in any charset/encoding. |
74 |
|
|
tools like biremes mx may store and retrieve (by MFN) text in |
75 |
|
|
nearly any encoding (but depending on how the programming is done, |
76 |
|
|
UTF-16 may not work because it may use bytes with value 0). |
77 |
|
|
- the ISIS query and formatting language depends on special |
78 |
|
|
ASCII-characters having special meaning and therefore will require |
79 |
|
|
an ASCII-compatible encoding. All the ISO-8859-x charsets will do |
80 |
|
|
as will UTF-8 encoded unicode (although some care must be taken |
81 |
|
|
when multiple bytes representing one character are cut off in the |
82 |
|
|
midth). At least in theory, mx and wwwisis are able to search for |
83 |
|
|
records in any ASCII-compatible encoding including UTF-8 unicode |
84 |
|
|
(given carefull web-programming). |
85 |
|
|
- winisis doesn't know about the possibility of one character having |
86 |
|
|
multiple bytes. It will work with any ASCII-compatible |
87 |
|
|
one-byte-charset , as long as it doesn't have to know what it |
88 |
|
|
does. That is, if your computer has some preferred charset |
89 |
|
|
installed, you will see all characters displayed according to that |
90 |
|
|
charset, and a character possibly entered as the german รค could |
91 |
|
|
show up as greek delta :). No support for multi-byte-charsets, |
92 |
|
|
especially not unicode. |
93 |
|
|
- Like any Java software, |
94 |
|
|
> http://web.tiscali.it/javaisis/ JavaISIS |
95 |
|
|
is - in theory - able to |
96 |
|
|
handle unicode characters and even to do the transformation |
97 |
|
|
between unicode and most of the other charsets. Some limitations |
98 |
|
|
may result from the underlying wwwisis. In practice, version 3.5 |
99 |
|
|
claims to give "Multi-language encoding support", but |
100 |
|
|
unfortunately it's in beta since March 2001 (sources made |
101 |
|
|
available in Feb 2002). |
102 |
|
|
- openisis supports any charset and with it's Java-binding, |
103 |
|
|
especially unicode and all the conversions. openisis alone can do |
104 |
|
|
it on the web, and in combination with JavaISIS (once new sources |
105 |
|
|
are available) also with a winisis-like interface. |
106 |
|
|
|
107 |
|
|
|
108 |
|
|
* some other resources on unicode |
109 |
|
|
|
110 |
|
|
To see all those characters, you need fonts to tell your display or |
111 |
|
|
printer how they look like. Here's a |
112 |
|
|
> http://www.hclrss.demon.co.uk/unicode/fonts.html very fine page |
113 |
|
|
on how to |
114 |
|
|
acquire and install those fonts (and some more advice). James Kass has |
115 |
|
|
a |
116 |
|
|
> http://home.att.net/~jameskass/scriptlinks.htm long list |
117 |
|
|
of high quality links related to Unicode. If you for some reason have to |
118 |
|
|
waste your time with M$ products, you may want to check out |
119 |
|
|
> http://www.microsoft.com/typography/fonts/ this page |
120 |
|
|
. Especially there's the one-size(23 MB)-fits-all fat font |
121 |
|
|
> http://office.microsoft.com/downloads/2000/aruniupd.aspx Arial Unicode MS |
122 |
|
|
(TM, (c), ... expect the worst) containing nearly all unicode glyphs, |
123 |
|
|
which is also included with newer Windoze and/or Ophice versions. |
124 |
|
|
|
125 |
|
|
|
126 |
|
|
See |
127 |
|
|
> UniStat statistics |
128 |
|
|
about how characters are distributed amongth Unicode. |
129 |
|
|
For example, the only scripts using uppercase/lowercase are those |
130 |
|
|
derived from Greek (i.e. Latin, Cyrillic, Armenian and Georgian). |
131 |
|
|
|
132 |
|
|
--- |
133 |
|
|
$Id: Unicode.txt,v 1.2 2003/05/08 14:04:39 kripke Exp $ |