1 |
dpavlin |
237 |
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> |
2 |
|
|
<html> |
3 |
|
|
<head> |
4 |
|
|
|
5 |
|
|
<meta http-equiv="content-type" content="text/html; charset=ISO-8859-1"> |
6 |
|
|
<title>ISIS charsets and unicode</title> |
7 |
|
|
</head> |
8 |
|
|
<body> |
9 |
|
|
|
10 |
|
|
<h2> some notes on the use of charsets with ISIS </h2> |
11 |
|
|
|
12 |
|
|
<h3>what are charsets?</h3> |
13 |
|
|
Since computers can store nothing but numbers, but we want them to store |
14 |
|
|
characters, there has to a table telling which character is stored as which |
15 |
|
|
number, or, vice versa, which number is to display and print as which character. |
16 |
|
|
such tables are called <b>charsets</b>.<br> |
17 |
|
|
Since the smallest unit of number storage is a byte, which can hold 256 |
18 |
|
|
different numbers from 0 to 255, many charsets are based on one |
19 |
|
|
byte and thus can hold up to 256 characters. such charsets are called <b> |
20 |
|
|
one-byte-charsets</b> .<br> |
21 |
|
|
For many scripts, like the various versions of latin, greek, cyrillic, hebrew |
22 |
|
|
and arabic, 256 characters are more than enough.<br> |
23 |
|
|
For others, namely chinese, japanese and korean (<a href="http://czyborra.com/charsets/cjk.html"> |
24 |
|
|
CJK</a> |
25 |
|
|
) scripts with several thousand characters, it's not enough. The modern |
26 |
|
|
<a href="http://czyborra.com/charsets/vietnamese.html"> vietnamese </a> |
27 |
|
|
script is based on latin letters but needs a vast amount of accented letters, |
28 |
|
|
so 256 isn't enough. Those scripts don't get by with one byte per character, |
29 |
|
|
so they need <b>multi-byte-charsets</b>, where two or more bytes are needed |
30 |
|
|
to encode one character.<br> |
31 |
|
|
|
32 |
|
|
<h3>what is UNICODE</h3> |
33 |
|
|
<a href="http://czyborra.com/unicode/standard.html">UNICODE</a> |
34 |
|
|
is a big multi-byte-charset designed to include all <a href="http://czyborra.com/unicode/characters.html"> |
35 |
|
|
characters</a> |
36 |
|
|
needed in the world (over 40.000 by now), even for some ancient languages. |
37 |
|
|
The problems having several charsets are a) you have to know which charset |
38 |
|
|
is used in a given text, b) computer systems need to be aware of all possible |
39 |
|
|
charsets and c) it's not possible to have a text or database contain characters |
40 |
|
|
which are encoded in different charsets. Having all text in unicode solves |
41 |
|
|
those problems. Check out <a href="http://www.unicode.org/iuc/iuc10/x-utf8.html"> |
42 |
|
|
this sample page</a> |
43 |
|
|
- with a 21st century browser like Mozilla 5 (Netscape 6) you will see most |
44 |
|
|
or all of the letters.<br> |
45 |
|
|
|
46 |
|
|
<h3>ASCII-compatible charsets and encodings</h3> |
47 |
|
|
Many charsets use the numbers 0 to 127 in the same way: to represent the |
48 |
|
|
basic set of latin characters defined by <a href="http://czyborra.com/charsets/iso646.html"> |
49 |
|
|
ASCII</a> |
50 |
|
|
. Whenever there's a byte with a number in that range, this byte has the |
51 |
|
|
meaning of the corresponding ASCII-character. For example, the number 43 always |
52 |
|
|
is a plus sign +, which is important if a query expression is scanned for |
53 |
|
|
such characters.<br> |
54 |
|
|
All <a href="http://czyborra.com/charsets/iso8859.html">ISO-8859-x</a> |
55 |
|
|
charsets are ASCII-compatible. Older <a href="http://czyborra.com/charsets/cyrillic.html"> |
56 |
|
|
Cyrillic</a> |
57 |
|
|
charsets are NOT compatible with ASCII. Some of the eastern multi-byte-charsets |
58 |
|
|
are, some are not.<br> |
59 |
|
|
Some of the multi-byte-charsets have different <b>encodings</b>, that is, |
60 |
|
|
there is only one table mapping numbers to letters, but distinct ways to use |
61 |
|
|
multiple bytes to express such a number, some of which use the numbers in |
62 |
|
|
the ASCII-range only for ASCII characters, others don't. UNICODE has two widely |
63 |
|
|
used encodings, <a href="http://czyborra.com/utf/#UTF-8">UTF-8</a> |
64 |
|
|
and UTF-16 (UCS-2). <b>UTF-8 is ASCII-compatible</b>, UTF-16 is not.<br> |
65 |
|
|
|
66 |
|
|
<h3>so what about ISIS</h3> |
67 |
|
|
|
68 |
|
|
<ul> |
69 |
|
|
<li>the ISIS database format itself is capable of storing anything and |
70 |
|
|
thus can store text in <b>any</b> charset/encoding.<br> |
71 |
|
|
tools like biremes mx may store and retrieve (by MFN) text in nearly any |
72 |
|
|
encoding (but depending on how the programming is done, UTF-16 may not work |
73 |
|
|
because it may use bytes with value 0).<br> |
74 |
|
|
</li> |
75 |
|
|
<li>the ISIS query and formatting language depends on special ASCII-characters |
76 |
|
|
having special meaning and therefore will require an <b>ASCII-compatible |
77 |
|
|
encoding</b>. All the ISO-8859-x charsets will do as will <b>UTF-8 encoded |
78 |
|
|
unicode</b> (although some care must be taken when multiple bytes representing |
79 |
|
|
one character are cut off in the midth). At least in theory, <b>mx</b> and |
80 |
|
|
<b>wwwisis</b> are able to search for records in any ASCII-compatible |
81 |
|
|
encoding including UTF-8 unicode (given carefull web-programming).</li> |
82 |
|
|
<li><b>winisis</b> doesn't know about the possibility of one character |
83 |
|
|
having multiple bytes. It will work with any <b>ASCII-compatible</b> <b>one-byte-charset</b> |
84 |
|
|
, as long as it doesn't have to know what it does. That is, if your computer |
85 |
|
|
has some preferred charset installed, you will see all characters displayed |
86 |
|
|
according to that charset, and a character possibly entered as the german |
87 |
|
|
ä could show up as greek delta :). No support for multi-byte-charsets, |
88 |
|
|
especially <b>not unicode</b>.</li> |
89 |
|
|
<li>Like any Java software, <b><a href="http://web.tiscali.it/javaisis/"> |
90 |
|
|
JavaISIS</a> |
91 |
|
|
</b> is - in theory - able to handle unicode characters and even to do |
92 |
|
|
the transformation between <b>unicode and most of the other</b> charsets. |
93 |
|
|
Some limitations may result from the underlying wwwisis. In practice, version |
94 |
|
|
3.5 claims to give "Multi-language encoding support", but unfortunately it's |
95 |
|
|
in beta since March 2001 (sources made available in Feb 2002).</li> |
96 |
|
|
<li><b>openisis</b> supports <b>any charset</b> and with it's Java-binding, |
97 |
|
|
<b>especially unicode</b> and all the conversions. openisis alone can |
98 |
|
|
do it on the web, and in combination with JavaISIS (once new sources are |
99 |
|
|
available) also with a winisis-like interface.<br> |
100 |
|
|
</li> |
101 |
|
|
|
102 |
|
|
</ul> |
103 |
|
|
<br> |
104 |
|
|
|
105 |
|
|
<h2> some other resources on unicode </h2> |
106 |
|
|
|
107 |
|
|
To see all those characters, you need fonts to tell your display |
108 |
|
|
or printer how they look like. |
109 |
|
|
Here's a |
110 |
|
|
<a href="http://www.hclrss.demon.co.uk/unicode/fonts.html"> very fine page </a> |
111 |
|
|
on how to acquire and install those fonts (and some more advice). |
112 |
|
|
James Kass has a |
113 |
|
|
<a href="http://home.att.net/~jameskass/scriptlinks.htm">long list</a> |
114 |
|
|
of high quality links related to Unicode. |
115 |
|
|
|
116 |
|
|
|
117 |
|
|
If you for some reason have to waste your time with M$ products, |
118 |
|
|
you may want to check out |
119 |
|
|
<a href="http://www.microsoft.com/typography/fonts/"> this page </a>. |
120 |
|
|
Especially there's the one-size(23 MB)-fits-all fat font |
121 |
|
|
<a href="http://office.microsoft.com/downloads/2000/aruniupd.aspx"> |
122 |
|
|
Arial Unicode MS </a> (TM, (c), ... expect the worst) |
123 |
|
|
containing nearly all unicode glyphs, which is also included |
124 |
|
|
with newer Windoze and/or Ophice versions. |
125 |
|
|
|
126 |
|
|
</body> |
127 |
|
|
</html> |