current/doc/Unicode.txt


some notes on the use of charsets with ISIS


*       what are charsets?

Since computers can store nothing but numbers, but we want them to
store characters, there has to a table telling which character is
stored as which number, or, vice versa, which number is to display and
print as which character. such tables are called charsets.
Since the smallest unit of number storage is a byte, which can hold
256 different numbers from 0 to 255, many charsets are based on one
byte and thus can hold up to 256 characters. such charsets are called
one-byte-charsets .
For many scripts, like the various versions of latin, greek, cyrillic,
hebrew and arabic, 256 characters are more than enough.
For others, namely chinese, japanese and korean (
>       http://czyborra.com/charsets/cjk.html   CJK
) scripts with several thousand characters, it's not enough.
The modern
>       http://czyborra.com/charsets/vietnamese.html    vietnamese
script is based on latin letters but needs a vast amount
of accented letters, so 256 isn't enough. Those scripts don't get by
with one byte per character, so they need multi-byte-charsets, where
two or more bytes are needed to encode one character.


*       what is UNICODE

>       http://czyborra.com/unicode/standard.html       UNICODE
is a big multi-byte-charset designed to include all
>       http://czyborra.com/unicode/characters.html     characters 
needed in the world (over 40.000 by now), even for some
ancient languages. The problems having several charsets are a) you
have to know which charset is used in a given text, b) computer
systems need to be aware of all possible charsets and c) it's not
possible to have a text or database contain characters which are
encoded in different charsets. Having all text in unicode solves those
problems. Check out
>       http://www.unicode.org/iuc/iuc10/x-utf8.html    this sample page
 - with a 21st century browser
like Mozilla 5 (Netscape 6) you will see most or all of the letters.


*       ASCII-compatible charsets and encodings

Many charsets use the numbers 0 to 127 in the same way: to represent
the basic set of latin characters defined by
>       http://czyborra.com/charsets/iso646.html        ASCII
. Whenever
there's a byte with a number in that range, this byte has the meaning
of the corresponding ASCII-character. For example, the number 43
always is a plus sign +, which is important if a query expression is
scanned for such characters.
All 
>       http://czyborra.com/charsets/iso8859.html       ISO-8859-x 
charsets are ASCII-compatible. Older
>       http://czyborra.com/charsets/cyrillic.html      Cyrillic
charsets are NOT compatible with ASCII. Some of the eastern
multi-byte-charsets are, some are not.
Some of the multi-byte-charsets have different encodings, that is,
there is only one table mapping numbers to letters, but distinct ways
to use multiple bytes to express such a number, some of which use the
numbers in the ASCII-range only for ASCII characters, others don't.
UNICODE has two widely used encodings,
>       http://czyborra.com/utf/#UTF-8  UTF-8
and UTF-16 (UCS-2). UTF-8 is ASCII-compatible, UTF-16 is not.


*       so what about ISIS

-       the ISIS database format itself is capable of storing anything and
        thus can store text in any charset/encoding.
        tools like biremes mx may store and retrieve (by MFN) text in
        nearly any encoding (but depending on how the programming is done,
        UTF-16 may not work because it may use bytes with value 0).
-       the ISIS query and formatting language depends on special
        ASCII-characters having special meaning and therefore will require
        an ASCII-compatible encoding. All the ISO-8859-x charsets will do
        as will UTF-8 encoded unicode (although some care must be taken
        when multiple bytes representing one character are cut off in the
        midth). At least in theory, mx and wwwisis are able to search for
        records in any ASCII-compatible encoding including UTF-8 unicode
        (given carefull web-programming).
-       winisis doesn't know about the possibility of one character having
        multiple bytes. It will work with any ASCII-compatible
        one-byte-charset , as long as it doesn't have to know what it
        does. That is, if your computer has some preferred charset
        installed, you will see all characters displayed according to that
        charset, and a character possibly entered as the german ä could
        show up as greek delta :). No support for multi-byte-charsets,
        especially not unicode.
-       Like any Java software, 
>       http://web.tiscali.it/javaisis/ JavaISIS
        is - in theory - able to
        handle unicode characters and even to do the transformation
        between unicode and most of the other charsets. Some limitations
        may result from the underlying wwwisis. In practice, version 3.5
        claims to give "Multi-language encoding support", but
        unfortunately it's in beta since March 2001 (sources made
        available in Feb 2002).
-       openisis supports any charset and with it's Java-binding,
        especially unicode and all the conversions. openisis alone can do
        it on the web, and in combination with JavaISIS (once new sources
        are available) also with a winisis-like interface.


*       some other resources on unicode

To see all those characters, you need fonts to tell your display or
printer how they look like. Here's a
>       http://www.hclrss.demon.co.uk/unicode/fonts.html        very fine page
on how to
acquire and install those fonts (and some more advice). James Kass has
a 
>       http://home.att.net/~jameskass/scriptlinks.htm  long list
of high quality links related to Unicode. If you for some reason have to
waste your time with M$ products, you may want to check out 
>       http://www.microsoft.com/typography/fonts/      this page 
. Especially there's the one-size(23 MB)-fits-all fat font 
>       http://office.microsoft.com/downloads/2000/aruniupd.aspx        Arial Unicode MS
(TM, (c), ... expect the worst) containing nearly all unicode glyphs,
which is also included with newer Windoze and/or Ophice versions.


See
>       UniStat statistics
about how characters are distributed amongth Unicode.
For example, the only scripts using uppercase/lowercase are those
derived from Greek (i.e. Latin, Cyrillic, Armenian and Georgian).

---
        $Id: Unicode.txt,v 1.2 2003/05/08 14:04:39 kripke Exp $
1	dpavlin	237
2			some notes on the use of charsets with ISIS
3
4
5			* what are charsets?
6
7			Since computers can store nothing but numbers, but we want them to
8			store characters, there has to a table telling which character is
9			stored as which number, or, vice versa, which number is to display and
10			print as which character. such tables are called charsets.
11			Since the smallest unit of number storage is a byte, which can hold
12			256 different numbers from 0 to 255, many charsets are based on one
13			byte and thus can hold up to 256 characters. such charsets are called
14			one-byte-charsets .
15			For many scripts, like the various versions of latin, greek, cyrillic,
16			hebrew and arabic, 256 characters are more than enough.
17			For others, namely chinese, japanese and korean (
18			> http://czyborra.com/charsets/cjk.html CJK
19			) scripts with several thousand characters, it's not enough.
20			The modern
21			> http://czyborra.com/charsets/vietnamese.html vietnamese
22			script is based on latin letters but needs a vast amount
23			of accented letters, so 256 isn't enough. Those scripts don't get by
24			with one byte per character, so they need multi-byte-charsets, where
25			two or more bytes are needed to encode one character.
26
27
28			* what is UNICODE
29
30			> http://czyborra.com/unicode/standard.html UNICODE
31			is a big multi-byte-charset designed to include all
32			> http://czyborra.com/unicode/characters.html characters
33			needed in the world (over 40.000 by now), even for some
34			ancient languages. The problems having several charsets are a) you
35			have to know which charset is used in a given text, b) computer
36			systems need to be aware of all possible charsets and c) it's not
37			possible to have a text or database contain characters which are
38			encoded in different charsets. Having all text in unicode solves those
39			problems. Check out
40			> http://www.unicode.org/iuc/iuc10/x-utf8.html this sample page
41			- with a 21st century browser
42			like Mozilla 5 (Netscape 6) you will see most or all of the letters.
43
44
45			* ASCII-compatible charsets and encodings
46
47			Many charsets use the numbers 0 to 127 in the same way: to represent
48			the basic set of latin characters defined by
49			> http://czyborra.com/charsets/iso646.html ASCII
50			. Whenever
51			there's a byte with a number in that range, this byte has the meaning
52			of the corresponding ASCII-character. For example, the number 43
53			always is a plus sign +, which is important if a query expression is
54			scanned for such characters.
55			All
56			> http://czyborra.com/charsets/iso8859.html ISO-8859-x
57			charsets are ASCII-compatible. Older
58			> http://czyborra.com/charsets/cyrillic.html Cyrillic
59			charsets are NOT compatible with ASCII. Some of the eastern
60			multi-byte-charsets are, some are not.
61			Some of the multi-byte-charsets have different encodings, that is,
62			there is only one table mapping numbers to letters, but distinct ways
63			to use multiple bytes to express such a number, some of which use the
64			numbers in the ASCII-range only for ASCII characters, others don't.
65			UNICODE has two widely used encodings,
66			> http://czyborra.com/utf/#UTF-8 UTF-8
67			and UTF-16 (UCS-2). UTF-8 is ASCII-compatible, UTF-16 is not.
68
69
70			* so what about ISIS
71
72			- the ISIS database format itself is capable of storing anything and
73			thus can store text in any charset/encoding.
74			tools like biremes mx may store and retrieve (by MFN) text in
75			nearly any encoding (but depending on how the programming is done,
76			UTF-16 may not work because it may use bytes with value 0).
77			- the ISIS query and formatting language depends on special
78			ASCII-characters having special meaning and therefore will require
79			an ASCII-compatible encoding. All the ISO-8859-x charsets will do
80			as will UTF-8 encoded unicode (although some care must be taken
81			when multiple bytes representing one character are cut off in the
82			midth). At least in theory, mx and wwwisis are able to search for
83			records in any ASCII-compatible encoding including UTF-8 unicode
84			(given carefull web-programming).
85			- winisis doesn't know about the possibility of one character having
86			multiple bytes. It will work with any ASCII-compatible
87			one-byte-charset , as long as it doesn't have to know what it
88			does. That is, if your computer has some preferred charset
89			installed, you will see all characters displayed according to that
90			charset, and a character possibly entered as the german ä could
91			show up as greek delta :). No support for multi-byte-charsets,
92			especially not unicode.
93			- Like any Java software,
94			> http://web.tiscali.it/javaisis/ JavaISIS
95			is - in theory - able to
96			handle unicode characters and even to do the transformation
97			between unicode and most of the other charsets. Some limitations
98			may result from the underlying wwwisis. In practice, version 3.5
99			claims to give "Multi-language encoding support", but
100			unfortunately it's in beta since March 2001 (sources made
101			available in Feb 2002).
102			- openisis supports any charset and with it's Java-binding,
103			especially unicode and all the conversions. openisis alone can do
104			it on the web, and in combination with JavaISIS (once new sources
105			are available) also with a winisis-like interface.
106
107
108			* some other resources on unicode
109
110			To see all those characters, you need fonts to tell your display or
111			printer how they look like. Here's a
112			> http://www.hclrss.demon.co.uk/unicode/fonts.html very fine page
113			on how to
114			acquire and install those fonts (and some more advice). James Kass has
115			a
116			> http://home.att.net/~jameskass/scriptlinks.htm long list
117			of high quality links related to Unicode. If you for some reason have to
118			waste your time with M$ products, you may want to check out
119			> http://www.microsoft.com/typography/fonts/ this page
120			. Especially there's the one-size(23 MB)-fits-all fat font
121			> http://office.microsoft.com/downloads/2000/aruniupd.aspx Arial Unicode MS
122			(TM, (c), ... expect the worst) containing nearly all unicode glyphs,
123			which is also included with newer Windoze and/or Ophice versions.
124
125
126			See
127			> UniStat statistics
128			about how characters are distributed amongth Unicode.
129			For example, the only scripts using uppercase/lowercase are those
130			derived from Greek (i.e. Latin, Cyrillic, Armenian and Georgian).
131
132			---
133			$Id: Unicode.txt,v 1.2 2003/05/08 14:04:39 kripke Exp $