0.9.0/doc/charsets.html

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
       
  <meta http-equiv="content-type" content="text/html; charset=ISO-8859-1">
  <title>ISIS charsets and unicode</title>
</head>
  <body>
 
<h2> some notes on the use of charsets with ISIS  </h2>
 
<h3>what are charsets?</h3>
 Since computers can store nothing but numbers, but we want them to store 
characters, there has to a table telling which character is stored as which 
number, or, vice versa, which number is to display and print as which character.
 such tables are called <b>charsets</b>.<br>
 Since the smallest unit of number storage is a byte, which can hold 256
different numbers from 0 to 255, many charsets are based on one
byte and thus can hold up to 256 characters. such charsets are called <b>
one-byte-charsets</b> .<br>
 For many scripts, like the various versions of latin, greek, cyrillic, hebrew 
and arabic, 256 characters are more than enough.<br>
 For others, namely chinese, japanese and korean (<a href="http://czyborra.com/charsets/cjk.html">
 CJK</a>
 ) scripts with several thousand characters, it's not enough. The modern
<a href="http://czyborra.com/charsets/vietnamese.html"> vietnamese </a>
 script is based on latin letters but needs a vast amount of accented letters, 
so 256 isn't enough. Those scripts don't get by with one byte per character,
 so they need <b>multi-byte-charsets</b>, where two or more bytes are needed 
to encode one character.<br>
 
<h3>what is UNICODE</h3>
 <a href="http://czyborra.com/unicode/standard.html">UNICODE</a>
  is a big multi-byte-charset designed to include all <a href="http://czyborra.com/unicode/characters.html">
 characters</a>
  needed in the world (over 40.000 by now), even for some ancient languages. 
The problems having several charsets are a) you have to know which charset 
is used in a given text, b) computer systems need to be aware of all possible 
charsets and c) it's not possible to have a text or database contain characters 
which are encoded in different charsets. Having all text in unicode solves 
those problems. Check out <a href="http://www.unicode.org/iuc/iuc10/x-utf8.html">
this sample page</a>
 - with a 21st century browser like Mozilla 5 (Netscape 6) you will see most
or all of the letters.<br>
 
<h3>ASCII-compatible charsets and encodings</h3>
 Many charsets use the numbers 0 to 127 in the same way: to represent the 
basic set of latin characters defined by <a href="http://czyborra.com/charsets/iso646.html">
 ASCII</a>
 . Whenever there's a byte with a number in that range, this byte has the 
meaning of the corresponding ASCII-character. For example, the number 43 always
is a plus sign +, which is important if a query expression is scanned for
such characters.<br>
 All <a href="http://czyborra.com/charsets/iso8859.html">ISO-8859-x</a>
  charsets are ASCII-compatible. Older <a href="http://czyborra.com/charsets/cyrillic.html">
 Cyrillic</a>
  charsets are NOT compatible with ASCII. Some of the eastern multi-byte-charsets 
are, some are not.<br>
Some of the multi-byte-charsets have different <b>encodings</b>, that is, 
there is only one table mapping numbers to letters, but distinct ways to use
multiple bytes to express such a number, some of which use the numbers in
the ASCII-range only for ASCII characters, others don't. UNICODE has two widely
used encodings, <a href="http://czyborra.com/utf/#UTF-8">UTF-8</a>
  and UTF-16 (UCS-2). <b>UTF-8 is ASCII-compatible</b>, UTF-16 is not.<br>
 
<h3>so what about ISIS</h3>
 
<ul>
   <li>the ISIS database format itself is capable of storing anything and 
thus can store text in <b>any</b> charset/encoding.<br>
 tools like biremes mx may store and retrieve (by MFN) text in nearly any 
encoding (but depending on how the programming is done, UTF-16 may not work 
because it may use bytes with value 0).<br>
   </li>
   <li>the ISIS query and formatting language depends on special ASCII-characters 
 having special meaning and therefore will require an <b>ASCII-compatible
encoding</b>. All the ISO-8859-x charsets will do as will <b>UTF-8 encoded
unicode</b> (although some care must be taken when multiple bytes representing
one character are cut off in the midth). At least in theory, <b>mx</b> and
    <b>wwwisis</b> are able to search for records in any&nbsp;ASCII-compatible
encoding including UTF-8 unicode (given carefull web-programming).</li>
   <li><b>winisis</b> doesn't know about the possibility of one character
having multiple bytes. It will work with any <b>ASCII-compatible</b> <b>one-byte-charset</b>
, as long as it doesn't have to know what it does. That is, if your computer
has some preferred charset installed, you will see all characters displayed
according to that charset, and a character possibly entered as the german
&auml; could show up as greek delta :). No support for multi-byte-charsets,
especially <b>not unicode</b>.</li>
   <li>Like any Java software, <b><a href="http://web.tiscali.it/javaisis/">
JavaISIS</a>
    </b> is - in theory - able to handle unicode characters and even to do
the transformation between <b>unicode and most of the other</b> charsets.
 Some limitations may result from the underlying wwwisis. In practice, version
3.5 claims to give "Multi-language encoding support", but unfortunately it's
in beta since March 2001 (sources made available in Feb 2002).</li>
   <li><b>openisis</b> supports <b>any charset</b> and with it's Java-binding,
    <b>especially unicode</b> and all the conversions. openisis alone can
do it on the web, and in combination with JavaISIS (once new sources are
available) also with a winisis-like interface.<br>
   </li>
 
</ul>
 <br>
 
<h2> some other resources on unicode </h2>

To see all those characters, you need fonts to tell your display
or printer how they look like.
Here's a
<a href="http://www.hclrss.demon.co.uk/unicode/fonts.html"> very fine page </a>
on how to acquire and install those fonts (and some more advice).
James Kass has a
<a href="http://home.att.net/~jameskass/scriptlinks.htm">long list</a>
of high quality links related to Unicode.


If you for some reason have to waste your time with M$ products,
you may want to check out
<a href="http://www.microsoft.com/typography/fonts/"> this page </a>.
Especially there's the one-size(23 MB)-fits-all fat font
<a href="http://office.microsoft.com/downloads/2000/aruniupd.aspx">
Arial Unicode MS </a> (TM, (c), ... expect the worst)
containing nearly all unicode glyphs, which is also included
with newer Windoze and/or Ophice versions.

</body>
</html>
1	dpavlin	237	<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
2			<html>
3			<head>
4
5			<meta http-equiv="content-type" content="text/html; charset=ISO-8859-1">
6			<title>ISIS charsets and unicode</title>
7			</head>
8			<body>
9
10			<h2> some notes on the use of charsets with ISIS </h2>
11
12			<h3>what are charsets?</h3>
13			Since computers can store nothing but numbers, but we want them to store
14			characters, there has to a table telling which character is stored as which
15			number, or, vice versa, which number is to display and print as which character.
16			such tables are called <b>charsets</b>.<br>
17			Since the smallest unit of number storage is a byte, which can hold 256
18			different numbers from 0 to 255, many charsets are based on one
19			byte and thus can hold up to 256 characters. such charsets are called <b>
20			one-byte-charsets</b> .<br>
21			For many scripts, like the various versions of latin, greek, cyrillic, hebrew
22			and arabic, 256 characters are more than enough.<br>
23			For others, namely chinese, japanese and korean (<a href="http://czyborra.com/charsets/cjk.html">
24			CJK</a>
25			) scripts with several thousand characters, it's not enough. The modern
26			<a href="http://czyborra.com/charsets/vietnamese.html"> vietnamese </a>
27			script is based on latin letters but needs a vast amount of accented letters,
28			so 256 isn't enough. Those scripts don't get by with one byte per character,
29			so they need <b>multi-byte-charsets</b>, where two or more bytes are needed
30			to encode one character.<br>
31
32			<h3>what is UNICODE</h3>
33			<a href="http://czyborra.com/unicode/standard.html">UNICODE</a>
34			is a big multi-byte-charset designed to include all <a href="http://czyborra.com/unicode/characters.html">
35			characters</a>
36			needed in the world (over 40.000 by now), even for some ancient languages.
37			The problems having several charsets are a) you have to know which charset
38			is used in a given text, b) computer systems need to be aware of all possible
39			charsets and c) it's not possible to have a text or database contain characters
40			which are encoded in different charsets. Having all text in unicode solves
41			those problems. Check out <a href="http://www.unicode.org/iuc/iuc10/x-utf8.html">
42			this sample page</a>
43			- with a 21st century browser like Mozilla 5 (Netscape 6) you will see most
44			or all of the letters.<br>
45
46			<h3>ASCII-compatible charsets and encodings</h3>
47			Many charsets use the numbers 0 to 127 in the same way: to represent the
48			basic set of latin characters defined by <a href="http://czyborra.com/charsets/iso646.html">
49			ASCII</a>
50			. Whenever there's a byte with a number in that range, this byte has the
51			meaning of the corresponding ASCII-character. For example, the number 43 always
52			is a plus sign +, which is important if a query expression is scanned for
53			such characters.<br>
54			All <a href="http://czyborra.com/charsets/iso8859.html">ISO-8859-x</a>
55			charsets are ASCII-compatible. Older <a href="http://czyborra.com/charsets/cyrillic.html">
56			Cyrillic</a>
57			charsets are NOT compatible with ASCII. Some of the eastern multi-byte-charsets
58			are, some are not.<br>
59			Some of the multi-byte-charsets have different <b>encodings</b>, that is,
60			there is only one table mapping numbers to letters, but distinct ways to use
61			multiple bytes to express such a number, some of which use the numbers in
62			the ASCII-range only for ASCII characters, others don't. UNICODE has two widely
63			used encodings, <a href="http://czyborra.com/utf/#UTF-8">UTF-8</a>
64			and UTF-16 (UCS-2). <b>UTF-8 is ASCII-compatible</b>, UTF-16 is not.<br>
65
66			<h3>so what about ISIS</h3>
67
68			<ul>
69			<li>the ISIS database format itself is capable of storing anything and
70			thus can store text in <b>any</b> charset/encoding.<br>
71			tools like biremes mx may store and retrieve (by MFN) text in nearly any
72			encoding (but depending on how the programming is done, UTF-16 may not work
73			because it may use bytes with value 0).<br>
74			</li>
75			<li>the ISIS query and formatting language depends on special ASCII-characters
76			having special meaning and therefore will require an <b>ASCII-compatible
77			encoding</b>. All the ISO-8859-x charsets will do as will <b>UTF-8 encoded
78			unicode</b> (although some care must be taken when multiple bytes representing
79			one character are cut off in the midth). At least in theory, <b>mx</b> and
80			<b>wwwisis</b> are able to search for records in any ASCII-compatible
81			encoding including UTF-8 unicode (given carefull web-programming).</li>
82			<li><b>winisis</b> doesn't know about the possibility of one character
83			having multiple bytes. It will work with any <b>ASCII-compatible</b> <b>one-byte-charset</b>
84			, as long as it doesn't have to know what it does. That is, if your computer
85			has some preferred charset installed, you will see all characters displayed
86			according to that charset, and a character possibly entered as the german
87			ä could show up as greek delta :). No support for multi-byte-charsets,
88			especially <b>not unicode</b>.</li>
89			<li>Like any Java software, <b><a href="http://web.tiscali.it/javaisis/">
90			JavaISIS</a>
91			</b> is - in theory - able to handle unicode characters and even to do
92			the transformation between <b>unicode and most of the other</b> charsets.
93			Some limitations may result from the underlying wwwisis. In practice, version
94			3.5 claims to give "Multi-language encoding support", but unfortunately it's
95			in beta since March 2001 (sources made available in Feb 2002).</li>
96			<li><b>openisis</b> supports <b>any charset</b> and with it's Java-binding,
97			<b>especially unicode</b> and all the conversions. openisis alone can
98			do it on the web, and in combination with JavaISIS (once new sources are
99			available) also with a winisis-like interface.<br>
100			</li>
101
102			</ul>
103			<br>
104
105			<h2> some other resources on unicode </h2>
106
107			To see all those characters, you need fonts to tell your display
108			or printer how they look like.
109			Here's a
110			<a href="http://www.hclrss.demon.co.uk/unicode/fonts.html"> very fine page </a>
111			on how to acquire and install those fonts (and some more advice).
112			James Kass has a
113			<a href="http://home.att.net/~jameskass/scriptlinks.htm">long list</a>
114			of high quality links related to Unicode.
115
116
117			If you for some reason have to waste your time with M$ products,
118			you may want to check out
119			<a href="http://www.microsoft.com/typography/fonts/"> this page </a>.
120			Especially there's the one-size(23 MB)-fits-all fat font
121			<a href="http://office.microsoft.com/downloads/2000/aruniupd.aspx">
122			Arial Unicode MS </a> (TM, (c), ... expect the worst)
123			containing nearly all unicode glyphs, which is also included
124			with newer Windoze and/or Ophice versions.
125
126			</body>
127			</html>