This is repository of my old source code which isn't updated any more. Go to git.rot13.org for current projects!
ViewVC logotype

Contents of /openisis/current/doc/charsets.html

Parent Directory Parent Directory | Revision Log Revision Log

Revision 237 - (show annotations)
Mon Mar 8 17:43:12 2004 UTC (16 years, 7 months ago) by dpavlin
File MIME type: text/html
File size: 6572 byte(s)
initial import of openisis 0.9.0 vendor drop

1 <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
2 <html>
3 <head>
5 <meta http-equiv="content-type" content="text/html; charset=ISO-8859-1">
6 <title>ISIS charsets and unicode</title>
7 </head>
8 <body>
10 <h2> some notes on the use of charsets with ISIS </h2>
12 <h3>what are charsets?</h3>
13 Since computers can store nothing but numbers, but we want them to store
14 characters, there has to a table telling which character is stored as which
15 number, or, vice versa, which number is to display and print as which character.
16 such tables are called <b>charsets</b>.<br>
17 Since the smallest unit of number storage is a byte, which can hold 256
18 different numbers from 0 to 255, many charsets are based on one
19 byte and thus can hold up to 256 characters. such charsets are called <b>
20 one-byte-charsets</b> .<br>
21 For many scripts, like the various versions of latin, greek, cyrillic, hebrew
22 and arabic, 256 characters are more than enough.<br>
23 For others, namely chinese, japanese and korean (<a href="http://czyborra.com/charsets/cjk.html">
24 CJK</a>
25 ) scripts with several thousand characters, it's not enough. The modern
26 <a href="http://czyborra.com/charsets/vietnamese.html"> vietnamese </a>
27 script is based on latin letters but needs a vast amount of accented letters,
28 so 256 isn't enough. Those scripts don't get by with one byte per character,
29 so they need <b>multi-byte-charsets</b>, where two or more bytes are needed
30 to encode one character.<br>
32 <h3>what is UNICODE</h3>
33 <a href="http://czyborra.com/unicode/standard.html">UNICODE</a>
34 is a big multi-byte-charset designed to include all <a href="http://czyborra.com/unicode/characters.html">
35 characters</a>
36 needed in the world (over 40.000 by now), even for some ancient languages.
37 The problems having several charsets are a) you have to know which charset
38 is used in a given text, b) computer systems need to be aware of all possible
39 charsets and c) it's not possible to have a text or database contain characters
40 which are encoded in different charsets. Having all text in unicode solves
41 those problems. Check out <a href="http://www.unicode.org/iuc/iuc10/x-utf8.html">
42 this sample page</a>
43 - with a 21st century browser like Mozilla 5 (Netscape 6) you will see most
44 or all of the letters.<br>
46 <h3>ASCII-compatible charsets and encodings</h3>
47 Many charsets use the numbers 0 to 127 in the same way: to represent the
48 basic set of latin characters defined by <a href="http://czyborra.com/charsets/iso646.html">
49 ASCII</a>
50 . Whenever there's a byte with a number in that range, this byte has the
51 meaning of the corresponding ASCII-character. For example, the number 43 always
52 is a plus sign +, which is important if a query expression is scanned for
53 such characters.<br>
54 All <a href="http://czyborra.com/charsets/iso8859.html">ISO-8859-x</a>
55 charsets are ASCII-compatible. Older <a href="http://czyborra.com/charsets/cyrillic.html">
56 Cyrillic</a>
57 charsets are NOT compatible with ASCII. Some of the eastern multi-byte-charsets
58 are, some are not.<br>
59 Some of the multi-byte-charsets have different <b>encodings</b>, that is,
60 there is only one table mapping numbers to letters, but distinct ways to use
61 multiple bytes to express such a number, some of which use the numbers in
62 the ASCII-range only for ASCII characters, others don't. UNICODE has two widely
63 used encodings, <a href="http://czyborra.com/utf/#UTF-8">UTF-8</a>
64 and UTF-16 (UCS-2). <b>UTF-8 is ASCII-compatible</b>, UTF-16 is not.<br>
66 <h3>so what about ISIS</h3>
68 <ul>
69 <li>the ISIS database format itself is capable of storing anything and
70 thus can store text in <b>any</b> charset/encoding.<br>
71 tools like biremes mx may store and retrieve (by MFN) text in nearly any
72 encoding (but depending on how the programming is done, UTF-16 may not work
73 because it may use bytes with value 0).<br>
74 </li>
75 <li>the ISIS query and formatting language depends on special ASCII-characters
76 having special meaning and therefore will require an <b>ASCII-compatible
77 encoding</b>. All the ISO-8859-x charsets will do as will <b>UTF-8 encoded
78 unicode</b> (although some care must be taken when multiple bytes representing
79 one character are cut off in the midth). At least in theory, <b>mx</b> and
80 <b>wwwisis</b> are able to search for records in any&nbsp;ASCII-compatible
81 encoding including UTF-8 unicode (given carefull web-programming).</li>
82 <li><b>winisis</b> doesn't know about the possibility of one character
83 having multiple bytes. It will work with any <b>ASCII-compatible</b> <b>one-byte-charset</b>
84 , as long as it doesn't have to know what it does. That is, if your computer
85 has some preferred charset installed, you will see all characters displayed
86 according to that charset, and a character possibly entered as the german
87 &auml; could show up as greek delta :). No support for multi-byte-charsets,
88 especially <b>not unicode</b>.</li>
89 <li>Like any Java software, <b><a href="http://web.tiscali.it/javaisis/">
90 JavaISIS</a>
91 </b> is - in theory - able to handle unicode characters and even to do
92 the transformation between <b>unicode and most of the other</b> charsets.
93 Some limitations may result from the underlying wwwisis. In practice, version
94 3.5 claims to give "Multi-language encoding support", but unfortunately it's
95 in beta since March 2001 (sources made available in Feb 2002).</li>
96 <li><b>openisis</b> supports <b>any charset</b> and with it's Java-binding,
97 <b>especially unicode</b> and all the conversions. openisis alone can
98 do it on the web, and in combination with JavaISIS (once new sources are
99 available) also with a winisis-like interface.<br>
100 </li>
102 </ul>
103 <br>
105 <h2> some other resources on unicode </h2>
107 To see all those characters, you need fonts to tell your display
108 or printer how they look like.
109 Here's a
110 <a href="http://www.hclrss.demon.co.uk/unicode/fonts.html"> very fine page </a>
111 on how to acquire and install those fonts (and some more advice).
112 James Kass has a
113 <a href="http://home.att.net/~jameskass/scriptlinks.htm">long list</a>
114 of high quality links related to Unicode.
117 If you for some reason have to waste your time with M$ products,
118 you may want to check out
119 <a href="http://www.microsoft.com/typography/fonts/"> this page </a>.
120 Especially there's the one-size(23 MB)-fits-all fat font
121 <a href="http://office.microsoft.com/downloads/2000/aruniupd.aspx">
122 Arial Unicode MS </a> (TM, (c), ... expect the worst)
123 containing nearly all unicode glyphs, which is also included
124 with newer Windoze and/or Ophice versions.
126 </body>
127 </html>

  ViewVC Help
Powered by ViewVC 1.1.26