/[webpac]/openisis/current/doc/charsets.html
This is repository of my old source code which isn't updated any more. Go to git.rot13.org for current projects!
ViewVC logotype

Annotation of /openisis/current/doc/charsets.html

Parent Directory Parent Directory | Revision Log Revision Log


Revision 237 - (hide annotations)
Mon Mar 8 17:43:12 2004 UTC (20 years, 1 month ago) by dpavlin
File MIME type: text/html
File size: 6572 byte(s)
initial import of openisis 0.9.0 vendor drop

1 dpavlin 237 <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
2     <html>
3     <head>
4    
5     <meta http-equiv="content-type" content="text/html; charset=ISO-8859-1">
6     <title>ISIS charsets and unicode</title>
7     </head>
8     <body>
9    
10     <h2> some notes on the use of charsets with ISIS </h2>
11    
12     <h3>what are charsets?</h3>
13     Since computers can store nothing but numbers, but we want them to store
14     characters, there has to a table telling which character is stored as which
15     number, or, vice versa, which number is to display and print as which character.
16     such tables are called <b>charsets</b>.<br>
17     Since the smallest unit of number storage is a byte, which can hold 256
18     different numbers from 0 to 255, many charsets are based on one
19     byte and thus can hold up to 256 characters. such charsets are called <b>
20     one-byte-charsets</b> .<br>
21     For many scripts, like the various versions of latin, greek, cyrillic, hebrew
22     and arabic, 256 characters are more than enough.<br>
23     For others, namely chinese, japanese and korean (<a href="http://czyborra.com/charsets/cjk.html">
24     CJK</a>
25     ) scripts with several thousand characters, it's not enough. The modern
26     <a href="http://czyborra.com/charsets/vietnamese.html"> vietnamese </a>
27     script is based on latin letters but needs a vast amount of accented letters,
28     so 256 isn't enough. Those scripts don't get by with one byte per character,
29     so they need <b>multi-byte-charsets</b>, where two or more bytes are needed
30     to encode one character.<br>
31    
32     <h3>what is UNICODE</h3>
33     <a href="http://czyborra.com/unicode/standard.html">UNICODE</a>
34     is a big multi-byte-charset designed to include all <a href="http://czyborra.com/unicode/characters.html">
35     characters</a>
36     needed in the world (over 40.000 by now), even for some ancient languages.
37     The problems having several charsets are a) you have to know which charset
38     is used in a given text, b) computer systems need to be aware of all possible
39     charsets and c) it's not possible to have a text or database contain characters
40     which are encoded in different charsets. Having all text in unicode solves
41     those problems. Check out <a href="http://www.unicode.org/iuc/iuc10/x-utf8.html">
42     this sample page</a>
43     - with a 21st century browser like Mozilla 5 (Netscape 6) you will see most
44     or all of the letters.<br>
45    
46     <h3>ASCII-compatible charsets and encodings</h3>
47     Many charsets use the numbers 0 to 127 in the same way: to represent the
48     basic set of latin characters defined by <a href="http://czyborra.com/charsets/iso646.html">
49     ASCII</a>
50     . Whenever there's a byte with a number in that range, this byte has the
51     meaning of the corresponding ASCII-character. For example, the number 43 always
52     is a plus sign +, which is important if a query expression is scanned for
53     such characters.<br>
54     All <a href="http://czyborra.com/charsets/iso8859.html">ISO-8859-x</a>
55     charsets are ASCII-compatible. Older <a href="http://czyborra.com/charsets/cyrillic.html">
56     Cyrillic</a>
57     charsets are NOT compatible with ASCII. Some of the eastern multi-byte-charsets
58     are, some are not.<br>
59     Some of the multi-byte-charsets have different <b>encodings</b>, that is,
60     there is only one table mapping numbers to letters, but distinct ways to use
61     multiple bytes to express such a number, some of which use the numbers in
62     the ASCII-range only for ASCII characters, others don't. UNICODE has two widely
63     used encodings, <a href="http://czyborra.com/utf/#UTF-8">UTF-8</a>
64     and UTF-16 (UCS-2). <b>UTF-8 is ASCII-compatible</b>, UTF-16 is not.<br>
65    
66     <h3>so what about ISIS</h3>
67    
68     <ul>
69     <li>the ISIS database format itself is capable of storing anything and
70     thus can store text in <b>any</b> charset/encoding.<br>
71     tools like biremes mx may store and retrieve (by MFN) text in nearly any
72     encoding (but depending on how the programming is done, UTF-16 may not work
73     because it may use bytes with value 0).<br>
74     </li>
75     <li>the ISIS query and formatting language depends on special ASCII-characters
76     having special meaning and therefore will require an <b>ASCII-compatible
77     encoding</b>. All the ISO-8859-x charsets will do as will <b>UTF-8 encoded
78     unicode</b> (although some care must be taken when multiple bytes representing
79     one character are cut off in the midth). At least in theory, <b>mx</b> and
80     <b>wwwisis</b> are able to search for records in any&nbsp;ASCII-compatible
81     encoding including UTF-8 unicode (given carefull web-programming).</li>
82     <li><b>winisis</b> doesn't know about the possibility of one character
83     having multiple bytes. It will work with any <b>ASCII-compatible</b> <b>one-byte-charset</b>
84     , as long as it doesn't have to know what it does. That is, if your computer
85     has some preferred charset installed, you will see all characters displayed
86     according to that charset, and a character possibly entered as the german
87     &auml; could show up as greek delta :). No support for multi-byte-charsets,
88     especially <b>not unicode</b>.</li>
89     <li>Like any Java software, <b><a href="http://web.tiscali.it/javaisis/">
90     JavaISIS</a>
91     </b> is - in theory - able to handle unicode characters and even to do
92     the transformation between <b>unicode and most of the other</b> charsets.
93     Some limitations may result from the underlying wwwisis. In practice, version
94     3.5 claims to give "Multi-language encoding support", but unfortunately it's
95     in beta since March 2001 (sources made available in Feb 2002).</li>
96     <li><b>openisis</b> supports <b>any charset</b> and with it's Java-binding,
97     <b>especially unicode</b> and all the conversions. openisis alone can
98     do it on the web, and in combination with JavaISIS (once new sources are
99     available) also with a winisis-like interface.<br>
100     </li>
101    
102     </ul>
103     <br>
104    
105     <h2> some other resources on unicode </h2>
106    
107     To see all those characters, you need fonts to tell your display
108     or printer how they look like.
109     Here's a
110     <a href="http://www.hclrss.demon.co.uk/unicode/fonts.html"> very fine page </a>
111     on how to acquire and install those fonts (and some more advice).
112     James Kass has a
113     <a href="http://home.att.net/~jameskass/scriptlinks.htm">long list</a>
114     of high quality links related to Unicode.
115    
116    
117     If you for some reason have to waste your time with M$ products,
118     you may want to check out
119     <a href="http://www.microsoft.com/typography/fonts/"> this page </a>.
120     Especially there's the one-size(23 MB)-fits-all fat font
121     <a href="http://office.microsoft.com/downloads/2000/aruniupd.aspx">
122     Arial Unicode MS </a> (TM, (c), ... expect the worst)
123     containing nearly all unicode glyphs, which is also included
124     with newer Windoze and/or Ophice versions.
125    
126     </body>
127     </html>

  ViewVC Help
Powered by ViewVC 1.1.26