current/doc/longidx.txt


With the emerging use of unicode in ISIS databases,
the 30 byte maximum on keys might become a somewhat
narrow limit for languages using 2 or 3 UTF-8 bytes per
character while still having lots of characters per word.
Where UTF-16/UCS-2 is used, all languages would be
equally affected by an effective maximum key of
15 characters.

Therefore, there is a clear need for a new index
structure supporting longer keys for some environments.
This would in itself not raise any compatibility issues,
since an index is always redundant and can be reconstructed
by any plattform in a suitable format for this plattform
(of course according to the plattform's builtin limits).

While it would be straightforward to just create some n03/l03
pair of files or change 01/02 parameters, and some tools
might even care for the control-file contents and work without
modification using such indexes, some other tools wouldn't.


In this situation, one may want to spend some time thinking
about how an index should be organised. For programmers having
coped with ISAM and databases in theory and practice,
the reasons for some design decisions of traditional
ISIS index structures are less than obvious.


*       blank padding
a typical index-block lookup routine does not benefit in speed
from blank-padding. while the fixed-array organisation allows
for very easy Pascal implementation (where you can't cast
some part of a block to a struct), C code doesn't care.
instead, the additional space accounts for
additional IO and therefore slows down index access.
considering that many DBs even allow for compressed index files,
where every key is stored as a number of leading bytes it
shares with it's predecessor followed by the trailing difference,
one should expect some performance improvement from at least
eliminating all those blanks.

*       block alignment
the node and leaves files are made up of blocks not aligned
at any page boundaries. Every B*tree in textbooks as well as
in real-world databases uses page-aligned blocks for best IO
performance. potential reader/writer concurrency problems
(readers reading partially updated blocks) are also reduced
for most systems when IO occurs at filesystem page boundaries.

*       multiple files
The separation of one index structures nodes and leaves is
due to the fact that they do have different block sizes.
The obvious advantage of having two index file structures is to
reduce the space wasted by blank padding.
Using page-aligned blocks of fix size with unpadded keys
would allow the complete index to go into one single file.
Moreover, the postings could be stored together with the
leaves, further reducing IO.

---
        $Id: longidx.txt,v 1.2 2002/11/13 18:52:55 kripke Exp $
1	dpavlin	237
2			With the emerging use of unicode in ISIS databases,
3			the 30 byte maximum on keys might become a somewhat
4			narrow limit for languages using 2 or 3 UTF-8 bytes per
5			character while still having lots of characters per word.
6			Where UTF-16/UCS-2 is used, all languages would be
7			equally affected by an effective maximum key of
8			15 characters.
9
10			Therefore, there is a clear need for a new index
11			structure supporting longer keys for some environments.
12			This would in itself not raise any compatibility issues,
13			since an index is always redundant and can be reconstructed
14			by any plattform in a suitable format for this plattform
15			(of course according to the plattform's builtin limits).
16
17			While it would be straightforward to just create some n03/l03
18			pair of files or change 01/02 parameters, and some tools
19			might even care for the control-file contents and work without
20			modification using such indexes, some other tools wouldn't.
21
22
23			In this situation, one may want to spend some time thinking
24			about how an index should be organised. For programmers having
25			coped with ISAM and databases in theory and practice,
26			the reasons for some design decisions of traditional
27			ISIS index structures are less than obvious.
28
29
30			* blank padding
31			a typical index-block lookup routine does not benefit in speed
32			from blank-padding. while the fixed-array organisation allows
33			for very easy Pascal implementation (where you can't cast
34			some part of a block to a struct), C code doesn't care.
35			instead, the additional space accounts for
36			additional IO and therefore slows down index access.
37			considering that many DBs even allow for compressed index files,
38			where every key is stored as a number of leading bytes it
39			shares with it's predecessor followed by the trailing difference,
40			one should expect some performance improvement from at least
41			eliminating all those blanks.
42
43			* block alignment
44			the node and leaves files are made up of blocks not aligned
45			at any page boundaries. Every B*tree in textbooks as well as
46			in real-world databases uses page-aligned blocks for best IO
47			performance. potential reader/writer concurrency problems
48			(readers reading partially updated blocks) are also reduced
49			for most systems when IO occurs at filesystem page boundaries.
50
51			* multiple files
52			The separation of one index structures nodes and leaves is
53			due to the fact that they do have different block sizes.
54			The obvious advantage of having two index file structures is to
55			reduce the space wasted by blank padding.
56			Using page-aligned blocks of fix size with unpadded keys
57			would allow the complete index to go into one single file.
58			Moreover, the postings could be stored together with the
59			leaves, further reducing IO.
60
61			---
62			$Id: longidx.txt,v 1.2 2002/11/13 18:52:55 kripke Exp $