openisis/doc/btree.txt

OpenIsis deploys a B-Tree structure similar to that of CDS/ISIS.

There are several differences to the original CDS/ISIS file layout:
-       The complete index is kept in one file, including
        control information (.CNT), inner nodes (.N0x) and leave nodes (.L0x)
        of any key length and "inverted file pointers" (.IFP, see below)
- The maximum key length is configurable up to 255 bytes (CDS/ISIS: 30)
-       The maximum mfn is configurable up to 6 bytes (CDS/ISIS: 3)
-       The keys are not blank padded.
-       The blocks have no fixed spread factor (CDS/ISIS: 10),
        but a fixed size of mK (where m is 1,2,4 or 8),
        allowing for about 200 keys of 30 bytes.
-       A custom string comparison function may be used,
        allowing for proper unicode collation.
-       To enhance safe high concurrency, it is a B-Link-Tree as used in postgres
        (see Lehmann/Yao 1982), i.e. each block has a link to it's right sibling.

Advantages:
-       The number of necessary I/O operations is dramatically reduced,
        typically to 1. Since the invention of B-Tree's by Bayer, the
        ratio of available main memory to disk storage is ~ 1:100.
        Since the ratio of the total size of the inner nodes to the total
        index size is aproximately equal to the spread factor, it is crucial
        that this is at least 100, allowing all the inner nodes to be in RAM.
-       The price paid for this, is that, since the inner nodes locate the
        leaf wanted less precise (up to a block of 200 entries instead of 10),
        there is more unwanted data read from disk.
        However, all I/O is in mK blocks on mK boundaries,
        matching the "page size" of most modern operating systems.
        This is the basic idea of B-Trees in the first place.
-       The maximum database size is much raised with regard to the mfn.
        This allows for an index spanning several databases, where the
        higher mfn bytes indicate the relevant master file.

Limits:
        The theoretical maximum number of terms is also somewhat raised from 20G
        (2G leaf blocks per 10 terms) to about 400G (4G blocks per 200 entries).
        Without support for large files (> 2G), however,
        the limit of the one-file design may actually even be lower,
        since the 2G file limits the total size of terms + IFPs.


*       Basic operation:

The B-L-Tree relates keys to "inverted file pointers" (IFP).
The key is a byte sequence of a length between 0 and 255.
The IFPs are 8 to 11 bytes long, with 3 to 6 bytes for an mfn (rowid),
and 5 bytes describing where the key occurs within the given record:
2 field tag, 1 occ(repetition of field), 2 word position.
These numbers are in big-endian byte order for easy sorting.
The mfn/value length is fixed for a given B-Tree.

File structure:
The index file consists of consecutive mK blocks,
with block number n at file offset n*mK.
Block 0 is the root, Block 1 the leftmost leaf.

Block structure:
The inner structure of a block is similar to an ISIS record:
some header fields followed by a dictionary of entries,
which is an array of positions (offset into block) and lengths.
The dictionary is sorted according to the entry's keys.
The actual entry data is located at the end of the block,
growing towards the dictionary as entries are added.

Entry structure:
Each entry starts with len bytes key as stated in the dictionary.
For an inner node, it is optionally followed by an IFP,
as indicated by a dictionary flag, then by a 4 byte block number.
For a leaf node, it is followed by a sorted array of IFPs.

Searching:
Since it is not uncommon for a key to relate to more distinct IFPs
than would fit within one block, we actually consider the pair of
key and IFP as the key when searching the tree.
When looking for all IFPs for a key, we start with empty (nulled) IFP.
Upon modification (insert/delete), we use the given IFP as second key segment.
Where the IFPs for one key span several leaf nodes,
the separating inner node entry will have it's key augmented
with the IFP, so we may quickly locate the proper leaf node.

Deleting:
The entry is searched and deleted, no restructuring.
(Background cleanup may be implemented some day).

Inserting:
If a new entry fit's within the target leaf, everything is fine.
Else either the new entry, or one or more other entries may have
to be shifted to the right sibling node.
If the right sibling is non-existent or too full
(or we didn't yet implement checking it),
a new right sibling is created.
The new separating key, which is lower than the one we stopped
at while locating our node, is then updated or inserted in the parent.

Multi-Process Concurrency:
There *must not* be more than 1 process having the index open for writing.
Readers, however, should work correctly, even in the presence of a writer.

Multi-Thread Concurrency:
Within a multi-threaded process, all access to internal structures
is interlocked within a monitor. The monitor is released during I/O,
with the block marked as being read or written from/to disk.
A thread wishing to access a block being read, or wishes to write
to a block while it is written to disk, must wait on the monitor.

Lehmann/Yao Concurrency:
While this design should make sure that each block is always
in a consistent state, the interrelation between blocks may be
disturbed by node splitting or shifting:
The key may be found to be no longer in the expected block.
Thus, while the search key is greater than the greatest key
in the current block, one has to try the right sibling.


*       Concurrency Details:
Within a multi-threaded process, all access to internal structures
is interlocked within a monitor. The monitor is released during I/O,
with the block marked as being read or written from/to disk.

This design will give the best possible over-all utilization on a
single-CPU system: letting a given thread run exclusively while it can
utilize the CPU. (I.e. unless it's waiting for I/O, swapping shouldn't
be an issue, if you're seriously running a db server).
If you have a multi-CPU system for nothing but OpenIsis,
run secondary read-only servers, each bound to one CPU.
This will be by far more efficient than sharing multiple CPUs
by one processe's threads.

Actually, we deploy one mutex (LOCK/UNLOCK) and one condition
(WAIT/WAKE) bound to it. Optionally a set of conditions bound
to the same mutex may be used.

All access to btree blocks is via a common cache, so that there
is never more than one copy of a block in memory.
During I/O, the cache block is flagged as being read or written, resp.
The read flag is held only temporarily during the I/O;
it marks the block contents as being asynchronously changed
(by the read call) and thus completely useless.
Any thread wishing to access the block has to wait on it.
After the read call returns, the thread initiating the read will
re-acquire the monitor, clear the flag and wake any waiting threads.

The write flag, on the other hand, acts as a WRITE LOCK on the block.
Threads wishing read-only access to the block completely ignore it,
since the block content is valid and will NOT asynchronously change:
it is changed by the writing thread only while it has the monitor,
thus no other thread will ever observe a change.
A thread wishing to modify the block, however, has to wait,
since a modification during an ongoing write could corrupt data.
Moreover, the lock need not be released immediatly after the write
call returns, but can be held until a second block is also written.


*       The block reading sequence is as follows:
$
r0      find the block in the cache.
        if it is not found, set it reading, release the monitor, read.
        else, if it is being read, increase waiter count, wait (releases the monitor).
        else use it (skip next state).
[r1]    SUSPENDED
        wait for either our own read to return or to be waked by another reader.
        if we return from read, re-acquire the monitor.
        (if we are awaked, the monitor is re-aqcuired implicitly).
r2      if we return from read, clear the reading flag, wake waiting threads.
        if we are awaked, decrease waiter count.
        (Since we might have less different conditions than blocks,
        we have to check the reading flag to see whether we were meant
        and possibly wait again).
        we have the block wanted.
        on searching, if it's not the leaf we wanted, start over.
$

NOTE:
-       the thread starting the read on a block is always the first to get it,
        since waiters can proceed only after this thread cleared the flag.
-       for a block that already is in cache, the reader proceeds without
        releasing the monitor. This especially holds for blocks being written.
These properties can be used for a save background cleanup.
-       a reader does not need to hold a block in the cache while it is
        suspended in I/O: all processing is immediately after reading.
        For a secondary reader on a database being written,
        at least leaf blocks should be invalidated asap.
        (Otherwise it will obviously miss some updates).

*       The overall search algorithm:
$
s1      starting from the root, locate the first entry with key
        greater than our key, read the block it pointed to by the previous entry.
        if there is no such entry found, and the block has a right sibling,
        check it for possible split or shift (Lehmann/Yao).
s2      when reaching a matching entry in a leaf, read subsequent right siblings
        according to the search range criteria and buffer.
$

NOTE:
-       due to the right sibling checking, it does not lead to inaccurate
        search results, if inner nodes have missing or too high keys
        for some childs, it only has a performance penalty
        (dropping the tree altogether means linear search).


*       The general block writing sequence:
$
w0      reading "for update" as r0, but with additional condition:
        if it has the WRITE LOCK flag set, wait, else use it (skip next state).
[w1]    SUSPENDED
        wait for read or to be waked by other writer.
        on return from read, re-acquire the monitor.
w2      if we are awaked and there's a WRITE LOCK, wait again.
        (the original reader or some other waiting thread might have set it).
        after returning from read, set the WRITE LOCK flag, wake waiting readers.
        if we need to write another block, start over.
        process block(s).
        release the monitor, start write call on first block to write.
[w3]    SUSPENDED
        wait for the write call to return.
        repeat while there are more blocks to be written.
        re-acquire the monitor.
w4      clear the WRITE LOCK flag on some blocks, wake waiting threads.
        if there are more blocks to be processed, start over.
$


*       The insert algorithm:
$
i0      locate the leaf L where to insert the entry.
        if it has enough room, process it and we're done.
i1      if it has a right sibling S, read it for update.
        if this hasn't enough room to take some entries from L,
        release the write lock and use a new block S as L's right sibling instead.
        shift entries to S, insert the new entry into L (or S).
i2      write first S, then L to disk.
        release the write lock on L.
i3      read the parent P of S for update,
        i.e. the parent of L or some block to the right.
        if we used old S, delete it's old entry.
        insert the new lower bound on the keys of S.
        if the parent hasn't enough room, start over at i1 (with P for L).
$

when unfolded to synchronized/unsynchronized steps,
a simplified version of this might look like:
$
        read your way down to L until it's read for update...
u0      not enough space in L, read S for update
[u1]    wait for S (L locked)
v2      modify both blocks, initiate read of P for update
[v3]    wait for P, write S then L (L,S locked)
v4      release lock on L and S, update P or go to u1 (P locked)
$


NOTE:
-       this is deadlock-free, since locking is ordered left to right,
        bottom to top.
-       by always keeping one lock, we avoid conflicts with other writers
        for the same entries, since those have to follow the same lock path,
        and thus work always in a serializable manner.
-       if we are the original reader for P,
        other threads wanting to read P have to wait for three I/Os.
        The detailled variant below avoids this.
-       reading the sibling and therefore the [u1] suspend is optional.
        depending on the key distribution, it might not hurt much
        or even be more efficient to always create a new block.
-       the only inconsistency a reader thread might notice is during v3
        (or a later suspend in order to lock P's sibling):
        there are new versions of L and S, but P is in the old state.
        The key of S (or L's former sibling) is too high and thus an entry
        now in S may be searched for in L. In this situation,
        the reader has to follow L's right link to find the key.


There is a slightly more complex variant of insert,
wich allows for somewhat enhanced concurreny
but with higher synchronization effort:
$
u2      modify both blocks (L,S locked)
[u3]    write S then L
u4      release lock on L (S locked), read P for update
[u5]    wait for P
u6      release lock on S, update P or go to u1 (P locked)
$


*       Multi-Process NOTEs:
-       we have to assume, that block writes
        * are atomic (seen by others completely or not at all) and
        * are ordered (if a later write is seen, an earlier is also).
        This does NOT hold for the file state on disk,
        but should hold for the operating system's file cache.
-       a reader in another process might also read the new version of
        S but the old version of L and thus has to deal with duplicate entries
        when searching ranges.
-       if a reader is caching, and thus write order is not in effect,
        it might miss entries completely.
        This does still give correct results, if
        * no leaves are cached
        * for all cached blocks, the parent is older
        This is ok for one-shot readers, or if caching is strictly from top
        with forced re-read after cache miss on search path.


Deleting an entry is trivial, since it affects one block only (if we're lazy).

In a single-process multi-threaded environment, however,
we can take advantage of certain properties to provide
relatively simple and save cleanup synchronously or in background.
1	OpenIsis deploys a B-Tree structure similar to that of CDS/ISIS.
2
3	There are several differences to the original CDS/ISIS file layout:
4	- The complete index is kept in one file, including
5	control information (.CNT), inner nodes (.N0x) and leave nodes (.L0x)
6	of any key length and "inverted file pointers" (.IFP, see below)
7	- The maximum key length is configurable up to 255 bytes (CDS/ISIS: 30)
8	- The maximum mfn is configurable up to 6 bytes (CDS/ISIS: 3)
9	- The keys are not blank padded.
10	- The blocks have no fixed spread factor (CDS/ISIS: 10),
11	but a fixed size of mK (where m is 1,2,4 or 8),
12	allowing for about 200 keys of 30 bytes.
13	- A custom string comparison function may be used,
14	allowing for proper unicode collation.
15	- To enhance safe high concurrency, it is a B-Link-Tree as used in postgres
16	(see Lehmann/Yao 1982), i.e. each block has a link to it's right sibling.
17
18	Advantages:
19	- The number of necessary I/O operations is dramatically reduced,
20	typically to 1. Since the invention of B-Tree's by Bayer, the
21	ratio of available main memory to disk storage is ~ 1:100.
22	Since the ratio of the total size of the inner nodes to the total
23	index size is aproximately equal to the spread factor, it is crucial
24	that this is at least 100, allowing all the inner nodes to be in RAM.
25	- The price paid for this, is that, since the inner nodes locate the
26	leaf wanted less precise (up to a block of 200 entries instead of 10),
27	there is more unwanted data read from disk.
28	However, all I/O is in mK blocks on mK boundaries,
29	matching the "page size" of most modern operating systems.
30	This is the basic idea of B-Trees in the first place.
31	- The maximum database size is much raised with regard to the mfn.
32	This allows for an index spanning several databases, where the
33	higher mfn bytes indicate the relevant master file.
34
35	Limits:
36	The theoretical maximum number of terms is also somewhat raised from 20G
37	(2G leaf blocks per 10 terms) to about 400G (4G blocks per 200 entries).
38	Without support for large files (> 2G), however,
39	the limit of the one-file design may actually even be lower,
40	since the 2G file limits the total size of terms + IFPs.
41
42
43
44	* Basic operation:
45
46	The B-L-Tree relates keys to "inverted file pointers" (IFP).
47	The key is a byte sequence of a length between 0 and 255.
48	The IFPs are 8 to 11 bytes long, with 3 to 6 bytes for an mfn (rowid),
49	and 5 bytes describing where the key occurs within the given record:
50	2 field tag, 1 occ(repetition of field), 2 word position.
51	These numbers are in big-endian byte order for easy sorting.
52	The mfn/value length is fixed for a given B-Tree.
53
54	File structure:
55	The index file consists of consecutive mK blocks,
56	with block number n at file offset n*mK.
57	Block 0 is the root, Block 1 the leftmost leaf.
58
59	Block structure:
60	The inner structure of a block is similar to an ISIS record:
61	some header fields followed by a dictionary of entries,
62	which is an array of positions (offset into block) and lengths.
63	The dictionary is sorted according to the entry's keys.
64	The actual entry data is located at the end of the block,
65	growing towards the dictionary as entries are added.
66
67	Entry structure:
68	Each entry starts with len bytes key as stated in the dictionary.
69	For an inner node, it is optionally followed by an IFP,
70	as indicated by a dictionary flag, then by a 4 byte block number.
71	For a leaf node, it is followed by a sorted array of IFPs.
72
73	Searching:
74	Since it is not uncommon for a key to relate to more distinct IFPs
75	than would fit within one block, we actually consider the pair of
76	key and IFP as the key when searching the tree.
77	When looking for all IFPs for a key, we start with empty (nulled) IFP.
78	Upon modification (insert/delete), we use the given IFP as second key segment.
79	Where the IFPs for one key span several leaf nodes,
80	the separating inner node entry will have it's key augmented
81	with the IFP, so we may quickly locate the proper leaf node.
82
83	Deleting:
84	The entry is searched and deleted, no restructuring.
85	(Background cleanup may be implemented some day).
86
87	Inserting:
88	If a new entry fit's within the target leaf, everything is fine.
89	Else either the new entry, or one or more other entries may have
90	to be shifted to the right sibling node.
91	If the right sibling is non-existent or too full
92	(or we didn't yet implement checking it),
93	a new right sibling is created.
94	The new separating key, which is lower than the one we stopped
95	at while locating our node, is then updated or inserted in the parent.
96
97	Multi-Process Concurrency:
98	There must not be more than 1 process having the index open for writing.
99	Readers, however, should work correctly, even in the presence of a writer.
100
101	Multi-Thread Concurrency:
102	Within a multi-threaded process, all access to internal structures
103	is interlocked within a monitor. The monitor is released during I/O,
104	with the block marked as being read or written from/to disk.
105	A thread wishing to access a block being read, or wishes to write
106	to a block while it is written to disk, must wait on the monitor.
107
108	Lehmann/Yao Concurrency:
109	While this design should make sure that each block is always
110	in a consistent state, the interrelation between blocks may be
111	disturbed by node splitting or shifting:
112	The key may be found to be no longer in the expected block.
113	Thus, while the search key is greater than the greatest key
114	in the current block, one has to try the right sibling.
115
116
117
118	* Concurrency Details:
119	Within a multi-threaded process, all access to internal structures
120	is interlocked within a monitor. The monitor is released during I/O,
121	with the block marked as being read or written from/to disk.
122
123	This design will give the best possible over-all utilization on a
124	single-CPU system: letting a given thread run exclusively while it can
125	utilize the CPU. (I.e. unless it's waiting for I/O, swapping shouldn't
126	be an issue, if you're seriously running a db server).
127	If you have a multi-CPU system for nothing but OpenIsis,
128	run secondary read-only servers, each bound to one CPU.
129	This will be by far more efficient than sharing multiple CPUs
130	by one processe's threads.
131
132	Actually, we deploy one mutex (LOCK/UNLOCK) and one condition
133	(WAIT/WAKE) bound to it. Optionally a set of conditions bound
134	to the same mutex may be used.
135
136	All access to btree blocks is via a common cache, so that there
137	is never more than one copy of a block in memory.
138	During I/O, the cache block is flagged as being read or written, resp.
139	The read flag is held only temporarily during the I/O;
140	it marks the block contents as being asynchronously changed
141	(by the read call) and thus completely useless.
142	Any thread wishing to access the block has to wait on it.
143	After the read call returns, the thread initiating the read will
144	re-acquire the monitor, clear the flag and wake any waiting threads.
145
146	The write flag, on the other hand, acts as a WRITE LOCK on the block.
147	Threads wishing read-only access to the block completely ignore it,
148	since the block content is valid and will NOT asynchronously change:
149	it is changed by the writing thread only while it has the monitor,
150	thus no other thread will ever observe a change.
151	A thread wishing to modify the block, however, has to wait,
152	since a modification during an ongoing write could corrupt data.
153	Moreover, the lock need not be released immediatly after the write
154	call returns, but can be held until a second block is also written.
155
156
157	* The block reading sequence is as follows:
158	$
159	r0 find the block in the cache.
160	if it is not found, set it reading, release the monitor, read.
161	else, if it is being read, increase waiter count, wait (releases the monitor).
162	else use it (skip next state).
163	[r1] SUSPENDED
164	wait for either our own read to return or to be waked by another reader.
165	if we return from read, re-acquire the monitor.
166	(if we are awaked, the monitor is re-aqcuired implicitly).
167	r2 if we return from read, clear the reading flag, wake waiting threads.
168	if we are awaked, decrease waiter count.
169	(Since we might have less different conditions than blocks,
170	we have to check the reading flag to see whether we were meant
171	and possibly wait again).
172	we have the block wanted.
173	on searching, if it's not the leaf we wanted, start over.
174	$
175
176	NOTE:
177	- the thread starting the read on a block is always the first to get it,
178	since waiters can proceed only after this thread cleared the flag.
179	- for a block that already is in cache, the reader proceeds without
180	releasing the monitor. This especially holds for blocks being written.
181	These properties can be used for a save background cleanup.
182	- a reader does not need to hold a block in the cache while it is
183	suspended in I/O: all processing is immediately after reading.
184	For a secondary reader on a database being written,
185	at least leaf blocks should be invalidated asap.
186	(Otherwise it will obviously miss some updates).
187
188	* The overall search algorithm:
189	$
190	s1 starting from the root, locate the first entry with key
191	greater than our key, read the block it pointed to by the previous entry.
192	if there is no such entry found, and the block has a right sibling,
193	check it for possible split or shift (Lehmann/Yao).
194	s2 when reaching a matching entry in a leaf, read subsequent right siblings
195	according to the search range criteria and buffer.
196	$
197
198	NOTE:
199	- due to the right sibling checking, it does not lead to inaccurate
200	search results, if inner nodes have missing or too high keys
201	for some childs, it only has a performance penalty
202	(dropping the tree altogether means linear search).
203
204
205	* The general block writing sequence:
206	$
207	w0 reading "for update" as r0, but with additional condition:
208	if it has the WRITE LOCK flag set, wait, else use it (skip next state).
209	[w1] SUSPENDED
210	wait for read or to be waked by other writer.
211	on return from read, re-acquire the monitor.
212	w2 if we are awaked and there's a WRITE LOCK, wait again.
213	(the original reader or some other waiting thread might have set it).
214	after returning from read, set the WRITE LOCK flag, wake waiting readers.
215	if we need to write another block, start over.
216	process block(s).
217	release the monitor, start write call on first block to write.
218	[w3] SUSPENDED
219	wait for the write call to return.
220	repeat while there are more blocks to be written.
221	re-acquire the monitor.
222	w4 clear the WRITE LOCK flag on some blocks, wake waiting threads.
223	if there are more blocks to be processed, start over.
224	$
225
226
227	* The insert algorithm:
228	$
229	i0 locate the leaf L where to insert the entry.
230	if it has enough room, process it and we're done.
231	i1 if it has a right sibling S, read it for update.
232	if this hasn't enough room to take some entries from L,
233	release the write lock and use a new block S as L's right sibling instead.
234	shift entries to S, insert the new entry into L (or S).
235	i2 write first S, then L to disk.
236	release the write lock on L.
237	i3 read the parent P of S for update,
238	i.e. the parent of L or some block to the right.
239	if we used old S, delete it's old entry.
240	insert the new lower bound on the keys of S.
241	if the parent hasn't enough room, start over at i1 (with P for L).
242	$
243
244	when unfolded to synchronized/unsynchronized steps,
245	a simplified version of this might look like:
246	$
247	read your way down to L until it's read for update...
248	u0 not enough space in L, read S for update
249	[u1] wait for S (L locked)
250	v2 modify both blocks, initiate read of P for update
251	[v3] wait for P, write S then L (L,S locked)
252	v4 release lock on L and S, update P or go to u1 (P locked)
253	$
254
255
256
257	NOTE:
258	- this is deadlock-free, since locking is ordered left to right,
259	bottom to top.
260	- by always keeping one lock, we avoid conflicts with other writers
261	for the same entries, since those have to follow the same lock path,
262	and thus work always in a serializable manner.
263	- if we are the original reader for P,
264	other threads wanting to read P have to wait for three I/Os.
265	The detailled variant below avoids this.
266	- reading the sibling and therefore the [u1] suspend is optional.
267	depending on the key distribution, it might not hurt much
268	or even be more efficient to always create a new block.
269	- the only inconsistency a reader thread might notice is during v3
270	(or a later suspend in order to lock P's sibling):
271	there are new versions of L and S, but P is in the old state.
272	The key of S (or L's former sibling) is too high and thus an entry
273	now in S may be searched for in L. In this situation,
274	the reader has to follow L's right link to find the key.
275
276
277	There is a slightly more complex variant of insert,
278	wich allows for somewhat enhanced concurreny
279	but with higher synchronization effort:
280	$
281	u2 modify both blocks (L,S locked)
282	[u3] write S then L
283	u4 release lock on L (S locked), read P for update
284	[u5] wait for P
285	u6 release lock on S, update P or go to u1 (P locked)
286	$
287
288
289	* Multi-Process NOTEs:
290	- we have to assume, that block writes
291	* are atomic (seen by others completely or not at all) and
292	* are ordered (if a later write is seen, an earlier is also).
293	This does NOT hold for the file state on disk,
294	but should hold for the operating system's file cache.
295	- a reader in another process might also read the new version of
296	S but the old version of L and thus has to deal with duplicate entries
297	when searching ranges.
298	- if a reader is caching, and thus write order is not in effect,
299	it might miss entries completely.
300	This does still give correct results, if
301	* no leaves are cached
302	* for all cached blocks, the parent is older
303	This is ok for one-shot readers, or if caching is strictly from top
304	with forced re-read after cache miss on search path.
305
306
307	Deleting an entry is trivial, since it affects one block only (if we're lazy).
308
309	In a single-process multi-threaded environment, however,
310	we can take advantage of certain properties to provide
311	relatively simple and save cleanup synchronously or in background.