current/doc/btree.txt

OpenIsis deploys a B-Tree structure similar to that of CDS/ISIS.

There are several differences to the original CDS/ISIS file layout:
-       The complete index is kept in one file, including
        control information (.CNT), inner nodes (.N0x) and leave nodes (.L0x)
        of any key length and "inverted file pointers" (.IFP, see below)
- The maximum key length is configurable up to 255 bytes (CDS/ISIS: 30)
-       The maximum mfn is configurable up to 6 bytes (CDS/ISIS: 3)
-       The keys are not blank padded.
-       The blocks have no fixed spread factor (CDS/ISIS: 10),
        but a fixed size of mK (where m is 1,2,4 or 8),
        allowing for about 200 keys of 30 bytes.
-       A custom string comparison function may be used,
        allowing for proper unicode collation.
-       To enhance safe high concurrency, it is a B-Link-Tree as used in postgres
        (see Lehmann/Yao 1982), i.e. each block has a link to it's right sibling.

Advantages:
-       The number of necessary I/O operations is dramatically reduced,
        typically to 1. Since the invention of B-Tree's by Bayer, the
        ratio of available main memory to disk storage is ~ 1:100.
        Since the ratio of the total size of the inner nodes to the total
        index size is aproximately equal to the spread factor, it is crucial
        that this is at least 100, allowing all the inner nodes to be in RAM.
-       The price paid for this, is that, since the inner nodes locate the
        leaf wanted less precise (up to a block of 200 entries instead of 10),
        there is more unwanted data read from disk.
        However, all I/O is in mK blocks on mK boundaries,
        matching the "page size" of most modern operating systems.
        This is the basic idea of B-Trees in the first place.
-       The maximum database size is much raised with regard to the mfn.
        This allows for an index spanning several databases, where the
        higher mfn bytes indicate the relevant master file.

Limits:
        The theoretical maximum number of terms is also somewhat raised from 20G
        (2G leaf blocks per 10 terms) to about 400G (4G blocks per 200 entries).
        Without support for large files (> 2G), however,
        the limit of the one-file design may actually even be lower,
        since the 2G file limits the total size of terms + IFPs.


*       Basic operation:

The B-L-Tree relates keys to "inverted file pointers" (IFP).
The key is a byte sequence of a length between 0 and 255.
The IFPs are 8 to 11 bytes long, with 3 to 6 bytes for an mfn (rowid),
and 5 bytes describing where the key occurs within the given record:
2 field tag, 1 occ(repetition of field), 2 word position.
These numbers are in big-endian byte order for easy sorting.
The mfn/value length is fixed for a given B-Tree.

File structure:
The index file consists of consecutive mK blocks,
with block number n at file offset n*mK.
Block 0 is the root, Block 1 the leftmost leaf.

Block structure:
The inner structure of a block is similar to an ISIS record:
some header fields followed by a dictionary of entries,
which is an array of positions (offset into block) and lengths.
The dictionary is sorted according to the entry's keys.
The actual entry data is located at the end of the block,
growing towards the dictionary as entries are added.

Entry structure:
Each entry starts with len bytes key as stated in the dictionary.
For an inner node, it is optionally followed by an IFP,
as indicated by a dictionary flag, then by a 4 byte block number.
For a leaf node, it is followed by a sorted array of IFPs.

Searching:
Since it is not uncommon for a key to relate to more distinct IFPs
than would fit within one block, we actually consider the pair of
key and IFP as the key when searching the tree.
When looking for all IFPs for a key, we start with empty (nulled) IFP.
Upon modification (insert/delete), we use the given IFP as second key segment.
Where the IFPs for one key span several leaf nodes,
the separating inner node entry will have it's key augmented
with the IFP, so we may quickly locate the proper leaf node.

Deleting:
The entry is searched and deleted, no restructuring.
(Background cleanup may be implemented some day).

Inserting:
If a new entry fit's within the target leaf, everything is fine.
Else either the new entry, or one or more other entries may have
to be shifted to the right sibling node.
If the right sibling is non-existent or too full
(or we didn't yet implement checking it),
a new right sibling is created.
The new separating key, which is lower than the one we stopped
at while locating our node, is then updated or inserted in the parent.

Multi-Process Concurrency:
There *must not* be more than 1 process having the index open for writing.
Readers, however, should work correctly, even in the presence of a writer.

Multi-Thread Concurrency:
Within a multi-threaded process, all access to internal structures
is interlocked within a monitor. The monitor is released during I/O,
with the block marked as being read or written from/to disk.
A thread wishing to access a block being read, or wishes to write
to a block while it is written to disk, must wait on the monitor.

Lehmann/Yao Concurrency:
While this design should make sure that each block is always
in a consistent state, the interrelation between blocks may be
disturbed by node splitting or shifting:
The key may be found to be no longer in the expected block.
Thus, while the search key is greater than the greatest key
in the current block, one has to try the right sibling.


*       Concurrency Details:
Within a multi-threaded process, all access to internal structures
is interlocked within a monitor. The monitor is released during I/O,
with the block marked as being read or written from/to disk.

This design will give the best possible over-all utilization on a
single-CPU system: letting a given thread run exclusively while it can
utilize the CPU. (I.e. unless it's waiting for I/O, swapping shouldn't
be an issue, if you're seriously running a db server).
If you have a multi-CPU system for nothing but OpenIsis,
run secondary read-only servers, each bound to one CPU.
This will be by far more efficient than sharing multiple CPUs
by one processe's threads.

Actually, we deploy one mutex (LOCK/UNLOCK) and one condition
(WAIT/WAKE) bound to it. Optionally a set of conditions bound
to the same mutex may be used.

All access to btree blocks is via a common cache, so that there
is never more than one copy of a block in memory.
During I/O, the cache block is flagged as being read or written, resp.
The read flag is held only temporarily during the I/O;
it marks the block contents as being asynchronously changed
(by the read call) and thus completely useless.
Any thread wishing to access the block has to wait on it.
After the read call returns, the thread initiating the read will
re-acquire the monitor, clear the flag and wake any waiting threads.

The write flag, on the other hand, acts as a WRITE LOCK on the block.
Threads wishing read-only access to the block completely ignore it,
since the block content is valid and will NOT asynchronously change:
it is changed by the writing thread only while it has the monitor,
thus no other thread will ever observe a change.
A thread wishing to modify the block, however, has to wait,
since a modification during an ongoing write could corrupt data.
Moreover, the lock need not be released immediatly after the write
call returns, but can be held until a second block is also written.


*       The block reading sequence is as follows:
$
r0      find the block in the cache.
        if it is not found, set it reading, release the monitor, read.
        else, if it is being read, increase waiter count, wait (releases the monitor).
        else use it (skip next state).
[r1]    SUSPENDED
        wait for either our own read to return or to be waked by another reader.
        if we return from read, re-acquire the monitor.
        (if we are awaked, the monitor is re-aqcuired implicitly).
r2      if we return from read, clear the reading flag, wake waiting threads.
        if we are awaked, decrease waiter count.
        (Since we might have less different conditions than blocks,
        we have to check the reading flag to see whether we were meant
        and possibly wait again).
        we have the block wanted.
        on searching, if it's not the leaf we wanted, start over.
$

NOTE:
-       the thread starting the read on a block is always the first to get it,
        since waiters can proceed only after this thread cleared the flag.
-       for a block that already is in cache, the reader proceeds without
        releasing the monitor. This especially holds for blocks being written.
These properties can be used for a save background cleanup.
-       a reader does not need to hold a block in the cache while it is
        suspended in I/O: all processing is immediately after reading.
        For a secondary reader on a database being written,
        at least leaf blocks should be invalidated asap.
        (Otherwise it will obviously miss some updates).

*       The overall search algorithm:
$
s1      starting from the root, locate the first entry with key
        greater than our key, read the block it pointed to by the previous entry.
        if there is no such entry found, and the block has a right sibling,
        check it for possible split or shift (Lehmann/Yao).
s2      when reaching a matching entry in a leaf, read subsequent right siblings
        according to the search range criteria and buffer.
$

NOTE:
-       due to the right sibling checking, it does not lead to inaccurate
        search results, if inner nodes have missing or too high keys
        for some childs, it only has a performance penalty
        (dropping the tree altogether means linear search).


*       The general block writing sequence:
$
w0      reading "for update" as r0, but with additional condition:
        if it has the WRITE LOCK flag set, wait, else use it (skip next state).
[w1]    SUSPENDED
        wait for read or to be waked by other writer.
        on return from read, re-acquire the monitor.
w2      if we are awaked and there's a WRITE LOCK, wait again.
        (the original reader or some other waiting thread might have set it).
        after returning from read, set the WRITE LOCK flag, wake waiting readers.
        if we need to write another block, start over.
        process block(s).
        release the monitor, start write call on first block to write.
[w3]    SUSPENDED
        wait for the write call to return.
        repeat while there are more blocks to be written.
        re-acquire the monitor.
w4      clear the WRITE LOCK flag on some blocks, wake waiting threads.
        if there are more blocks to be processed, start over.
$


*       The insert algorithm:
$
i0      locate the leaf L where to insert the entry.
        if it has enough room, process it and we're done.
i1      if it has a right sibling S, read it for update.
        if this hasn't enough room to take some entries from L,
        release the write lock and use a new block S as L's right sibling instead.
        shift entries to S, insert the new entry into L (or S).
i2      write first S, then L to disk.
        release the write lock on L.
i3      read the parent P of S for update,
        i.e. the parent of L or some block to the right.
        if we used old S, delete it's old entry.
        insert the new lower bound on the keys of S.
        if the parent hasn't enough room, start over at i1 (with P for L).
$

when unfolded to synchronized/unsynchronized steps,
a simplified version of this might look like:
$
        read your way down to L until it's read for update...
u0      not enough space in L, read S for update
[u1]    wait for S (L locked)
v2      modify both blocks, initiate read of P for update
[v3]    wait for P, write S then L (L,S locked)
v4      release lock on L and S, update P or go to u1 (P locked)
$


NOTE:
-       this is deadlock-free, since locking is ordered left to right,
        bottom to top.
-       by always keeping one lock, we avoid conflicts with other writers
        for the same entries, since those have to follow the same lock path,
        and thus work always in a serializable manner.
-       if we are the original reader for P,
        other threads wanting to read P have to wait for three I/Os.
        The detailled variant below avoids this.
-       reading the sibling and therefore the [u1] suspend is optional.
        depending on the key distribution, it might not hurt much
        or even be more efficient to always create a new block.
-       the only inconsistency a reader thread might notice is during v3
        (or a later suspend in order to lock P's sibling):
        there are new versions of L and S, but P is in the old state.
        The key of S (or L's former sibling) is too high and thus an entry
        now in S may be searched for in L. In this situation,
        the reader has to follow L's right link to find the key.


There is a slightly more complex variant of insert,
wich allows for somewhat enhanced concurreny
but with higher synchronization effort:
$
u2      modify both blocks (L,S locked)
[u3]    write S then L
u4      release lock on L (S locked), read P for update
[u5]    wait for P
u6      release lock on S, update P or go to u1 (P locked)
$


*       Multi-Process NOTEs:
-       we have to assume, that block writes
        * are atomic (seen by others completely or not at all) and
        * are ordered (if a later write is seen, an earlier is also).
        This does NOT hold for the file state on disk,
        but should hold for the operating system's file cache.
-       a reader in another process might also read the new version of
        S but the old version of L and thus has to deal with duplicate entries
        when searching ranges.
-       if a reader is caching, and thus write order is not in effect,
        it might miss entries completely.
        This does still give correct results, if
        * no leaves are cached
        * for all cached blocks, the parent is older
        This is ok for one-shot readers, or if caching is strictly from top
        with forced re-read after cache miss on search path.


Deleting an entry is trivial, since it affects one block only (if we're lazy).

In a single-process multi-threaded environment, however,
we can take advantage of certain properties to provide
relatively simple and save cleanup synchronously or in background.
1	dpavlin	237	OpenIsis deploys a B-Tree structure similar to that of CDS/ISIS.
2
3			There are several differences to the original CDS/ISIS file layout:
4			- The complete index is kept in one file, including
5			control information (.CNT), inner nodes (.N0x) and leave nodes (.L0x)
6			of any key length and "inverted file pointers" (.IFP, see below)
7			- The maximum key length is configurable up to 255 bytes (CDS/ISIS: 30)
8			- The maximum mfn is configurable up to 6 bytes (CDS/ISIS: 3)
9			- The keys are not blank padded.
10			- The blocks have no fixed spread factor (CDS/ISIS: 10),
11			but a fixed size of mK (where m is 1,2,4 or 8),
12			allowing for about 200 keys of 30 bytes.
13			- A custom string comparison function may be used,
14			allowing for proper unicode collation.
15			- To enhance safe high concurrency, it is a B-Link-Tree as used in postgres
16			(see Lehmann/Yao 1982), i.e. each block has a link to it's right sibling.
17
18			Advantages:
19			- The number of necessary I/O operations is dramatically reduced,
20			typically to 1. Since the invention of B-Tree's by Bayer, the
21			ratio of available main memory to disk storage is ~ 1:100.
22			Since the ratio of the total size of the inner nodes to the total
23			index size is aproximately equal to the spread factor, it is crucial
24			that this is at least 100, allowing all the inner nodes to be in RAM.
25			- The price paid for this, is that, since the inner nodes locate the
26			leaf wanted less precise (up to a block of 200 entries instead of 10),
27			there is more unwanted data read from disk.
28			However, all I/O is in mK blocks on mK boundaries,
29			matching the "page size" of most modern operating systems.
30			This is the basic idea of B-Trees in the first place.
31			- The maximum database size is much raised with regard to the mfn.
32			This allows for an index spanning several databases, where the
33			higher mfn bytes indicate the relevant master file.
34
35			Limits:
36			The theoretical maximum number of terms is also somewhat raised from 20G
37			(2G leaf blocks per 10 terms) to about 400G (4G blocks per 200 entries).
38			Without support for large files (> 2G), however,
39			the limit of the one-file design may actually even be lower,
40			since the 2G file limits the total size of terms + IFPs.
41
42
43
44			* Basic operation:
45
46			The B-L-Tree relates keys to "inverted file pointers" (IFP).
47			The key is a byte sequence of a length between 0 and 255.
48			The IFPs are 8 to 11 bytes long, with 3 to 6 bytes for an mfn (rowid),
49			and 5 bytes describing where the key occurs within the given record:
50			2 field tag, 1 occ(repetition of field), 2 word position.
51			These numbers are in big-endian byte order for easy sorting.
52			The mfn/value length is fixed for a given B-Tree.
53
54			File structure:
55			The index file consists of consecutive mK blocks,
56			with block number n at file offset n*mK.
57			Block 0 is the root, Block 1 the leftmost leaf.
58
59			Block structure:
60			The inner structure of a block is similar to an ISIS record:
61			some header fields followed by a dictionary of entries,
62			which is an array of positions (offset into block) and lengths.
63			The dictionary is sorted according to the entry's keys.
64			The actual entry data is located at the end of the block,
65			growing towards the dictionary as entries are added.
66
67			Entry structure:
68			Each entry starts with len bytes key as stated in the dictionary.
69			For an inner node, it is optionally followed by an IFP,
70			as indicated by a dictionary flag, then by a 4 byte block number.
71			For a leaf node, it is followed by a sorted array of IFPs.
72
73			Searching:
74			Since it is not uncommon for a key to relate to more distinct IFPs
75			than would fit within one block, we actually consider the pair of
76			key and IFP as the key when searching the tree.
77			When looking for all IFPs for a key, we start with empty (nulled) IFP.
78			Upon modification (insert/delete), we use the given IFP as second key segment.
79			Where the IFPs for one key span several leaf nodes,
80			the separating inner node entry will have it's key augmented
81			with the IFP, so we may quickly locate the proper leaf node.
82
83			Deleting:
84			The entry is searched and deleted, no restructuring.
85			(Background cleanup may be implemented some day).
86
87			Inserting:
88			If a new entry fit's within the target leaf, everything is fine.
89			Else either the new entry, or one or more other entries may have
90			to be shifted to the right sibling node.
91			If the right sibling is non-existent or too full
92			(or we didn't yet implement checking it),
93			a new right sibling is created.
94			The new separating key, which is lower than the one we stopped
95			at while locating our node, is then updated or inserted in the parent.
96
97			Multi-Process Concurrency:
98			There must not be more than 1 process having the index open for writing.
99			Readers, however, should work correctly, even in the presence of a writer.
100
101			Multi-Thread Concurrency:
102			Within a multi-threaded process, all access to internal structures
103			is interlocked within a monitor. The monitor is released during I/O,
104			with the block marked as being read or written from/to disk.
105			A thread wishing to access a block being read, or wishes to write
106			to a block while it is written to disk, must wait on the monitor.
107
108			Lehmann/Yao Concurrency:
109			While this design should make sure that each block is always
110			in a consistent state, the interrelation between blocks may be
111			disturbed by node splitting or shifting:
112			The key may be found to be no longer in the expected block.
113			Thus, while the search key is greater than the greatest key
114			in the current block, one has to try the right sibling.
115
116
117
118			* Concurrency Details:
119			Within a multi-threaded process, all access to internal structures
120			is interlocked within a monitor. The monitor is released during I/O,
121			with the block marked as being read or written from/to disk.
122
123			This design will give the best possible over-all utilization on a
124			single-CPU system: letting a given thread run exclusively while it can
125			utilize the CPU. (I.e. unless it's waiting for I/O, swapping shouldn't
126			be an issue, if you're seriously running a db server).
127			If you have a multi-CPU system for nothing but OpenIsis,
128			run secondary read-only servers, each bound to one CPU.
129			This will be by far more efficient than sharing multiple CPUs
130			by one processe's threads.
131
132			Actually, we deploy one mutex (LOCK/UNLOCK) and one condition
133			(WAIT/WAKE) bound to it. Optionally a set of conditions bound
134			to the same mutex may be used.
135
136			All access to btree blocks is via a common cache, so that there
137			is never more than one copy of a block in memory.
138			During I/O, the cache block is flagged as being read or written, resp.
139			The read flag is held only temporarily during the I/O;
140			it marks the block contents as being asynchronously changed
141			(by the read call) and thus completely useless.
142			Any thread wishing to access the block has to wait on it.
143			After the read call returns, the thread initiating the read will
144			re-acquire the monitor, clear the flag and wake any waiting threads.
145
146			The write flag, on the other hand, acts as a WRITE LOCK on the block.
147			Threads wishing read-only access to the block completely ignore it,
148			since the block content is valid and will NOT asynchronously change:
149			it is changed by the writing thread only while it has the monitor,
150			thus no other thread will ever observe a change.
151			A thread wishing to modify the block, however, has to wait,
152			since a modification during an ongoing write could corrupt data.
153			Moreover, the lock need not be released immediatly after the write
154			call returns, but can be held until a second block is also written.
155
156
157			* The block reading sequence is as follows:
158			$
159			r0 find the block in the cache.
160			if it is not found, set it reading, release the monitor, read.
161			else, if it is being read, increase waiter count, wait (releases the monitor).
162			else use it (skip next state).
163			[r1] SUSPENDED
164			wait for either our own read to return or to be waked by another reader.
165			if we return from read, re-acquire the monitor.
166			(if we are awaked, the monitor is re-aqcuired implicitly).
167			r2 if we return from read, clear the reading flag, wake waiting threads.
168			if we are awaked, decrease waiter count.
169			(Since we might have less different conditions than blocks,
170			we have to check the reading flag to see whether we were meant
171			and possibly wait again).
172			we have the block wanted.
173			on searching, if it's not the leaf we wanted, start over.
174			$
175
176			NOTE:
177			- the thread starting the read on a block is always the first to get it,
178			since waiters can proceed only after this thread cleared the flag.
179			- for a block that already is in cache, the reader proceeds without
180			releasing the monitor. This especially holds for blocks being written.
181			These properties can be used for a save background cleanup.
182			- a reader does not need to hold a block in the cache while it is
183			suspended in I/O: all processing is immediately after reading.
184			For a secondary reader on a database being written,
185			at least leaf blocks should be invalidated asap.
186			(Otherwise it will obviously miss some updates).
187
188			* The overall search algorithm:
189			$
190			s1 starting from the root, locate the first entry with key
191			greater than our key, read the block it pointed to by the previous entry.
192			if there is no such entry found, and the block has a right sibling,
193			check it for possible split or shift (Lehmann/Yao).
194			s2 when reaching a matching entry in a leaf, read subsequent right siblings
195			according to the search range criteria and buffer.
196			$
197
198			NOTE:
199			- due to the right sibling checking, it does not lead to inaccurate
200			search results, if inner nodes have missing or too high keys
201			for some childs, it only has a performance penalty
202			(dropping the tree altogether means linear search).
203
204
205			* The general block writing sequence:
206			$
207			w0 reading "for update" as r0, but with additional condition:
208			if it has the WRITE LOCK flag set, wait, else use it (skip next state).
209			[w1] SUSPENDED
210			wait for read or to be waked by other writer.
211			on return from read, re-acquire the monitor.
212			w2 if we are awaked and there's a WRITE LOCK, wait again.
213			(the original reader or some other waiting thread might have set it).
214			after returning from read, set the WRITE LOCK flag, wake waiting readers.
215			if we need to write another block, start over.
216			process block(s).
217			release the monitor, start write call on first block to write.
218			[w3] SUSPENDED
219			wait for the write call to return.
220			repeat while there are more blocks to be written.
221			re-acquire the monitor.
222			w4 clear the WRITE LOCK flag on some blocks, wake waiting threads.
223			if there are more blocks to be processed, start over.
224			$
225
226
227			* The insert algorithm:
228			$
229			i0 locate the leaf L where to insert the entry.
230			if it has enough room, process it and we're done.
231			i1 if it has a right sibling S, read it for update.
232			if this hasn't enough room to take some entries from L,
233			release the write lock and use a new block S as L's right sibling instead.
234			shift entries to S, insert the new entry into L (or S).
235			i2 write first S, then L to disk.
236			release the write lock on L.
237			i3 read the parent P of S for update,
238			i.e. the parent of L or some block to the right.
239			if we used old S, delete it's old entry.
240			insert the new lower bound on the keys of S.
241			if the parent hasn't enough room, start over at i1 (with P for L).
242			$
243
244			when unfolded to synchronized/unsynchronized steps,
245			a simplified version of this might look like:
246			$
247			read your way down to L until it's read for update...
248			u0 not enough space in L, read S for update
249			[u1] wait for S (L locked)
250			v2 modify both blocks, initiate read of P for update
251			[v3] wait for P, write S then L (L,S locked)
252			v4 release lock on L and S, update P or go to u1 (P locked)
253			$
254
255
256
257			NOTE:
258			- this is deadlock-free, since locking is ordered left to right,
259			bottom to top.
260			- by always keeping one lock, we avoid conflicts with other writers
261			for the same entries, since those have to follow the same lock path,
262			and thus work always in a serializable manner.
263			- if we are the original reader for P,
264			other threads wanting to read P have to wait for three I/Os.
265			The detailled variant below avoids this.
266			- reading the sibling and therefore the [u1] suspend is optional.
267			depending on the key distribution, it might not hurt much
268			or even be more efficient to always create a new block.
269			- the only inconsistency a reader thread might notice is during v3
270			(or a later suspend in order to lock P's sibling):
271			there are new versions of L and S, but P is in the old state.
272			The key of S (or L's former sibling) is too high and thus an entry
273			now in S may be searched for in L. In this situation,
274			the reader has to follow L's right link to find the key.
275
276
277			There is a slightly more complex variant of insert,
278			wich allows for somewhat enhanced concurreny
279			but with higher synchronization effort:
280			$
281			u2 modify both blocks (L,S locked)
282			[u3] write S then L
283			u4 release lock on L (S locked), read P for update
284			[u5] wait for P
285			u6 release lock on S, update P or go to u1 (P locked)
286			$
287
288
289			* Multi-Process NOTEs:
290			- we have to assume, that block writes
291			* are atomic (seen by others completely or not at all) and
292			* are ordered (if a later write is seen, an earlier is also).
293			This does NOT hold for the file state on disk,
294			but should hold for the operating system's file cache.
295			- a reader in another process might also read the new version of
296			S but the old version of L and thus has to deal with duplicate entries
297			when searching ranges.
298			- if a reader is caching, and thus write order is not in effect,
299			it might miss entries completely.
300			This does still give correct results, if
301			* no leaves are cached
302			* for all cached blocks, the parent is older
303			This is ok for one-shot readers, or if caching is strictly from top
304			with forced re-read after cache miss on search path.
305
306
307			Deleting an entry is trivial, since it affects one block only (if we're lazy).
308
309			In a single-process multi-threaded environment, however,
310			we can take advantage of certain properties to provide
311			relatively simple and save cleanup synchronously or in background.