/[webpac]/openisis/current/doc/btree.txt
This is repository of my old source code which isn't updated any more. Go to git.rot13.org for current projects!
ViewVC logotype

Contents of /openisis/current/doc/btree.txt

Parent Directory Parent Directory | Revision Log Revision Log


Revision 237 - (show annotations)
Mon Mar 8 17:43:12 2004 UTC (20 years ago) by dpavlin
File MIME type: text/plain
File size: 13705 byte(s)
initial import of openisis 0.9.0 vendor drop

1 OpenIsis deploys a B-Tree structure similar to that of CDS/ISIS.
2
3 There are several differences to the original CDS/ISIS file layout:
4 - The complete index is kept in one file, including
5 control information (.CNT), inner nodes (.N0x) and leave nodes (.L0x)
6 of any key length and "inverted file pointers" (.IFP, see below)
7 - The maximum key length is configurable up to 255 bytes (CDS/ISIS: 30)
8 - The maximum mfn is configurable up to 6 bytes (CDS/ISIS: 3)
9 - The keys are not blank padded.
10 - The blocks have no fixed spread factor (CDS/ISIS: 10),
11 but a fixed size of mK (where m is 1,2,4 or 8),
12 allowing for about 200 keys of 30 bytes.
13 - A custom string comparison function may be used,
14 allowing for proper unicode collation.
15 - To enhance safe high concurrency, it is a B-Link-Tree as used in postgres
16 (see Lehmann/Yao 1982), i.e. each block has a link to it's right sibling.
17
18 Advantages:
19 - The number of necessary I/O operations is dramatically reduced,
20 typically to 1. Since the invention of B-Tree's by Bayer, the
21 ratio of available main memory to disk storage is ~ 1:100.
22 Since the ratio of the total size of the inner nodes to the total
23 index size is aproximately equal to the spread factor, it is crucial
24 that this is at least 100, allowing all the inner nodes to be in RAM.
25 - The price paid for this, is that, since the inner nodes locate the
26 leaf wanted less precise (up to a block of 200 entries instead of 10),
27 there is more unwanted data read from disk.
28 However, all I/O is in mK blocks on mK boundaries,
29 matching the "page size" of most modern operating systems.
30 This is the basic idea of B-Trees in the first place.
31 - The maximum database size is much raised with regard to the mfn.
32 This allows for an index spanning several databases, where the
33 higher mfn bytes indicate the relevant master file.
34
35 Limits:
36 The theoretical maximum number of terms is also somewhat raised from 20G
37 (2G leaf blocks per 10 terms) to about 400G (4G blocks per 200 entries).
38 Without support for large files (> 2G), however,
39 the limit of the one-file design may actually even be lower,
40 since the 2G file limits the total size of terms + IFPs.
41
42
43
44 * Basic operation:
45
46 The B-L-Tree relates keys to "inverted file pointers" (IFP).
47 The key is a byte sequence of a length between 0 and 255.
48 The IFPs are 8 to 11 bytes long, with 3 to 6 bytes for an mfn (rowid),
49 and 5 bytes describing where the key occurs within the given record:
50 2 field tag, 1 occ(repetition of field), 2 word position.
51 These numbers are in big-endian byte order for easy sorting.
52 The mfn/value length is fixed for a given B-Tree.
53
54 File structure:
55 The index file consists of consecutive mK blocks,
56 with block number n at file offset n*mK.
57 Block 0 is the root, Block 1 the leftmost leaf.
58
59 Block structure:
60 The inner structure of a block is similar to an ISIS record:
61 some header fields followed by a dictionary of entries,
62 which is an array of positions (offset into block) and lengths.
63 The dictionary is sorted according to the entry's keys.
64 The actual entry data is located at the end of the block,
65 growing towards the dictionary as entries are added.
66
67 Entry structure:
68 Each entry starts with len bytes key as stated in the dictionary.
69 For an inner node, it is optionally followed by an IFP,
70 as indicated by a dictionary flag, then by a 4 byte block number.
71 For a leaf node, it is followed by a sorted array of IFPs.
72
73 Searching:
74 Since it is not uncommon for a key to relate to more distinct IFPs
75 than would fit within one block, we actually consider the pair of
76 key and IFP as the key when searching the tree.
77 When looking for all IFPs for a key, we start with empty (nulled) IFP.
78 Upon modification (insert/delete), we use the given IFP as second key segment.
79 Where the IFPs for one key span several leaf nodes,
80 the separating inner node entry will have it's key augmented
81 with the IFP, so we may quickly locate the proper leaf node.
82
83 Deleting:
84 The entry is searched and deleted, no restructuring.
85 (Background cleanup may be implemented some day).
86
87 Inserting:
88 If a new entry fit's within the target leaf, everything is fine.
89 Else either the new entry, or one or more other entries may have
90 to be shifted to the right sibling node.
91 If the right sibling is non-existent or too full
92 (or we didn't yet implement checking it),
93 a new right sibling is created.
94 The new separating key, which is lower than the one we stopped
95 at while locating our node, is then updated or inserted in the parent.
96
97 Multi-Process Concurrency:
98 There *must not* be more than 1 process having the index open for writing.
99 Readers, however, should work correctly, even in the presence of a writer.
100
101 Multi-Thread Concurrency:
102 Within a multi-threaded process, all access to internal structures
103 is interlocked within a monitor. The monitor is released during I/O,
104 with the block marked as being read or written from/to disk.
105 A thread wishing to access a block being read, or wishes to write
106 to a block while it is written to disk, must wait on the monitor.
107
108 Lehmann/Yao Concurrency:
109 While this design should make sure that each block is always
110 in a consistent state, the interrelation between blocks may be
111 disturbed by node splitting or shifting:
112 The key may be found to be no longer in the expected block.
113 Thus, while the search key is greater than the greatest key
114 in the current block, one has to try the right sibling.
115
116
117
118 * Concurrency Details:
119 Within a multi-threaded process, all access to internal structures
120 is interlocked within a monitor. The monitor is released during I/O,
121 with the block marked as being read or written from/to disk.
122
123 This design will give the best possible over-all utilization on a
124 single-CPU system: letting a given thread run exclusively while it can
125 utilize the CPU. (I.e. unless it's waiting for I/O, swapping shouldn't
126 be an issue, if you're seriously running a db server).
127 If you have a multi-CPU system for nothing but OpenIsis,
128 run secondary read-only servers, each bound to one CPU.
129 This will be by far more efficient than sharing multiple CPUs
130 by one processe's threads.
131
132 Actually, we deploy one mutex (LOCK/UNLOCK) and one condition
133 (WAIT/WAKE) bound to it. Optionally a set of conditions bound
134 to the same mutex may be used.
135
136 All access to btree blocks is via a common cache, so that there
137 is never more than one copy of a block in memory.
138 During I/O, the cache block is flagged as being read or written, resp.
139 The read flag is held only temporarily during the I/O;
140 it marks the block contents as being asynchronously changed
141 (by the read call) and thus completely useless.
142 Any thread wishing to access the block has to wait on it.
143 After the read call returns, the thread initiating the read will
144 re-acquire the monitor, clear the flag and wake any waiting threads.
145
146 The write flag, on the other hand, acts as a WRITE LOCK on the block.
147 Threads wishing read-only access to the block completely ignore it,
148 since the block content is valid and will NOT asynchronously change:
149 it is changed by the writing thread only while it has the monitor,
150 thus no other thread will ever observe a change.
151 A thread wishing to modify the block, however, has to wait,
152 since a modification during an ongoing write could corrupt data.
153 Moreover, the lock need not be released immediatly after the write
154 call returns, but can be held until a second block is also written.
155
156
157 * The block reading sequence is as follows:
158 $
159 r0 find the block in the cache.
160 if it is not found, set it reading, release the monitor, read.
161 else, if it is being read, increase waiter count, wait (releases the monitor).
162 else use it (skip next state).
163 [r1] SUSPENDED
164 wait for either our own read to return or to be waked by another reader.
165 if we return from read, re-acquire the monitor.
166 (if we are awaked, the monitor is re-aqcuired implicitly).
167 r2 if we return from read, clear the reading flag, wake waiting threads.
168 if we are awaked, decrease waiter count.
169 (Since we might have less different conditions than blocks,
170 we have to check the reading flag to see whether we were meant
171 and possibly wait again).
172 we have the block wanted.
173 on searching, if it's not the leaf we wanted, start over.
174 $
175
176 NOTE:
177 - the thread starting the read on a block is always the first to get it,
178 since waiters can proceed only after this thread cleared the flag.
179 - for a block that already is in cache, the reader proceeds without
180 releasing the monitor. This especially holds for blocks being written.
181 These properties can be used for a save background cleanup.
182 - a reader does not need to hold a block in the cache while it is
183 suspended in I/O: all processing is immediately after reading.
184 For a secondary reader on a database being written,
185 at least leaf blocks should be invalidated asap.
186 (Otherwise it will obviously miss some updates).
187
188 * The overall search algorithm:
189 $
190 s1 starting from the root, locate the first entry with key
191 greater than our key, read the block it pointed to by the previous entry.
192 if there is no such entry found, and the block has a right sibling,
193 check it for possible split or shift (Lehmann/Yao).
194 s2 when reaching a matching entry in a leaf, read subsequent right siblings
195 according to the search range criteria and buffer.
196 $
197
198 NOTE:
199 - due to the right sibling checking, it does not lead to inaccurate
200 search results, if inner nodes have missing or too high keys
201 for some childs, it only has a performance penalty
202 (dropping the tree altogether means linear search).
203
204
205 * The general block writing sequence:
206 $
207 w0 reading "for update" as r0, but with additional condition:
208 if it has the WRITE LOCK flag set, wait, else use it (skip next state).
209 [w1] SUSPENDED
210 wait for read or to be waked by other writer.
211 on return from read, re-acquire the monitor.
212 w2 if we are awaked and there's a WRITE LOCK, wait again.
213 (the original reader or some other waiting thread might have set it).
214 after returning from read, set the WRITE LOCK flag, wake waiting readers.
215 if we need to write another block, start over.
216 process block(s).
217 release the monitor, start write call on first block to write.
218 [w3] SUSPENDED
219 wait for the write call to return.
220 repeat while there are more blocks to be written.
221 re-acquire the monitor.
222 w4 clear the WRITE LOCK flag on some blocks, wake waiting threads.
223 if there are more blocks to be processed, start over.
224 $
225
226
227 * The insert algorithm:
228 $
229 i0 locate the leaf L where to insert the entry.
230 if it has enough room, process it and we're done.
231 i1 if it has a right sibling S, read it for update.
232 if this hasn't enough room to take some entries from L,
233 release the write lock and use a new block S as L's right sibling instead.
234 shift entries to S, insert the new entry into L (or S).
235 i2 write first S, then L to disk.
236 release the write lock on L.
237 i3 read the parent P of S for update,
238 i.e. the parent of L or some block to the right.
239 if we used old S, delete it's old entry.
240 insert the new lower bound on the keys of S.
241 if the parent hasn't enough room, start over at i1 (with P for L).
242 $
243
244 when unfolded to synchronized/unsynchronized steps,
245 a simplified version of this might look like:
246 $
247 read your way down to L until it's read for update...
248 u0 not enough space in L, read S for update
249 [u1] wait for S (L locked)
250 v2 modify both blocks, initiate read of P for update
251 [v3] wait for P, write S then L (L,S locked)
252 v4 release lock on L and S, update P or go to u1 (P locked)
253 $
254
255
256
257 NOTE:
258 - this is deadlock-free, since locking is ordered left to right,
259 bottom to top.
260 - by always keeping one lock, we avoid conflicts with other writers
261 for the same entries, since those have to follow the same lock path,
262 and thus work always in a serializable manner.
263 - if we are the original reader for P,
264 other threads wanting to read P have to wait for three I/Os.
265 The detailled variant below avoids this.
266 - reading the sibling and therefore the [u1] suspend is optional.
267 depending on the key distribution, it might not hurt much
268 or even be more efficient to always create a new block.
269 - the only inconsistency a reader thread might notice is during v3
270 (or a later suspend in order to lock P's sibling):
271 there are new versions of L and S, but P is in the old state.
272 The key of S (or L's former sibling) is too high and thus an entry
273 now in S may be searched for in L. In this situation,
274 the reader has to follow L's right link to find the key.
275
276
277 There is a slightly more complex variant of insert,
278 wich allows for somewhat enhanced concurreny
279 but with higher synchronization effort:
280 $
281 u2 modify both blocks (L,S locked)
282 [u3] write S then L
283 u4 release lock on L (S locked), read P for update
284 [u5] wait for P
285 u6 release lock on S, update P or go to u1 (P locked)
286 $
287
288
289 * Multi-Process NOTEs:
290 - we have to assume, that block writes
291 * are atomic (seen by others completely or not at all) and
292 * are ordered (if a later write is seen, an earlier is also).
293 This does NOT hold for the file state on disk,
294 but should hold for the operating system's file cache.
295 - a reader in another process might also read the new version of
296 S but the old version of L and thus has to deal with duplicate entries
297 when searching ranges.
298 - if a reader is caching, and thus write order is not in effect,
299 it might miss entries completely.
300 This does still give correct results, if
301 * no leaves are cached
302 * for all cached blocks, the parent is older
303 This is ok for one-shot readers, or if caching is strictly from top
304 with forced re-read after cache miss on search path.
305
306
307 Deleting an entry is trivial, since it affects one block only (if we're lazy).
308
309 In a single-process multi-threaded environment, however,
310 we can take advantage of certain properties to provide
311 relatively simple and save cleanup synchronously or in background.

  ViewVC Help
Powered by ViewVC 1.1.26