/[webpac]/openisis/current/doc/btree.txt
This is repository of my old source code which isn't updated any more. Go to git.rot13.org for current projects!
ViewVC logotype

Annotation of /openisis/current/doc/btree.txt

Parent Directory Parent Directory | Revision Log Revision Log


Revision 237 - (hide annotations)
Mon Mar 8 17:43:12 2004 UTC (20 years, 1 month ago) by dpavlin
File MIME type: text/plain
File size: 13705 byte(s)
initial import of openisis 0.9.0 vendor drop

1 dpavlin 237 OpenIsis deploys a B-Tree structure similar to that of CDS/ISIS.
2    
3     There are several differences to the original CDS/ISIS file layout:
4     - The complete index is kept in one file, including
5     control information (.CNT), inner nodes (.N0x) and leave nodes (.L0x)
6     of any key length and "inverted file pointers" (.IFP, see below)
7     - The maximum key length is configurable up to 255 bytes (CDS/ISIS: 30)
8     - The maximum mfn is configurable up to 6 bytes (CDS/ISIS: 3)
9     - The keys are not blank padded.
10     - The blocks have no fixed spread factor (CDS/ISIS: 10),
11     but a fixed size of mK (where m is 1,2,4 or 8),
12     allowing for about 200 keys of 30 bytes.
13     - A custom string comparison function may be used,
14     allowing for proper unicode collation.
15     - To enhance safe high concurrency, it is a B-Link-Tree as used in postgres
16     (see Lehmann/Yao 1982), i.e. each block has a link to it's right sibling.
17    
18     Advantages:
19     - The number of necessary I/O operations is dramatically reduced,
20     typically to 1. Since the invention of B-Tree's by Bayer, the
21     ratio of available main memory to disk storage is ~ 1:100.
22     Since the ratio of the total size of the inner nodes to the total
23     index size is aproximately equal to the spread factor, it is crucial
24     that this is at least 100, allowing all the inner nodes to be in RAM.
25     - The price paid for this, is that, since the inner nodes locate the
26     leaf wanted less precise (up to a block of 200 entries instead of 10),
27     there is more unwanted data read from disk.
28     However, all I/O is in mK blocks on mK boundaries,
29     matching the "page size" of most modern operating systems.
30     This is the basic idea of B-Trees in the first place.
31     - The maximum database size is much raised with regard to the mfn.
32     This allows for an index spanning several databases, where the
33     higher mfn bytes indicate the relevant master file.
34    
35     Limits:
36     The theoretical maximum number of terms is also somewhat raised from 20G
37     (2G leaf blocks per 10 terms) to about 400G (4G blocks per 200 entries).
38     Without support for large files (> 2G), however,
39     the limit of the one-file design may actually even be lower,
40     since the 2G file limits the total size of terms + IFPs.
41    
42    
43    
44     * Basic operation:
45    
46     The B-L-Tree relates keys to "inverted file pointers" (IFP).
47     The key is a byte sequence of a length between 0 and 255.
48     The IFPs are 8 to 11 bytes long, with 3 to 6 bytes for an mfn (rowid),
49     and 5 bytes describing where the key occurs within the given record:
50     2 field tag, 1 occ(repetition of field), 2 word position.
51     These numbers are in big-endian byte order for easy sorting.
52     The mfn/value length is fixed for a given B-Tree.
53    
54     File structure:
55     The index file consists of consecutive mK blocks,
56     with block number n at file offset n*mK.
57     Block 0 is the root, Block 1 the leftmost leaf.
58    
59     Block structure:
60     The inner structure of a block is similar to an ISIS record:
61     some header fields followed by a dictionary of entries,
62     which is an array of positions (offset into block) and lengths.
63     The dictionary is sorted according to the entry's keys.
64     The actual entry data is located at the end of the block,
65     growing towards the dictionary as entries are added.
66    
67     Entry structure:
68     Each entry starts with len bytes key as stated in the dictionary.
69     For an inner node, it is optionally followed by an IFP,
70     as indicated by a dictionary flag, then by a 4 byte block number.
71     For a leaf node, it is followed by a sorted array of IFPs.
72    
73     Searching:
74     Since it is not uncommon for a key to relate to more distinct IFPs
75     than would fit within one block, we actually consider the pair of
76     key and IFP as the key when searching the tree.
77     When looking for all IFPs for a key, we start with empty (nulled) IFP.
78     Upon modification (insert/delete), we use the given IFP as second key segment.
79     Where the IFPs for one key span several leaf nodes,
80     the separating inner node entry will have it's key augmented
81     with the IFP, so we may quickly locate the proper leaf node.
82    
83     Deleting:
84     The entry is searched and deleted, no restructuring.
85     (Background cleanup may be implemented some day).
86    
87     Inserting:
88     If a new entry fit's within the target leaf, everything is fine.
89     Else either the new entry, or one or more other entries may have
90     to be shifted to the right sibling node.
91     If the right sibling is non-existent or too full
92     (or we didn't yet implement checking it),
93     a new right sibling is created.
94     The new separating key, which is lower than the one we stopped
95     at while locating our node, is then updated or inserted in the parent.
96    
97     Multi-Process Concurrency:
98     There *must not* be more than 1 process having the index open for writing.
99     Readers, however, should work correctly, even in the presence of a writer.
100    
101     Multi-Thread Concurrency:
102     Within a multi-threaded process, all access to internal structures
103     is interlocked within a monitor. The monitor is released during I/O,
104     with the block marked as being read or written from/to disk.
105     A thread wishing to access a block being read, or wishes to write
106     to a block while it is written to disk, must wait on the monitor.
107    
108     Lehmann/Yao Concurrency:
109     While this design should make sure that each block is always
110     in a consistent state, the interrelation between blocks may be
111     disturbed by node splitting or shifting:
112     The key may be found to be no longer in the expected block.
113     Thus, while the search key is greater than the greatest key
114     in the current block, one has to try the right sibling.
115    
116    
117    
118     * Concurrency Details:
119     Within a multi-threaded process, all access to internal structures
120     is interlocked within a monitor. The monitor is released during I/O,
121     with the block marked as being read or written from/to disk.
122    
123     This design will give the best possible over-all utilization on a
124     single-CPU system: letting a given thread run exclusively while it can
125     utilize the CPU. (I.e. unless it's waiting for I/O, swapping shouldn't
126     be an issue, if you're seriously running a db server).
127     If you have a multi-CPU system for nothing but OpenIsis,
128     run secondary read-only servers, each bound to one CPU.
129     This will be by far more efficient than sharing multiple CPUs
130     by one processe's threads.
131    
132     Actually, we deploy one mutex (LOCK/UNLOCK) and one condition
133     (WAIT/WAKE) bound to it. Optionally a set of conditions bound
134     to the same mutex may be used.
135    
136     All access to btree blocks is via a common cache, so that there
137     is never more than one copy of a block in memory.
138     During I/O, the cache block is flagged as being read or written, resp.
139     The read flag is held only temporarily during the I/O;
140     it marks the block contents as being asynchronously changed
141     (by the read call) and thus completely useless.
142     Any thread wishing to access the block has to wait on it.
143     After the read call returns, the thread initiating the read will
144     re-acquire the monitor, clear the flag and wake any waiting threads.
145    
146     The write flag, on the other hand, acts as a WRITE LOCK on the block.
147     Threads wishing read-only access to the block completely ignore it,
148     since the block content is valid and will NOT asynchronously change:
149     it is changed by the writing thread only while it has the monitor,
150     thus no other thread will ever observe a change.
151     A thread wishing to modify the block, however, has to wait,
152     since a modification during an ongoing write could corrupt data.
153     Moreover, the lock need not be released immediatly after the write
154     call returns, but can be held until a second block is also written.
155    
156    
157     * The block reading sequence is as follows:
158     $
159     r0 find the block in the cache.
160     if it is not found, set it reading, release the monitor, read.
161     else, if it is being read, increase waiter count, wait (releases the monitor).
162     else use it (skip next state).
163     [r1] SUSPENDED
164     wait for either our own read to return or to be waked by another reader.
165     if we return from read, re-acquire the monitor.
166     (if we are awaked, the monitor is re-aqcuired implicitly).
167     r2 if we return from read, clear the reading flag, wake waiting threads.
168     if we are awaked, decrease waiter count.
169     (Since we might have less different conditions than blocks,
170     we have to check the reading flag to see whether we were meant
171     and possibly wait again).
172     we have the block wanted.
173     on searching, if it's not the leaf we wanted, start over.
174     $
175    
176     NOTE:
177     - the thread starting the read on a block is always the first to get it,
178     since waiters can proceed only after this thread cleared the flag.
179     - for a block that already is in cache, the reader proceeds without
180     releasing the monitor. This especially holds for blocks being written.
181     These properties can be used for a save background cleanup.
182     - a reader does not need to hold a block in the cache while it is
183     suspended in I/O: all processing is immediately after reading.
184     For a secondary reader on a database being written,
185     at least leaf blocks should be invalidated asap.
186     (Otherwise it will obviously miss some updates).
187    
188     * The overall search algorithm:
189     $
190     s1 starting from the root, locate the first entry with key
191     greater than our key, read the block it pointed to by the previous entry.
192     if there is no such entry found, and the block has a right sibling,
193     check it for possible split or shift (Lehmann/Yao).
194     s2 when reaching a matching entry in a leaf, read subsequent right siblings
195     according to the search range criteria and buffer.
196     $
197    
198     NOTE:
199     - due to the right sibling checking, it does not lead to inaccurate
200     search results, if inner nodes have missing or too high keys
201     for some childs, it only has a performance penalty
202     (dropping the tree altogether means linear search).
203    
204    
205     * The general block writing sequence:
206     $
207     w0 reading "for update" as r0, but with additional condition:
208     if it has the WRITE LOCK flag set, wait, else use it (skip next state).
209     [w1] SUSPENDED
210     wait for read or to be waked by other writer.
211     on return from read, re-acquire the monitor.
212     w2 if we are awaked and there's a WRITE LOCK, wait again.
213     (the original reader or some other waiting thread might have set it).
214     after returning from read, set the WRITE LOCK flag, wake waiting readers.
215     if we need to write another block, start over.
216     process block(s).
217     release the monitor, start write call on first block to write.
218     [w3] SUSPENDED
219     wait for the write call to return.
220     repeat while there are more blocks to be written.
221     re-acquire the monitor.
222     w4 clear the WRITE LOCK flag on some blocks, wake waiting threads.
223     if there are more blocks to be processed, start over.
224     $
225    
226    
227     * The insert algorithm:
228     $
229     i0 locate the leaf L where to insert the entry.
230     if it has enough room, process it and we're done.
231     i1 if it has a right sibling S, read it for update.
232     if this hasn't enough room to take some entries from L,
233     release the write lock and use a new block S as L's right sibling instead.
234     shift entries to S, insert the new entry into L (or S).
235     i2 write first S, then L to disk.
236     release the write lock on L.
237     i3 read the parent P of S for update,
238     i.e. the parent of L or some block to the right.
239     if we used old S, delete it's old entry.
240     insert the new lower bound on the keys of S.
241     if the parent hasn't enough room, start over at i1 (with P for L).
242     $
243    
244     when unfolded to synchronized/unsynchronized steps,
245     a simplified version of this might look like:
246     $
247     read your way down to L until it's read for update...
248     u0 not enough space in L, read S for update
249     [u1] wait for S (L locked)
250     v2 modify both blocks, initiate read of P for update
251     [v3] wait for P, write S then L (L,S locked)
252     v4 release lock on L and S, update P or go to u1 (P locked)
253     $
254    
255    
256    
257     NOTE:
258     - this is deadlock-free, since locking is ordered left to right,
259     bottom to top.
260     - by always keeping one lock, we avoid conflicts with other writers
261     for the same entries, since those have to follow the same lock path,
262     and thus work always in a serializable manner.
263     - if we are the original reader for P,
264     other threads wanting to read P have to wait for three I/Os.
265     The detailled variant below avoids this.
266     - reading the sibling and therefore the [u1] suspend is optional.
267     depending on the key distribution, it might not hurt much
268     or even be more efficient to always create a new block.
269     - the only inconsistency a reader thread might notice is during v3
270     (or a later suspend in order to lock P's sibling):
271     there are new versions of L and S, but P is in the old state.
272     The key of S (or L's former sibling) is too high and thus an entry
273     now in S may be searched for in L. In this situation,
274     the reader has to follow L's right link to find the key.
275    
276    
277     There is a slightly more complex variant of insert,
278     wich allows for somewhat enhanced concurreny
279     but with higher synchronization effort:
280     $
281     u2 modify both blocks (L,S locked)
282     [u3] write S then L
283     u4 release lock on L (S locked), read P for update
284     [u5] wait for P
285     u6 release lock on S, update P or go to u1 (P locked)
286     $
287    
288    
289     * Multi-Process NOTEs:
290     - we have to assume, that block writes
291     * are atomic (seen by others completely or not at all) and
292     * are ordered (if a later write is seen, an earlier is also).
293     This does NOT hold for the file state on disk,
294     but should hold for the operating system's file cache.
295     - a reader in another process might also read the new version of
296     S but the old version of L and thus has to deal with duplicate entries
297     when searching ranges.
298     - if a reader is caching, and thus write order is not in effect,
299     it might miss entries completely.
300     This does still give correct results, if
301     * no leaves are cached
302     * for all cached blocks, the parent is older
303     This is ok for one-shot readers, or if caching is strictly from top
304     with forced re-read after cache miss on search path.
305    
306    
307     Deleting an entry is trivial, since it affects one block only (if we're lazy).
308    
309     In a single-process multi-threaded environment, however,
310     we can take advantage of certain properties to provide
311     relatively simple and save cleanup synchronously or in background.

  ViewVC Help
Powered by ViewVC 1.1.26