/[webpac]/openisis/0.9.9e/doc/MultiProcess.txt
This is repository of my old source code which isn't updated any more. Go to git.rot13.org for current projects!
ViewVC logotype

Contents of /openisis/0.9.9e/doc/MultiProcess.txt

Parent Directory Parent Directory | Revision Log Revision Log


Revision 604 - (show annotations)
Mon Dec 27 21:49:01 2004 UTC (19 years, 3 months ago) by dpavlin
File MIME type: text/plain
File size: 16451 byte(s)
import of new openisis release, 0.9.9e

1 Concurrency and locking in the Malete core.
2
3
4 * MP vs. MT
5
6 Unlike the OpenIsis versions up to 0.9, which where built for multithreading,
7 the Malete core is designed to run in multiple processes in parallel.
8
9 The major reason for this shift is ongoing problems with several
10 MT environments:
11 - multithreading is in a big move towards compiler supported
12 thread local storage (TLS). TLS routines like pthread_getspecific
13 or TlsGetValue are replaced by the __thread attribute on variables.
14 While this is nice and efficient, it is incompatible
15 (it does change the behaviour of libc in subtle ways).
16 - on Linux, the 2.2/2.4 LinuxThreads are now superseded by NPTL
17 in the 2.6 kernels (and some backported 2.4.20s, watch out!),
18 which uses compiler supported TLS. While this offers a couple of advantages,
19 it is not worth the effort to support both during the transition.
20 - on BSD thread support is only emerging. On MacOS X, the implementation
21 has yet to prove its stability. Even on Solaris with a long track
22 of reliable MT support, they are switching the threading model.
23 - on Windows the 9x and Me versions are still in widespread use,
24 which lack important features to leverage the benefits of MT
25 (OVERLAPPED_IO and SignalObjectAndWait). While they also lack
26 support for synchronization of multiple read/write processes,
27 at least read-only multiprocessing is feasible.
28 - while Java can be considered by far the most stable environment
29 for MT programming, this only works out when actually writing pure Java.
30 Embedding MT C-code via JNI is not well defined and pretty intricate,
31 especially where the underlying MT implementation is on the move.
32
33 On the other hand, there is some demand for MP support in certain
34 environments like e.g. MP PHP or a parallel server running from
35 wrappers like tcpserver.
36
37 Thus Malete by now focuses on MP and leaves a reworked MT version
38 for better times with stable and widely available TLS support.
39
40
41 * shared ressources
42
43 The only ressources shared are (regions of) files,
44 i.e. there is no shared memory (besides mmapped files),
45 queues, pipes, sockets or other ressources.
46 In typical usage read access is much more frequent than writing.
47
48 Therefore the means of coordinating access to shared ressources
49 - must support multiple processes
50 - should support read/write locks
51 (i.e. shared vs. exclusive locking modes) to allow for concurrent readers
52 - should support locking and unlocking of regions of open files
53 to support concurrent writers
54
55 With some limitations to be discussed further below,
56 this can be implementated based on file locking.
57
58
59 Features we do NOT need:
60 - locking over network file systems means looking for trouble.
61 While this MAY work in simple cases, it is NOT RECOMMENDED!
62 If you really can't avoid accessing remote database files via NFS or SMB,
63 the only reasonable use of locks is to "protect" read-only access.
64 DO NOT even consider writing your valuable data to remote storage.
65 Better run a database server where the disks are.
66 - mandatory locking means looking for trouble,
67 c.f. /usr/src/linux/Documentation/mandatory.txt.
68 DO NOT mount with mandatory locking support,
69 DO NOT set the mandatory locking file mode!
70 - deadlock detection.
71 The usage of locking is deadlock free.
72 (Unless mandatory locking is enabled, so don't do that).
73
74
75 * concurrency modes
76
77 There are three modes of concurrency to distinguish:
78 - read-only mode:
79 any number of processes holding shared locks on whole files.
80 Every process may read at any time, no process may write.
81 This is the most efficient mode for high volume query processing.
82 - exclusive mode:
83 a single process holding exclusive locks on whole files.
84 The process may or may not write to the files.
85 A simple, safe and portable mode to allow writing.
86 - shared mode:
87 multiple processes using temporary locks on regions of files.
88 Every process may read and write. Supported on UNIX-style systems only.
89
90 The first two modes using locks held during process lifetime
91 are trivially correct. Actually older OpenIsis versions used such locks.
92 Shared mode is much more complicated and deserves more detailled inspection,
93 which is done in the remainder of this document.
94
95
96 * shared mode
97
98 First of all: try to avoid it.
99
100 For development and typical data entry use, an exclusive mode single process
101 server will do perfectly well. For high volume query processing,
102 use a separate copy of the database in read-only mode.
103 Malete databases are designed to be easily copied and backed up.
104 This is intrinsically much more efficient and reliable than combining
105 high read and write load. Moreover, cleanup tasks like data and tree
106 compaction can be safely done in an exclusive writer,
107 but would require unduly synchronization effort in shared mode.
108
109 Having that said, here are the gory details.
110
111
112 Both the records (r) and the index entries (q) use two files each:
113 - a "data" (d) file
114 (the plain masterfile .mrd and the BTree leaves chain .mqd, resp.)
115 - an "access" (x) file for faster lookup
116 (the record xref file .mrx and BTree inner nodes .mqx, resp.)
117 The access and data files need to be in sync,
118 thus writing access needs to be synchronized to some extend.
119 Typically, but not necessarily, the application uses index entries
120 in turn as pointers for the records; see below.
121
122 Basically, all files are accessed using (unbuffered) native file IO.
123 The access files, however, are memory mapped where possible.
124
125
126 In summary, we use
127 - one "record lock" per database guarding any record write
128 - a "xref lock" for every record
129 - one "tree lock" per database guarding BTree inner node access
130 - a "leaf lock" for every BTree leaf block
131
132
133 To read a record:
134 - obtain the shared xref lock
135 - look up the xref
136 - release the xref lock
137 - read the record
138 (does not need a lock since records are immutable)
139
140 Writing a new or changed record is done by:
141 - acquiring the exclusive record lock
142 - appending to the end of the masterfile
143 - obtain the exclusive xref lock
144 - writing to the xref file (using msync where mmapped)
145 - release the xref lock
146 - release the record lock
147
148 When searching the index
149 - the shared tree lock is acquired, the tree is read to find the leaf number,
150 and the tree lock is released
151 - a shared lock on the leaf block is acquired and the leaf is read.
152 The shared lock is released immediatly after reading.
153 - on split detection (Lehmann/Yao) successive leaf blocks are read alike
154
155 The steps to write an index entry are:
156 - find the target leaf block, searching as above,
157 but using exclusive leaf locks. On split detection,
158 an ex leaf lock is released only after locking and reading a successor.
159 The final ex lock is held until after the leaf block has been written.
160 - if the write involves a split, lock the block after the end
161 of the leaves file and double check you really got a brand new block
162 (it must not be readable, else, repeat)
163 - write the block or blocks
164 - if the write involved a split, obtain the exclusive tree lock.
165 Iteratively insert pointers to any newly created blocks.
166 - release all locks
167
168 To support a unique key, the record lock is held while writing / looking up
169 the record's key in the index and until after the record was written.
170
171
172 * operating system considerations
173
174 Locking an mmapped tree file may not work on some systems.
175 The rationale is that accesses to mapped memory can not be
176 checked against arbitrary mandatory locked regions;
177 c.f. /usr/src/linux/Documentation/mandatory.txt.
178 On some systems locking mapped files is completely ruled out,
179 others, like Linux, deny only mandatory locks.
180 Solaris allows locks on the whole file only (not regions).
181
182 Since all locks are used advisory, however,
183 there is no need to put the lock on the actual bytes affected by an operation.
184 - The record lock is on byte 0 of the masterfile .mrd,
185 and the xref on record id n (1..) locks byte n.
186 - The tree lock is on byte 1 of the leaves file .mqd,
187 and the leaf block n (0..) locks byte 2*n.
188
189 Consequently, in read-only and exclusive mode, full locks on the
190 data files .m?d are sufficient to prevent conflicts with other writers.
191 While this statement with regard to read-only and exclusive mode processes
192 can be considered to constitute an interface,
193 the details of shared mode coordination may change
194 (i.e. which bytes are locked in which order, see considerations below).
195
196
197 Since M$ Windows does not offer support for serious programming,
198 we are limited to the exclusive and read-only modes:
199
200 The 9x/Me family is even lacking shared file region locks and locks that can
201 be waited upon, not to mention any reasonable means of signaling.
202 Memory mappings here are of limited use since they might copy the file to swap.
203
204 The NT-based versions have a couple of the necessary calls like LockFileEx;
205 Still, all this is pretty tedious and the semantics of memory mappings
206 (CreateFileMapping) in the presence of writers is problematic at best.
207
208
209 * optimizations to consider
210
211 - where accessing a memory mapped xref can be considered atomic,
212 xref locks are not needed, meaning readers do not need any lock at all.
213 For 8 byte xrefs, we can easily avoid being interrupted by page faults,
214 so some architectures may support this.
215 - where the effect of writing a leaf block is atomic,
216 i.e. seen completely or not at all by any read,
217 readers do not need to use locks on leaves.
218 According to
219 > http://marc.theaimsgroup.com/?l=linux-kernel&m=107375454908544 Linus
220 this may hold only under very specific conditions:
221 We must have neither SMP nor the new kernel preemption enabled,
222 must have ext2 or similar filesystem, the file region must fit
223 one cache page and the userspace region must not pagefault
224 (i.e. we need a piece of locked memory).
225 We always fit a cache page, if our blocksize is
226 a power of two of at most the cache page size.
227 While 4K is a usual value for the cache page size
228 (as well as the pagesize as well as the filesystem block size),
229 we use 1K (which, under Linux, is the minimum for all of these)
230 leaf blocks to be reasonably safe.
231 - for writers, there is next to no more concurrency we could achieve.
232 The minimum about the changed masterfile structure that needs to be
233 visible is its new size, and basically we are just setting this
234 with the one write operation. With any other scheme along the lines
235 of "first extend, then lock only that area" and so on we would only
236 get a couple of additional syscalls plus integrity issues
237 with much more difficult crash recovery.
238 Especially we don't use a separate lock on the access file.
239 - one might consider delaying a sync-to-disk (fsync/msync)
240 until after the lock is released. However, at least a
241 msync using MS_ASYNC and MS_INVALIDATE should be issued.
242 (On Linux, MS_INVALIDATE does nothing, since maps are always coherent).
243 - locking the whole tree might seem pretty restrictive,
244 however, it saves a lot of syscalls and the operations on the
245 mmap should we very fast. Concurrency is affected only by
246 writes to the tree, which are fairly rare
247 (on about 3% of inserts, depending on sizes and fill level).
248 - for non mmapped tree files, however, the classical per-block
249 locking scheme could be considered to reduce worst case delays.
250 Locks on the odd leaves file bytes are reserved for this purpose.
251 Yet this involves substantial overhead on every read.
252
253
254 * considering writer starvation
255
256 In the presence of readers holding shared locks,
257 we can not expect fcntl to avoid writer starvation.
258 If at every time there is always at least one process having a
259 shared lock, a writer may wait indefinitely for an exclusive lock.
260
261
262 This can trivially be avoided by only using exclusive locks,
263 which typically will have a simple and fair FIFO implementation.
264 Clearly this sacrifices most concurrency in a situation where
265 it is obviously demanded.
266
267 A more sophisticated aproach is to guard the acquisition of a lock A,
268 whether shared or exclusive, by an additional ex "guard" lock Ag
269 in a sequence of get Ag - get A - release Ag.
270 That way no process can get A while another process is waiting for A.
271
272
273 This double locking should not be used for leaf reads,
274 because continued overlapping of shared locks on leaves means such
275 a pathological congestion that the system will croak anyway.
276
277 The record and tree locks, on the other hand, are held by readers
278 only while they are inspecting memory mapped files, which should
279 involve no disk accesses if you are going for high throughput.
280 More precisely, we should only very rarely see a process suspended for
281 any reason while holding such a shared lock,
282 and thus overlapped shared locks should be rare.
283
284
285 To summarize, in almost any case of high volume querying it is advisable to
286 set up a separate read-only copy of the database instead of increasing
287 system load by doubling the number of locks.
288
289 According to these considerations,
290 we might use the aforementioned optimizations to reduce writer delays,
291 but do not care about absolutely avoiding writer starvation.
292
293
294 * unmounting databases
295
296 Whenever a process created a new database, other processes may
297 access it by opening it on demand just as any old database.
298 There is no additional interprocess communication needed to make
299 a database available.
300
301
302 Since the shared mode is meant to be used in high volume production
303 environments, we assume any available database to remain available without
304 structural changes, and support such changes only in the exclusive server,
305 where they can be handled internally.
306
307 For completeness, however, here goes an outline of the steps that would
308 be needed to support unmounting databases in a shared environment.
309
310
311 Changes to a database, including making it unavailable,
312 are handled by one process obtaining the exclusive "options" lock.
313 Conceptually this can be regarded as lock on the options file,
314 with a process using the database holding a read lock on the options
315 and a process changing database options requiring a write lock.
316
317
318 In such an environment,
319 - any process using a database for read a/o write without changing
320 options holds the shared options lock.
321 - a process that wants to change a database must obtain the exclusive
322 options lock to mount the database in single process mode.
323 - before waiting on the exclusive options lock,
324 the process acquires guard locks
325 and notifies other processes to release their locks
326
327 For synchronization issues, actually three locks are used:
328 - the "options lock" is held shared or exclusive while the database is in use
329 - the shared or exclusive "use lock" guards any attempt to obtain
330 the options lock
331 - the exclusive "change lock" guards any attempt to obtain an exclusive
332 use and options lock
333
334 A reader (shared user)
335 - asks for the shared use lock without waiting.
336 - if this fails, another process is attempting
337 modification and the reader must close the database.
338 - else, no process must have the exclusive options lock.
339 The reader obtains a shared options lock and releases the use lock.
340
341 A writer (about to structurally change the database)
342 - first asks for the exclusive change lock without waiting.
343 - if this fails, another process is attempting modification
344 and the writer must close the database and bail out.
345 - else, no process must have the exclusive use lock.
346 Then the exclusive use lock is acquired with waiting.
347 (Since there are shared use locks held for very short times,
348 this has a slight risk of writer starvation and could,
349 for paranoia's sake, be guarded by yet another lock).
350 - finally other processes are notified and the writer
351 waits for the exclusive options lock (probably using a timeout).
352 - once done with changes, the writer releases locks in reverse order.
353
354 Notifying other processes:
355 - for the unix multi process server,
356 we have all processes share the same process group,
357 so they can be signalled by kill(0,sig).
358 On most systems we should use SIGWINCH and SIGURG,
359 because they are ignored by default (so we don't kill our tcpserver).
360 - while SIGURG will have any running request aborted,
361 processes block SIGWINCH during normal request processing
362 and receive it only when about to read a new request.
363
364 To make a database finally unavailable,
365 a process should move or remove the files while holding the exclusive locks.
366 Since locks are obtained on open files, other processes may have opened
367 the files before this and finally succeed in obtaining the options lock.
368 To finally detect this race condition,
369 a process must use stat and fstat to ensure, after getting the options lock,
370 that the file it opened is still the same it would open then.

  ViewVC Help
Powered by ViewVC 1.1.26