/[webpac]/openisis/0.9.9e/doc/MultiProcess.txt
This is repository of my old source code which isn't updated any more. Go to git.rot13.org for current projects!
ViewVC logotype

Annotation of /openisis/0.9.9e/doc/MultiProcess.txt

Parent Directory Parent Directory | Revision Log Revision Log


Revision 604 - (hide annotations)
Mon Dec 27 21:49:01 2004 UTC (19 years, 4 months ago) by dpavlin
File MIME type: text/plain
File size: 16451 byte(s)
import of new openisis release, 0.9.9e

1 dpavlin 604 Concurrency and locking in the Malete core.
2    
3    
4     * MP vs. MT
5    
6     Unlike the OpenIsis versions up to 0.9, which where built for multithreading,
7     the Malete core is designed to run in multiple processes in parallel.
8    
9     The major reason for this shift is ongoing problems with several
10     MT environments:
11     - multithreading is in a big move towards compiler supported
12     thread local storage (TLS). TLS routines like pthread_getspecific
13     or TlsGetValue are replaced by the __thread attribute on variables.
14     While this is nice and efficient, it is incompatible
15     (it does change the behaviour of libc in subtle ways).
16     - on Linux, the 2.2/2.4 LinuxThreads are now superseded by NPTL
17     in the 2.6 kernels (and some backported 2.4.20s, watch out!),
18     which uses compiler supported TLS. While this offers a couple of advantages,
19     it is not worth the effort to support both during the transition.
20     - on BSD thread support is only emerging. On MacOS X, the implementation
21     has yet to prove its stability. Even on Solaris with a long track
22     of reliable MT support, they are switching the threading model.
23     - on Windows the 9x and Me versions are still in widespread use,
24     which lack important features to leverage the benefits of MT
25     (OVERLAPPED_IO and SignalObjectAndWait). While they also lack
26     support for synchronization of multiple read/write processes,
27     at least read-only multiprocessing is feasible.
28     - while Java can be considered by far the most stable environment
29     for MT programming, this only works out when actually writing pure Java.
30     Embedding MT C-code via JNI is not well defined and pretty intricate,
31     especially where the underlying MT implementation is on the move.
32    
33     On the other hand, there is some demand for MP support in certain
34     environments like e.g. MP PHP or a parallel server running from
35     wrappers like tcpserver.
36    
37     Thus Malete by now focuses on MP and leaves a reworked MT version
38     for better times with stable and widely available TLS support.
39    
40    
41     * shared ressources
42    
43     The only ressources shared are (regions of) files,
44     i.e. there is no shared memory (besides mmapped files),
45     queues, pipes, sockets or other ressources.
46     In typical usage read access is much more frequent than writing.
47    
48     Therefore the means of coordinating access to shared ressources
49     - must support multiple processes
50     - should support read/write locks
51     (i.e. shared vs. exclusive locking modes) to allow for concurrent readers
52     - should support locking and unlocking of regions of open files
53     to support concurrent writers
54    
55     With some limitations to be discussed further below,
56     this can be implementated based on file locking.
57    
58    
59     Features we do NOT need:
60     - locking over network file systems means looking for trouble.
61     While this MAY work in simple cases, it is NOT RECOMMENDED!
62     If you really can't avoid accessing remote database files via NFS or SMB,
63     the only reasonable use of locks is to "protect" read-only access.
64     DO NOT even consider writing your valuable data to remote storage.
65     Better run a database server where the disks are.
66     - mandatory locking means looking for trouble,
67     c.f. /usr/src/linux/Documentation/mandatory.txt.
68     DO NOT mount with mandatory locking support,
69     DO NOT set the mandatory locking file mode!
70     - deadlock detection.
71     The usage of locking is deadlock free.
72     (Unless mandatory locking is enabled, so don't do that).
73    
74    
75     * concurrency modes
76    
77     There are three modes of concurrency to distinguish:
78     - read-only mode:
79     any number of processes holding shared locks on whole files.
80     Every process may read at any time, no process may write.
81     This is the most efficient mode for high volume query processing.
82     - exclusive mode:
83     a single process holding exclusive locks on whole files.
84     The process may or may not write to the files.
85     A simple, safe and portable mode to allow writing.
86     - shared mode:
87     multiple processes using temporary locks on regions of files.
88     Every process may read and write. Supported on UNIX-style systems only.
89    
90     The first two modes using locks held during process lifetime
91     are trivially correct. Actually older OpenIsis versions used such locks.
92     Shared mode is much more complicated and deserves more detailled inspection,
93     which is done in the remainder of this document.
94    
95    
96     * shared mode
97    
98     First of all: try to avoid it.
99    
100     For development and typical data entry use, an exclusive mode single process
101     server will do perfectly well. For high volume query processing,
102     use a separate copy of the database in read-only mode.
103     Malete databases are designed to be easily copied and backed up.
104     This is intrinsically much more efficient and reliable than combining
105     high read and write load. Moreover, cleanup tasks like data and tree
106     compaction can be safely done in an exclusive writer,
107     but would require unduly synchronization effort in shared mode.
108    
109     Having that said, here are the gory details.
110    
111    
112     Both the records (r) and the index entries (q) use two files each:
113     - a "data" (d) file
114     (the plain masterfile .mrd and the BTree leaves chain .mqd, resp.)
115     - an "access" (x) file for faster lookup
116     (the record xref file .mrx and BTree inner nodes .mqx, resp.)
117     The access and data files need to be in sync,
118     thus writing access needs to be synchronized to some extend.
119     Typically, but not necessarily, the application uses index entries
120     in turn as pointers for the records; see below.
121    
122     Basically, all files are accessed using (unbuffered) native file IO.
123     The access files, however, are memory mapped where possible.
124    
125    
126     In summary, we use
127     - one "record lock" per database guarding any record write
128     - a "xref lock" for every record
129     - one "tree lock" per database guarding BTree inner node access
130     - a "leaf lock" for every BTree leaf block
131    
132    
133     To read a record:
134     - obtain the shared xref lock
135     - look up the xref
136     - release the xref lock
137     - read the record
138     (does not need a lock since records are immutable)
139    
140     Writing a new or changed record is done by:
141     - acquiring the exclusive record lock
142     - appending to the end of the masterfile
143     - obtain the exclusive xref lock
144     - writing to the xref file (using msync where mmapped)
145     - release the xref lock
146     - release the record lock
147    
148     When searching the index
149     - the shared tree lock is acquired, the tree is read to find the leaf number,
150     and the tree lock is released
151     - a shared lock on the leaf block is acquired and the leaf is read.
152     The shared lock is released immediatly after reading.
153     - on split detection (Lehmann/Yao) successive leaf blocks are read alike
154    
155     The steps to write an index entry are:
156     - find the target leaf block, searching as above,
157     but using exclusive leaf locks. On split detection,
158     an ex leaf lock is released only after locking and reading a successor.
159     The final ex lock is held until after the leaf block has been written.
160     - if the write involves a split, lock the block after the end
161     of the leaves file and double check you really got a brand new block
162     (it must not be readable, else, repeat)
163     - write the block or blocks
164     - if the write involved a split, obtain the exclusive tree lock.
165     Iteratively insert pointers to any newly created blocks.
166     - release all locks
167    
168     To support a unique key, the record lock is held while writing / looking up
169     the record's key in the index and until after the record was written.
170    
171    
172     * operating system considerations
173    
174     Locking an mmapped tree file may not work on some systems.
175     The rationale is that accesses to mapped memory can not be
176     checked against arbitrary mandatory locked regions;
177     c.f. /usr/src/linux/Documentation/mandatory.txt.
178     On some systems locking mapped files is completely ruled out,
179     others, like Linux, deny only mandatory locks.
180     Solaris allows locks on the whole file only (not regions).
181    
182     Since all locks are used advisory, however,
183     there is no need to put the lock on the actual bytes affected by an operation.
184     - The record lock is on byte 0 of the masterfile .mrd,
185     and the xref on record id n (1..) locks byte n.
186     - The tree lock is on byte 1 of the leaves file .mqd,
187     and the leaf block n (0..) locks byte 2*n.
188    
189     Consequently, in read-only and exclusive mode, full locks on the
190     data files .m?d are sufficient to prevent conflicts with other writers.
191     While this statement with regard to read-only and exclusive mode processes
192     can be considered to constitute an interface,
193     the details of shared mode coordination may change
194     (i.e. which bytes are locked in which order, see considerations below).
195    
196    
197     Since M$ Windows does not offer support for serious programming,
198     we are limited to the exclusive and read-only modes:
199    
200     The 9x/Me family is even lacking shared file region locks and locks that can
201     be waited upon, not to mention any reasonable means of signaling.
202     Memory mappings here are of limited use since they might copy the file to swap.
203    
204     The NT-based versions have a couple of the necessary calls like LockFileEx;
205     Still, all this is pretty tedious and the semantics of memory mappings
206     (CreateFileMapping) in the presence of writers is problematic at best.
207    
208    
209     * optimizations to consider
210    
211     - where accessing a memory mapped xref can be considered atomic,
212     xref locks are not needed, meaning readers do not need any lock at all.
213     For 8 byte xrefs, we can easily avoid being interrupted by page faults,
214     so some architectures may support this.
215     - where the effect of writing a leaf block is atomic,
216     i.e. seen completely or not at all by any read,
217     readers do not need to use locks on leaves.
218     According to
219     > http://marc.theaimsgroup.com/?l=linux-kernel&m=107375454908544 Linus
220     this may hold only under very specific conditions:
221     We must have neither SMP nor the new kernel preemption enabled,
222     must have ext2 or similar filesystem, the file region must fit
223     one cache page and the userspace region must not pagefault
224     (i.e. we need a piece of locked memory).
225     We always fit a cache page, if our blocksize is
226     a power of two of at most the cache page size.
227     While 4K is a usual value for the cache page size
228     (as well as the pagesize as well as the filesystem block size),
229     we use 1K (which, under Linux, is the minimum for all of these)
230     leaf blocks to be reasonably safe.
231     - for writers, there is next to no more concurrency we could achieve.
232     The minimum about the changed masterfile structure that needs to be
233     visible is its new size, and basically we are just setting this
234     with the one write operation. With any other scheme along the lines
235     of "first extend, then lock only that area" and so on we would only
236     get a couple of additional syscalls plus integrity issues
237     with much more difficult crash recovery.
238     Especially we don't use a separate lock on the access file.
239     - one might consider delaying a sync-to-disk (fsync/msync)
240     until after the lock is released. However, at least a
241     msync using MS_ASYNC and MS_INVALIDATE should be issued.
242     (On Linux, MS_INVALIDATE does nothing, since maps are always coherent).
243     - locking the whole tree might seem pretty restrictive,
244     however, it saves a lot of syscalls and the operations on the
245     mmap should we very fast. Concurrency is affected only by
246     writes to the tree, which are fairly rare
247     (on about 3% of inserts, depending on sizes and fill level).
248     - for non mmapped tree files, however, the classical per-block
249     locking scheme could be considered to reduce worst case delays.
250     Locks on the odd leaves file bytes are reserved for this purpose.
251     Yet this involves substantial overhead on every read.
252    
253    
254     * considering writer starvation
255    
256     In the presence of readers holding shared locks,
257     we can not expect fcntl to avoid writer starvation.
258     If at every time there is always at least one process having a
259     shared lock, a writer may wait indefinitely for an exclusive lock.
260    
261    
262     This can trivially be avoided by only using exclusive locks,
263     which typically will have a simple and fair FIFO implementation.
264     Clearly this sacrifices most concurrency in a situation where
265     it is obviously demanded.
266    
267     A more sophisticated aproach is to guard the acquisition of a lock A,
268     whether shared or exclusive, by an additional ex "guard" lock Ag
269     in a sequence of get Ag - get A - release Ag.
270     That way no process can get A while another process is waiting for A.
271    
272    
273     This double locking should not be used for leaf reads,
274     because continued overlapping of shared locks on leaves means such
275     a pathological congestion that the system will croak anyway.
276    
277     The record and tree locks, on the other hand, are held by readers
278     only while they are inspecting memory mapped files, which should
279     involve no disk accesses if you are going for high throughput.
280     More precisely, we should only very rarely see a process suspended for
281     any reason while holding such a shared lock,
282     and thus overlapped shared locks should be rare.
283    
284    
285     To summarize, in almost any case of high volume querying it is advisable to
286     set up a separate read-only copy of the database instead of increasing
287     system load by doubling the number of locks.
288    
289     According to these considerations,
290     we might use the aforementioned optimizations to reduce writer delays,
291     but do not care about absolutely avoiding writer starvation.
292    
293    
294     * unmounting databases
295    
296     Whenever a process created a new database, other processes may
297     access it by opening it on demand just as any old database.
298     There is no additional interprocess communication needed to make
299     a database available.
300    
301    
302     Since the shared mode is meant to be used in high volume production
303     environments, we assume any available database to remain available without
304     structural changes, and support such changes only in the exclusive server,
305     where they can be handled internally.
306    
307     For completeness, however, here goes an outline of the steps that would
308     be needed to support unmounting databases in a shared environment.
309    
310    
311     Changes to a database, including making it unavailable,
312     are handled by one process obtaining the exclusive "options" lock.
313     Conceptually this can be regarded as lock on the options file,
314     with a process using the database holding a read lock on the options
315     and a process changing database options requiring a write lock.
316    
317    
318     In such an environment,
319     - any process using a database for read a/o write without changing
320     options holds the shared options lock.
321     - a process that wants to change a database must obtain the exclusive
322     options lock to mount the database in single process mode.
323     - before waiting on the exclusive options lock,
324     the process acquires guard locks
325     and notifies other processes to release their locks
326    
327     For synchronization issues, actually three locks are used:
328     - the "options lock" is held shared or exclusive while the database is in use
329     - the shared or exclusive "use lock" guards any attempt to obtain
330     the options lock
331     - the exclusive "change lock" guards any attempt to obtain an exclusive
332     use and options lock
333    
334     A reader (shared user)
335     - asks for the shared use lock without waiting.
336     - if this fails, another process is attempting
337     modification and the reader must close the database.
338     - else, no process must have the exclusive options lock.
339     The reader obtains a shared options lock and releases the use lock.
340    
341     A writer (about to structurally change the database)
342     - first asks for the exclusive change lock without waiting.
343     - if this fails, another process is attempting modification
344     and the writer must close the database and bail out.
345     - else, no process must have the exclusive use lock.
346     Then the exclusive use lock is acquired with waiting.
347     (Since there are shared use locks held for very short times,
348     this has a slight risk of writer starvation and could,
349     for paranoia's sake, be guarded by yet another lock).
350     - finally other processes are notified and the writer
351     waits for the exclusive options lock (probably using a timeout).
352     - once done with changes, the writer releases locks in reverse order.
353    
354     Notifying other processes:
355     - for the unix multi process server,
356     we have all processes share the same process group,
357     so they can be signalled by kill(0,sig).
358     On most systems we should use SIGWINCH and SIGURG,
359     because they are ignored by default (so we don't kill our tcpserver).
360     - while SIGURG will have any running request aborted,
361     processes block SIGWINCH during normal request processing
362     and receive it only when about to read a new request.
363    
364     To make a database finally unavailable,
365     a process should move or remove the files while holding the exclusive locks.
366     Since locks are obtained on open files, other processes may have opened
367     the files before this and finally succeed in obtaining the options lock.
368     To finally detect this race condition,
369     a process must use stat and fstat to ensure, after getting the options lock,
370     that the file it opened is still the same it would open then.

  ViewVC Help
Powered by ViewVC 1.1.26