0.9.9e/doc/MultiProcess.txt

Concurrency and locking in the Malete core.


*       MP vs. MT

Unlike the OpenIsis versions up to 0.9, which where built for multithreading,
the Malete core is designed to run in multiple processes in parallel.

The major reason for this shift is ongoing problems with several
MT environments:
-       multithreading is in a big move towards compiler supported
        thread local storage (TLS). TLS routines like pthread_getspecific
        or TlsGetValue are replaced by the __thread attribute on variables.
        While this is nice and efficient, it is incompatible
        (it does change the behaviour of libc in subtle ways).
-       on Linux, the 2.2/2.4 LinuxThreads are now superseded by NPTL
        in the 2.6 kernels (and some backported 2.4.20s, watch out!),
        which uses compiler supported TLS. While this offers a couple of advantages,
        it is not worth the effort to support both during the transition.
-       on BSD thread support is only emerging. On MacOS X, the implementation
        has yet to prove its stability. Even on Solaris with a long track
        of reliable MT support, they are switching the threading model.
-       on Windows the 9x and Me versions are still in widespread use,
        which lack important features to leverage the benefits of MT
        (OVERLAPPED_IO and SignalObjectAndWait). While they also lack
        support for synchronization of multiple read/write processes,
        at least read-only multiprocessing is feasible.
-       while Java can be considered by far the most stable environment
        for MT programming, this only works out when actually writing pure Java.
        Embedding MT C-code via JNI is not well defined and pretty intricate,
        especially where the underlying MT implementation is on the move.

On the other hand, there is some demand for MP support in certain
environments like e.g. MP PHP or a parallel server running from
wrappers like tcpserver.

Thus Malete by now focuses on MP and leaves a reworked MT version
for better times with stable and widely available TLS support.


*       shared ressources

The only ressources shared are (regions of) files,
i.e. there is no shared memory (besides mmapped files),
queues, pipes, sockets or other ressources.
In typical usage read access is much more frequent than writing.

Therefore the means of coordinating access to shared ressources
-       must support multiple processes 
-       should support read/write locks
        (i.e. shared vs. exclusive locking modes) to allow for concurrent readers
-       should support locking and unlocking of regions of open files
        to support concurrent writers

With some limitations to be discussed further below,
this can be implementated based on file locking.


Features we do NOT need:
-       locking over network file systems means looking for trouble.
        While this MAY work in simple cases, it is NOT RECOMMENDED!
        If you really can't avoid accessing remote database files via NFS or SMB,
        the only reasonable use of locks is to "protect" read-only access.
        DO NOT even consider writing your valuable data to remote storage.
        Better run a database server where the disks are.
-       mandatory locking means looking for trouble,
        c.f. /usr/src/linux/Documentation/mandatory.txt.
        DO NOT mount with mandatory locking support,
        DO NOT set the mandatory locking file mode!
-       deadlock detection.
        The usage of locking is deadlock free.
        (Unless mandatory locking is enabled, so don't do that).


*       concurrency modes

There are three modes of concurrency to distinguish:
-       read-only mode:
        any number of processes holding shared locks on whole files.
        Every process may read at any time, no process may write.
        This is the most efficient mode for high volume query processing.
-       exclusive mode:
        a single process holding exclusive locks on whole files.
        The process may or may not write to the files.
        A simple, safe and portable mode to allow writing.
-       shared mode:
        multiple processes using temporary locks on regions of files.
        Every process may read and write. Supported on UNIX-style systems only.

The first two modes using locks held during process lifetime
are trivially correct.  Actually older OpenIsis versions used such locks.
Shared mode is much more complicated and deserves more detailled inspection,
which is done in the remainder of this document.


*       shared mode

First of all: try to avoid it.

For development and typical data entry use, an exclusive mode single process
server will do perfectly well. For high volume query processing,
use a separate copy of the database in read-only mode.
Malete databases are designed to be easily copied and backed up.
This is intrinsically much more efficient and reliable than combining
high read and write load. Moreover, cleanup tasks like data and tree
compaction can be safely done in an exclusive writer,
but would require unduly synchronization effort in shared mode.

Having that said, here are the gory details.


Both the records (r) and the index entries (q) use two files each:
-       a "data" (d) file
        (the plain masterfile .mrd and the BTree leaves chain .mqd, resp.)
-       an "access" (x) file for faster lookup
        (the record xref file .mrx and BTree inner nodes .mqx, resp.)
The access and data files need to be in sync,
thus writing access needs to be synchronized to some extend.
Typically, but not necessarily, the application uses index entries
in turn as pointers for the records; see below.

Basically, all files are accessed using (unbuffered) native file IO.
The access files, however, are memory mapped where possible.


In summary, we use
-       one "record lock" per database guarding any record write
-       a "xref lock" for every record
-       one "tree lock" per database guarding BTree inner node access
-       a "leaf lock" for every BTree leaf block


To read a record:
-       obtain the shared xref lock
-       look up the xref
-       release the xref lock
-       read the record
        (does not need a lock since records are immutable)

Writing a new or changed record is done by:
-       acquiring the exclusive record lock
-       appending to the end of the masterfile
-       obtain the exclusive xref lock
-       writing to the xref file (using msync where mmapped)
-       release the xref lock
-       release the record lock

When searching the index
-       the shared tree lock is acquired, the tree is read to find the leaf number,
        and the tree lock is released
-       a shared lock on the leaf block is acquired and the leaf is read.
        The shared lock is released immediatly after reading.
-       on split detection (Lehmann/Yao) successive leaf blocks are read alike

The steps to write an index entry are:
-       find the target leaf block, searching as above,
        but using exclusive leaf locks. On split detection,
        an ex leaf lock is released only after locking and reading a successor.
        The final ex lock is held until after the leaf block has been written.
-       if the write involves a split, lock the block after the end
        of the leaves file and double check you really got a brand new block
        (it must not be readable, else, repeat)
-       write the block or blocks
-       if the write involved a split, obtain the exclusive tree lock.
        Iteratively insert pointers to any newly created blocks.
-       release all locks

To support a unique key, the record lock is held while writing / looking up
the record's key in the index and until after the record was written.


*       operating system considerations

Locking an mmapped tree file may not work on some systems.
The rationale is that accesses to mapped memory can not be
checked against arbitrary mandatory locked regions;
c.f. /usr/src/linux/Documentation/mandatory.txt.
On some systems locking mapped files is completely ruled out,
others, like Linux, deny only mandatory locks.
Solaris allows locks on the whole file only (not regions).

Since all locks are used advisory, however,
there is no need to put the lock on the actual bytes affected by an operation.
-       The record lock is on byte 0 of the masterfile .mrd,
        and the xref on record id n (1..) locks byte n.
-       The tree lock is on byte 1 of the leaves file .mqd,
        and the leaf block n (0..) locks byte 2*n.

Consequently, in read-only and exclusive mode, full locks on the
data files .m?d are sufficient to prevent conflicts with other writers.
While this statement with regard to read-only and exclusive mode processes
can be considered to constitute an interface,
the details of shared mode coordination may change
(i.e. which bytes are locked in which order, see considerations below).


Since M$ Windows does not offer support for serious programming,
we are limited to the exclusive and read-only modes:

The 9x/Me family is even lacking shared file region locks and locks that can
be waited upon, not to mention any reasonable means of signaling.
Memory mappings here are of limited use since they might copy the file to swap.

The NT-based versions have a couple of the necessary calls like LockFileEx;
Still, all this is pretty tedious and the semantics of memory mappings
(CreateFileMapping) in the presence of writers is problematic at best.


*       optimizations to consider

-       where accessing a memory mapped xref can be considered atomic,
        xref locks are not needed, meaning readers do not need any lock at all.
        For 8 byte xrefs, we can easily avoid being interrupted by page faults,
        so some architectures may support this.
-       where the effect of writing a leaf block is atomic,
        i.e. seen completely or not at all by any read,
        readers do not need to use locks on leaves.
        According to
>       http://marc.theaimsgroup.com/?l=linux-kernel&m=107375454908544  Linus
        this may hold only under very specific conditions:
        We must have neither SMP nor the new kernel preemption enabled,
        must have ext2 or similar filesystem, the file region must fit
        one cache page and the userspace region must not pagefault
        (i.e. we need a piece of locked memory).
        We always fit a cache page, if our blocksize is
        a power of two of at most the cache page size.
        While 4K is a usual value for the cache page size
        (as well as the pagesize as well as the filesystem block size),
        we use 1K (which, under Linux, is the minimum for all of these)
        leaf blocks to be reasonably safe.
-       for writers, there is next to no more concurrency we could achieve.
        The minimum about the changed masterfile structure that needs to be
        visible is its new size, and basically we are just setting this
        with the one write operation. With any other scheme along the lines
        of "first extend, then lock only that area" and so on we would only
        get a couple of additional syscalls plus integrity issues
        with much more difficult crash recovery.
        Especially we don't use a separate lock on the access file.
-       one might consider delaying a sync-to-disk (fsync/msync)
        until after the lock is released. However, at least a
        msync using MS_ASYNC and MS_INVALIDATE should be issued.
        (On Linux, MS_INVALIDATE does nothing, since maps are always coherent).
-       locking the whole tree might seem pretty restrictive,
        however, it saves a lot of syscalls and the operations on the
        mmap should we very fast. Concurrency is affected only by
        writes to the tree, which are fairly rare
        (on about 3% of inserts, depending on sizes and fill level).
-       for non mmapped tree files, however, the classical per-block
        locking scheme could be considered to reduce worst case delays.
        Locks on the odd leaves file bytes are reserved for this purpose.
        Yet this involves substantial overhead on every read.


*       considering writer starvation

In the presence of readers holding shared locks,
we can not expect fcntl to avoid writer starvation.
If at every time there is always at least one process having a
shared lock, a writer may wait indefinitely for an exclusive lock.


This can trivially be avoided by only using exclusive locks,
which typically will have a simple and fair FIFO implementation.
Clearly this sacrifices most concurrency in a situation where
it is obviously demanded.

A more sophisticated aproach is to guard the acquisition of a lock A,
whether shared or exclusive, by an additional ex "guard" lock Ag
in a sequence of get Ag - get A - release Ag.
That way no process can get A while another process is waiting for A.


This double locking should not be used for leaf reads,
because continued overlapping of shared locks on leaves means such
a pathological congestion that the system will croak anyway.

The record and tree locks, on the other hand, are held by readers
only while they are inspecting memory mapped files, which should
involve no disk accesses if you are going for high throughput.
More precisely, we should only very rarely see a process suspended for
any reason while holding such a shared lock,
and thus overlapped shared locks should be rare.


To summarize, in almost any case of high volume querying it is advisable to
set up a separate read-only copy of the database instead of increasing
system load by doubling the number of locks.

According to these considerations,
we might use the aforementioned optimizations to reduce writer delays,
but do not care about absolutely avoiding writer starvation.


*       unmounting databases

Whenever a process created a new database, other processes may
access it by opening it on demand just as any old database.
There is no additional interprocess communication needed to make
a database available.


Since the shared mode is meant to be used in high volume production
environments, we assume any available database to remain available without
structural changes, and support such changes only in the exclusive server,
where they can be handled internally.

For completeness, however, here goes an outline of the steps that would
be needed to support unmounting databases in a shared environment.


Changes to a database, including making it unavailable,
are handled by one process obtaining the exclusive "options" lock.
Conceptually this can be regarded as lock on the options file,
with a process using the database holding a read lock on the options
and a process changing database options requiring a write lock.


In such an environment,
-       any process using a database for read a/o write without changing
        options holds the shared options lock.
-       a process that wants to change a database must obtain the exclusive
        options lock to mount the database in single process mode.
-       before waiting on the exclusive options lock,
        the process acquires guard locks
        and notifies other processes to release their locks

For synchronization issues, actually three locks are used:
-       the "options lock" is held shared or exclusive while the database is in use
-       the shared or exclusive "use lock" guards any attempt to obtain
        the options lock
-       the exclusive "change lock" guards any attempt to obtain an exclusive
        use and options lock

A reader (shared user)
-       asks for the shared use lock without waiting.
-       if this fails, another process is attempting
        modification and the reader must close the database.
-       else, no process must have the exclusive options lock.
        The reader obtains a shared options lock and releases the use lock.

A writer (about to structurally change the database)
-       first asks for the exclusive change lock without waiting.
-       if this fails, another process is attempting modification
        and the writer must close the database and bail out.
-       else, no process must have the exclusive use lock.
        Then the exclusive use lock is acquired with waiting.
        (Since there are shared use locks held for very short times,
        this has a slight risk of writer starvation and could,
        for paranoia's sake, be guarded by yet another lock).
-       finally other processes are notified and the writer
        waits for the exclusive options lock (probably using a timeout).
-       once done with changes, the writer releases locks in reverse order.

Notifying other processes:
-       for the unix multi process server,
        we have all processes share the same process group,
        so they can be signalled by kill(0,sig).
        On most systems we should use SIGWINCH and SIGURG,
        because they are ignored by default (so we don't kill our tcpserver).
-       while SIGURG will have any running request aborted,
        processes block SIGWINCH during normal request processing
        and receive it only when about to read a new request.

To make a database finally unavailable,
a process should move or remove the files while holding the exclusive locks.
Since locks are obtained on open files, other processes may have opened
the files before this and finally succeed in obtaining the options lock.
To finally detect this race condition,
a process must use stat and fstat to ensure, after getting the options lock,
that the file it opened is still the same it would open then.
1	Concurrency and locking in the Malete core.
2
3
4	* MP vs. MT
5
6	Unlike the OpenIsis versions up to 0.9, which where built for multithreading,
7	the Malete core is designed to run in multiple processes in parallel.
8
9	The major reason for this shift is ongoing problems with several
10	MT environments:
11	- multithreading is in a big move towards compiler supported
12	thread local storage (TLS). TLS routines like pthread_getspecific
13	or TlsGetValue are replaced by the __thread attribute on variables.
14	While this is nice and efficient, it is incompatible
15	(it does change the behaviour of libc in subtle ways).
16	- on Linux, the 2.2/2.4 LinuxThreads are now superseded by NPTL
17	in the 2.6 kernels (and some backported 2.4.20s, watch out!),
18	which uses compiler supported TLS. While this offers a couple of advantages,
19	it is not worth the effort to support both during the transition.
20	- on BSD thread support is only emerging. On MacOS X, the implementation
21	has yet to prove its stability. Even on Solaris with a long track
22	of reliable MT support, they are switching the threading model.
23	- on Windows the 9x and Me versions are still in widespread use,
24	which lack important features to leverage the benefits of MT
25	(OVERLAPPED_IO and SignalObjectAndWait). While they also lack
26	support for synchronization of multiple read/write processes,
27	at least read-only multiprocessing is feasible.
28	- while Java can be considered by far the most stable environment
29	for MT programming, this only works out when actually writing pure Java.
30	Embedding MT C-code via JNI is not well defined and pretty intricate,
31	especially where the underlying MT implementation is on the move.
32
33	On the other hand, there is some demand for MP support in certain
34	environments like e.g. MP PHP or a parallel server running from
35	wrappers like tcpserver.
36
37	Thus Malete by now focuses on MP and leaves a reworked MT version
38	for better times with stable and widely available TLS support.
39
40
41	* shared ressources
42
43	The only ressources shared are (regions of) files,
44	i.e. there is no shared memory (besides mmapped files),
45	queues, pipes, sockets or other ressources.
46	In typical usage read access is much more frequent than writing.
47
48	Therefore the means of coordinating access to shared ressources
49	- must support multiple processes
50	- should support read/write locks
51	(i.e. shared vs. exclusive locking modes) to allow for concurrent readers
52	- should support locking and unlocking of regions of open files
53	to support concurrent writers
54
55	With some limitations to be discussed further below,
56	this can be implementated based on file locking.
57
58
59	Features we do NOT need:
60	- locking over network file systems means looking for trouble.
61	While this MAY work in simple cases, it is NOT RECOMMENDED!
62	If you really can't avoid accessing remote database files via NFS or SMB,
63	the only reasonable use of locks is to "protect" read-only access.
64	DO NOT even consider writing your valuable data to remote storage.
65	Better run a database server where the disks are.
66	- mandatory locking means looking for trouble,
67	c.f. /usr/src/linux/Documentation/mandatory.txt.
68	DO NOT mount with mandatory locking support,
69	DO NOT set the mandatory locking file mode!
70	- deadlock detection.
71	The usage of locking is deadlock free.
72	(Unless mandatory locking is enabled, so don't do that).
73
74
75	* concurrency modes
76
77	There are three modes of concurrency to distinguish:
78	- read-only mode:
79	any number of processes holding shared locks on whole files.
80	Every process may read at any time, no process may write.
81	This is the most efficient mode for high volume query processing.
82	- exclusive mode:
83	a single process holding exclusive locks on whole files.
84	The process may or may not write to the files.
85	A simple, safe and portable mode to allow writing.
86	- shared mode:
87	multiple processes using temporary locks on regions of files.
88	Every process may read and write. Supported on UNIX-style systems only.
89
90	The first two modes using locks held during process lifetime
91	are trivially correct. Actually older OpenIsis versions used such locks.
92	Shared mode is much more complicated and deserves more detailled inspection,
93	which is done in the remainder of this document.
94
95
96	* shared mode
97
98	First of all: try to avoid it.
99
100	For development and typical data entry use, an exclusive mode single process
101	server will do perfectly well. For high volume query processing,
102	use a separate copy of the database in read-only mode.
103	Malete databases are designed to be easily copied and backed up.
104	This is intrinsically much more efficient and reliable than combining
105	high read and write load. Moreover, cleanup tasks like data and tree
106	compaction can be safely done in an exclusive writer,
107	but would require unduly synchronization effort in shared mode.
108
109	Having that said, here are the gory details.
110
111
112	Both the records (r) and the index entries (q) use two files each:
113	- a "data" (d) file
114	(the plain masterfile .mrd and the BTree leaves chain .mqd, resp.)
115	- an "access" (x) file for faster lookup
116	(the record xref file .mrx and BTree inner nodes .mqx, resp.)
117	The access and data files need to be in sync,
118	thus writing access needs to be synchronized to some extend.
119	Typically, but not necessarily, the application uses index entries
120	in turn as pointers for the records; see below.
121
122	Basically, all files are accessed using (unbuffered) native file IO.
123	The access files, however, are memory mapped where possible.
124
125
126	In summary, we use
127	- one "record lock" per database guarding any record write
128	- a "xref lock" for every record
129	- one "tree lock" per database guarding BTree inner node access
130	- a "leaf lock" for every BTree leaf block
131
132
133	To read a record:
134	- obtain the shared xref lock
135	- look up the xref
136	- release the xref lock
137	- read the record
138	(does not need a lock since records are immutable)
139
140	Writing a new or changed record is done by:
141	- acquiring the exclusive record lock
142	- appending to the end of the masterfile
143	- obtain the exclusive xref lock
144	- writing to the xref file (using msync where mmapped)
145	- release the xref lock
146	- release the record lock
147
148	When searching the index
149	- the shared tree lock is acquired, the tree is read to find the leaf number,
150	and the tree lock is released
151	- a shared lock on the leaf block is acquired and the leaf is read.
152	The shared lock is released immediatly after reading.
153	- on split detection (Lehmann/Yao) successive leaf blocks are read alike
154
155	The steps to write an index entry are:
156	- find the target leaf block, searching as above,
157	but using exclusive leaf locks. On split detection,
158	an ex leaf lock is released only after locking and reading a successor.
159	The final ex lock is held until after the leaf block has been written.
160	- if the write involves a split, lock the block after the end
161	of the leaves file and double check you really got a brand new block
162	(it must not be readable, else, repeat)
163	- write the block or blocks
164	- if the write involved a split, obtain the exclusive tree lock.
165	Iteratively insert pointers to any newly created blocks.
166	- release all locks
167
168	To support a unique key, the record lock is held while writing / looking up
169	the record's key in the index and until after the record was written.
170
171
172	* operating system considerations
173
174	Locking an mmapped tree file may not work on some systems.
175	The rationale is that accesses to mapped memory can not be
176	checked against arbitrary mandatory locked regions;
177	c.f. /usr/src/linux/Documentation/mandatory.txt.
178	On some systems locking mapped files is completely ruled out,
179	others, like Linux, deny only mandatory locks.
180	Solaris allows locks on the whole file only (not regions).
181
182	Since all locks are used advisory, however,
183	there is no need to put the lock on the actual bytes affected by an operation.
184	- The record lock is on byte 0 of the masterfile .mrd,
185	and the xref on record id n (1..) locks byte n.
186	- The tree lock is on byte 1 of the leaves file .mqd,
187	and the leaf block n (0..) locks byte 2*n.
188
189	Consequently, in read-only and exclusive mode, full locks on the
190	data files .m?d are sufficient to prevent conflicts with other writers.
191	While this statement with regard to read-only and exclusive mode processes
192	can be considered to constitute an interface,
193	the details of shared mode coordination may change
194	(i.e. which bytes are locked in which order, see considerations below).
195
196
197	Since M$ Windows does not offer support for serious programming,
198	we are limited to the exclusive and read-only modes:
199
200	The 9x/Me family is even lacking shared file region locks and locks that can
201	be waited upon, not to mention any reasonable means of signaling.
202	Memory mappings here are of limited use since they might copy the file to swap.
203
204	The NT-based versions have a couple of the necessary calls like LockFileEx;
205	Still, all this is pretty tedious and the semantics of memory mappings
206	(CreateFileMapping) in the presence of writers is problematic at best.
207
208
209	* optimizations to consider
210
211	- where accessing a memory mapped xref can be considered atomic,
212	xref locks are not needed, meaning readers do not need any lock at all.
213	For 8 byte xrefs, we can easily avoid being interrupted by page faults,
214	so some architectures may support this.
215	- where the effect of writing a leaf block is atomic,
216	i.e. seen completely or not at all by any read,
217	readers do not need to use locks on leaves.
218	According to
219	> http://marc.theaimsgroup.com/?l=linux-kernel&m=107375454908544 Linus
220	this may hold only under very specific conditions:
221	We must have neither SMP nor the new kernel preemption enabled,
222	must have ext2 or similar filesystem, the file region must fit
223	one cache page and the userspace region must not pagefault
224	(i.e. we need a piece of locked memory).
225	We always fit a cache page, if our blocksize is
226	a power of two of at most the cache page size.
227	While 4K is a usual value for the cache page size
228	(as well as the pagesize as well as the filesystem block size),
229	we use 1K (which, under Linux, is the minimum for all of these)
230	leaf blocks to be reasonably safe.
231	- for writers, there is next to no more concurrency we could achieve.
232	The minimum about the changed masterfile structure that needs to be
233	visible is its new size, and basically we are just setting this
234	with the one write operation. With any other scheme along the lines
235	of "first extend, then lock only that area" and so on we would only
236	get a couple of additional syscalls plus integrity issues
237	with much more difficult crash recovery.
238	Especially we don't use a separate lock on the access file.
239	- one might consider delaying a sync-to-disk (fsync/msync)
240	until after the lock is released. However, at least a
241	msync using MS_ASYNC and MS_INVALIDATE should be issued.
242	(On Linux, MS_INVALIDATE does nothing, since maps are always coherent).
243	- locking the whole tree might seem pretty restrictive,
244	however, it saves a lot of syscalls and the operations on the
245	mmap should we very fast. Concurrency is affected only by
246	writes to the tree, which are fairly rare
247	(on about 3% of inserts, depending on sizes and fill level).
248	- for non mmapped tree files, however, the classical per-block
249	locking scheme could be considered to reduce worst case delays.
250	Locks on the odd leaves file bytes are reserved for this purpose.
251	Yet this involves substantial overhead on every read.
252
253
254	* considering writer starvation
255
256	In the presence of readers holding shared locks,
257	we can not expect fcntl to avoid writer starvation.
258	If at every time there is always at least one process having a
259	shared lock, a writer may wait indefinitely for an exclusive lock.
260
261
262	This can trivially be avoided by only using exclusive locks,
263	which typically will have a simple and fair FIFO implementation.
264	Clearly this sacrifices most concurrency in a situation where
265	it is obviously demanded.
266
267	A more sophisticated aproach is to guard the acquisition of a lock A,
268	whether shared or exclusive, by an additional ex "guard" lock Ag
269	in a sequence of get Ag - get A - release Ag.
270	That way no process can get A while another process is waiting for A.
271
272
273	This double locking should not be used for leaf reads,
274	because continued overlapping of shared locks on leaves means such
275	a pathological congestion that the system will croak anyway.
276
277	The record and tree locks, on the other hand, are held by readers
278	only while they are inspecting memory mapped files, which should
279	involve no disk accesses if you are going for high throughput.
280	More precisely, we should only very rarely see a process suspended for
281	any reason while holding such a shared lock,
282	and thus overlapped shared locks should be rare.
283
284
285	To summarize, in almost any case of high volume querying it is advisable to
286	set up a separate read-only copy of the database instead of increasing
287	system load by doubling the number of locks.
288
289	According to these considerations,
290	we might use the aforementioned optimizations to reduce writer delays,
291	but do not care about absolutely avoiding writer starvation.
292
293
294	* unmounting databases
295
296	Whenever a process created a new database, other processes may
297	access it by opening it on demand just as any old database.
298	There is no additional interprocess communication needed to make
299	a database available.
300
301
302	Since the shared mode is meant to be used in high volume production
303	environments, we assume any available database to remain available without
304	structural changes, and support such changes only in the exclusive server,
305	where they can be handled internally.
306
307	For completeness, however, here goes an outline of the steps that would
308	be needed to support unmounting databases in a shared environment.
309
310
311	Changes to a database, including making it unavailable,
312	are handled by one process obtaining the exclusive "options" lock.
313	Conceptually this can be regarded as lock on the options file,
314	with a process using the database holding a read lock on the options
315	and a process changing database options requiring a write lock.
316
317
318	In such an environment,
319	- any process using a database for read a/o write without changing
320	options holds the shared options lock.
321	- a process that wants to change a database must obtain the exclusive
322	options lock to mount the database in single process mode.
323	- before waiting on the exclusive options lock,
324	the process acquires guard locks
325	and notifies other processes to release their locks
326
327	For synchronization issues, actually three locks are used:
328	- the "options lock" is held shared or exclusive while the database is in use
329	- the shared or exclusive "use lock" guards any attempt to obtain
330	the options lock
331	- the exclusive "change lock" guards any attempt to obtain an exclusive
332	use and options lock
333
334	A reader (shared user)
335	- asks for the shared use lock without waiting.
336	- if this fails, another process is attempting
337	modification and the reader must close the database.
338	- else, no process must have the exclusive options lock.
339	The reader obtains a shared options lock and releases the use lock.
340
341	A writer (about to structurally change the database)
342	- first asks for the exclusive change lock without waiting.
343	- if this fails, another process is attempting modification
344	and the writer must close the database and bail out.
345	- else, no process must have the exclusive use lock.
346	Then the exclusive use lock is acquired with waiting.
347	(Since there are shared use locks held for very short times,
348	this has a slight risk of writer starvation and could,
349	for paranoia's sake, be guarded by yet another lock).
350	- finally other processes are notified and the writer
351	waits for the exclusive options lock (probably using a timeout).
352	- once done with changes, the writer releases locks in reverse order.
353
354	Notifying other processes:
355	- for the unix multi process server,
356	we have all processes share the same process group,
357	so they can be signalled by kill(0,sig).
358	On most systems we should use SIGWINCH and SIGURG,
359	because they are ignored by default (so we don't kill our tcpserver).
360	- while SIGURG will have any running request aborted,
361	processes block SIGWINCH during normal request processing
362	and receive it only when about to read a new request.
363
364	To make a database finally unavailable,
365	a process should move or remove the files while holding the exclusive locks.
366	Since locks are obtained on open files, other processes may have opened
367	the files before this and finally succeed in obtaining the options lock.
368	To finally detect this race condition,
369	a process must use stat and fstat to ensure, after getting the options lock,
370	that the file it opened is still the same it would open then.