0.9.9e/doc/MultiProcess.txt

Concurrency and locking in the Malete core.


*       MP vs. MT

Unlike the OpenIsis versions up to 0.9, which where built for multithreading,
the Malete core is designed to run in multiple processes in parallel.

The major reason for this shift is ongoing problems with several
MT environments:
-       multithreading is in a big move towards compiler supported
        thread local storage (TLS). TLS routines like pthread_getspecific
        or TlsGetValue are replaced by the __thread attribute on variables.
        While this is nice and efficient, it is incompatible
        (it does change the behaviour of libc in subtle ways).
-       on Linux, the 2.2/2.4 LinuxThreads are now superseded by NPTL
        in the 2.6 kernels (and some backported 2.4.20s, watch out!),
        which uses compiler supported TLS. While this offers a couple of advantages,
        it is not worth the effort to support both during the transition.
-       on BSD thread support is only emerging. On MacOS X, the implementation
        has yet to prove its stability. Even on Solaris with a long track
        of reliable MT support, they are switching the threading model.
-       on Windows the 9x and Me versions are still in widespread use,
        which lack important features to leverage the benefits of MT
        (OVERLAPPED_IO and SignalObjectAndWait). While they also lack
        support for synchronization of multiple read/write processes,
        at least read-only multiprocessing is feasible.
-       while Java can be considered by far the most stable environment
        for MT programming, this only works out when actually writing pure Java.
        Embedding MT C-code via JNI is not well defined and pretty intricate,
        especially where the underlying MT implementation is on the move.

On the other hand, there is some demand for MP support in certain
environments like e.g. MP PHP or a parallel server running from
wrappers like tcpserver.

Thus Malete by now focuses on MP and leaves a reworked MT version
for better times with stable and widely available TLS support.


*       shared ressources

The only ressources shared are (regions of) files,
i.e. there is no shared memory (besides mmapped files),
queues, pipes, sockets or other ressources.
In typical usage read access is much more frequent than writing.

Therefore the means of coordinating access to shared ressources
-       must support multiple processes 
-       should support read/write locks
        (i.e. shared vs. exclusive locking modes) to allow for concurrent readers
-       should support locking and unlocking of regions of open files
        to support concurrent writers

With some limitations to be discussed further below,
this can be implementated based on file locking.


Features we do NOT need:
-       locking over network file systems means looking for trouble.
        While this MAY work in simple cases, it is NOT RECOMMENDED!
        If you really can't avoid accessing remote database files via NFS or SMB,
        the only reasonable use of locks is to "protect" read-only access.
        DO NOT even consider writing your valuable data to remote storage.
        Better run a database server where the disks are.
-       mandatory locking means looking for trouble,
        c.f. /usr/src/linux/Documentation/mandatory.txt.
        DO NOT mount with mandatory locking support,
        DO NOT set the mandatory locking file mode!
-       deadlock detection.
        The usage of locking is deadlock free.
        (Unless mandatory locking is enabled, so don't do that).


*       concurrency modes

There are three modes of concurrency to distinguish:
-       read-only mode:
        any number of processes holding shared locks on whole files.
        Every process may read at any time, no process may write.
        This is the most efficient mode for high volume query processing.
-       exclusive mode:
        a single process holding exclusive locks on whole files.
        The process may or may not write to the files.
        A simple, safe and portable mode to allow writing.
-       shared mode:
        multiple processes using temporary locks on regions of files.
        Every process may read and write. Supported on UNIX-style systems only.

The first two modes using locks held during process lifetime
are trivially correct.  Actually older OpenIsis versions used such locks.
Shared mode is much more complicated and deserves more detailled inspection,
which is done in the remainder of this document.


*       shared mode

First of all: try to avoid it.

For development and typical data entry use, an exclusive mode single process
server will do perfectly well. For high volume query processing,
use a separate copy of the database in read-only mode.
Malete databases are designed to be easily copied and backed up.
This is intrinsically much more efficient and reliable than combining
high read and write load. Moreover, cleanup tasks like data and tree
compaction can be safely done in an exclusive writer,
but would require unduly synchronization effort in shared mode.

Having that said, here are the gory details.


Both the records (r) and the index entries (q) use two files each:
-       a "data" (d) file
        (the plain masterfile .mrd and the BTree leaves chain .mqd, resp.)
-       an "access" (x) file for faster lookup
        (the record xref file .mrx and BTree inner nodes .mqx, resp.)
The access and data files need to be in sync,
thus writing access needs to be synchronized to some extend.
Typically, but not necessarily, the application uses index entries
in turn as pointers for the records; see below.

Basically, all files are accessed using (unbuffered) native file IO.
The access files, however, are memory mapped where possible.


In summary, we use
-       one "record lock" per database guarding any record write
-       a "xref lock" for every record
-       one "tree lock" per database guarding BTree inner node access
-       a "leaf lock" for every BTree leaf block


To read a record:
-       obtain the shared xref lock
-       look up the xref
-       release the xref lock
-       read the record
        (does not need a lock since records are immutable)

Writing a new or changed record is done by:
-       acquiring the exclusive record lock
-       appending to the end of the masterfile
-       obtain the exclusive xref lock
-       writing to the xref file (using msync where mmapped)
-       release the xref lock
-       release the record lock

When searching the index
-       the shared tree lock is acquired, the tree is read to find the leaf number,
        and the tree lock is released
-       a shared lock on the leaf block is acquired and the leaf is read.
        The shared lock is released immediatly after reading.
-       on split detection (Lehmann/Yao) successive leaf blocks are read alike

The steps to write an index entry are:
-       find the target leaf block, searching as above,
        but using exclusive leaf locks. On split detection,
        an ex leaf lock is released only after locking and reading a successor.
        The final ex lock is held until after the leaf block has been written.
-       if the write involves a split, lock the block after the end
        of the leaves file and double check you really got a brand new block
        (it must not be readable, else, repeat)
-       write the block or blocks
-       if the write involved a split, obtain the exclusive tree lock.
        Iteratively insert pointers to any newly created blocks.
-       release all locks

To support a unique key, the record lock is held while writing / looking up
the record's key in the index and until after the record was written.


*       operating system considerations

Locking an mmapped tree file may not work on some systems.
The rationale is that accesses to mapped memory can not be
checked against arbitrary mandatory locked regions;
c.f. /usr/src/linux/Documentation/mandatory.txt.
On some systems locking mapped files is completely ruled out,
others, like Linux, deny only mandatory locks.
Solaris allows locks on the whole file only (not regions).

Since all locks are used advisory, however,
there is no need to put the lock on the actual bytes affected by an operation.
-       The record lock is on byte 0 of the masterfile .mrd,
        and the xref on record id n (1..) locks byte n.
-       The tree lock is on byte 1 of the leaves file .mqd,
        and the leaf block n (0..) locks byte 2*n.

Consequently, in read-only and exclusive mode, full locks on the
data files .m?d are sufficient to prevent conflicts with other writers.
While this statement with regard to read-only and exclusive mode processes
can be considered to constitute an interface,
the details of shared mode coordination may change
(i.e. which bytes are locked in which order, see considerations below).


Since M$ Windows does not offer support for serious programming,
we are limited to the exclusive and read-only modes:

The 9x/Me family is even lacking shared file region locks and locks that can
be waited upon, not to mention any reasonable means of signaling.
Memory mappings here are of limited use since they might copy the file to swap.

The NT-based versions have a couple of the necessary calls like LockFileEx;
Still, all this is pretty tedious and the semantics of memory mappings
(CreateFileMapping) in the presence of writers is problematic at best.


*       optimizations to consider

-       where accessing a memory mapped xref can be considered atomic,
        xref locks are not needed, meaning readers do not need any lock at all.
        For 8 byte xrefs, we can easily avoid being interrupted by page faults,
        so some architectures may support this.
-       where the effect of writing a leaf block is atomic,
        i.e. seen completely or not at all by any read,
        readers do not need to use locks on leaves.
        According to
>       http://marc.theaimsgroup.com/?l=linux-kernel&m=107375454908544  Linus
        this may hold only under very specific conditions:
        We must have neither SMP nor the new kernel preemption enabled,
        must have ext2 or similar filesystem, the file region must fit
        one cache page and the userspace region must not pagefault
        (i.e. we need a piece of locked memory).
        We always fit a cache page, if our blocksize is
        a power of two of at most the cache page size.
        While 4K is a usual value for the cache page size
        (as well as the pagesize as well as the filesystem block size),
        we use 1K (which, under Linux, is the minimum for all of these)
        leaf blocks to be reasonably safe.
-       for writers, there is next to no more concurrency we could achieve.
        The minimum about the changed masterfile structure that needs to be
        visible is its new size, and basically we are just setting this
        with the one write operation. With any other scheme along the lines
        of "first extend, then lock only that area" and so on we would only
        get a couple of additional syscalls plus integrity issues
        with much more difficult crash recovery.
        Especially we don't use a separate lock on the access file.
-       one might consider delaying a sync-to-disk (fsync/msync)
        until after the lock is released. However, at least a
        msync using MS_ASYNC and MS_INVALIDATE should be issued.
        (On Linux, MS_INVALIDATE does nothing, since maps are always coherent).
-       locking the whole tree might seem pretty restrictive,
        however, it saves a lot of syscalls and the operations on the
        mmap should we very fast. Concurrency is affected only by
        writes to the tree, which are fairly rare
        (on about 3% of inserts, depending on sizes and fill level).
-       for non mmapped tree files, however, the classical per-block
        locking scheme could be considered to reduce worst case delays.
        Locks on the odd leaves file bytes are reserved for this purpose.
        Yet this involves substantial overhead on every read.


*       considering writer starvation

In the presence of readers holding shared locks,
we can not expect fcntl to avoid writer starvation.
If at every time there is always at least one process having a
shared lock, a writer may wait indefinitely for an exclusive lock.


This can trivially be avoided by only using exclusive locks,
which typically will have a simple and fair FIFO implementation.
Clearly this sacrifices most concurrency in a situation where
it is obviously demanded.

A more sophisticated aproach is to guard the acquisition of a lock A,
whether shared or exclusive, by an additional ex "guard" lock Ag
in a sequence of get Ag - get A - release Ag.
That way no process can get A while another process is waiting for A.


This double locking should not be used for leaf reads,
because continued overlapping of shared locks on leaves means such
a pathological congestion that the system will croak anyway.

The record and tree locks, on the other hand, are held by readers
only while they are inspecting memory mapped files, which should
involve no disk accesses if you are going for high throughput.
More precisely, we should only very rarely see a process suspended for
any reason while holding such a shared lock,
and thus overlapped shared locks should be rare.


To summarize, in almost any case of high volume querying it is advisable to
set up a separate read-only copy of the database instead of increasing
system load by doubling the number of locks.

According to these considerations,
we might use the aforementioned optimizations to reduce writer delays,
but do not care about absolutely avoiding writer starvation.


*       unmounting databases

Whenever a process created a new database, other processes may
access it by opening it on demand just as any old database.
There is no additional interprocess communication needed to make
a database available.


Since the shared mode is meant to be used in high volume production
environments, we assume any available database to remain available without
structural changes, and support such changes only in the exclusive server,
where they can be handled internally.

For completeness, however, here goes an outline of the steps that would
be needed to support unmounting databases in a shared environment.


Changes to a database, including making it unavailable,
are handled by one process obtaining the exclusive "options" lock.
Conceptually this can be regarded as lock on the options file,
with a process using the database holding a read lock on the options
and a process changing database options requiring a write lock.


In such an environment,
-       any process using a database for read a/o write without changing
        options holds the shared options lock.
-       a process that wants to change a database must obtain the exclusive
        options lock to mount the database in single process mode.
-       before waiting on the exclusive options lock,
        the process acquires guard locks
        and notifies other processes to release their locks

For synchronization issues, actually three locks are used:
-       the "options lock" is held shared or exclusive while the database is in use
-       the shared or exclusive "use lock" guards any attempt to obtain
        the options lock
-       the exclusive "change lock" guards any attempt to obtain an exclusive
        use and options lock

A reader (shared user)
-       asks for the shared use lock without waiting.
-       if this fails, another process is attempting
        modification and the reader must close the database.
-       else, no process must have the exclusive options lock.
        The reader obtains a shared options lock and releases the use lock.

A writer (about to structurally change the database)
-       first asks for the exclusive change lock without waiting.
-       if this fails, another process is attempting modification
        and the writer must close the database and bail out.
-       else, no process must have the exclusive use lock.
        Then the exclusive use lock is acquired with waiting.
        (Since there are shared use locks held for very short times,
        this has a slight risk of writer starvation and could,
        for paranoia's sake, be guarded by yet another lock).
-       finally other processes are notified and the writer
        waits for the exclusive options lock (probably using a timeout).
-       once done with changes, the writer releases locks in reverse order.

Notifying other processes:
-       for the unix multi process server,
        we have all processes share the same process group,
        so they can be signalled by kill(0,sig).
        On most systems we should use SIGWINCH and SIGURG,
        because they are ignored by default (so we don't kill our tcpserver).
-       while SIGURG will have any running request aborted,
        processes block SIGWINCH during normal request processing
        and receive it only when about to read a new request.

To make a database finally unavailable,
a process should move or remove the files while holding the exclusive locks.
Since locks are obtained on open files, other processes may have opened
the files before this and finally succeed in obtaining the options lock.
To finally detect this race condition,
a process must use stat and fstat to ensure, after getting the options lock,
that the file it opened is still the same it would open then.
1	dpavlin	604	Concurrency and locking in the Malete core.
2
3
4			* MP vs. MT
5
6			Unlike the OpenIsis versions up to 0.9, which where built for multithreading,
7			the Malete core is designed to run in multiple processes in parallel.
8
9			The major reason for this shift is ongoing problems with several
10			MT environments:
11			- multithreading is in a big move towards compiler supported
12			thread local storage (TLS). TLS routines like pthread_getspecific
13			or TlsGetValue are replaced by the __thread attribute on variables.
14			While this is nice and efficient, it is incompatible
15			(it does change the behaviour of libc in subtle ways).
16			- on Linux, the 2.2/2.4 LinuxThreads are now superseded by NPTL
17			in the 2.6 kernels (and some backported 2.4.20s, watch out!),
18			which uses compiler supported TLS. While this offers a couple of advantages,
19			it is not worth the effort to support both during the transition.
20			- on BSD thread support is only emerging. On MacOS X, the implementation
21			has yet to prove its stability. Even on Solaris with a long track
22			of reliable MT support, they are switching the threading model.
23			- on Windows the 9x and Me versions are still in widespread use,
24			which lack important features to leverage the benefits of MT
25			(OVERLAPPED_IO and SignalObjectAndWait). While they also lack
26			support for synchronization of multiple read/write processes,
27			at least read-only multiprocessing is feasible.
28			- while Java can be considered by far the most stable environment
29			for MT programming, this only works out when actually writing pure Java.
30			Embedding MT C-code via JNI is not well defined and pretty intricate,
31			especially where the underlying MT implementation is on the move.
32
33			On the other hand, there is some demand for MP support in certain
34			environments like e.g. MP PHP or a parallel server running from
35			wrappers like tcpserver.
36
37			Thus Malete by now focuses on MP and leaves a reworked MT version
38			for better times with stable and widely available TLS support.
39
40
41			* shared ressources
42
43			The only ressources shared are (regions of) files,
44			i.e. there is no shared memory (besides mmapped files),
45			queues, pipes, sockets or other ressources.
46			In typical usage read access is much more frequent than writing.
47
48			Therefore the means of coordinating access to shared ressources
49			- must support multiple processes
50			- should support read/write locks
51			(i.e. shared vs. exclusive locking modes) to allow for concurrent readers
52			- should support locking and unlocking of regions of open files
53			to support concurrent writers
54
55			With some limitations to be discussed further below,
56			this can be implementated based on file locking.
57
58
59			Features we do NOT need:
60			- locking over network file systems means looking for trouble.
61			While this MAY work in simple cases, it is NOT RECOMMENDED!
62			If you really can't avoid accessing remote database files via NFS or SMB,
63			the only reasonable use of locks is to "protect" read-only access.
64			DO NOT even consider writing your valuable data to remote storage.
65			Better run a database server where the disks are.
66			- mandatory locking means looking for trouble,
67			c.f. /usr/src/linux/Documentation/mandatory.txt.
68			DO NOT mount with mandatory locking support,
69			DO NOT set the mandatory locking file mode!
70			- deadlock detection.
71			The usage of locking is deadlock free.
72			(Unless mandatory locking is enabled, so don't do that).
73
74
75			* concurrency modes
76
77			There are three modes of concurrency to distinguish:
78			- read-only mode:
79			any number of processes holding shared locks on whole files.
80			Every process may read at any time, no process may write.
81			This is the most efficient mode for high volume query processing.
82			- exclusive mode:
83			a single process holding exclusive locks on whole files.
84			The process may or may not write to the files.
85			A simple, safe and portable mode to allow writing.
86			- shared mode:
87			multiple processes using temporary locks on regions of files.
88			Every process may read and write. Supported on UNIX-style systems only.
89
90			The first two modes using locks held during process lifetime
91			are trivially correct. Actually older OpenIsis versions used such locks.
92			Shared mode is much more complicated and deserves more detailled inspection,
93			which is done in the remainder of this document.
94
95
96			* shared mode
97
98			First of all: try to avoid it.
99
100			For development and typical data entry use, an exclusive mode single process
101			server will do perfectly well. For high volume query processing,
102			use a separate copy of the database in read-only mode.
103			Malete databases are designed to be easily copied and backed up.
104			This is intrinsically much more efficient and reliable than combining
105			high read and write load. Moreover, cleanup tasks like data and tree
106			compaction can be safely done in an exclusive writer,
107			but would require unduly synchronization effort in shared mode.
108
109			Having that said, here are the gory details.
110
111
112			Both the records (r) and the index entries (q) use two files each:
113			- a "data" (d) file
114			(the plain masterfile .mrd and the BTree leaves chain .mqd, resp.)
115			- an "access" (x) file for faster lookup
116			(the record xref file .mrx and BTree inner nodes .mqx, resp.)
117			The access and data files need to be in sync,
118			thus writing access needs to be synchronized to some extend.
119			Typically, but not necessarily, the application uses index entries
120			in turn as pointers for the records; see below.
121
122			Basically, all files are accessed using (unbuffered) native file IO.
123			The access files, however, are memory mapped where possible.
124
125
126			In summary, we use
127			- one "record lock" per database guarding any record write
128			- a "xref lock" for every record
129			- one "tree lock" per database guarding BTree inner node access
130			- a "leaf lock" for every BTree leaf block
131
132
133			To read a record:
134			- obtain the shared xref lock
135			- look up the xref
136			- release the xref lock
137			- read the record
138			(does not need a lock since records are immutable)
139
140			Writing a new or changed record is done by:
141			- acquiring the exclusive record lock
142			- appending to the end of the masterfile
143			- obtain the exclusive xref lock
144			- writing to the xref file (using msync where mmapped)
145			- release the xref lock
146			- release the record lock
147
148			When searching the index
149			- the shared tree lock is acquired, the tree is read to find the leaf number,
150			and the tree lock is released
151			- a shared lock on the leaf block is acquired and the leaf is read.
152			The shared lock is released immediatly after reading.
153			- on split detection (Lehmann/Yao) successive leaf blocks are read alike
154
155			The steps to write an index entry are:
156			- find the target leaf block, searching as above,
157			but using exclusive leaf locks. On split detection,
158			an ex leaf lock is released only after locking and reading a successor.
159			The final ex lock is held until after the leaf block has been written.
160			- if the write involves a split, lock the block after the end
161			of the leaves file and double check you really got a brand new block
162			(it must not be readable, else, repeat)
163			- write the block or blocks
164			- if the write involved a split, obtain the exclusive tree lock.
165			Iteratively insert pointers to any newly created blocks.
166			- release all locks
167
168			To support a unique key, the record lock is held while writing / looking up
169			the record's key in the index and until after the record was written.
170
171
172			* operating system considerations
173
174			Locking an mmapped tree file may not work on some systems.
175			The rationale is that accesses to mapped memory can not be
176			checked against arbitrary mandatory locked regions;
177			c.f. /usr/src/linux/Documentation/mandatory.txt.
178			On some systems locking mapped files is completely ruled out,
179			others, like Linux, deny only mandatory locks.
180			Solaris allows locks on the whole file only (not regions).
181
182			Since all locks are used advisory, however,
183			there is no need to put the lock on the actual bytes affected by an operation.
184			- The record lock is on byte 0 of the masterfile .mrd,
185			and the xref on record id n (1..) locks byte n.
186			- The tree lock is on byte 1 of the leaves file .mqd,
187			and the leaf block n (0..) locks byte 2*n.
188
189			Consequently, in read-only and exclusive mode, full locks on the
190			data files .m?d are sufficient to prevent conflicts with other writers.
191			While this statement with regard to read-only and exclusive mode processes
192			can be considered to constitute an interface,
193			the details of shared mode coordination may change
194			(i.e. which bytes are locked in which order, see considerations below).
195
196
197			Since M$ Windows does not offer support for serious programming,
198			we are limited to the exclusive and read-only modes:
199
200			The 9x/Me family is even lacking shared file region locks and locks that can
201			be waited upon, not to mention any reasonable means of signaling.
202			Memory mappings here are of limited use since they might copy the file to swap.
203
204			The NT-based versions have a couple of the necessary calls like LockFileEx;
205			Still, all this is pretty tedious and the semantics of memory mappings
206			(CreateFileMapping) in the presence of writers is problematic at best.
207
208
209			* optimizations to consider
210
211			- where accessing a memory mapped xref can be considered atomic,
212			xref locks are not needed, meaning readers do not need any lock at all.
213			For 8 byte xrefs, we can easily avoid being interrupted by page faults,
214			so some architectures may support this.
215			- where the effect of writing a leaf block is atomic,
216			i.e. seen completely or not at all by any read,
217			readers do not need to use locks on leaves.
218			According to
219			> http://marc.theaimsgroup.com/?l=linux-kernel&m=107375454908544 Linus
220			this may hold only under very specific conditions:
221			We must have neither SMP nor the new kernel preemption enabled,
222			must have ext2 or similar filesystem, the file region must fit
223			one cache page and the userspace region must not pagefault
224			(i.e. we need a piece of locked memory).
225			We always fit a cache page, if our blocksize is
226			a power of two of at most the cache page size.
227			While 4K is a usual value for the cache page size
228			(as well as the pagesize as well as the filesystem block size),
229			we use 1K (which, under Linux, is the minimum for all of these)
230			leaf blocks to be reasonably safe.
231			- for writers, there is next to no more concurrency we could achieve.
232			The minimum about the changed masterfile structure that needs to be
233			visible is its new size, and basically we are just setting this
234			with the one write operation. With any other scheme along the lines
235			of "first extend, then lock only that area" and so on we would only
236			get a couple of additional syscalls plus integrity issues
237			with much more difficult crash recovery.
238			Especially we don't use a separate lock on the access file.
239			- one might consider delaying a sync-to-disk (fsync/msync)
240			until after the lock is released. However, at least a
241			msync using MS_ASYNC and MS_INVALIDATE should be issued.
242			(On Linux, MS_INVALIDATE does nothing, since maps are always coherent).
243			- locking the whole tree might seem pretty restrictive,
244			however, it saves a lot of syscalls and the operations on the
245			mmap should we very fast. Concurrency is affected only by
246			writes to the tree, which are fairly rare
247			(on about 3% of inserts, depending on sizes and fill level).
248			- for non mmapped tree files, however, the classical per-block
249			locking scheme could be considered to reduce worst case delays.
250			Locks on the odd leaves file bytes are reserved for this purpose.
251			Yet this involves substantial overhead on every read.
252
253
254			* considering writer starvation
255
256			In the presence of readers holding shared locks,
257			we can not expect fcntl to avoid writer starvation.
258			If at every time there is always at least one process having a
259			shared lock, a writer may wait indefinitely for an exclusive lock.
260
261
262			This can trivially be avoided by only using exclusive locks,
263			which typically will have a simple and fair FIFO implementation.
264			Clearly this sacrifices most concurrency in a situation where
265			it is obviously demanded.
266
267			A more sophisticated aproach is to guard the acquisition of a lock A,
268			whether shared or exclusive, by an additional ex "guard" lock Ag
269			in a sequence of get Ag - get A - release Ag.
270			That way no process can get A while another process is waiting for A.
271
272
273			This double locking should not be used for leaf reads,
274			because continued overlapping of shared locks on leaves means such
275			a pathological congestion that the system will croak anyway.
276
277			The record and tree locks, on the other hand, are held by readers
278			only while they are inspecting memory mapped files, which should
279			involve no disk accesses if you are going for high throughput.
280			More precisely, we should only very rarely see a process suspended for
281			any reason while holding such a shared lock,
282			and thus overlapped shared locks should be rare.
283
284
285			To summarize, in almost any case of high volume querying it is advisable to
286			set up a separate read-only copy of the database instead of increasing
287			system load by doubling the number of locks.
288
289			According to these considerations,
290			we might use the aforementioned optimizations to reduce writer delays,
291			but do not care about absolutely avoiding writer starvation.
292
293
294			* unmounting databases
295
296			Whenever a process created a new database, other processes may
297			access it by opening it on demand just as any old database.
298			There is no additional interprocess communication needed to make
299			a database available.
300
301
302			Since the shared mode is meant to be used in high volume production
303			environments, we assume any available database to remain available without
304			structural changes, and support such changes only in the exclusive server,
305			where they can be handled internally.
306
307			For completeness, however, here goes an outline of the steps that would
308			be needed to support unmounting databases in a shared environment.
309
310
311			Changes to a database, including making it unavailable,
312			are handled by one process obtaining the exclusive "options" lock.
313			Conceptually this can be regarded as lock on the options file,
314			with a process using the database holding a read lock on the options
315			and a process changing database options requiring a write lock.
316
317
318			In such an environment,
319			- any process using a database for read a/o write without changing
320			options holds the shared options lock.
321			- a process that wants to change a database must obtain the exclusive
322			options lock to mount the database in single process mode.
323			- before waiting on the exclusive options lock,
324			the process acquires guard locks
325			and notifies other processes to release their locks
326
327			For synchronization issues, actually three locks are used:
328			- the "options lock" is held shared or exclusive while the database is in use
329			- the shared or exclusive "use lock" guards any attempt to obtain
330			the options lock
331			- the exclusive "change lock" guards any attempt to obtain an exclusive
332			use and options lock
333
334			A reader (shared user)
335			- asks for the shared use lock without waiting.
336			- if this fails, another process is attempting
337			modification and the reader must close the database.
338			- else, no process must have the exclusive options lock.
339			The reader obtains a shared options lock and releases the use lock.
340
341			A writer (about to structurally change the database)
342			- first asks for the exclusive change lock without waiting.
343			- if this fails, another process is attempting modification
344			and the writer must close the database and bail out.
345			- else, no process must have the exclusive use lock.
346			Then the exclusive use lock is acquired with waiting.
347			(Since there are shared use locks held for very short times,
348			this has a slight risk of writer starvation and could,
349			for paranoia's sake, be guarded by yet another lock).
350			- finally other processes are notified and the writer
351			waits for the exclusive options lock (probably using a timeout).
352			- once done with changes, the writer releases locks in reverse order.
353
354			Notifying other processes:
355			- for the unix multi process server,
356			we have all processes share the same process group,
357			so they can be signalled by kill(0,sig).
358			On most systems we should use SIGWINCH and SIGURG,
359			because they are ignored by default (so we don't kill our tcpserver).
360			- while SIGURG will have any running request aborted,
361			processes block SIGWINCH during normal request processing
362			and receive it only when about to read a new request.
363
364			To make a database finally unavailable,
365			a process should move or remove the files while holding the exclusive locks.
366			Since locks are obtained on open files, other processes may have opened
367			the files before this and finally succeed in obtaining the options lock.
368			To finally detect this race condition,
369			a process must use stat and fstat to ensure, after getting the options lock,
370			that the file it opened is still the same it would open then.