1 |
Concurrency and locking in the Malete core. |
2 |
|
3 |
|
4 |
* MP vs. MT |
5 |
|
6 |
Unlike the OpenIsis versions up to 0.9, which where built for multithreading, |
7 |
the Malete core is designed to run in multiple processes in parallel. |
8 |
|
9 |
The major reason for this shift is ongoing problems with several |
10 |
MT environments: |
11 |
- multithreading is in a big move towards compiler supported |
12 |
thread local storage (TLS). TLS routines like pthread_getspecific |
13 |
or TlsGetValue are replaced by the __thread attribute on variables. |
14 |
While this is nice and efficient, it is incompatible |
15 |
(it does change the behaviour of libc in subtle ways). |
16 |
- on Linux, the 2.2/2.4 LinuxThreads are now superseded by NPTL |
17 |
in the 2.6 kernels (and some backported 2.4.20s, watch out!), |
18 |
which uses compiler supported TLS. While this offers a couple of advantages, |
19 |
it is not worth the effort to support both during the transition. |
20 |
- on BSD thread support is only emerging. On MacOS X, the implementation |
21 |
has yet to prove its stability. Even on Solaris with a long track |
22 |
of reliable MT support, they are switching the threading model. |
23 |
- on Windows the 9x and Me versions are still in widespread use, |
24 |
which lack important features to leverage the benefits of MT |
25 |
(OVERLAPPED_IO and SignalObjectAndWait). While they also lack |
26 |
support for synchronization of multiple read/write processes, |
27 |
at least read-only multiprocessing is feasible. |
28 |
- while Java can be considered by far the most stable environment |
29 |
for MT programming, this only works out when actually writing pure Java. |
30 |
Embedding MT C-code via JNI is not well defined and pretty intricate, |
31 |
especially where the underlying MT implementation is on the move. |
32 |
|
33 |
On the other hand, there is some demand for MP support in certain |
34 |
environments like e.g. MP PHP or a parallel server running from |
35 |
wrappers like tcpserver. |
36 |
|
37 |
Thus Malete by now focuses on MP and leaves a reworked MT version |
38 |
for better times with stable and widely available TLS support. |
39 |
|
40 |
|
41 |
* shared ressources |
42 |
|
43 |
The only ressources shared are (regions of) files, |
44 |
i.e. there is no shared memory (besides mmapped files), |
45 |
queues, pipes, sockets or other ressources. |
46 |
In typical usage read access is much more frequent than writing. |
47 |
|
48 |
Therefore the means of coordinating access to shared ressources |
49 |
- must support multiple processes |
50 |
- should support read/write locks |
51 |
(i.e. shared vs. exclusive locking modes) to allow for concurrent readers |
52 |
- should support locking and unlocking of regions of open files |
53 |
to support concurrent writers |
54 |
|
55 |
With some limitations to be discussed further below, |
56 |
this can be implementated based on file locking. |
57 |
|
58 |
|
59 |
Features we do NOT need: |
60 |
- locking over network file systems means looking for trouble. |
61 |
While this MAY work in simple cases, it is NOT RECOMMENDED! |
62 |
If you really can't avoid accessing remote database files via NFS or SMB, |
63 |
the only reasonable use of locks is to "protect" read-only access. |
64 |
DO NOT even consider writing your valuable data to remote storage. |
65 |
Better run a database server where the disks are. |
66 |
- mandatory locking means looking for trouble, |
67 |
c.f. /usr/src/linux/Documentation/mandatory.txt. |
68 |
DO NOT mount with mandatory locking support, |
69 |
DO NOT set the mandatory locking file mode! |
70 |
- deadlock detection. |
71 |
The usage of locking is deadlock free. |
72 |
(Unless mandatory locking is enabled, so don't do that). |
73 |
|
74 |
|
75 |
* concurrency modes |
76 |
|
77 |
There are three modes of concurrency to distinguish: |
78 |
- read-only mode: |
79 |
any number of processes holding shared locks on whole files. |
80 |
Every process may read at any time, no process may write. |
81 |
This is the most efficient mode for high volume query processing. |
82 |
- exclusive mode: |
83 |
a single process holding exclusive locks on whole files. |
84 |
The process may or may not write to the files. |
85 |
A simple, safe and portable mode to allow writing. |
86 |
- shared mode: |
87 |
multiple processes using temporary locks on regions of files. |
88 |
Every process may read and write. Supported on UNIX-style systems only. |
89 |
|
90 |
The first two modes using locks held during process lifetime |
91 |
are trivially correct. Actually older OpenIsis versions used such locks. |
92 |
Shared mode is much more complicated and deserves more detailled inspection, |
93 |
which is done in the remainder of this document. |
94 |
|
95 |
|
96 |
* shared mode |
97 |
|
98 |
First of all: try to avoid it. |
99 |
|
100 |
For development and typical data entry use, an exclusive mode single process |
101 |
server will do perfectly well. For high volume query processing, |
102 |
use a separate copy of the database in read-only mode. |
103 |
Malete databases are designed to be easily copied and backed up. |
104 |
This is intrinsically much more efficient and reliable than combining |
105 |
high read and write load. Moreover, cleanup tasks like data and tree |
106 |
compaction can be safely done in an exclusive writer, |
107 |
but would require unduly synchronization effort in shared mode. |
108 |
|
109 |
Having that said, here are the gory details. |
110 |
|
111 |
|
112 |
Both the records (r) and the index entries (q) use two files each: |
113 |
- a "data" (d) file |
114 |
(the plain masterfile .mrd and the BTree leaves chain .mqd, resp.) |
115 |
- an "access" (x) file for faster lookup |
116 |
(the record xref file .mrx and BTree inner nodes .mqx, resp.) |
117 |
The access and data files need to be in sync, |
118 |
thus writing access needs to be synchronized to some extend. |
119 |
Typically, but not necessarily, the application uses index entries |
120 |
in turn as pointers for the records; see below. |
121 |
|
122 |
Basically, all files are accessed using (unbuffered) native file IO. |
123 |
The access files, however, are memory mapped where possible. |
124 |
|
125 |
|
126 |
In summary, we use |
127 |
- one "record lock" per database guarding any record write |
128 |
- a "xref lock" for every record |
129 |
- one "tree lock" per database guarding BTree inner node access |
130 |
- a "leaf lock" for every BTree leaf block |
131 |
|
132 |
|
133 |
To read a record: |
134 |
- obtain the shared xref lock |
135 |
- look up the xref |
136 |
- release the xref lock |
137 |
- read the record |
138 |
(does not need a lock since records are immutable) |
139 |
|
140 |
Writing a new or changed record is done by: |
141 |
- acquiring the exclusive record lock |
142 |
- appending to the end of the masterfile |
143 |
- obtain the exclusive xref lock |
144 |
- writing to the xref file (using msync where mmapped) |
145 |
- release the xref lock |
146 |
- release the record lock |
147 |
|
148 |
When searching the index |
149 |
- the shared tree lock is acquired, the tree is read to find the leaf number, |
150 |
and the tree lock is released |
151 |
- a shared lock on the leaf block is acquired and the leaf is read. |
152 |
The shared lock is released immediatly after reading. |
153 |
- on split detection (Lehmann/Yao) successive leaf blocks are read alike |
154 |
|
155 |
The steps to write an index entry are: |
156 |
- find the target leaf block, searching as above, |
157 |
but using exclusive leaf locks. On split detection, |
158 |
an ex leaf lock is released only after locking and reading a successor. |
159 |
The final ex lock is held until after the leaf block has been written. |
160 |
- if the write involves a split, lock the block after the end |
161 |
of the leaves file and double check you really got a brand new block |
162 |
(it must not be readable, else, repeat) |
163 |
- write the block or blocks |
164 |
- if the write involved a split, obtain the exclusive tree lock. |
165 |
Iteratively insert pointers to any newly created blocks. |
166 |
- release all locks |
167 |
|
168 |
To support a unique key, the record lock is held while writing / looking up |
169 |
the record's key in the index and until after the record was written. |
170 |
|
171 |
|
172 |
* operating system considerations |
173 |
|
174 |
Locking an mmapped tree file may not work on some systems. |
175 |
The rationale is that accesses to mapped memory can not be |
176 |
checked against arbitrary mandatory locked regions; |
177 |
c.f. /usr/src/linux/Documentation/mandatory.txt. |
178 |
On some systems locking mapped files is completely ruled out, |
179 |
others, like Linux, deny only mandatory locks. |
180 |
Solaris allows locks on the whole file only (not regions). |
181 |
|
182 |
Since all locks are used advisory, however, |
183 |
there is no need to put the lock on the actual bytes affected by an operation. |
184 |
- The record lock is on byte 0 of the masterfile .mrd, |
185 |
and the xref on record id n (1..) locks byte n. |
186 |
- The tree lock is on byte 1 of the leaves file .mqd, |
187 |
and the leaf block n (0..) locks byte 2*n. |
188 |
|
189 |
Consequently, in read-only and exclusive mode, full locks on the |
190 |
data files .m?d are sufficient to prevent conflicts with other writers. |
191 |
While this statement with regard to read-only and exclusive mode processes |
192 |
can be considered to constitute an interface, |
193 |
the details of shared mode coordination may change |
194 |
(i.e. which bytes are locked in which order, see considerations below). |
195 |
|
196 |
|
197 |
Since M$ Windows does not offer support for serious programming, |
198 |
we are limited to the exclusive and read-only modes: |
199 |
|
200 |
The 9x/Me family is even lacking shared file region locks and locks that can |
201 |
be waited upon, not to mention any reasonable means of signaling. |
202 |
Memory mappings here are of limited use since they might copy the file to swap. |
203 |
|
204 |
The NT-based versions have a couple of the necessary calls like LockFileEx; |
205 |
Still, all this is pretty tedious and the semantics of memory mappings |
206 |
(CreateFileMapping) in the presence of writers is problematic at best. |
207 |
|
208 |
|
209 |
* optimizations to consider |
210 |
|
211 |
- where accessing a memory mapped xref can be considered atomic, |
212 |
xref locks are not needed, meaning readers do not need any lock at all. |
213 |
For 8 byte xrefs, we can easily avoid being interrupted by page faults, |
214 |
so some architectures may support this. |
215 |
- where the effect of writing a leaf block is atomic, |
216 |
i.e. seen completely or not at all by any read, |
217 |
readers do not need to use locks on leaves. |
218 |
According to |
219 |
> http://marc.theaimsgroup.com/?l=linux-kernel&m=107375454908544 Linus |
220 |
this may hold only under very specific conditions: |
221 |
We must have neither SMP nor the new kernel preemption enabled, |
222 |
must have ext2 or similar filesystem, the file region must fit |
223 |
one cache page and the userspace region must not pagefault |
224 |
(i.e. we need a piece of locked memory). |
225 |
We always fit a cache page, if our blocksize is |
226 |
a power of two of at most the cache page size. |
227 |
While 4K is a usual value for the cache page size |
228 |
(as well as the pagesize as well as the filesystem block size), |
229 |
we use 1K (which, under Linux, is the minimum for all of these) |
230 |
leaf blocks to be reasonably safe. |
231 |
- for writers, there is next to no more concurrency we could achieve. |
232 |
The minimum about the changed masterfile structure that needs to be |
233 |
visible is its new size, and basically we are just setting this |
234 |
with the one write operation. With any other scheme along the lines |
235 |
of "first extend, then lock only that area" and so on we would only |
236 |
get a couple of additional syscalls plus integrity issues |
237 |
with much more difficult crash recovery. |
238 |
Especially we don't use a separate lock on the access file. |
239 |
- one might consider delaying a sync-to-disk (fsync/msync) |
240 |
until after the lock is released. However, at least a |
241 |
msync using MS_ASYNC and MS_INVALIDATE should be issued. |
242 |
(On Linux, MS_INVALIDATE does nothing, since maps are always coherent). |
243 |
- locking the whole tree might seem pretty restrictive, |
244 |
however, it saves a lot of syscalls and the operations on the |
245 |
mmap should we very fast. Concurrency is affected only by |
246 |
writes to the tree, which are fairly rare |
247 |
(on about 3% of inserts, depending on sizes and fill level). |
248 |
- for non mmapped tree files, however, the classical per-block |
249 |
locking scheme could be considered to reduce worst case delays. |
250 |
Locks on the odd leaves file bytes are reserved for this purpose. |
251 |
Yet this involves substantial overhead on every read. |
252 |
|
253 |
|
254 |
* considering writer starvation |
255 |
|
256 |
In the presence of readers holding shared locks, |
257 |
we can not expect fcntl to avoid writer starvation. |
258 |
If at every time there is always at least one process having a |
259 |
shared lock, a writer may wait indefinitely for an exclusive lock. |
260 |
|
261 |
|
262 |
This can trivially be avoided by only using exclusive locks, |
263 |
which typically will have a simple and fair FIFO implementation. |
264 |
Clearly this sacrifices most concurrency in a situation where |
265 |
it is obviously demanded. |
266 |
|
267 |
A more sophisticated aproach is to guard the acquisition of a lock A, |
268 |
whether shared or exclusive, by an additional ex "guard" lock Ag |
269 |
in a sequence of get Ag - get A - release Ag. |
270 |
That way no process can get A while another process is waiting for A. |
271 |
|
272 |
|
273 |
This double locking should not be used for leaf reads, |
274 |
because continued overlapping of shared locks on leaves means such |
275 |
a pathological congestion that the system will croak anyway. |
276 |
|
277 |
The record and tree locks, on the other hand, are held by readers |
278 |
only while they are inspecting memory mapped files, which should |
279 |
involve no disk accesses if you are going for high throughput. |
280 |
More precisely, we should only very rarely see a process suspended for |
281 |
any reason while holding such a shared lock, |
282 |
and thus overlapped shared locks should be rare. |
283 |
|
284 |
|
285 |
To summarize, in almost any case of high volume querying it is advisable to |
286 |
set up a separate read-only copy of the database instead of increasing |
287 |
system load by doubling the number of locks. |
288 |
|
289 |
According to these considerations, |
290 |
we might use the aforementioned optimizations to reduce writer delays, |
291 |
but do not care about absolutely avoiding writer starvation. |
292 |
|
293 |
|
294 |
* unmounting databases |
295 |
|
296 |
Whenever a process created a new database, other processes may |
297 |
access it by opening it on demand just as any old database. |
298 |
There is no additional interprocess communication needed to make |
299 |
a database available. |
300 |
|
301 |
|
302 |
Since the shared mode is meant to be used in high volume production |
303 |
environments, we assume any available database to remain available without |
304 |
structural changes, and support such changes only in the exclusive server, |
305 |
where they can be handled internally. |
306 |
|
307 |
For completeness, however, here goes an outline of the steps that would |
308 |
be needed to support unmounting databases in a shared environment. |
309 |
|
310 |
|
311 |
Changes to a database, including making it unavailable, |
312 |
are handled by one process obtaining the exclusive "options" lock. |
313 |
Conceptually this can be regarded as lock on the options file, |
314 |
with a process using the database holding a read lock on the options |
315 |
and a process changing database options requiring a write lock. |
316 |
|
317 |
|
318 |
In such an environment, |
319 |
- any process using a database for read a/o write without changing |
320 |
options holds the shared options lock. |
321 |
- a process that wants to change a database must obtain the exclusive |
322 |
options lock to mount the database in single process mode. |
323 |
- before waiting on the exclusive options lock, |
324 |
the process acquires guard locks |
325 |
and notifies other processes to release their locks |
326 |
|
327 |
For synchronization issues, actually three locks are used: |
328 |
- the "options lock" is held shared or exclusive while the database is in use |
329 |
- the shared or exclusive "use lock" guards any attempt to obtain |
330 |
the options lock |
331 |
- the exclusive "change lock" guards any attempt to obtain an exclusive |
332 |
use and options lock |
333 |
|
334 |
A reader (shared user) |
335 |
- asks for the shared use lock without waiting. |
336 |
- if this fails, another process is attempting |
337 |
modification and the reader must close the database. |
338 |
- else, no process must have the exclusive options lock. |
339 |
The reader obtains a shared options lock and releases the use lock. |
340 |
|
341 |
A writer (about to structurally change the database) |
342 |
- first asks for the exclusive change lock without waiting. |
343 |
- if this fails, another process is attempting modification |
344 |
and the writer must close the database and bail out. |
345 |
- else, no process must have the exclusive use lock. |
346 |
Then the exclusive use lock is acquired with waiting. |
347 |
(Since there are shared use locks held for very short times, |
348 |
this has a slight risk of writer starvation and could, |
349 |
for paranoia's sake, be guarded by yet another lock). |
350 |
- finally other processes are notified and the writer |
351 |
waits for the exclusive options lock (probably using a timeout). |
352 |
- once done with changes, the writer releases locks in reverse order. |
353 |
|
354 |
Notifying other processes: |
355 |
- for the unix multi process server, |
356 |
we have all processes share the same process group, |
357 |
so they can be signalled by kill(0,sig). |
358 |
On most systems we should use SIGWINCH and SIGURG, |
359 |
because they are ignored by default (so we don't kill our tcpserver). |
360 |
- while SIGURG will have any running request aborted, |
361 |
processes block SIGWINCH during normal request processing |
362 |
and receive it only when about to read a new request. |
363 |
|
364 |
To make a database finally unavailable, |
365 |
a process should move or remove the files while holding the exclusive locks. |
366 |
Since locks are obtained on open files, other processes may have opened |
367 |
the files before this and finally succeed in obtaining the options lock. |
368 |
To finally detect this race condition, |
369 |
a process must use stat and fstat to ensure, after getting the options lock, |
370 |
that the file it opened is still the same it would open then. |