1 |
;man malete 5 $Date: 2004/11/12 11:18:23 $ |
2 |
Malete file formats |
3 |
|
4 |
|
5 |
* meta data |
6 |
|
7 |
The Malete options (record 0) file .m0d (with a zero; .mod used otherwise), |
8 |
if present, contains a single serialized record containing metadata, |
9 |
including |
10 |
> RecStruct field definitions |
11 |
|
12 |
Named |
13 |
> CharSet collation info |
14 |
is compiled on demand to byte-order dependent .mcx files with |
15 |
magic 'mcx' or 'MCX' for little-/big-endian, resp. |
16 |
|
17 |
|
18 |
* record data |
19 |
|
20 |
The Malete record data file .mrd, a.k.a masterfile, is a plain text file. |
21 |
Lines are terminated by the ASCII newline character with byte value 10. |
22 |
Any superset of ASCII - including UTF-8 - may be used as character encoding. |
23 |
In other words, byte values 0-127 are always interpreted according to ASCII, |
24 |
values 128-255 have no special meaning. |
25 |
More precisely, the byte values 10 (newline), 9 (tabulator), |
26 |
45 (minus '-'), 48-57 (digits '0'-'9'), 64 (at-sign '@') and 87 ('W') |
27 |
have structural significance. |
28 |
|
29 |
|
30 |
Logically, the masterfile is a sequence of records, |
31 |
where every record is terminated by an empty line. |
32 |
A record is substructured as a sequence of non-empty lines, |
33 |
each terminated by a newline. |
34 |
|
35 |
Field lines are substructured as a number |
36 |
(optional minus followed by digits) followed by a tabulator |
37 |
followed by any characters but a newline as field values. |
38 |
If the first line of a record does not start with a number, |
39 |
it is used as header line. All other lines must be field lines. |
40 |
|
41 |
|
42 |
As of now, there is only one valid header line format defined, |
43 |
consisting of the letter 'W', followed by a tabulator and a |
44 |
data record header. The data record header is of the form |
45 |
'rid[@pos][*TAB*leader]'. Here |
46 |
- rid is a positive record id (a.k.a. masterfile number MFN, in ASCII digits) |
47 |
- if @pos is present, pos must be the position of the previous version |
48 |
of this record as byte offset from beginning of file in ASCII digits. |
49 |
Malete software supports retrieval of old versions based on this. |
50 |
- the optional leader is arbitrary leader data for the record. |
51 |
It can hold a MARC leader or a unique key ("name") for the record. |
52 |
|
53 |
If a record has no header line, its record id is one plus the highest |
54 |
record id used up to that point in the file. |
55 |
|
56 |
Using header lines, records can represent updated versions of |
57 |
previously stored records. There is no way to delete records, |
58 |
the operation next to it is to write an empty record. |
59 |
|
60 |
|
61 |
During normal operation, data in the masterfile is never updated, |
62 |
all writes are appends. Thus live backups and replications can be |
63 |
based on 'tail -f' writing to a tape or piped through netcat. |
64 |
A consistent snapshot can be made by recording the current file size |
65 |
and only accessing record versions at lower positions. |
66 |
|
67 |
|
68 |
Compacting a masterfile by purging old versions is straightforward, |
69 |
but considered an offline operation |
70 |
(basically printing current versions to a new masterfile). |
71 |
Compacting a mounted database is feasible for the exclusive mode server, |
72 |
but currently not supported. |
73 |
|
74 |
|
75 |
The masterfile is a special case of the serialized |
76 |
> Protocol |
77 |
, see there for details. |
78 |
The suggested handling of malformed records should not be relied upon. |
79 |
|
80 |
|
81 |
* record access |
82 |
|
83 |
The Malete record access file .mrx, a.k.a cross reference (xref) file, |
84 |
is a binary file dependent on the platform byte order and pagesize. |
85 |
Since the xref file is accessed by memory mapping, |
86 |
its size must be a multiple of the pagesize |
87 |
(which is 4 KB for most Intel based systems). |
88 |
Where the xref is found to be missing, have the wrong byte order or size |
89 |
or be otherwise invalid, it is rebuilt from the masterfile. |
90 |
|
91 |
|
92 |
The xref is organized in units of a fixed size ranging from 8 to 16 bytes. |
93 |
The unit at offset rid*size is a pointer to the current version of record |
94 |
with id rid in the masterfile (rid starting from 1). |
95 |
|
96 |
This pointer consists of |
97 |
- 4 to 8 bytes for the record's position |
98 |
(including any header line) |
99 |
- 3 to 6 bytes for the record's length |
100 |
(including the terminating empty line) |
101 |
- 0 to 2 bytes for the record's number of fields |
102 |
(including 1 header field, even if it's empty) |
103 |
When not compiled with support for large files, |
104 |
only the combinations 4,3,1 and 4,4,0 are implemented. |
105 |
|
106 |
If a record's number of fields exceeds the representable range, |
107 |
the corresponding pointer bytes (if any) are set to 0 and the number |
108 |
of field must be determined while reading the record. |
109 |
This number is also 0 for "deleted" records (no fields, empty leader). |
110 |
|
111 |
|
112 |
The first such unit (at mrx offset 0) consists of |
113 |
- 3 bytes magic number, which is "MRX" for big endian, "mrx" else |
114 |
- 1 byte type, computed as |
115 |
(bytes for pos-4)*16+(bytes for length-3)*4+(bytes for fields) |
116 |
- a 4 byte integer in native byte order holding the max used rid |
117 |
- if the unit size is 10 or greater (i.e. with largefile support), |
118 |
the following 2 (for 10,11) or 4 bytes are an integer in native byte order |
119 |
holding high order bytes of the max record id. |
120 |
|
121 |
|
122 |
* query data and access |
123 |
|
124 |
The malete query data .mqd and query access .mqx files hold |
125 |
the leaves and forks (inner nodes) of a B-Link-Tree, resp., |
126 |
which is usually used as index associating keys generated from masterfile |
127 |
records with pointers to such records. |
128 |
|
129 |
|
130 |
Both files are organized in fixed size blocks containing binary integer numbers |
131 |
and arbitrary key strings (encoded according to the collation configuration). |
132 |
Blocks contain |
133 |
- 16 bytes of "header" data |
134 |
- a "dictionary", |
135 |
which is an array of 4 byte units describing entries in the block |
136 |
with position, length of key and number of values. |
137 |
- a "stack" holding the entries, |
138 |
growing downwards from the end of the block. |
139 |
For dictionary slots d[0]..d[n], describing entries of size s0..sn, |
140 |
entry i occupies si consecutive bytes starting at (block size)-(s0+...+si). |
141 |
|
142 |
The header contains 8 unsigned binary numbers: |
143 |
- num 4 bytes block number |
144 |
- typ 1 byte block type (bitmask; see below) |
145 |
- ksz 1 byte maximum key length |
146 |
(default 0 treated like the maximum 255; CDS/ISIS uses 30) |
147 |
- ptr 1 byte pointer type (bitmask; see below) |
148 |
- lev 1 byte level over bottom (0 for leaves) |
149 |
- nxt 4 bytes number of right sibling block (0 if none) |
150 |
- ent 2 bytes number of entries (length of dictionary) |
151 |
- stk 2 bytes offset of 1st byte used by the stack |
152 |
The first 5 numbers are actually redundant since the block number |
153 |
must match the blocks file position, the three configurable bytes |
154 |
must be the same among all leaves and forks, resp., |
155 |
and the level is not really needed. |
156 |
However, wasting this 8 bytes serves as a minimal check |
157 |
and makes handling much easier. |
158 |
|
159 |
In leaf blocks, the byte order of all header numbers is little endian |
160 |
(so big endian machines have to swap bytes when reading/writing leaf blocks). |
161 |
In fork blocks, the header numbers are in native byte order |
162 |
(since they are usually not accessed by explicit reading and writing, |
163 |
but through memory mapping). |
164 |
|
165 |
|
166 |
The block type is: |
167 |
- upper 2 bits 0xC0 give the basic block type: |
168 |
0x00 for a standard leaf block, |
169 |
0x40 and 0x80 for little and big endian forks, |
170 |
0xC0 for leaves with unstructured values. |
171 |
- next two bits are clear and reserved for future extensions; |
172 |
should software see such a bit set, |
173 |
it must not assume anything about the index structure. |
174 |
- the bit 0x08, if set, will indicate that the index is compressed. |
175 |
Each key is stored as one byte giving the length |
176 |
of the common prefix to its predecessor followed by changed bytes. |
177 |
This is not yet supported and current software will refuse to |
178 |
access such an index. |
179 |
Also do not confuse this with compressing each key individually based on |
180 |
> CharSet collation recoding |
181 |
- lowest 3 bits 0x07 give the block size. |
182 |
For leaves, this is freely configurable as bitcount-9, |
183 |
from 0 for 512 (2^9) bits to 4 for 8KB (2^13). |
184 |
For forks, this is bitcount-12, from 0 for 4KB (2^12) to 4 for 64KB(2^16), |
185 |
and must match the system's pagesize (should there be a system with |
186 |
a pagesize of less than 4 KB, 4KB is used). |
187 |
|
188 |
The pointer type describes the structure of leaf values: |
189 |
- upper two bits 0xC0 give the number of bytes to hold the tag |
190 |
from 0 (0x00) to 2 (0x80). |
191 |
- next two bits 0x30 give the number of bytes to hold the record id - 3, |
192 |
from 0x00 for 3 bytes to 0x30 for 6 bytes |
193 |
- the bit 0x08, if set, indicates that the tag is stored after |
194 |
the record id (as in CDS/ISIS), else the tag bytes precede the record id |
195 |
- lowest 3 bits 0x07 give the number of bytes to hold the position |
196 |
(occurence*65536 + word position) from 0x00 for 0 bytes to 0x04 for 4 bytes. |
197 |
For a value of 5 or 6, we should assume 4 byte position and one or two |
198 |
additional bytes for the record id (currently not supported). |
199 |
Pointers must use at least 4 and at most 14 (2+8+4; currently 12=2+6+4) bytes. |
200 |
Pointer type 0x8B (3 byte rid + 2 byte tag + 3 byte pos) |
201 |
is the same as used by CDS/ISIS. |
202 |
All numbers in the pointer are stored in big endian byte order |
203 |
(most significant byte first) for lexical sorting. |
204 |
|
205 |
|
206 |
The dictionary contains 4 byte units, describing the position pos of an entry, |
207 |
its number of values vln and the length of its key kln, |
208 |
which is stored in the 4th byte (0 here is actually the empty key, |
209 |
which is always the first in the leftmost block of every fork level). |
210 |
In a fork block, the first 2 bytes store the pos in native byte order, |
211 |
and the 3rd byte is vln. |
212 |
In a leaf block (max size 8KB), the pos has 13bit and the vln 11bit. |
213 |
The first 3 bytes in a dictionary unit are, independent of platform, |
214 |
0: pos mod 256 (lower 8 bits), |
215 |
1: pos div 256 (higher 5 bits) + 32 * (vln div 256) (higher 3 bits) |
216 |
and 2: vln mod 256 (lower 8 bits). |
217 |
|
218 |
|
219 |
Entries in the stack always start at pos with kln bytes holding the key |
220 |
(as the actual key bytes, or, in a future version, compressed). |
221 |
|
222 |
In a leaf block, after the key there are vln values sorted in increasing order |
223 |
as of memcmp. Values have fixed size and structure as described by the |
224 |
pointer type (typically 8 byte each). |
225 |
Should the leaf block type be unstructured values, ptr is the actual |
226 |
value size and no assumptions are made about the structure of values |
227 |
(to be used independent of a masterfile as general purpose B-Tree). |
228 |
|
229 |
A special value of all 0 bytes is used for stopwords; |
230 |
no other value will be associated with the key. |
231 |
For a tag-first pointer type (bit 0x08 clear), a pointer to tag 0 |
232 |
will be the first and is reserved to store unique keys: |
233 |
at most one pointer to tag 0 will be associated with a key. |
234 |
|
235 |
A leaf key currently always has at least one value; |
236 |
should this key-value association be deleted, the entry is deleted completely. |
237 |
(With index compression vln 0 will denote a stopword). |
238 |
|
239 |
Should a key be associated with more values than fit within a block, |
240 |
the following block starts with an entry with the same key |
241 |
and next higher value. |
242 |
(With index compression, we might consider using the empty key there; |
243 |
to be defined). |
244 |
|
245 |
|
246 |
In a fork block, currently (no index compression), only vln 0 and 1 |
247 |
are used. With vln 0, the key is directly followed by a 4 byte native |
248 |
child block number (which is a leaf block number, if the block's level is 1, |
249 |
else a fork block number). With vln not 0, they key is followed by |
250 |
vln pairs of size (value size + 4), containing value bytes as in the leaves |
251 |
followed by the 4 byte child block number. This is used where a key |
252 |
spans leave blocks: we have a fork entry with vln 0 pointing to the key's |
253 |
first block, followed by an entry with the same key plus the starting value |
254 |
of the next block and so on. |
255 |
(With index compression, we will use one entry with multiple values). |
256 |
|
257 |
|
258 |
Fork block number 0 is always the root, and leaf block number 0 |
259 |
is always the leftmost leaf. All keys and values can be looped in order |
260 |
by starting from leaf 0 and following the nxt pointers. |
261 |
|
262 |
Note that the layout of leaf blocks is fully platform independent: |
263 |
record pointers are organized big endian as in CDS/ISIS, |
264 |
header numbers are little endian, and the dictionary is defined per byte. |
265 |
|
266 |
|
267 |
* shared B-L-Tree access |
268 |
|
269 |
A process accessing the index in exclusive mode may actually shift |
270 |
keys and values around according to the specified layout |
271 |
as it seems fit to minimize unused space in blocks. |
272 |
|
273 |
With shared access, changes are very limited: |
274 |
- Deleting a key-value association is done in the leaf block only, |
275 |
there are no updates to any fork or other key block. |
276 |
- The same holds where an inserted key-value association fits within |
277 |
the target leaf block. |
278 |
- The only case where data is moved between blocks is when a new pair |
279 |
does not fit in its target block: a new block is allocated to become |
280 |
the new right sibling of the target block and as many entries are moved |
281 |
from the block's end to its new sibling so that the new pair can be |
282 |
inserted in one of the blocks. On such a block split, a new entry |
283 |
is made in the block's parent fork (which might trigger a split there). |
284 |
- Should a process find that the key expected in a block is greater |
285 |
than the greatest key there, it must assume that a block split |
286 |
occurred and follow the nxt link to inspect the block's successor. |
287 |
(That's why it's called B-Link-Tree; invented by Lehmann and Yao). |
288 |
Actually, as we use full fork file locking, this can only happen |
289 |
in leave blocks. |
290 |
|
291 |
|
292 |
* limits according to file formats |
293 |
|
294 |
The masterfile obviously has no limits but filesize. |
295 |
To break the custom 2GB (32bit signed) barrier there are two approaches, |
296 |
compiling with large file support and splitting files, |
297 |
discussed further below. |
298 |
|
299 |
|
300 |
In general, most 4 and 8 byte numbers are assumed signed entities, |
301 |
since several system calls handle them that way. |
302 |
Any 2^x here is to be understood as 2^x-1. |
303 |
|
304 |
|
305 |
The xref in the small file implementation (8 byte pointers) can handle |
306 |
- any legal (i.e. up to 2GB) masterfile size. |
307 |
This limit can be broken by masterfile splitting. |
308 |
- any record size fitting in there using 4,4,0 pointers |
309 |
or records up to 16MB size using 4,3,1 pointers |
310 |
(which have a slight performance advantage) |
311 |
- any number of fields |
312 |
- record ids up to 2^31 |
313 |
|
314 |
The full xref spec can handle |
315 |
- large file sizes up to 2^63 (about 8.000.000.000.000.000.000) |
316 |
- record sizes up to 2^48 (256 TB) |
317 |
- record ids up to 2^63 |
318 |
|
319 |
The index can deal with |
320 |
- general purpose key-value pairs of a combined length of up to 255 bytes |
321 |
(might be limited to 127 with extensions like index compression) |
322 |
- standard values (hit pointers) of 4 to 14 bytes (default: 8) |
323 |
allow for keys of up to 241 to 251 bytes (default: 247) |
324 |
- pointers to record ids up to 2^63 (currently impl. up to 2^48) |
325 |
- pointers to non-negative tags up to 2^16 |
326 |
- pointers to position information up to 2^31 |
327 |
(by convention used as 1 or 2 bytes for field occurence, |
328 |
last 2 bytes for word position). |
329 |
- up to 2^32 leaf blocks of up to 8KB, totalling to a leaf file |
330 |
size of up to 32 TB (with large file support). |
331 |
Assuming an average key length of 16 and on average 10 values of 8 bytes |
332 |
per key, each block can hold 81 keys, totalling about 320 billion keys. |
333 |
The fork file, on an IA64 configured with pagesize 64K, can extend to 256TB. |
334 |
This limit can be broken by index splitting. |
335 |
|
336 |
|
337 |
* limits of the current implementation |
338 |
|
339 |
Records have to fit into available memory, which, even when using ridiculous |
340 |
amounts of RAM and/or large swapfiles, is bound by addressable memory. |
341 |
On 32 bit architectures, this is usually 1 or 2 GB (on the heap). |
342 |
|
343 |
Also bound by addressable memory is the possibility to memory map |
344 |
the access files. While they work without memory mapping, |
345 |
performance degrades substantially. |
346 |
|
347 |
|
348 |
So, to use very large databases, the system should be compiled |
349 |
for a 64 bit machine, which are luckily becoming affordable these |
350 |
days and we hope to get us such a box in the near future. |
351 |
|
352 |
|
353 |
* extending file size limits |
354 |
|
355 |
Neither large file support nor file splitting are currently implemented. |
356 |
However, large file support is pretty straightforward. |
357 |
|
358 |
|
359 |
File splitting for masterfiles works by configuring a number n |
360 |
and using a series of masterfiles f0, f1 ... |
361 |
so that records with ids in the range i*n... (i+1)*n-1 |
362 |
are stored in masterfile i. Likewise a series of xref files |
363 |
is used based on a number m, which should be some multiple of n. |
364 |
|
365 |
|
366 |
File splitting for the B-Tree is based on configuring a sequence of keys |
367 |
k0=empty, k1 ... and using a series of leaf files l0, l1 ... |
368 |
so that keys less than ki and not less than k(i-1) are stored in li. |
369 |
Fork files could be split on the same keys or some subsequence of those. |
370 |
|
371 |
|
372 |
The advantage of file splitting over large file support is that |
373 |
it is more portable, saves some bytes on file positions and could aid backup, |
374 |
especially where records are mostly appended to the last masterfile. |
375 |
Moreover it can extend indexes even beyond 32 TB. |
376 |
The total database size need not even be addressable with 64 bits. |
377 |
|
378 |
The disadvantage is that it is a little bit more complicated |
379 |
and when used exhaustively could make the system's open files limit |
380 |
become a problem. Also tracking and snapshotting masterfiles |
381 |
is a little bit less trivial. |
382 |
|
383 |
|
384 |
--- |
385 |
$Id: FileFormats.txt,v 1.7 2004/11/12 11:18:23 kripke Exp $ |