This is repository of my old source code which isn't updated any more. Go to git.rot13.org for current projects!
ViewVC logotype

Contents of /openisis/0.9.9e/doc/FileFormats.txt

Parent Directory Parent Directory | Revision Log Revision Log

Revision 604 - (show annotations)
Mon Dec 27 21:49:01 2004 UTC (19 years, 6 months ago) by dpavlin
File MIME type: text/plain
File size: 16732 byte(s)
import of new openisis release, 0.9.9e

1 ;man malete 5 $Date: 2004/11/12 11:18:23 $
2 Malete file formats
5 * meta data
7 The Malete options (record 0) file .m0d (with a zero; .mod used otherwise),
8 if present, contains a single serialized record containing metadata,
9 including
10 > RecStruct field definitions
12 Named
13 > CharSet collation info
14 is compiled on demand to byte-order dependent .mcx files with
15 magic 'mcx' or 'MCX' for little-/big-endian, resp.
18 * record data
20 The Malete record data file .mrd, a.k.a masterfile, is a plain text file.
21 Lines are terminated by the ASCII newline character with byte value 10.
22 Any superset of ASCII - including UTF-8 - may be used as character encoding.
23 In other words, byte values 0-127 are always interpreted according to ASCII,
24 values 128-255 have no special meaning.
25 More precisely, the byte values 10 (newline), 9 (tabulator),
26 45 (minus '-'), 48-57 (digits '0'-'9'), 64 (at-sign '@') and 87 ('W')
27 have structural significance.
30 Logically, the masterfile is a sequence of records,
31 where every record is terminated by an empty line.
32 A record is substructured as a sequence of non-empty lines,
33 each terminated by a newline.
35 Field lines are substructured as a number
36 (optional minus followed by digits) followed by a tabulator
37 followed by any characters but a newline as field values.
38 If the first line of a record does not start with a number,
39 it is used as header line. All other lines must be field lines.
42 As of now, there is only one valid header line format defined,
43 consisting of the letter 'W', followed by a tabulator and a
44 data record header. The data record header is of the form
45 'rid[@pos][*TAB*leader]'. Here
46 - rid is a positive record id (a.k.a. masterfile number MFN, in ASCII digits)
47 - if @pos is present, pos must be the position of the previous version
48 of this record as byte offset from beginning of file in ASCII digits.
49 Malete software supports retrieval of old versions based on this.
50 - the optional leader is arbitrary leader data for the record.
51 It can hold a MARC leader or a unique key ("name") for the record.
53 If a record has no header line, its record id is one plus the highest
54 record id used up to that point in the file.
56 Using header lines, records can represent updated versions of
57 previously stored records. There is no way to delete records,
58 the operation next to it is to write an empty record.
61 During normal operation, data in the masterfile is never updated,
62 all writes are appends. Thus live backups and replications can be
63 based on 'tail -f' writing to a tape or piped through netcat.
64 A consistent snapshot can be made by recording the current file size
65 and only accessing record versions at lower positions.
68 Compacting a masterfile by purging old versions is straightforward,
69 but considered an offline operation
70 (basically printing current versions to a new masterfile).
71 Compacting a mounted database is feasible for the exclusive mode server,
72 but currently not supported.
75 The masterfile is a special case of the serialized
76 > Protocol
77 , see there for details.
78 The suggested handling of malformed records should not be relied upon.
81 * record access
83 The Malete record access file .mrx, a.k.a cross reference (xref) file,
84 is a binary file dependent on the platform byte order and pagesize.
85 Since the xref file is accessed by memory mapping,
86 its size must be a multiple of the pagesize
87 (which is 4 KB for most Intel based systems).
88 Where the xref is found to be missing, have the wrong byte order or size
89 or be otherwise invalid, it is rebuilt from the masterfile.
92 The xref is organized in units of a fixed size ranging from 8 to 16 bytes.
93 The unit at offset rid*size is a pointer to the current version of record
94 with id rid in the masterfile (rid starting from 1).
96 This pointer consists of
97 - 4 to 8 bytes for the record's position
98 (including any header line)
99 - 3 to 6 bytes for the record's length
100 (including the terminating empty line)
101 - 0 to 2 bytes for the record's number of fields
102 (including 1 header field, even if it's empty)
103 When not compiled with support for large files,
104 only the combinations 4,3,1 and 4,4,0 are implemented.
106 If a record's number of fields exceeds the representable range,
107 the corresponding pointer bytes (if any) are set to 0 and the number
108 of field must be determined while reading the record.
109 This number is also 0 for "deleted" records (no fields, empty leader).
112 The first such unit (at mrx offset 0) consists of
113 - 3 bytes magic number, which is "MRX" for big endian, "mrx" else
114 - 1 byte type, computed as
115 (bytes for pos-4)*16+(bytes for length-3)*4+(bytes for fields)
116 - a 4 byte integer in native byte order holding the max used rid
117 - if the unit size is 10 or greater (i.e. with largefile support),
118 the following 2 (for 10,11) or 4 bytes are an integer in native byte order
119 holding high order bytes of the max record id.
122 * query data and access
124 The malete query data .mqd and query access .mqx files hold
125 the leaves and forks (inner nodes) of a B-Link-Tree, resp.,
126 which is usually used as index associating keys generated from masterfile
127 records with pointers to such records.
130 Both files are organized in fixed size blocks containing binary integer numbers
131 and arbitrary key strings (encoded according to the collation configuration).
132 Blocks contain
133 - 16 bytes of "header" data
134 - a "dictionary",
135 which is an array of 4 byte units describing entries in the block
136 with position, length of key and number of values.
137 - a "stack" holding the entries,
138 growing downwards from the end of the block.
139 For dictionary slots d[0]..d[n], describing entries of size s0..sn,
140 entry i occupies si consecutive bytes starting at (block size)-(s0+...+si).
142 The header contains 8 unsigned binary numbers:
143 - num 4 bytes block number
144 - typ 1 byte block type (bitmask; see below)
145 - ksz 1 byte maximum key length
146 (default 0 treated like the maximum 255; CDS/ISIS uses 30)
147 - ptr 1 byte pointer type (bitmask; see below)
148 - lev 1 byte level over bottom (0 for leaves)
149 - nxt 4 bytes number of right sibling block (0 if none)
150 - ent 2 bytes number of entries (length of dictionary)
151 - stk 2 bytes offset of 1st byte used by the stack
152 The first 5 numbers are actually redundant since the block number
153 must match the blocks file position, the three configurable bytes
154 must be the same among all leaves and forks, resp.,
155 and the level is not really needed.
156 However, wasting this 8 bytes serves as a minimal check
157 and makes handling much easier.
159 In leaf blocks, the byte order of all header numbers is little endian
160 (so big endian machines have to swap bytes when reading/writing leaf blocks).
161 In fork blocks, the header numbers are in native byte order
162 (since they are usually not accessed by explicit reading and writing,
163 but through memory mapping).
166 The block type is:
167 - upper 2 bits 0xC0 give the basic block type:
168 0x00 for a standard leaf block,
169 0x40 and 0x80 for little and big endian forks,
170 0xC0 for leaves with unstructured values.
171 - next two bits are clear and reserved for future extensions;
172 should software see such a bit set,
173 it must not assume anything about the index structure.
174 - the bit 0x08, if set, will indicate that the index is compressed.
175 Each key is stored as one byte giving the length
176 of the common prefix to its predecessor followed by changed bytes.
177 This is not yet supported and current software will refuse to
178 access such an index.
179 Also do not confuse this with compressing each key individually based on
180 > CharSet collation recoding
181 - lowest 3 bits 0x07 give the block size.
182 For leaves, this is freely configurable as bitcount-9,
183 from 0 for 512 (2^9) bits to 4 for 8KB (2^13).
184 For forks, this is bitcount-12, from 0 for 4KB (2^12) to 4 for 64KB(2^16),
185 and must match the system's pagesize (should there be a system with
186 a pagesize of less than 4 KB, 4KB is used).
188 The pointer type describes the structure of leaf values:
189 - upper two bits 0xC0 give the number of bytes to hold the tag
190 from 0 (0x00) to 2 (0x80).
191 - next two bits 0x30 give the number of bytes to hold the record id - 3,
192 from 0x00 for 3 bytes to 0x30 for 6 bytes
193 - the bit 0x08, if set, indicates that the tag is stored after
194 the record id (as in CDS/ISIS), else the tag bytes precede the record id
195 - lowest 3 bits 0x07 give the number of bytes to hold the position
196 (occurence*65536 + word position) from 0x00 for 0 bytes to 0x04 for 4 bytes.
197 For a value of 5 or 6, we should assume 4 byte position and one or two
198 additional bytes for the record id (currently not supported).
199 Pointers must use at least 4 and at most 14 (2+8+4; currently 12=2+6+4) bytes.
200 Pointer type 0x8B (3 byte rid + 2 byte tag + 3 byte pos)
201 is the same as used by CDS/ISIS.
202 All numbers in the pointer are stored in big endian byte order
203 (most significant byte first) for lexical sorting.
206 The dictionary contains 4 byte units, describing the position pos of an entry,
207 its number of values vln and the length of its key kln,
208 which is stored in the 4th byte (0 here is actually the empty key,
209 which is always the first in the leftmost block of every fork level).
210 In a fork block, the first 2 bytes store the pos in native byte order,
211 and the 3rd byte is vln.
212 In a leaf block (max size 8KB), the pos has 13bit and the vln 11bit.
213 The first 3 bytes in a dictionary unit are, independent of platform,
214 0: pos mod 256 (lower 8 bits),
215 1: pos div 256 (higher 5 bits) + 32 * (vln div 256) (higher 3 bits)
216 and 2: vln mod 256 (lower 8 bits).
219 Entries in the stack always start at pos with kln bytes holding the key
220 (as the actual key bytes, or, in a future version, compressed).
222 In a leaf block, after the key there are vln values sorted in increasing order
223 as of memcmp. Values have fixed size and structure as described by the
224 pointer type (typically 8 byte each).
225 Should the leaf block type be unstructured values, ptr is the actual
226 value size and no assumptions are made about the structure of values
227 (to be used independent of a masterfile as general purpose B-Tree).
229 A special value of all 0 bytes is used for stopwords;
230 no other value will be associated with the key.
231 For a tag-first pointer type (bit 0x08 clear), a pointer to tag 0
232 will be the first and is reserved to store unique keys:
233 at most one pointer to tag 0 will be associated with a key.
235 A leaf key currently always has at least one value;
236 should this key-value association be deleted, the entry is deleted completely.
237 (With index compression vln 0 will denote a stopword).
239 Should a key be associated with more values than fit within a block,
240 the following block starts with an entry with the same key
241 and next higher value.
242 (With index compression, we might consider using the empty key there;
243 to be defined).
246 In a fork block, currently (no index compression), only vln 0 and 1
247 are used. With vln 0, the key is directly followed by a 4 byte native
248 child block number (which is a leaf block number, if the block's level is 1,
249 else a fork block number). With vln not 0, they key is followed by
250 vln pairs of size (value size + 4), containing value bytes as in the leaves
251 followed by the 4 byte child block number. This is used where a key
252 spans leave blocks: we have a fork entry with vln 0 pointing to the key's
253 first block, followed by an entry with the same key plus the starting value
254 of the next block and so on.
255 (With index compression, we will use one entry with multiple values).
258 Fork block number 0 is always the root, and leaf block number 0
259 is always the leftmost leaf. All keys and values can be looped in order
260 by starting from leaf 0 and following the nxt pointers.
262 Note that the layout of leaf blocks is fully platform independent:
263 record pointers are organized big endian as in CDS/ISIS,
264 header numbers are little endian, and the dictionary is defined per byte.
267 * shared B-L-Tree access
269 A process accessing the index in exclusive mode may actually shift
270 keys and values around according to the specified layout
271 as it seems fit to minimize unused space in blocks.
273 With shared access, changes are very limited:
274 - Deleting a key-value association is done in the leaf block only,
275 there are no updates to any fork or other key block.
276 - The same holds where an inserted key-value association fits within
277 the target leaf block.
278 - The only case where data is moved between blocks is when a new pair
279 does not fit in its target block: a new block is allocated to become
280 the new right sibling of the target block and as many entries are moved
281 from the block's end to its new sibling so that the new pair can be
282 inserted in one of the blocks. On such a block split, a new entry
283 is made in the block's parent fork (which might trigger a split there).
284 - Should a process find that the key expected in a block is greater
285 than the greatest key there, it must assume that a block split
286 occurred and follow the nxt link to inspect the block's successor.
287 (That's why it's called B-Link-Tree; invented by Lehmann and Yao).
288 Actually, as we use full fork file locking, this can only happen
289 in leave blocks.
292 * limits according to file formats
294 The masterfile obviously has no limits but filesize.
295 To break the custom 2GB (32bit signed) barrier there are two approaches,
296 compiling with large file support and splitting files,
297 discussed further below.
300 In general, most 4 and 8 byte numbers are assumed signed entities,
301 since several system calls handle them that way.
302 Any 2^x here is to be understood as 2^x-1.
305 The xref in the small file implementation (8 byte pointers) can handle
306 - any legal (i.e. up to 2GB) masterfile size.
307 This limit can be broken by masterfile splitting.
308 - any record size fitting in there using 4,4,0 pointers
309 or records up to 16MB size using 4,3,1 pointers
310 (which have a slight performance advantage)
311 - any number of fields
312 - record ids up to 2^31
314 The full xref spec can handle
315 - large file sizes up to 2^63 (about
316 - record sizes up to 2^48 (256 TB)
317 - record ids up to 2^63
319 The index can deal with
320 - general purpose key-value pairs of a combined length of up to 255 bytes
321 (might be limited to 127 with extensions like index compression)
322 - standard values (hit pointers) of 4 to 14 bytes (default: 8)
323 allow for keys of up to 241 to 251 bytes (default: 247)
324 - pointers to record ids up to 2^63 (currently impl. up to 2^48)
325 - pointers to non-negative tags up to 2^16
326 - pointers to position information up to 2^31
327 (by convention used as 1 or 2 bytes for field occurence,
328 last 2 bytes for word position).
329 - up to 2^32 leaf blocks of up to 8KB, totalling to a leaf file
330 size of up to 32 TB (with large file support).
331 Assuming an average key length of 16 and on average 10 values of 8 bytes
332 per key, each block can hold 81 keys, totalling about 320 billion keys.
333 The fork file, on an IA64 configured with pagesize 64K, can extend to 256TB.
334 This limit can be broken by index splitting.
337 * limits of the current implementation
339 Records have to fit into available memory, which, even when using ridiculous
340 amounts of RAM and/or large swapfiles, is bound by addressable memory.
341 On 32 bit architectures, this is usually 1 or 2 GB (on the heap).
343 Also bound by addressable memory is the possibility to memory map
344 the access files. While they work without memory mapping,
345 performance degrades substantially.
348 So, to use very large databases, the system should be compiled
349 for a 64 bit machine, which are luckily becoming affordable these
350 days and we hope to get us such a box in the near future.
353 * extending file size limits
355 Neither large file support nor file splitting are currently implemented.
356 However, large file support is pretty straightforward.
359 File splitting for masterfiles works by configuring a number n
360 and using a series of masterfiles f0, f1 ...
361 so that records with ids in the range i*n... (i+1)*n-1
362 are stored in masterfile i. Likewise a series of xref files
363 is used based on a number m, which should be some multiple of n.
366 File splitting for the B-Tree is based on configuring a sequence of keys
367 k0=empty, k1 ... and using a series of leaf files l0, l1 ...
368 so that keys less than ki and not less than k(i-1) are stored in li.
369 Fork files could be split on the same keys or some subsequence of those.
372 The advantage of file splitting over large file support is that
373 it is more portable, saves some bytes on file positions and could aid backup,
374 especially where records are mostly appended to the last masterfile.
375 Moreover it can extend indexes even beyond 32 TB.
376 The total database size need not even be addressable with 64 bits.
378 The disadvantage is that it is a little bit more complicated
379 and when used exhaustively could make the system's open files limit
380 become a problem. Also tracking and snapshotting masterfiles
381 is a little bit less trivial.
384 ---
385 $Id: FileFormats.txt,v 1.7 2004/11/12 11:18:23 kripke Exp $

  ViewVC Help
Powered by ViewVC 1.1.26