/[webpac]/openisis/0.9.9e/doc/FileFormats.txt
This is repository of my old source code which isn't updated any more. Go to git.rot13.org for current projects!
ViewVC logotype

Annotation of /openisis/0.9.9e/doc/FileFormats.txt

Parent Directory Parent Directory | Revision Log Revision Log


Revision 604 - (hide annotations)
Mon Dec 27 21:49:01 2004 UTC (19 years, 4 months ago) by dpavlin
File MIME type: text/plain
File size: 16732 byte(s)
import of new openisis release, 0.9.9e

1 dpavlin 604 ;man malete 5 $Date: 2004/11/12 11:18:23 $
2     Malete file formats
3    
4    
5     * meta data
6    
7     The Malete options (record 0) file .m0d (with a zero; .mod used otherwise),
8     if present, contains a single serialized record containing metadata,
9     including
10     > RecStruct field definitions
11    
12     Named
13     > CharSet collation info
14     is compiled on demand to byte-order dependent .mcx files with
15     magic 'mcx' or 'MCX' for little-/big-endian, resp.
16    
17    
18     * record data
19    
20     The Malete record data file .mrd, a.k.a masterfile, is a plain text file.
21     Lines are terminated by the ASCII newline character with byte value 10.
22     Any superset of ASCII - including UTF-8 - may be used as character encoding.
23     In other words, byte values 0-127 are always interpreted according to ASCII,
24     values 128-255 have no special meaning.
25     More precisely, the byte values 10 (newline), 9 (tabulator),
26     45 (minus '-'), 48-57 (digits '0'-'9'), 64 (at-sign '@') and 87 ('W')
27     have structural significance.
28    
29    
30     Logically, the masterfile is a sequence of records,
31     where every record is terminated by an empty line.
32     A record is substructured as a sequence of non-empty lines,
33     each terminated by a newline.
34    
35     Field lines are substructured as a number
36     (optional minus followed by digits) followed by a tabulator
37     followed by any characters but a newline as field values.
38     If the first line of a record does not start with a number,
39     it is used as header line. All other lines must be field lines.
40    
41    
42     As of now, there is only one valid header line format defined,
43     consisting of the letter 'W', followed by a tabulator and a
44     data record header. The data record header is of the form
45     'rid[@pos][*TAB*leader]'. Here
46     - rid is a positive record id (a.k.a. masterfile number MFN, in ASCII digits)
47     - if @pos is present, pos must be the position of the previous version
48     of this record as byte offset from beginning of file in ASCII digits.
49     Malete software supports retrieval of old versions based on this.
50     - the optional leader is arbitrary leader data for the record.
51     It can hold a MARC leader or a unique key ("name") for the record.
52    
53     If a record has no header line, its record id is one plus the highest
54     record id used up to that point in the file.
55    
56     Using header lines, records can represent updated versions of
57     previously stored records. There is no way to delete records,
58     the operation next to it is to write an empty record.
59    
60    
61     During normal operation, data in the masterfile is never updated,
62     all writes are appends. Thus live backups and replications can be
63     based on 'tail -f' writing to a tape or piped through netcat.
64     A consistent snapshot can be made by recording the current file size
65     and only accessing record versions at lower positions.
66    
67    
68     Compacting a masterfile by purging old versions is straightforward,
69     but considered an offline operation
70     (basically printing current versions to a new masterfile).
71     Compacting a mounted database is feasible for the exclusive mode server,
72     but currently not supported.
73    
74    
75     The masterfile is a special case of the serialized
76     > Protocol
77     , see there for details.
78     The suggested handling of malformed records should not be relied upon.
79    
80    
81     * record access
82    
83     The Malete record access file .mrx, a.k.a cross reference (xref) file,
84     is a binary file dependent on the platform byte order and pagesize.
85     Since the xref file is accessed by memory mapping,
86     its size must be a multiple of the pagesize
87     (which is 4 KB for most Intel based systems).
88     Where the xref is found to be missing, have the wrong byte order or size
89     or be otherwise invalid, it is rebuilt from the masterfile.
90    
91    
92     The xref is organized in units of a fixed size ranging from 8 to 16 bytes.
93     The unit at offset rid*size is a pointer to the current version of record
94     with id rid in the masterfile (rid starting from 1).
95    
96     This pointer consists of
97     - 4 to 8 bytes for the record's position
98     (including any header line)
99     - 3 to 6 bytes for the record's length
100     (including the terminating empty line)
101     - 0 to 2 bytes for the record's number of fields
102     (including 1 header field, even if it's empty)
103     When not compiled with support for large files,
104     only the combinations 4,3,1 and 4,4,0 are implemented.
105    
106     If a record's number of fields exceeds the representable range,
107     the corresponding pointer bytes (if any) are set to 0 and the number
108     of field must be determined while reading the record.
109     This number is also 0 for "deleted" records (no fields, empty leader).
110    
111    
112     The first such unit (at mrx offset 0) consists of
113     - 3 bytes magic number, which is "MRX" for big endian, "mrx" else
114     - 1 byte type, computed as
115     (bytes for pos-4)*16+(bytes for length-3)*4+(bytes for fields)
116     - a 4 byte integer in native byte order holding the max used rid
117     - if the unit size is 10 or greater (i.e. with largefile support),
118     the following 2 (for 10,11) or 4 bytes are an integer in native byte order
119     holding high order bytes of the max record id.
120    
121    
122     * query data and access
123    
124     The malete query data .mqd and query access .mqx files hold
125     the leaves and forks (inner nodes) of a B-Link-Tree, resp.,
126     which is usually used as index associating keys generated from masterfile
127     records with pointers to such records.
128    
129    
130     Both files are organized in fixed size blocks containing binary integer numbers
131     and arbitrary key strings (encoded according to the collation configuration).
132     Blocks contain
133     - 16 bytes of "header" data
134     - a "dictionary",
135     which is an array of 4 byte units describing entries in the block
136     with position, length of key and number of values.
137     - a "stack" holding the entries,
138     growing downwards from the end of the block.
139     For dictionary slots d[0]..d[n], describing entries of size s0..sn,
140     entry i occupies si consecutive bytes starting at (block size)-(s0+...+si).
141    
142     The header contains 8 unsigned binary numbers:
143     - num 4 bytes block number
144     - typ 1 byte block type (bitmask; see below)
145     - ksz 1 byte maximum key length
146     (default 0 treated like the maximum 255; CDS/ISIS uses 30)
147     - ptr 1 byte pointer type (bitmask; see below)
148     - lev 1 byte level over bottom (0 for leaves)
149     - nxt 4 bytes number of right sibling block (0 if none)
150     - ent 2 bytes number of entries (length of dictionary)
151     - stk 2 bytes offset of 1st byte used by the stack
152     The first 5 numbers are actually redundant since the block number
153     must match the blocks file position, the three configurable bytes
154     must be the same among all leaves and forks, resp.,
155     and the level is not really needed.
156     However, wasting this 8 bytes serves as a minimal check
157     and makes handling much easier.
158    
159     In leaf blocks, the byte order of all header numbers is little endian
160     (so big endian machines have to swap bytes when reading/writing leaf blocks).
161     In fork blocks, the header numbers are in native byte order
162     (since they are usually not accessed by explicit reading and writing,
163     but through memory mapping).
164    
165    
166     The block type is:
167     - upper 2 bits 0xC0 give the basic block type:
168     0x00 for a standard leaf block,
169     0x40 and 0x80 for little and big endian forks,
170     0xC0 for leaves with unstructured values.
171     - next two bits are clear and reserved for future extensions;
172     should software see such a bit set,
173     it must not assume anything about the index structure.
174     - the bit 0x08, if set, will indicate that the index is compressed.
175     Each key is stored as one byte giving the length
176     of the common prefix to its predecessor followed by changed bytes.
177     This is not yet supported and current software will refuse to
178     access such an index.
179     Also do not confuse this with compressing each key individually based on
180     > CharSet collation recoding
181     - lowest 3 bits 0x07 give the block size.
182     For leaves, this is freely configurable as bitcount-9,
183     from 0 for 512 (2^9) bits to 4 for 8KB (2^13).
184     For forks, this is bitcount-12, from 0 for 4KB (2^12) to 4 for 64KB(2^16),
185     and must match the system's pagesize (should there be a system with
186     a pagesize of less than 4 KB, 4KB is used).
187    
188     The pointer type describes the structure of leaf values:
189     - upper two bits 0xC0 give the number of bytes to hold the tag
190     from 0 (0x00) to 2 (0x80).
191     - next two bits 0x30 give the number of bytes to hold the record id - 3,
192     from 0x00 for 3 bytes to 0x30 for 6 bytes
193     - the bit 0x08, if set, indicates that the tag is stored after
194     the record id (as in CDS/ISIS), else the tag bytes precede the record id
195     - lowest 3 bits 0x07 give the number of bytes to hold the position
196     (occurence*65536 + word position) from 0x00 for 0 bytes to 0x04 for 4 bytes.
197     For a value of 5 or 6, we should assume 4 byte position and one or two
198     additional bytes for the record id (currently not supported).
199     Pointers must use at least 4 and at most 14 (2+8+4; currently 12=2+6+4) bytes.
200     Pointer type 0x8B (3 byte rid + 2 byte tag + 3 byte pos)
201     is the same as used by CDS/ISIS.
202     All numbers in the pointer are stored in big endian byte order
203     (most significant byte first) for lexical sorting.
204    
205    
206     The dictionary contains 4 byte units, describing the position pos of an entry,
207     its number of values vln and the length of its key kln,
208     which is stored in the 4th byte (0 here is actually the empty key,
209     which is always the first in the leftmost block of every fork level).
210     In a fork block, the first 2 bytes store the pos in native byte order,
211     and the 3rd byte is vln.
212     In a leaf block (max size 8KB), the pos has 13bit and the vln 11bit.
213     The first 3 bytes in a dictionary unit are, independent of platform,
214     0: pos mod 256 (lower 8 bits),
215     1: pos div 256 (higher 5 bits) + 32 * (vln div 256) (higher 3 bits)
216     and 2: vln mod 256 (lower 8 bits).
217    
218    
219     Entries in the stack always start at pos with kln bytes holding the key
220     (as the actual key bytes, or, in a future version, compressed).
221    
222     In a leaf block, after the key there are vln values sorted in increasing order
223     as of memcmp. Values have fixed size and structure as described by the
224     pointer type (typically 8 byte each).
225     Should the leaf block type be unstructured values, ptr is the actual
226     value size and no assumptions are made about the structure of values
227     (to be used independent of a masterfile as general purpose B-Tree).
228    
229     A special value of all 0 bytes is used for stopwords;
230     no other value will be associated with the key.
231     For a tag-first pointer type (bit 0x08 clear), a pointer to tag 0
232     will be the first and is reserved to store unique keys:
233     at most one pointer to tag 0 will be associated with a key.
234    
235     A leaf key currently always has at least one value;
236     should this key-value association be deleted, the entry is deleted completely.
237     (With index compression vln 0 will denote a stopword).
238    
239     Should a key be associated with more values than fit within a block,
240     the following block starts with an entry with the same key
241     and next higher value.
242     (With index compression, we might consider using the empty key there;
243     to be defined).
244    
245    
246     In a fork block, currently (no index compression), only vln 0 and 1
247     are used. With vln 0, the key is directly followed by a 4 byte native
248     child block number (which is a leaf block number, if the block's level is 1,
249     else a fork block number). With vln not 0, they key is followed by
250     vln pairs of size (value size + 4), containing value bytes as in the leaves
251     followed by the 4 byte child block number. This is used where a key
252     spans leave blocks: we have a fork entry with vln 0 pointing to the key's
253     first block, followed by an entry with the same key plus the starting value
254     of the next block and so on.
255     (With index compression, we will use one entry with multiple values).
256    
257    
258     Fork block number 0 is always the root, and leaf block number 0
259     is always the leftmost leaf. All keys and values can be looped in order
260     by starting from leaf 0 and following the nxt pointers.
261    
262     Note that the layout of leaf blocks is fully platform independent:
263     record pointers are organized big endian as in CDS/ISIS,
264     header numbers are little endian, and the dictionary is defined per byte.
265    
266    
267     * shared B-L-Tree access
268    
269     A process accessing the index in exclusive mode may actually shift
270     keys and values around according to the specified layout
271     as it seems fit to minimize unused space in blocks.
272    
273     With shared access, changes are very limited:
274     - Deleting a key-value association is done in the leaf block only,
275     there are no updates to any fork or other key block.
276     - The same holds where an inserted key-value association fits within
277     the target leaf block.
278     - The only case where data is moved between blocks is when a new pair
279     does not fit in its target block: a new block is allocated to become
280     the new right sibling of the target block and as many entries are moved
281     from the block's end to its new sibling so that the new pair can be
282     inserted in one of the blocks. On such a block split, a new entry
283     is made in the block's parent fork (which might trigger a split there).
284     - Should a process find that the key expected in a block is greater
285     than the greatest key there, it must assume that a block split
286     occurred and follow the nxt link to inspect the block's successor.
287     (That's why it's called B-Link-Tree; invented by Lehmann and Yao).
288     Actually, as we use full fork file locking, this can only happen
289     in leave blocks.
290    
291    
292     * limits according to file formats
293    
294     The masterfile obviously has no limits but filesize.
295     To break the custom 2GB (32bit signed) barrier there are two approaches,
296     compiling with large file support and splitting files,
297     discussed further below.
298    
299    
300     In general, most 4 and 8 byte numbers are assumed signed entities,
301     since several system calls handle them that way.
302     Any 2^x here is to be understood as 2^x-1.
303    
304    
305     The xref in the small file implementation (8 byte pointers) can handle
306     - any legal (i.e. up to 2GB) masterfile size.
307     This limit can be broken by masterfile splitting.
308     - any record size fitting in there using 4,4,0 pointers
309     or records up to 16MB size using 4,3,1 pointers
310     (which have a slight performance advantage)
311     - any number of fields
312     - record ids up to 2^31
313    
314     The full xref spec can handle
315     - large file sizes up to 2^63 (about 8.000.000.000.000.000.000)
316     - record sizes up to 2^48 (256 TB)
317     - record ids up to 2^63
318    
319     The index can deal with
320     - general purpose key-value pairs of a combined length of up to 255 bytes
321     (might be limited to 127 with extensions like index compression)
322     - standard values (hit pointers) of 4 to 14 bytes (default: 8)
323     allow for keys of up to 241 to 251 bytes (default: 247)
324     - pointers to record ids up to 2^63 (currently impl. up to 2^48)
325     - pointers to non-negative tags up to 2^16
326     - pointers to position information up to 2^31
327     (by convention used as 1 or 2 bytes for field occurence,
328     last 2 bytes for word position).
329     - up to 2^32 leaf blocks of up to 8KB, totalling to a leaf file
330     size of up to 32 TB (with large file support).
331     Assuming an average key length of 16 and on average 10 values of 8 bytes
332     per key, each block can hold 81 keys, totalling about 320 billion keys.
333     The fork file, on an IA64 configured with pagesize 64K, can extend to 256TB.
334     This limit can be broken by index splitting.
335    
336    
337     * limits of the current implementation
338    
339     Records have to fit into available memory, which, even when using ridiculous
340     amounts of RAM and/or large swapfiles, is bound by addressable memory.
341     On 32 bit architectures, this is usually 1 or 2 GB (on the heap).
342    
343     Also bound by addressable memory is the possibility to memory map
344     the access files. While they work without memory mapping,
345     performance degrades substantially.
346    
347    
348     So, to use very large databases, the system should be compiled
349     for a 64 bit machine, which are luckily becoming affordable these
350     days and we hope to get us such a box in the near future.
351    
352    
353     * extending file size limits
354    
355     Neither large file support nor file splitting are currently implemented.
356     However, large file support is pretty straightforward.
357    
358    
359     File splitting for masterfiles works by configuring a number n
360     and using a series of masterfiles f0, f1 ...
361     so that records with ids in the range i*n... (i+1)*n-1
362     are stored in masterfile i. Likewise a series of xref files
363     is used based on a number m, which should be some multiple of n.
364    
365    
366     File splitting for the B-Tree is based on configuring a sequence of keys
367     k0=empty, k1 ... and using a series of leaf files l0, l1 ...
368     so that keys less than ki and not less than k(i-1) are stored in li.
369     Fork files could be split on the same keys or some subsequence of those.
370    
371    
372     The advantage of file splitting over large file support is that
373     it is more portable, saves some bytes on file positions and could aid backup,
374     especially where records are mostly appended to the last masterfile.
375     Moreover it can extend indexes even beyond 32 TB.
376     The total database size need not even be addressable with 64 bits.
377    
378     The disadvantage is that it is a little bit more complicated
379     and when used exhaustively could make the system's open files limit
380     become a problem. Also tracking and snapshotting masterfiles
381     is a little bit less trivial.
382    
383    
384     ---
385     $Id: FileFormats.txt,v 1.7 2004/11/12 11:18:23 kripke Exp $

  ViewVC Help
Powered by ViewVC 1.1.26