0.9.9e/doc/FileFormats.txt

;man    malete  5       $Date: 2004/11/12 11:18:23 $
Malete file formats


*       meta data

The Malete options (record 0) file .m0d (with a zero; .mod used otherwise),
if present, contains a single serialized record containing metadata,
including
>       RecStruct       field definitions

Named
>       CharSet collation info
is compiled on demand to byte-order dependent .mcx files with
magic 'mcx' or 'MCX' for little-/big-endian, resp.


*       record data

The Malete record data file .mrd, a.k.a masterfile, is a plain text file.
Lines are terminated by the ASCII newline character with byte value 10.
Any superset of ASCII - including UTF-8 - may be used as character encoding.
In other words, byte values 0-127 are always interpreted according to ASCII,
values 128-255 have no special meaning.
More precisely, the byte values 10 (newline), 9 (tabulator),
45 (minus '-'), 48-57 (digits '0'-'9'), 64 (at-sign '@') and 87 ('W')
have structural significance.


Logically, the masterfile is a sequence of records,
where every record is terminated by an empty line.
A record is substructured as a sequence of non-empty lines,
each terminated by a newline.

Field lines are substructured as a number
(optional minus followed by digits) followed by a tabulator
followed by any characters but a newline as field values.
If the first line of a record does not start with a number,
it is used as header line. All other lines must be field lines.


As of now, there is only one valid header line format defined,
consisting of the letter 'W', followed by a tabulator and a
data record header. The data record header is of the form
'rid[@pos][*TAB*leader]'. Here
-       rid is a positive record id (a.k.a. masterfile number MFN, in ASCII digits)
-       if @pos is present, pos must be the position of the previous version
        of this record as byte offset from beginning of file in ASCII digits.
        Malete software supports retrieval of old versions based on this.
-       the optional leader is arbitrary leader data for the record.
        It can hold a MARC leader or a unique key ("name") for the record.

If a record has no header line, its record id is one plus the highest
record id used up to that point in the file.

Using header lines, records can represent updated versions of
previously stored records. There is no way to delete records,
the operation next to it is to write an empty record.


During normal operation, data in the masterfile is never updated,
all writes are appends. Thus live backups and replications can be
based on 'tail -f' writing to a tape or piped through netcat.
A consistent snapshot can be made by recording the current file size
and only accessing record versions at lower positions.


Compacting a masterfile by purging old versions is straightforward,
but considered an offline operation
(basically printing current versions to a new masterfile).
Compacting a mounted database is feasible for the exclusive mode server,
but currently not supported.


The masterfile is a special case of the serialized
>       Protocol
, see there for details.
The suggested handling of malformed records should not be relied upon.


*       record access

The Malete record access file .mrx, a.k.a cross reference (xref) file,
is a binary file dependent on the platform byte order and pagesize.
Since the xref file is accessed by memory mapping,
its size must be a multiple of the pagesize
(which is 4 KB for most Intel based systems).
Where the xref is found to be missing, have the wrong byte order or size
or be otherwise invalid, it is rebuilt from the masterfile.


The xref is organized in units of a fixed size ranging from 8 to 16 bytes.
The unit at offset rid*size is a pointer to the current version of record
with id rid in the masterfile (rid starting from 1).

This pointer consists of
-       4 to 8 bytes for the record's position
        (including any header line)
-       3 to 6 bytes for the record's length
        (including the terminating empty line)
-       0 to 2 bytes for the record's number of fields
        (including 1 header field, even if it's empty)
When not compiled with support for large files,
only the combinations 4,3,1 and 4,4,0 are implemented.

If a record's number of fields exceeds the representable range,
the corresponding pointer bytes (if any) are set to 0 and the number
of field must be determined while reading the record.
This number is also 0 for "deleted" records (no fields, empty leader).


The first such unit (at mrx offset 0) consists of
-       3 bytes magic number, which is "MRX" for big endian, "mrx" else
-       1 byte type, computed as
        (bytes for pos-4)*16+(bytes for length-3)*4+(bytes for fields)
-       a 4 byte integer in native byte order holding the max used rid
-       if the unit size is 10 or greater (i.e. with largefile support),
        the following 2 (for 10,11) or 4 bytes are an integer in native byte order
        holding high order bytes of the max record id.


*       query data and access

The malete query data .mqd and query access .mqx files hold
the leaves and forks (inner nodes) of a B-Link-Tree, resp.,
which is usually used as index associating keys generated from masterfile
records with pointers to such records.


Both files are organized in fixed size blocks containing binary integer numbers
and arbitrary key strings (encoded according to the collation configuration).
Blocks contain
-       16 bytes of "header" data
-       a "dictionary",
        which is an array of 4 byte units describing entries in the block
        with position, length of key and number of values.
-       a "stack" holding the entries,
        growing downwards from the end of the block.
        For dictionary slots d[0]..d[n], describing entries of size s0..sn,
        entry i occupies si consecutive bytes starting at (block size)-(s0+...+si).

The header contains 8 unsigned binary numbers:
-       num 4 bytes block number
-       typ 1 byte block type (bitmask; see below)
-       ksz 1 byte maximum key length
        (default 0 treated like the maximum 255; CDS/ISIS uses 30)
-       ptr 1 byte pointer type (bitmask; see below)
-       lev 1 byte level over bottom (0 for leaves)
-       nxt 4 bytes number of right sibling block (0 if none)
-       ent 2 bytes number of entries (length of dictionary)
-       stk 2 bytes offset of 1st byte used by the stack
The first 5 numbers are actually redundant since the block number
must match the blocks file position, the three configurable bytes
must be the same among all leaves and forks, resp.,
and the level is not really needed.
However, wasting this 8 bytes serves as a minimal check
and makes handling much easier.

In leaf blocks, the byte order of all header numbers is little endian
(so big endian machines have to swap bytes when reading/writing leaf blocks).
In fork blocks, the header numbers are in native byte order
(since they are usually not accessed by explicit reading and writing,
but through memory mapping).


The block type is:
-       upper 2 bits 0xC0 give the basic block type:
        0x00 for a standard leaf block,
        0x40 and 0x80 for little and big endian forks,
        0xC0 for leaves with unstructured values.
-       next two bits are clear and reserved for future extensions;
        should software see such a bit set,
        it must not assume anything about the index structure.
-       the bit 0x08, if set, will indicate that the index is compressed.
        Each key is stored as one byte giving the length
        of the common prefix to its predecessor followed by changed bytes.
        This is not yet supported and current software will refuse to
        access such an index.
        Also do not confuse this with compressing each key individually based on
>       CharSet collation recoding
-       lowest 3 bits 0x07 give the block size.
        For leaves, this is freely configurable as bitcount-9,
        from 0 for 512 (2^9) bits to 4 for 8KB (2^13).
        For forks, this is bitcount-12, from 0 for 4KB (2^12) to 4 for 64KB(2^16),
        and must match the system's pagesize (should there be a system with
        a pagesize of less than 4 KB, 4KB is used).

The pointer type describes the structure of leaf values:
-       upper two bits 0xC0 give the number of bytes to hold the tag
        from 0 (0x00) to 2 (0x80).
-       next two bits 0x30 give the number of bytes to hold the record id - 3,
        from 0x00 for 3 bytes to 0x30 for 6 bytes
-       the bit 0x08, if set, indicates that the tag is stored after
        the record id (as in CDS/ISIS), else the tag bytes precede the record id
-       lowest 3 bits 0x07 give the number of bytes to hold the position
        (occurence*65536 + word position) from 0x00 for 0 bytes to 0x04 for 4 bytes.
        For a value of 5 or 6, we should assume 4 byte position and one or two
        additional bytes for the record id (currently not supported).
Pointers must use at least 4 and at most 14 (2+8+4; currently 12=2+6+4) bytes.
Pointer type 0x8B (3 byte rid + 2 byte tag + 3 byte pos)
is the same as used by CDS/ISIS.
All numbers in the pointer are stored in big endian byte order
(most significant byte first) for lexical sorting.


The dictionary contains 4 byte units, describing the position pos of an entry,
its number of values vln and the length of its key kln,
which is stored in the 4th byte (0 here is actually the empty key,
which is always the first in the leftmost block of every fork level).
In a fork block, the first 2 bytes store the pos in native byte order,
and the 3rd byte is vln.
In a leaf block (max size 8KB), the pos has 13bit and the vln 11bit.
The first 3 bytes in a dictionary unit are, independent of platform,
0: pos mod 256 (lower 8 bits),
1: pos div 256 (higher 5 bits) + 32 * (vln div 256) (higher 3 bits)
and 2: vln mod 256 (lower 8 bits).


Entries in the stack always start at pos with kln bytes holding the key
(as the actual key bytes, or, in a future version, compressed).

In a leaf block, after the key there are vln values sorted in increasing order
as of memcmp. Values have fixed size and structure as described by the
pointer type (typically 8 byte each).
Should the leaf block type be unstructured values, ptr is the actual
value size and no assumptions are made about the structure of values
(to be used independent of a masterfile as general purpose B-Tree).

A special value of all 0 bytes is used for stopwords;
no other value will be associated with the key.
For a tag-first pointer type (bit 0x08 clear), a pointer to tag 0
will be the first and is reserved to store unique keys:
at most one pointer to tag 0 will be associated with a key.

A leaf key currently always has at least one value;
should this key-value association be deleted, the entry is deleted completely.
(With index compression vln 0 will denote a stopword).

Should a key be associated with more values than fit within a block,
the following block starts with an entry with the same key
and next higher value.
(With index compression, we might consider using the empty key there;
to be defined).


In a fork block, currently (no index compression), only vln 0 and 1
are used. With vln 0, the key is directly followed by a 4 byte native
child block number (which is a leaf block number, if the block's level is 1,
else a fork block number). With vln not 0, they key is followed by
vln pairs of size (value size + 4), containing value bytes as in the leaves
followed by the 4 byte child block number. This is used where a key
spans leave blocks: we have a fork entry with vln 0 pointing to the key's
first block, followed by an entry with the same key plus the starting value
of the next block and so on.
(With index compression, we will use one entry with multiple values).


Fork block number 0 is always the root, and leaf block number 0
is always the leftmost leaf. All keys and values can be looped in order
by starting from leaf 0 and following the nxt pointers.

Note that the layout of leaf blocks is fully platform independent:
record pointers are organized big endian as in CDS/ISIS,
header numbers are little endian, and the dictionary is defined per byte.


*       shared B-L-Tree access

A process accessing the index in exclusive mode may actually shift
keys and values around according to the specified layout
as it seems fit to minimize unused space in blocks.

With shared access, changes are very limited:
-       Deleting a key-value association is done in the leaf block only,
        there are no updates to any fork or other key block.
-       The same holds where an inserted key-value association fits within
        the target leaf block.
-       The only case where data is moved between blocks is when a new pair
        does not fit in its target block: a new block is allocated to become
        the new right sibling of the target block and as many entries are moved
        from the block's end to its new sibling so that the new pair can be
        inserted in one of the blocks. On such a block split, a new entry
        is made in the block's parent fork (which might trigger a split there).
-       Should a process find that the key expected in a block is greater
        than the greatest key there, it must assume that a block split
        occurred and follow the nxt link to inspect the block's successor.
        (That's why it's called B-Link-Tree; invented by Lehmann and Yao).
        Actually, as we use full fork file locking, this can only happen
        in leave blocks.


*       limits according to file formats

The masterfile obviously has no limits but filesize.
To break the custom 2GB (32bit signed) barrier there are two approaches,
compiling with large file support and splitting files,
discussed further below.


In general, most 4 and 8 byte numbers are assumed signed entities,
since several system calls handle them that way.
Any 2^x here is to be understood as 2^x-1.


The xref in the small file implementation (8 byte pointers) can handle
-       any legal (i.e. up to 2GB) masterfile size.
        This limit can be broken by masterfile splitting.
-       any record size fitting in there using 4,4,0 pointers
        or records up to 16MB size using 4,3,1 pointers
        (which have a slight performance advantage)
-       any number of fields
-       record ids up to 2^31

The full xref spec can handle
-       large file sizes up to 2^63 (about 8.000.000.000.000.000.000)
-       record sizes up to 2^48 (256 TB)
-       record ids up to 2^63

The index can deal with
-       general purpose key-value pairs of a combined length of up to 255 bytes
        (might be limited to 127 with extensions like index compression)
- standard values (hit pointers) of 4 to 14 bytes (default: 8)
        allow for keys of up to 241 to 251 bytes (default: 247)
-       pointers to record ids up to 2^63 (currently impl. up to 2^48)
-       pointers to non-negative tags up to 2^16
-       pointers to position information up to 2^31
        (by convention used as 1 or 2 bytes for field occurence,
        last 2 bytes for word position).
-       up to 2^32 leaf blocks of up to 8KB, totalling to a leaf file
        size of up to 32 TB (with large file support).
        Assuming an average key length of 16 and on average 10 values of 8 bytes
        per key, each block can hold 81 keys, totalling about 320 billion keys.
        The fork file, on an IA64 configured with pagesize 64K, can extend to 256TB.
        This limit can be broken by index splitting.


*       limits of the current implementation

Records have to fit into available memory, which, even when using ridiculous
amounts of RAM and/or large swapfiles, is bound by addressable memory.
On 32 bit architectures, this is usually 1 or 2 GB (on the heap).

Also bound by addressable memory is the possibility to memory map
the access files. While they work without memory mapping,
performance degrades substantially.


So, to use very large databases, the system should be compiled
for a 64 bit machine, which are luckily becoming affordable these
days and we hope to get us such a box in the near future.


*       extending file size limits

Neither large file support nor file splitting are currently implemented.
However, large file support is pretty straightforward.


File splitting for masterfiles works by configuring a number n
and using a series of masterfiles f0, f1 ...
so that records with ids in the range i*n... (i+1)*n-1
are stored in masterfile i. Likewise a series of xref files
is used based on a number m, which should be some multiple of n.


File splitting for the B-Tree is based on configuring a sequence of keys
k0=empty, k1 ... and using a series of leaf files l0, l1 ...
so that keys less than ki and not less than k(i-1) are stored in li.
Fork files could be split on the same keys or some subsequence of those.


The advantage of file splitting over large file support is that
it is more portable, saves some bytes on file positions and could aid backup,
especially where records are mostly appended to the last masterfile.
Moreover it can extend indexes even beyond 32 TB.
The total database size need not even be addressable with 64 bits.

The disadvantage is that it is a little bit more complicated
and when used exhaustively could make the system's open files limit
become a problem. Also tracking and snapshotting masterfiles
is a little bit less trivial.


---
        $Id: FileFormats.txt,v 1.7 2004/11/12 11:18:23 kripke Exp $
1	dpavlin	604	;man malete 5 $Date: 2004/11/12 11:18:23 $
2			Malete file formats
3
4
5			* meta data
6
7			The Malete options (record 0) file .m0d (with a zero; .mod used otherwise),
8			if present, contains a single serialized record containing metadata,
9			including
10			> RecStruct field definitions
11
12			Named
13			> CharSet collation info
14			is compiled on demand to byte-order dependent .mcx files with
15			magic 'mcx' or 'MCX' for little-/big-endian, resp.
16
17
18			* record data
19
20			The Malete record data file .mrd, a.k.a masterfile, is a plain text file.
21			Lines are terminated by the ASCII newline character with byte value 10.
22			Any superset of ASCII - including UTF-8 - may be used as character encoding.
23			In other words, byte values 0-127 are always interpreted according to ASCII,
24			values 128-255 have no special meaning.
25			More precisely, the byte values 10 (newline), 9 (tabulator),
26			45 (minus '-'), 48-57 (digits '0'-'9'), 64 (at-sign '@') and 87 ('W')
27			have structural significance.
28
29
30			Logically, the masterfile is a sequence of records,
31			where every record is terminated by an empty line.
32			A record is substructured as a sequence of non-empty lines,
33			each terminated by a newline.
34
35			Field lines are substructured as a number
36			(optional minus followed by digits) followed by a tabulator
37			followed by any characters but a newline as field values.
38			If the first line of a record does not start with a number,
39			it is used as header line. All other lines must be field lines.
40
41
42			As of now, there is only one valid header line format defined,
43			consisting of the letter 'W', followed by a tabulator and a
44			data record header. The data record header is of the form
45			'rid[@pos][TABleader]'. Here
46			- rid is a positive record id (a.k.a. masterfile number MFN, in ASCII digits)
47			- if @pos is present, pos must be the position of the previous version
48			of this record as byte offset from beginning of file in ASCII digits.
49			Malete software supports retrieval of old versions based on this.
50			- the optional leader is arbitrary leader data for the record.
51			It can hold a MARC leader or a unique key ("name") for the record.
52
53			If a record has no header line, its record id is one plus the highest
54			record id used up to that point in the file.
55
56			Using header lines, records can represent updated versions of
57			previously stored records. There is no way to delete records,
58			the operation next to it is to write an empty record.
59
60
61			During normal operation, data in the masterfile is never updated,
62			all writes are appends. Thus live backups and replications can be
63			based on 'tail -f' writing to a tape or piped through netcat.
64			A consistent snapshot can be made by recording the current file size
65			and only accessing record versions at lower positions.
66
67
68			Compacting a masterfile by purging old versions is straightforward,
69			but considered an offline operation
70			(basically printing current versions to a new masterfile).
71			Compacting a mounted database is feasible for the exclusive mode server,
72			but currently not supported.
73
74
75			The masterfile is a special case of the serialized
76			> Protocol
77			, see there for details.
78			The suggested handling of malformed records should not be relied upon.
79
80
81			* record access
82
83			The Malete record access file .mrx, a.k.a cross reference (xref) file,
84			is a binary file dependent on the platform byte order and pagesize.
85			Since the xref file is accessed by memory mapping,
86			its size must be a multiple of the pagesize
87			(which is 4 KB for most Intel based systems).
88			Where the xref is found to be missing, have the wrong byte order or size
89			or be otherwise invalid, it is rebuilt from the masterfile.
90
91
92			The xref is organized in units of a fixed size ranging from 8 to 16 bytes.
93			The unit at offset rid*size is a pointer to the current version of record
94			with id rid in the masterfile (rid starting from 1).
95
96			This pointer consists of
97			- 4 to 8 bytes for the record's position
98			(including any header line)
99			- 3 to 6 bytes for the record's length
100			(including the terminating empty line)
101			- 0 to 2 bytes for the record's number of fields
102			(including 1 header field, even if it's empty)
103			When not compiled with support for large files,
104			only the combinations 4,3,1 and 4,4,0 are implemented.
105
106			If a record's number of fields exceeds the representable range,
107			the corresponding pointer bytes (if any) are set to 0 and the number
108			of field must be determined while reading the record.
109			This number is also 0 for "deleted" records (no fields, empty leader).
110
111
112			The first such unit (at mrx offset 0) consists of
113			- 3 bytes magic number, which is "MRX" for big endian, "mrx" else
114			- 1 byte type, computed as
115			(bytes for pos-4)16+(bytes for length-3)4+(bytes for fields)
116			- a 4 byte integer in native byte order holding the max used rid
117			- if the unit size is 10 or greater (i.e. with largefile support),
118			the following 2 (for 10,11) or 4 bytes are an integer in native byte order
119			holding high order bytes of the max record id.
120
121
122			* query data and access
123
124			The malete query data .mqd and query access .mqx files hold
125			the leaves and forks (inner nodes) of a B-Link-Tree, resp.,
126			which is usually used as index associating keys generated from masterfile
127			records with pointers to such records.
128
129
130			Both files are organized in fixed size blocks containing binary integer numbers
131			and arbitrary key strings (encoded according to the collation configuration).
132			Blocks contain
133			- 16 bytes of "header" data
134			- a "dictionary",
135			which is an array of 4 byte units describing entries in the block
136			with position, length of key and number of values.
137			- a "stack" holding the entries,
138			growing downwards from the end of the block.
139			For dictionary slots d[0]..d[n], describing entries of size s0..sn,
140			entry i occupies si consecutive bytes starting at (block size)-(s0+...+si).
141
142			The header contains 8 unsigned binary numbers:
143			- num 4 bytes block number
144			- typ 1 byte block type (bitmask; see below)
145			- ksz 1 byte maximum key length
146			(default 0 treated like the maximum 255; CDS/ISIS uses 30)
147			- ptr 1 byte pointer type (bitmask; see below)
148			- lev 1 byte level over bottom (0 for leaves)
149			- nxt 4 bytes number of right sibling block (0 if none)
150			- ent 2 bytes number of entries (length of dictionary)
151			- stk 2 bytes offset of 1st byte used by the stack
152			The first 5 numbers are actually redundant since the block number
153			must match the blocks file position, the three configurable bytes
154			must be the same among all leaves and forks, resp.,
155			and the level is not really needed.
156			However, wasting this 8 bytes serves as a minimal check
157			and makes handling much easier.
158
159			In leaf blocks, the byte order of all header numbers is little endian
160			(so big endian machines have to swap bytes when reading/writing leaf blocks).
161			In fork blocks, the header numbers are in native byte order
162			(since they are usually not accessed by explicit reading and writing,
163			but through memory mapping).
164
165
166			The block type is:
167			- upper 2 bits 0xC0 give the basic block type:
168			0x00 for a standard leaf block,
169			0x40 and 0x80 for little and big endian forks,
170			0xC0 for leaves with unstructured values.
171			- next two bits are clear and reserved for future extensions;
172			should software see such a bit set,
173			it must not assume anything about the index structure.
174			- the bit 0x08, if set, will indicate that the index is compressed.
175			Each key is stored as one byte giving the length
176			of the common prefix to its predecessor followed by changed bytes.
177			This is not yet supported and current software will refuse to
178			access such an index.
179			Also do not confuse this with compressing each key individually based on
180			> CharSet collation recoding
181			- lowest 3 bits 0x07 give the block size.
182			For leaves, this is freely configurable as bitcount-9,
183			from 0 for 512 (2^9) bits to 4 for 8KB (2^13).
184			For forks, this is bitcount-12, from 0 for 4KB (2^12) to 4 for 64KB(2^16),
185			and must match the system's pagesize (should there be a system with
186			a pagesize of less than 4 KB, 4KB is used).
187
188			The pointer type describes the structure of leaf values:
189			- upper two bits 0xC0 give the number of bytes to hold the tag
190			from 0 (0x00) to 2 (0x80).
191			- next two bits 0x30 give the number of bytes to hold the record id - 3,
192			from 0x00 for 3 bytes to 0x30 for 6 bytes
193			- the bit 0x08, if set, indicates that the tag is stored after
194			the record id (as in CDS/ISIS), else the tag bytes precede the record id
195			- lowest 3 bits 0x07 give the number of bytes to hold the position
196			(occurence*65536 + word position) from 0x00 for 0 bytes to 0x04 for 4 bytes.
197			For a value of 5 or 6, we should assume 4 byte position and one or two
198			additional bytes for the record id (currently not supported).
199			Pointers must use at least 4 and at most 14 (2+8+4; currently 12=2+6+4) bytes.
200			Pointer type 0x8B (3 byte rid + 2 byte tag + 3 byte pos)
201			is the same as used by CDS/ISIS.
202			All numbers in the pointer are stored in big endian byte order
203			(most significant byte first) for lexical sorting.
204
205
206			The dictionary contains 4 byte units, describing the position pos of an entry,
207			its number of values vln and the length of its key kln,
208			which is stored in the 4th byte (0 here is actually the empty key,
209			which is always the first in the leftmost block of every fork level).
210			In a fork block, the first 2 bytes store the pos in native byte order,
211			and the 3rd byte is vln.
212			In a leaf block (max size 8KB), the pos has 13bit and the vln 11bit.
213			The first 3 bytes in a dictionary unit are, independent of platform,
214			0: pos mod 256 (lower 8 bits),
215			1: pos div 256 (higher 5 bits) + 32 * (vln div 256) (higher 3 bits)
216			and 2: vln mod 256 (lower 8 bits).
217
218
219			Entries in the stack always start at pos with kln bytes holding the key
220			(as the actual key bytes, or, in a future version, compressed).
221
222			In a leaf block, after the key there are vln values sorted in increasing order
223			as of memcmp. Values have fixed size and structure as described by the
224			pointer type (typically 8 byte each).
225			Should the leaf block type be unstructured values, ptr is the actual
226			value size and no assumptions are made about the structure of values
227			(to be used independent of a masterfile as general purpose B-Tree).
228
229			A special value of all 0 bytes is used for stopwords;
230			no other value will be associated with the key.
231			For a tag-first pointer type (bit 0x08 clear), a pointer to tag 0
232			will be the first and is reserved to store unique keys:
233			at most one pointer to tag 0 will be associated with a key.
234
235			A leaf key currently always has at least one value;
236			should this key-value association be deleted, the entry is deleted completely.
237			(With index compression vln 0 will denote a stopword).
238
239			Should a key be associated with more values than fit within a block,
240			the following block starts with an entry with the same key
241			and next higher value.
242			(With index compression, we might consider using the empty key there;
243			to be defined).
244
245
246			In a fork block, currently (no index compression), only vln 0 and 1
247			are used. With vln 0, the key is directly followed by a 4 byte native
248			child block number (which is a leaf block number, if the block's level is 1,
249			else a fork block number). With vln not 0, they key is followed by
250			vln pairs of size (value size + 4), containing value bytes as in the leaves
251			followed by the 4 byte child block number. This is used where a key
252			spans leave blocks: we have a fork entry with vln 0 pointing to the key's
253			first block, followed by an entry with the same key plus the starting value
254			of the next block and so on.
255			(With index compression, we will use one entry with multiple values).
256
257
258			Fork block number 0 is always the root, and leaf block number 0
259			is always the leftmost leaf. All keys and values can be looped in order
260			by starting from leaf 0 and following the nxt pointers.
261
262			Note that the layout of leaf blocks is fully platform independent:
263			record pointers are organized big endian as in CDS/ISIS,
264			header numbers are little endian, and the dictionary is defined per byte.
265
266
267			* shared B-L-Tree access
268
269			A process accessing the index in exclusive mode may actually shift
270			keys and values around according to the specified layout
271			as it seems fit to minimize unused space in blocks.
272
273			With shared access, changes are very limited:
274			- Deleting a key-value association is done in the leaf block only,
275			there are no updates to any fork or other key block.
276			- The same holds where an inserted key-value association fits within
277			the target leaf block.
278			- The only case where data is moved between blocks is when a new pair
279			does not fit in its target block: a new block is allocated to become
280			the new right sibling of the target block and as many entries are moved
281			from the block's end to its new sibling so that the new pair can be
282			inserted in one of the blocks. On such a block split, a new entry
283			is made in the block's parent fork (which might trigger a split there).
284			- Should a process find that the key expected in a block is greater
285			than the greatest key there, it must assume that a block split
286			occurred and follow the nxt link to inspect the block's successor.
287			(That's why it's called B-Link-Tree; invented by Lehmann and Yao).
288			Actually, as we use full fork file locking, this can only happen
289			in leave blocks.
290
291
292			* limits according to file formats
293
294			The masterfile obviously has no limits but filesize.
295			To break the custom 2GB (32bit signed) barrier there are two approaches,
296			compiling with large file support and splitting files,
297			discussed further below.
298
299
300			In general, most 4 and 8 byte numbers are assumed signed entities,
301			since several system calls handle them that way.
302			Any 2^x here is to be understood as 2^x-1.
303
304
305			The xref in the small file implementation (8 byte pointers) can handle
306			- any legal (i.e. up to 2GB) masterfile size.
307			This limit can be broken by masterfile splitting.
308			- any record size fitting in there using 4,4,0 pointers
309			or records up to 16MB size using 4,3,1 pointers
310			(which have a slight performance advantage)
311			- any number of fields
312			- record ids up to 2^31
313
314			The full xref spec can handle
315			- large file sizes up to 2^63 (about 8.000.000.000.000.000.000)
316			- record sizes up to 2^48 (256 TB)
317			- record ids up to 2^63
318
319			The index can deal with
320			- general purpose key-value pairs of a combined length of up to 255 bytes
321			(might be limited to 127 with extensions like index compression)
322			- standard values (hit pointers) of 4 to 14 bytes (default: 8)
323			allow for keys of up to 241 to 251 bytes (default: 247)
324			- pointers to record ids up to 2^63 (currently impl. up to 2^48)
325			- pointers to non-negative tags up to 2^16
326			- pointers to position information up to 2^31
327			(by convention used as 1 or 2 bytes for field occurence,
328			last 2 bytes for word position).
329			- up to 2^32 leaf blocks of up to 8KB, totalling to a leaf file
330			size of up to 32 TB (with large file support).
331			Assuming an average key length of 16 and on average 10 values of 8 bytes
332			per key, each block can hold 81 keys, totalling about 320 billion keys.
333			The fork file, on an IA64 configured with pagesize 64K, can extend to 256TB.
334			This limit can be broken by index splitting.
335
336
337			* limits of the current implementation
338
339			Records have to fit into available memory, which, even when using ridiculous
340			amounts of RAM and/or large swapfiles, is bound by addressable memory.
341			On 32 bit architectures, this is usually 1 or 2 GB (on the heap).
342
343			Also bound by addressable memory is the possibility to memory map
344			the access files. While they work without memory mapping,
345			performance degrades substantially.
346
347
348			So, to use very large databases, the system should be compiled
349			for a 64 bit machine, which are luckily becoming affordable these
350			days and we hope to get us such a box in the near future.
351
352
353			* extending file size limits
354
355			Neither large file support nor file splitting are currently implemented.
356			However, large file support is pretty straightforward.
357
358
359			File splitting for masterfiles works by configuring a number n
360			and using a series of masterfiles f0, f1 ...
361			so that records with ids in the range in... (i+1)n-1
362			are stored in masterfile i. Likewise a series of xref files
363			is used based on a number m, which should be some multiple of n.
364
365
366			File splitting for the B-Tree is based on configuring a sequence of keys
367			k0=empty, k1 ... and using a series of leaf files l0, l1 ...
368			so that keys less than ki and not less than k(i-1) are stored in li.
369			Fork files could be split on the same keys or some subsequence of those.
370
371
372			The advantage of file splitting over large file support is that
373			it is more portable, saves some bytes on file positions and could aid backup,
374			especially where records are mostly appended to the last masterfile.
375			Moreover it can extend indexes even beyond 32 TB.
376			The total database size need not even be addressable with 64 bits.
377
378			The disadvantage is that it is a little bit more complicated
379			and when used exhaustively could make the system's open files limit
380			become a problem. Also tracking and snapshotting masterfiles
381			is a little bit less trivial.
382
383
384			---
385			$Id: FileFormats.txt,v 1.7 2004/11/12 11:18:23 kripke Exp $