0.9.9e/doc/FileFormats.txt

;man    malete  5       $Date: 2004/11/12 11:18:23 $
Malete file formats


*       meta data

The Malete options (record 0) file .m0d (with a zero; .mod used otherwise),
if present, contains a single serialized record containing metadata,
including
>       RecStruct       field definitions

Named
>       CharSet collation info
is compiled on demand to byte-order dependent .mcx files with
magic 'mcx' or 'MCX' for little-/big-endian, resp.


*       record data

The Malete record data file .mrd, a.k.a masterfile, is a plain text file.
Lines are terminated by the ASCII newline character with byte value 10.
Any superset of ASCII - including UTF-8 - may be used as character encoding.
In other words, byte values 0-127 are always interpreted according to ASCII,
values 128-255 have no special meaning.
More precisely, the byte values 10 (newline), 9 (tabulator),
45 (minus '-'), 48-57 (digits '0'-'9'), 64 (at-sign '@') and 87 ('W')
have structural significance.


Logically, the masterfile is a sequence of records,
where every record is terminated by an empty line.
A record is substructured as a sequence of non-empty lines,
each terminated by a newline.

Field lines are substructured as a number
(optional minus followed by digits) followed by a tabulator
followed by any characters but a newline as field values.
If the first line of a record does not start with a number,
it is used as header line. All other lines must be field lines.


As of now, there is only one valid header line format defined,
consisting of the letter 'W', followed by a tabulator and a
data record header. The data record header is of the form
'rid[@pos][*TAB*leader]'. Here
-       rid is a positive record id (a.k.a. masterfile number MFN, in ASCII digits)
-       if @pos is present, pos must be the position of the previous version
        of this record as byte offset from beginning of file in ASCII digits.
        Malete software supports retrieval of old versions based on this.
-       the optional leader is arbitrary leader data for the record.
        It can hold a MARC leader or a unique key ("name") for the record.

If a record has no header line, its record id is one plus the highest
record id used up to that point in the file.

Using header lines, records can represent updated versions of
previously stored records. There is no way to delete records,
the operation next to it is to write an empty record.


During normal operation, data in the masterfile is never updated,
all writes are appends. Thus live backups and replications can be
based on 'tail -f' writing to a tape or piped through netcat.
A consistent snapshot can be made by recording the current file size
and only accessing record versions at lower positions.


Compacting a masterfile by purging old versions is straightforward,
but considered an offline operation
(basically printing current versions to a new masterfile).
Compacting a mounted database is feasible for the exclusive mode server,
but currently not supported.


The masterfile is a special case of the serialized
>       Protocol
, see there for details.
The suggested handling of malformed records should not be relied upon.


*       record access

The Malete record access file .mrx, a.k.a cross reference (xref) file,
is a binary file dependent on the platform byte order and pagesize.
Since the xref file is accessed by memory mapping,
its size must be a multiple of the pagesize
(which is 4 KB for most Intel based systems).
Where the xref is found to be missing, have the wrong byte order or size
or be otherwise invalid, it is rebuilt from the masterfile.


The xref is organized in units of a fixed size ranging from 8 to 16 bytes.
The unit at offset rid*size is a pointer to the current version of record
with id rid in the masterfile (rid starting from 1).

This pointer consists of
-       4 to 8 bytes for the record's position
        (including any header line)
-       3 to 6 bytes for the record's length
        (including the terminating empty line)
-       0 to 2 bytes for the record's number of fields
        (including 1 header field, even if it's empty)
When not compiled with support for large files,
only the combinations 4,3,1 and 4,4,0 are implemented.

If a record's number of fields exceeds the representable range,
the corresponding pointer bytes (if any) are set to 0 and the number
of field must be determined while reading the record.
This number is also 0 for "deleted" records (no fields, empty leader).


The first such unit (at mrx offset 0) consists of
-       3 bytes magic number, which is "MRX" for big endian, "mrx" else
-       1 byte type, computed as
        (bytes for pos-4)*16+(bytes for length-3)*4+(bytes for fields)
-       a 4 byte integer in native byte order holding the max used rid
-       if the unit size is 10 or greater (i.e. with largefile support),
        the following 2 (for 10,11) or 4 bytes are an integer in native byte order
        holding high order bytes of the max record id.


*       query data and access

The malete query data .mqd and query access .mqx files hold
the leaves and forks (inner nodes) of a B-Link-Tree, resp.,
which is usually used as index associating keys generated from masterfile
records with pointers to such records.


Both files are organized in fixed size blocks containing binary integer numbers
and arbitrary key strings (encoded according to the collation configuration).
Blocks contain
-       16 bytes of "header" data
-       a "dictionary",
        which is an array of 4 byte units describing entries in the block
        with position, length of key and number of values.
-       a "stack" holding the entries,
        growing downwards from the end of the block.
        For dictionary slots d[0]..d[n], describing entries of size s0..sn,
        entry i occupies si consecutive bytes starting at (block size)-(s0+...+si).

The header contains 8 unsigned binary numbers:
-       num 4 bytes block number
-       typ 1 byte block type (bitmask; see below)
-       ksz 1 byte maximum key length
        (default 0 treated like the maximum 255; CDS/ISIS uses 30)
-       ptr 1 byte pointer type (bitmask; see below)
-       lev 1 byte level over bottom (0 for leaves)
-       nxt 4 bytes number of right sibling block (0 if none)
-       ent 2 bytes number of entries (length of dictionary)
-       stk 2 bytes offset of 1st byte used by the stack
The first 5 numbers are actually redundant since the block number
must match the blocks file position, the three configurable bytes
must be the same among all leaves and forks, resp.,
and the level is not really needed.
However, wasting this 8 bytes serves as a minimal check
and makes handling much easier.

In leaf blocks, the byte order of all header numbers is little endian
(so big endian machines have to swap bytes when reading/writing leaf blocks).
In fork blocks, the header numbers are in native byte order
(since they are usually not accessed by explicit reading and writing,
but through memory mapping).


The block type is:
-       upper 2 bits 0xC0 give the basic block type:
        0x00 for a standard leaf block,
        0x40 and 0x80 for little and big endian forks,
        0xC0 for leaves with unstructured values.
-       next two bits are clear and reserved for future extensions;
        should software see such a bit set,
        it must not assume anything about the index structure.
-       the bit 0x08, if set, will indicate that the index is compressed.
        Each key is stored as one byte giving the length
        of the common prefix to its predecessor followed by changed bytes.
        This is not yet supported and current software will refuse to
        access such an index.
        Also do not confuse this with compressing each key individually based on
>       CharSet collation recoding
-       lowest 3 bits 0x07 give the block size.
        For leaves, this is freely configurable as bitcount-9,
        from 0 for 512 (2^9) bits to 4 for 8KB (2^13).
        For forks, this is bitcount-12, from 0 for 4KB (2^12) to 4 for 64KB(2^16),
        and must match the system's pagesize (should there be a system with
        a pagesize of less than 4 KB, 4KB is used).

The pointer type describes the structure of leaf values:
-       upper two bits 0xC0 give the number of bytes to hold the tag
        from 0 (0x00) to 2 (0x80).
-       next two bits 0x30 give the number of bytes to hold the record id - 3,
        from 0x00 for 3 bytes to 0x30 for 6 bytes
-       the bit 0x08, if set, indicates that the tag is stored after
        the record id (as in CDS/ISIS), else the tag bytes precede the record id
-       lowest 3 bits 0x07 give the number of bytes to hold the position
        (occurence*65536 + word position) from 0x00 for 0 bytes to 0x04 for 4 bytes.
        For a value of 5 or 6, we should assume 4 byte position and one or two
        additional bytes for the record id (currently not supported).
Pointers must use at least 4 and at most 14 (2+8+4; currently 12=2+6+4) bytes.
Pointer type 0x8B (3 byte rid + 2 byte tag + 3 byte pos)
is the same as used by CDS/ISIS.
All numbers in the pointer are stored in big endian byte order
(most significant byte first) for lexical sorting.


The dictionary contains 4 byte units, describing the position pos of an entry,
its number of values vln and the length of its key kln,
which is stored in the 4th byte (0 here is actually the empty key,
which is always the first in the leftmost block of every fork level).
In a fork block, the first 2 bytes store the pos in native byte order,
and the 3rd byte is vln.
In a leaf block (max size 8KB), the pos has 13bit and the vln 11bit.
The first 3 bytes in a dictionary unit are, independent of platform,
0: pos mod 256 (lower 8 bits),
1: pos div 256 (higher 5 bits) + 32 * (vln div 256) (higher 3 bits)
and 2: vln mod 256 (lower 8 bits).


Entries in the stack always start at pos with kln bytes holding the key
(as the actual key bytes, or, in a future version, compressed).

In a leaf block, after the key there are vln values sorted in increasing order
as of memcmp. Values have fixed size and structure as described by the
pointer type (typically 8 byte each).
Should the leaf block type be unstructured values, ptr is the actual
value size and no assumptions are made about the structure of values
(to be used independent of a masterfile as general purpose B-Tree).

A special value of all 0 bytes is used for stopwords;
no other value will be associated with the key.
For a tag-first pointer type (bit 0x08 clear), a pointer to tag 0
will be the first and is reserved to store unique keys:
at most one pointer to tag 0 will be associated with a key.

A leaf key currently always has at least one value;
should this key-value association be deleted, the entry is deleted completely.
(With index compression vln 0 will denote a stopword).

Should a key be associated with more values than fit within a block,
the following block starts with an entry with the same key
and next higher value.
(With index compression, we might consider using the empty key there;
to be defined).


In a fork block, currently (no index compression), only vln 0 and 1
are used. With vln 0, the key is directly followed by a 4 byte native
child block number (which is a leaf block number, if the block's level is 1,
else a fork block number). With vln not 0, they key is followed by
vln pairs of size (value size + 4), containing value bytes as in the leaves
followed by the 4 byte child block number. This is used where a key
spans leave blocks: we have a fork entry with vln 0 pointing to the key's
first block, followed by an entry with the same key plus the starting value
of the next block and so on.
(With index compression, we will use one entry with multiple values).


Fork block number 0 is always the root, and leaf block number 0
is always the leftmost leaf. All keys and values can be looped in order
by starting from leaf 0 and following the nxt pointers.

Note that the layout of leaf blocks is fully platform independent:
record pointers are organized big endian as in CDS/ISIS,
header numbers are little endian, and the dictionary is defined per byte.


*       shared B-L-Tree access

A process accessing the index in exclusive mode may actually shift
keys and values around according to the specified layout
as it seems fit to minimize unused space in blocks.

With shared access, changes are very limited:
-       Deleting a key-value association is done in the leaf block only,
        there are no updates to any fork or other key block.
-       The same holds where an inserted key-value association fits within
        the target leaf block.
-       The only case where data is moved between blocks is when a new pair
        does not fit in its target block: a new block is allocated to become
        the new right sibling of the target block and as many entries are moved
        from the block's end to its new sibling so that the new pair can be
        inserted in one of the blocks. On such a block split, a new entry
        is made in the block's parent fork (which might trigger a split there).
-       Should a process find that the key expected in a block is greater
        than the greatest key there, it must assume that a block split
        occurred and follow the nxt link to inspect the block's successor.
        (That's why it's called B-Link-Tree; invented by Lehmann and Yao).
        Actually, as we use full fork file locking, this can only happen
        in leave blocks.


*       limits according to file formats

The masterfile obviously has no limits but filesize.
To break the custom 2GB (32bit signed) barrier there are two approaches,
compiling with large file support and splitting files,
discussed further below.


In general, most 4 and 8 byte numbers are assumed signed entities,
since several system calls handle them that way.
Any 2^x here is to be understood as 2^x-1.


The xref in the small file implementation (8 byte pointers) can handle
-       any legal (i.e. up to 2GB) masterfile size.
        This limit can be broken by masterfile splitting.
-       any record size fitting in there using 4,4,0 pointers
        or records up to 16MB size using 4,3,1 pointers
        (which have a slight performance advantage)
-       any number of fields
-       record ids up to 2^31

The full xref spec can handle
-       large file sizes up to 2^63 (about 8.000.000.000.000.000.000)
-       record sizes up to 2^48 (256 TB)
-       record ids up to 2^63

The index can deal with
-       general purpose key-value pairs of a combined length of up to 255 bytes
        (might be limited to 127 with extensions like index compression)
- standard values (hit pointers) of 4 to 14 bytes (default: 8)
        allow for keys of up to 241 to 251 bytes (default: 247)
-       pointers to record ids up to 2^63 (currently impl. up to 2^48)
-       pointers to non-negative tags up to 2^16
-       pointers to position information up to 2^31
        (by convention used as 1 or 2 bytes for field occurence,
        last 2 bytes for word position).
-       up to 2^32 leaf blocks of up to 8KB, totalling to a leaf file
        size of up to 32 TB (with large file support).
        Assuming an average key length of 16 and on average 10 values of 8 bytes
        per key, each block can hold 81 keys, totalling about 320 billion keys.
        The fork file, on an IA64 configured with pagesize 64K, can extend to 256TB.
        This limit can be broken by index splitting.


*       limits of the current implementation

Records have to fit into available memory, which, even when using ridiculous
amounts of RAM and/or large swapfiles, is bound by addressable memory.
On 32 bit architectures, this is usually 1 or 2 GB (on the heap).

Also bound by addressable memory is the possibility to memory map
the access files. While they work without memory mapping,
performance degrades substantially.


So, to use very large databases, the system should be compiled
for a 64 bit machine, which are luckily becoming affordable these
days and we hope to get us such a box in the near future.


*       extending file size limits

Neither large file support nor file splitting are currently implemented.
However, large file support is pretty straightforward.


File splitting for masterfiles works by configuring a number n
and using a series of masterfiles f0, f1 ...
so that records with ids in the range i*n... (i+1)*n-1
are stored in masterfile i. Likewise a series of xref files
is used based on a number m, which should be some multiple of n.


File splitting for the B-Tree is based on configuring a sequence of keys
k0=empty, k1 ... and using a series of leaf files l0, l1 ...
so that keys less than ki and not less than k(i-1) are stored in li.
Fork files could be split on the same keys or some subsequence of those.


The advantage of file splitting over large file support is that
it is more portable, saves some bytes on file positions and could aid backup,
especially where records are mostly appended to the last masterfile.
Moreover it can extend indexes even beyond 32 TB.
The total database size need not even be addressable with 64 bits.

The disadvantage is that it is a little bit more complicated
and when used exhaustively could make the system's open files limit
become a problem. Also tracking and snapshotting masterfiles
is a little bit less trivial.


---
        $Id: FileFormats.txt,v 1.7 2004/11/12 11:18:23 kripke Exp $
1	;man malete 5 $Date: 2004/11/12 11:18:23 $
2	Malete file formats
3
4
5	* meta data
6
7	The Malete options (record 0) file .m0d (with a zero; .mod used otherwise),
8	if present, contains a single serialized record containing metadata,
9	including
10	> RecStruct field definitions
11
12	Named
13	> CharSet collation info
14	is compiled on demand to byte-order dependent .mcx files with
15	magic 'mcx' or 'MCX' for little-/big-endian, resp.
16
17
18	* record data
19
20	The Malete record data file .mrd, a.k.a masterfile, is a plain text file.
21	Lines are terminated by the ASCII newline character with byte value 10.
22	Any superset of ASCII - including UTF-8 - may be used as character encoding.
23	In other words, byte values 0-127 are always interpreted according to ASCII,
24	values 128-255 have no special meaning.
25	More precisely, the byte values 10 (newline), 9 (tabulator),
26	45 (minus '-'), 48-57 (digits '0'-'9'), 64 (at-sign '@') and 87 ('W')
27	have structural significance.
28
29
30	Logically, the masterfile is a sequence of records,
31	where every record is terminated by an empty line.
32	A record is substructured as a sequence of non-empty lines,
33	each terminated by a newline.
34
35	Field lines are substructured as a number
36	(optional minus followed by digits) followed by a tabulator
37	followed by any characters but a newline as field values.
38	If the first line of a record does not start with a number,
39	it is used as header line. All other lines must be field lines.
40
41
42	As of now, there is only one valid header line format defined,
43	consisting of the letter 'W', followed by a tabulator and a
44	data record header. The data record header is of the form
45	'rid[@pos][TABleader]'. Here
46	- rid is a positive record id (a.k.a. masterfile number MFN, in ASCII digits)
47	- if @pos is present, pos must be the position of the previous version
48	of this record as byte offset from beginning of file in ASCII digits.
49	Malete software supports retrieval of old versions based on this.
50	- the optional leader is arbitrary leader data for the record.
51	It can hold a MARC leader or a unique key ("name") for the record.
52
53	If a record has no header line, its record id is one plus the highest
54	record id used up to that point in the file.
55
56	Using header lines, records can represent updated versions of
57	previously stored records. There is no way to delete records,
58	the operation next to it is to write an empty record.
59
60
61	During normal operation, data in the masterfile is never updated,
62	all writes are appends. Thus live backups and replications can be
63	based on 'tail -f' writing to a tape or piped through netcat.
64	A consistent snapshot can be made by recording the current file size
65	and only accessing record versions at lower positions.
66
67
68	Compacting a masterfile by purging old versions is straightforward,
69	but considered an offline operation
70	(basically printing current versions to a new masterfile).
71	Compacting a mounted database is feasible for the exclusive mode server,
72	but currently not supported.
73
74
75	The masterfile is a special case of the serialized
76	> Protocol
77	, see there for details.
78	The suggested handling of malformed records should not be relied upon.
79
80
81	* record access
82
83	The Malete record access file .mrx, a.k.a cross reference (xref) file,
84	is a binary file dependent on the platform byte order and pagesize.
85	Since the xref file is accessed by memory mapping,
86	its size must be a multiple of the pagesize
87	(which is 4 KB for most Intel based systems).
88	Where the xref is found to be missing, have the wrong byte order or size
89	or be otherwise invalid, it is rebuilt from the masterfile.
90
91
92	The xref is organized in units of a fixed size ranging from 8 to 16 bytes.
93	The unit at offset rid*size is a pointer to the current version of record
94	with id rid in the masterfile (rid starting from 1).
95
96	This pointer consists of
97	- 4 to 8 bytes for the record's position
98	(including any header line)
99	- 3 to 6 bytes for the record's length
100	(including the terminating empty line)
101	- 0 to 2 bytes for the record's number of fields
102	(including 1 header field, even if it's empty)
103	When not compiled with support for large files,
104	only the combinations 4,3,1 and 4,4,0 are implemented.
105
106	If a record's number of fields exceeds the representable range,
107	the corresponding pointer bytes (if any) are set to 0 and the number
108	of field must be determined while reading the record.
109	This number is also 0 for "deleted" records (no fields, empty leader).
110
111
112	The first such unit (at mrx offset 0) consists of
113	- 3 bytes magic number, which is "MRX" for big endian, "mrx" else
114	- 1 byte type, computed as
115	(bytes for pos-4)16+(bytes for length-3)4+(bytes for fields)
116	- a 4 byte integer in native byte order holding the max used rid
117	- if the unit size is 10 or greater (i.e. with largefile support),
118	the following 2 (for 10,11) or 4 bytes are an integer in native byte order
119	holding high order bytes of the max record id.
120
121
122	* query data and access
123
124	The malete query data .mqd and query access .mqx files hold
125	the leaves and forks (inner nodes) of a B-Link-Tree, resp.,
126	which is usually used as index associating keys generated from masterfile
127	records with pointers to such records.
128
129
130	Both files are organized in fixed size blocks containing binary integer numbers
131	and arbitrary key strings (encoded according to the collation configuration).
132	Blocks contain
133	- 16 bytes of "header" data
134	- a "dictionary",
135	which is an array of 4 byte units describing entries in the block
136	with position, length of key and number of values.
137	- a "stack" holding the entries,
138	growing downwards from the end of the block.
139	For dictionary slots d[0]..d[n], describing entries of size s0..sn,
140	entry i occupies si consecutive bytes starting at (block size)-(s0+...+si).
141
142	The header contains 8 unsigned binary numbers:
143	- num 4 bytes block number
144	- typ 1 byte block type (bitmask; see below)
145	- ksz 1 byte maximum key length
146	(default 0 treated like the maximum 255; CDS/ISIS uses 30)
147	- ptr 1 byte pointer type (bitmask; see below)
148	- lev 1 byte level over bottom (0 for leaves)
149	- nxt 4 bytes number of right sibling block (0 if none)
150	- ent 2 bytes number of entries (length of dictionary)
151	- stk 2 bytes offset of 1st byte used by the stack
152	The first 5 numbers are actually redundant since the block number
153	must match the blocks file position, the three configurable bytes
154	must be the same among all leaves and forks, resp.,
155	and the level is not really needed.
156	However, wasting this 8 bytes serves as a minimal check
157	and makes handling much easier.
158
159	In leaf blocks, the byte order of all header numbers is little endian
160	(so big endian machines have to swap bytes when reading/writing leaf blocks).
161	In fork blocks, the header numbers are in native byte order
162	(since they are usually not accessed by explicit reading and writing,
163	but through memory mapping).
164
165
166	The block type is:
167	- upper 2 bits 0xC0 give the basic block type:
168	0x00 for a standard leaf block,
169	0x40 and 0x80 for little and big endian forks,
170	0xC0 for leaves with unstructured values.
171	- next two bits are clear and reserved for future extensions;
172	should software see such a bit set,
173	it must not assume anything about the index structure.
174	- the bit 0x08, if set, will indicate that the index is compressed.
175	Each key is stored as one byte giving the length
176	of the common prefix to its predecessor followed by changed bytes.
177	This is not yet supported and current software will refuse to
178	access such an index.
179	Also do not confuse this with compressing each key individually based on
180	> CharSet collation recoding
181	- lowest 3 bits 0x07 give the block size.
182	For leaves, this is freely configurable as bitcount-9,
183	from 0 for 512 (2^9) bits to 4 for 8KB (2^13).
184	For forks, this is bitcount-12, from 0 for 4KB (2^12) to 4 for 64KB(2^16),
185	and must match the system's pagesize (should there be a system with
186	a pagesize of less than 4 KB, 4KB is used).
187
188	The pointer type describes the structure of leaf values:
189	- upper two bits 0xC0 give the number of bytes to hold the tag
190	from 0 (0x00) to 2 (0x80).
191	- next two bits 0x30 give the number of bytes to hold the record id - 3,
192	from 0x00 for 3 bytes to 0x30 for 6 bytes
193	- the bit 0x08, if set, indicates that the tag is stored after
194	the record id (as in CDS/ISIS), else the tag bytes precede the record id
195	- lowest 3 bits 0x07 give the number of bytes to hold the position
196	(occurence*65536 + word position) from 0x00 for 0 bytes to 0x04 for 4 bytes.
197	For a value of 5 or 6, we should assume 4 byte position and one or two
198	additional bytes for the record id (currently not supported).
199	Pointers must use at least 4 and at most 14 (2+8+4; currently 12=2+6+4) bytes.
200	Pointer type 0x8B (3 byte rid + 2 byte tag + 3 byte pos)
201	is the same as used by CDS/ISIS.
202	All numbers in the pointer are stored in big endian byte order
203	(most significant byte first) for lexical sorting.
204
205
206	The dictionary contains 4 byte units, describing the position pos of an entry,
207	its number of values vln and the length of its key kln,
208	which is stored in the 4th byte (0 here is actually the empty key,
209	which is always the first in the leftmost block of every fork level).
210	In a fork block, the first 2 bytes store the pos in native byte order,
211	and the 3rd byte is vln.
212	In a leaf block (max size 8KB), the pos has 13bit and the vln 11bit.
213	The first 3 bytes in a dictionary unit are, independent of platform,
214	0: pos mod 256 (lower 8 bits),
215	1: pos div 256 (higher 5 bits) + 32 * (vln div 256) (higher 3 bits)
216	and 2: vln mod 256 (lower 8 bits).
217
218
219	Entries in the stack always start at pos with kln bytes holding the key
220	(as the actual key bytes, or, in a future version, compressed).
221
222	In a leaf block, after the key there are vln values sorted in increasing order
223	as of memcmp. Values have fixed size and structure as described by the
224	pointer type (typically 8 byte each).
225	Should the leaf block type be unstructured values, ptr is the actual
226	value size and no assumptions are made about the structure of values
227	(to be used independent of a masterfile as general purpose B-Tree).
228
229	A special value of all 0 bytes is used for stopwords;
230	no other value will be associated with the key.
231	For a tag-first pointer type (bit 0x08 clear), a pointer to tag 0
232	will be the first and is reserved to store unique keys:
233	at most one pointer to tag 0 will be associated with a key.
234
235	A leaf key currently always has at least one value;
236	should this key-value association be deleted, the entry is deleted completely.
237	(With index compression vln 0 will denote a stopword).
238
239	Should a key be associated with more values than fit within a block,
240	the following block starts with an entry with the same key
241	and next higher value.
242	(With index compression, we might consider using the empty key there;
243	to be defined).
244
245
246	In a fork block, currently (no index compression), only vln 0 and 1
247	are used. With vln 0, the key is directly followed by a 4 byte native
248	child block number (which is a leaf block number, if the block's level is 1,
249	else a fork block number). With vln not 0, they key is followed by
250	vln pairs of size (value size + 4), containing value bytes as in the leaves
251	followed by the 4 byte child block number. This is used where a key
252	spans leave blocks: we have a fork entry with vln 0 pointing to the key's
253	first block, followed by an entry with the same key plus the starting value
254	of the next block and so on.
255	(With index compression, we will use one entry with multiple values).
256
257
258	Fork block number 0 is always the root, and leaf block number 0
259	is always the leftmost leaf. All keys and values can be looped in order
260	by starting from leaf 0 and following the nxt pointers.
261
262	Note that the layout of leaf blocks is fully platform independent:
263	record pointers are organized big endian as in CDS/ISIS,
264	header numbers are little endian, and the dictionary is defined per byte.
265
266
267	* shared B-L-Tree access
268
269	A process accessing the index in exclusive mode may actually shift
270	keys and values around according to the specified layout
271	as it seems fit to minimize unused space in blocks.
272
273	With shared access, changes are very limited:
274	- Deleting a key-value association is done in the leaf block only,
275	there are no updates to any fork or other key block.
276	- The same holds where an inserted key-value association fits within
277	the target leaf block.
278	- The only case where data is moved between blocks is when a new pair
279	does not fit in its target block: a new block is allocated to become
280	the new right sibling of the target block and as many entries are moved
281	from the block's end to its new sibling so that the new pair can be
282	inserted in one of the blocks. On such a block split, a new entry
283	is made in the block's parent fork (which might trigger a split there).
284	- Should a process find that the key expected in a block is greater
285	than the greatest key there, it must assume that a block split
286	occurred and follow the nxt link to inspect the block's successor.
287	(That's why it's called B-Link-Tree; invented by Lehmann and Yao).
288	Actually, as we use full fork file locking, this can only happen
289	in leave blocks.
290
291
292	* limits according to file formats
293
294	The masterfile obviously has no limits but filesize.
295	To break the custom 2GB (32bit signed) barrier there are two approaches,
296	compiling with large file support and splitting files,
297	discussed further below.
298
299
300	In general, most 4 and 8 byte numbers are assumed signed entities,
301	since several system calls handle them that way.
302	Any 2^x here is to be understood as 2^x-1.
303
304
305	The xref in the small file implementation (8 byte pointers) can handle
306	- any legal (i.e. up to 2GB) masterfile size.
307	This limit can be broken by masterfile splitting.
308	- any record size fitting in there using 4,4,0 pointers
309	or records up to 16MB size using 4,3,1 pointers
310	(which have a slight performance advantage)
311	- any number of fields
312	- record ids up to 2^31
313
314	The full xref spec can handle
315	- large file sizes up to 2^63 (about 8.000.000.000.000.000.000)
316	- record sizes up to 2^48 (256 TB)
317	- record ids up to 2^63
318
319	The index can deal with
320	- general purpose key-value pairs of a combined length of up to 255 bytes
321	(might be limited to 127 with extensions like index compression)
322	- standard values (hit pointers) of 4 to 14 bytes (default: 8)
323	allow for keys of up to 241 to 251 bytes (default: 247)
324	- pointers to record ids up to 2^63 (currently impl. up to 2^48)
325	- pointers to non-negative tags up to 2^16
326	- pointers to position information up to 2^31
327	(by convention used as 1 or 2 bytes for field occurence,
328	last 2 bytes for word position).
329	- up to 2^32 leaf blocks of up to 8KB, totalling to a leaf file
330	size of up to 32 TB (with large file support).
331	Assuming an average key length of 16 and on average 10 values of 8 bytes
332	per key, each block can hold 81 keys, totalling about 320 billion keys.
333	The fork file, on an IA64 configured with pagesize 64K, can extend to 256TB.
334	This limit can be broken by index splitting.
335
336
337	* limits of the current implementation
338
339	Records have to fit into available memory, which, even when using ridiculous
340	amounts of RAM and/or large swapfiles, is bound by addressable memory.
341	On 32 bit architectures, this is usually 1 or 2 GB (on the heap).
342
343	Also bound by addressable memory is the possibility to memory map
344	the access files. While they work without memory mapping,
345	performance degrades substantially.
346
347
348	So, to use very large databases, the system should be compiled
349	for a 64 bit machine, which are luckily becoming affordable these
350	days and we hope to get us such a box in the near future.
351
352
353	* extending file size limits
354
355	Neither large file support nor file splitting are currently implemented.
356	However, large file support is pretty straightforward.
357
358
359	File splitting for masterfiles works by configuring a number n
360	and using a series of masterfiles f0, f1 ...
361	so that records with ids in the range in... (i+1)n-1
362	are stored in masterfile i. Likewise a series of xref files
363	is used based on a number m, which should be some multiple of n.
364
365
366	File splitting for the B-Tree is based on configuring a sequence of keys
367	k0=empty, k1 ... and using a series of leaf files l0, l1 ...
368	so that keys less than ki and not less than k(i-1) are stored in li.
369	Fork files could be split on the same keys or some subsequence of those.
370
371
372	The advantage of file splitting over large file support is that
373	it is more portable, saves some bytes on file positions and could aid backup,
374	especially where records are mostly appended to the last masterfile.
375	Moreover it can extend indexes even beyond 32 TB.
376	The total database size need not even be addressable with 64 bits.
377
378	The disadvantage is that it is a little bit more complicated
379	and when used exhaustively could make the system's open files limit
380	become a problem. Also tracking and snapshotting masterfiles
381	is a little bit less trivial.
382
383
384	---
385	$Id: FileFormats.txt,v 1.7 2004/11/12 11:18:23 kripke Exp $