openisis/doc/Serialized.txt

ISIS records serialized

Serialization means to convert the internal representation 
of an ISIS record to a sequence of bytes ("octets") suitable
to be stored in a file or transferred via a network.

The serialization format described here is used by OpenIsis
for both database master files and network communications.


*       design goals

The serialization format should be
-       easy to use
        for programmers and tool writers
-       efficient
        in execution time and space used
-       robust
        a broken masterfile should be fixed using a text editor
-       versatile
        can be used for a variety of applications using a variety of tools
-       without limits
        in number and size of records and fields


*       basic format

In general, a record is serialized by
-       serializing meta information
-       serializing the fields in order
-       appending a blank line

Fields are serialized as
-       the field tag printed using ASCII decimal digits
        (optionally preceeded by a minus sign, if negative tags are allowed)
-       a (horizontal) TAB character (ASCII value 9, ^I)
-       the field value
-       a newline character (ASCII value 10, ^J)

Metadata is serialized in the same way,
using special tags according to the needs of the environment.
Two situations are distinguished:
-       "soft" metadata, which may and should be accessible as part of the record.
        This is encoded by convention using negative tags.
        An example of this is HTTP and other MIME-style communication,
        where the MIME headers like "User-agent" or "Date" are encoded
        in such a way, while content data like GET or POST parameters
        should be mapped to positive IDs.
-       "hard" metadata, which must not interfere with the record contents
        in order for the environment to work properly.
        This is encoded using a single non-digit character instead of tag digits.
        An example of this is the MFN in a master file and information
        regarding record deletion or update.

The final blankline may be omitted,
where only a single record is contained in an otherwise delimited byte sequence.

A reader should support a lazy mode,
allowing the TAB to be omitted, where unambigous
(the field value does not start with a digit or a TAB).
Writers, however, are strongly urged to write the TAB.

Tags with leading zeros are allowed (typically with %03d)
and must not be interpreted as octal using atoi.


*       newline conventions

Two ways are supported to deal with newline characters in field values:
-       in "text mode", newlines are replaced with vertical tabs (ASCII 11, ^K)
        on serialization and vice versa on deserialization.
-       in "binary mode", newlines are replaced as newline-TAB sequences.

As an alternative to these protocol-level modes one may choose "field mode"
at the application level: simply claim that you are not interested in
newlines and replace any by tab or space (w/o ever converting back).


The advantages of text mode over binary mode are
-       it is slightly faster than the binary translation
-       the serialized records do not need more space
        than the internal representation
        (whereas the binary serialization might need nearly twice as much
        in worst case)
-       it is easily used with line-oriented utilities like grep or sed,
        since each field is contained within one line

The binary mode (which resembles MIME continuation lines) has the
advantage of not loosing vertical tab characters that might have
been contained in the original field values.
It is fully transparent and can be used to store any binary data like images
with an average overhead of 0.4%
(as compared to +33% needed by BASE64 encoding).


The OpenIsis server automatically detects binary mode,
if the client uses a continuation line.


*       masterfile format

A basic masterfile consists of a blank-line separated series of records.

The first record is the "controlling record",
containing descriptive information such as the newline convention,
the subfield separator and the character encoding.
All of this is optional; the masterfile might just start with a blank line.

The MFNs are then assigned implicitly in order, starting from 1.
There is no distinction between
empty (consecutive blanklines) and deleted records.
There is absolutely no redundant information contained.

Masterfile compression creates this state
(however, it may choose to use special meta lines for long ranges
of deleted records, see below -- but those are inefficient in
the Xref, anyway).
Such a masterfile can be very easily created by any tool (like Perl and such).
The Xref file can be easily, fast and reliably recreated.


When writing to the database, information is ALWAYS appended to the end.
There is NO OVERWRITING of any data, ever, period.
That way data CANT BE DESTROYED by any operation
(one could advise the operating system to set a mandatory read lock on that),
and all changes are easily traced using tail -f.

A binary mode masterfile starts with a "broken" continuation line
containing a single TAB.


*       basic mode writing

In metalines, all numbers are given in decimal digits
and multiple items are separated by TABs.
The optional timestamps are an arbitrary prefix of generalized time format
YYYYMMDDhhmmssttt... as of 'date +%Y%m%d%H%M' plus milliseconds ttt,
optionally followed by up to 14 characters to create a unique request id.

Write operations use the following meta line (preceeding record data):
-       W       mfn     [oldpos [timestamp]]
        followed by new record data denotes a write of record mfn.
        Depending on the needs of the environment,
        the byte offset of the last version might be added in order to
        support access to old versions (e.g. for delayed index update).
        oldpos may be given as position[.length[.fields]]

A W with no data following is mostly equivalent to a delete.
If an otherwise written mfn is higher than the highest known,
the highest known (and thus the implicit counter) is set to this.

A lazy reader should not require meta lines to be followed by a blank line,
where unambigous.
For writers, however, the blank line is strongly recommended.


A reasonable size limit on metalines is 127(+newline), since
-       22 = 1+1+20 for operator, tab and mfn (64 bits ~ 20 digits)
-       43 = 1+20+1+10+1+10 for tab position.length.fields
-       32 = 1+17+14 for tab, milliseconds time and id
totals to 97, so we have some room left.


*       advance mode

For advanced space efficiency at the cost of read increased access time,
the following lines may be used:
-       D       mfn     [oldpos [timestamp]]
        a record entry consisting of only a D line denotes the deletion of
        record number mfn.
        Basically equivalent to a write with no data following,
        but a little bit more explicit.
-       I       mfn [timestamp]
        (set id) override the implicit MFN counter,
        e.g. after a series of deleted rows or just to be explicit.
        Basically equivalent to a write with no old pos,
        but a little bit more explicit.
        Recommended when appending records after some deletes.
-       C       mfn     oldpos [timestamp]
        introduces a series of patch commands specifying how record mfn
        was changed.

Software writing the masterfile will typically choose to
write full updates and MUST provide a switch to forcedly do so,
in order to be compatible with basic mode readers.

However, the patch command language is particularily useful
with server operations, to avoid the need for read-write
sequences with some sort of locking.


*       the patch language

The patch commands are lines starting with special
characters like +, -, ~ and so on,
followed by (an optional TAB and) field addresses, TAB and field data.


The simplest case is the '+' command,
meaning that it's data is to be appended to the record.
The + and TAB may be omitted
(both, in order to not be confused with a continuation line).
In other words, the add command may look exactly like an ordinary
field line.

A series of '=' commands works exactly like the set operation in an
OpenIsis Tcl record. Especially, field indexes and subfields are supported.
The '-' command resembles the del operation.

A detailled description of the "patch language" is to be done.


example
$
C       1234
=       24      foo
=       24      bar
25      baz
$
changes record 1234 by setting the first to occurences of
field 24 to foo and baz, respectively, deleting any other occurences
of field 24, and appending a field 25 with value baz.


*       the pointer file

The pointer file is an array of n-Byte (n >= 6) entries,
the ith entry referencing mfn i, similar to the traditional .XRF (crossref).

The n=k+l+m bytes specify two or three numbers (in native byte order)
of up to 8 bytes each:
-       the first k bytes (k >= 4) give the position of the record
        (or it's last update or change entry)
-       the next l bytes (l >= 2) give the length of the record
        (excluding the last field's terminating newline and following blank line)
-       the final m bytes (m >= 0) give the number of fields.
        If m is 0, or all bits in a field number are set (for large records),
        the reader has to determine the number fo fields by inspecting the record.

The first six bytes of the first entry describe the detailled layout.
Four bytes are the "magic number" containing the ASCII characters "ISIX".
Two bytes are the number (m*256 + l*16 + k) in native byte order.

The minimum case k=4, l=2 imposes the limits of traditional ISIS.
Actually the lower limits are not enforced;
in a very specialised application one might want to use k/l/m = 3/1/0.
Recommended are at least eight bytes as k/l/m = 4/3/1 or 4/4/0
or, with large file support, 5/3/0.


However, when any of the limits is reached,
or an unsupported combination or byte order is found,
the Xref can easily be recreated with greater values.
The 12 byte pointer with k/l/m = 6/4/2 will be enough for
even gigantic databases of a quarter Petabyte (262.144 Gigabyte).

The number of fields is redundant, but as an optimization
may make live a little bit easier for a reader.
If the pointer structure has m>0 (typically m=1),
a value of 0 must be stored if the number of fields exceeds the representable
range and a reader should be prepared to figure it out itself in that case.


---
        $Id: Serialized.txt,v 1.8 2003/05/30 13:26:34 kripke Exp $
1	ISIS records serialized
2
3	Serialization means to convert the internal representation
4	of an ISIS record to a sequence of bytes ("octets") suitable
5	to be stored in a file or transferred via a network.
6
7	The serialization format described here is used by OpenIsis
8	for both database master files and network communications.
9
10
11	* design goals
12
13	The serialization format should be
14	- easy to use
15	for programmers and tool writers
16	- efficient
17	in execution time and space used
18	- robust
19	a broken masterfile should be fixed using a text editor
20	- versatile
21	can be used for a variety of applications using a variety of tools
22	- without limits
23	in number and size of records and fields
24
25
26	* basic format
27
28	In general, a record is serialized by
29	- serializing meta information
30	- serializing the fields in order
31	- appending a blank line
32
33	Fields are serialized as
34	- the field tag printed using ASCII decimal digits
35	(optionally preceeded by a minus sign, if negative tags are allowed)
36	- a (horizontal) TAB character (ASCII value 9, ^I)
37	- the field value
38	- a newline character (ASCII value 10, ^J)
39
40	Metadata is serialized in the same way,
41	using special tags according to the needs of the environment.
42	Two situations are distinguished:
43	- "soft" metadata, which may and should be accessible as part of the record.
44	This is encoded by convention using negative tags.
45	An example of this is HTTP and other MIME-style communication,
46	where the MIME headers like "User-agent" or "Date" are encoded
47	in such a way, while content data like GET or POST parameters
48	should be mapped to positive IDs.
49	- "hard" metadata, which must not interfere with the record contents
50	in order for the environment to work properly.
51	This is encoded using a single non-digit character instead of tag digits.
52	An example of this is the MFN in a master file and information
53	regarding record deletion or update.
54
55	The final blankline may be omitted,
56	where only a single record is contained in an otherwise delimited byte sequence.
57
58	A reader should support a lazy mode,
59	allowing the TAB to be omitted, where unambigous
60	(the field value does not start with a digit or a TAB).
61	Writers, however, are strongly urged to write the TAB.
62
63	Tags with leading zeros are allowed (typically with %03d)
64	and must not be interpreted as octal using atoi.
65
66
67	* newline conventions
68
69	Two ways are supported to deal with newline characters in field values:
70	- in "text mode", newlines are replaced with vertical tabs (ASCII 11, ^K)
71	on serialization and vice versa on deserialization.
72	- in "binary mode", newlines are replaced as newline-TAB sequences.
73
74	As an alternative to these protocol-level modes one may choose "field mode"
75	at the application level: simply claim that you are not interested in
76	newlines and replace any by tab or space (w/o ever converting back).
77
78
79	The advantages of text mode over binary mode are
80	- it is slightly faster than the binary translation
81	- the serialized records do not need more space
82	than the internal representation
83	(whereas the binary serialization might need nearly twice as much
84	in worst case)
85	- it is easily used with line-oriented utilities like grep or sed,
86	since each field is contained within one line
87
88	The binary mode (which resembles MIME continuation lines) has the
89	advantage of not loosing vertical tab characters that might have
90	been contained in the original field values.
91	It is fully transparent and can be used to store any binary data like images
92	with an average overhead of 0.4%
93	(as compared to +33% needed by BASE64 encoding).
94
95
96	The OpenIsis server automatically detects binary mode,
97	if the client uses a continuation line.
98
99
100	* masterfile format
101
102	A basic masterfile consists of a blank-line separated series of records.
103
104	The first record is the "controlling record",
105	containing descriptive information such as the newline convention,
106	the subfield separator and the character encoding.
107	All of this is optional; the masterfile might just start with a blank line.
108
109	The MFNs are then assigned implicitly in order, starting from 1.
110	There is no distinction between
111	empty (consecutive blanklines) and deleted records.
112	There is absolutely no redundant information contained.
113
114	Masterfile compression creates this state
115	(however, it may choose to use special meta lines for long ranges
116	of deleted records, see below -- but those are inefficient in
117	the Xref, anyway).
118	Such a masterfile can be very easily created by any tool (like Perl and such).
119	The Xref file can be easily, fast and reliably recreated.
120
121
122	When writing to the database, information is ALWAYS appended to the end.
123	There is NO OVERWRITING of any data, ever, period.
124	That way data CANT BE DESTROYED by any operation
125	(one could advise the operating system to set a mandatory read lock on that),
126	and all changes are easily traced using tail -f.
127
128	A binary mode masterfile starts with a "broken" continuation line
129	containing a single TAB.
130
131
132	* basic mode writing
133
134	In metalines, all numbers are given in decimal digits
135	and multiple items are separated by TABs.
136	The optional timestamps are an arbitrary prefix of generalized time format
137	YYYYMMDDhhmmssttt... as of 'date +%Y%m%d%H%M' plus milliseconds ttt,
138	optionally followed by up to 14 characters to create a unique request id.
139
140	Write operations use the following meta line (preceeding record data):
141	- W mfn [oldpos [timestamp]]
142	followed by new record data denotes a write of record mfn.
143	Depending on the needs of the environment,
144	the byte offset of the last version might be added in order to
145	support access to old versions (e.g. for delayed index update).
146	oldpos may be given as position[.length[.fields]]
147
148	A W with no data following is mostly equivalent to a delete.
149	If an otherwise written mfn is higher than the highest known,
150	the highest known (and thus the implicit counter) is set to this.
151
152	A lazy reader should not require meta lines to be followed by a blank line,
153	where unambigous.
154	For writers, however, the blank line is strongly recommended.
155
156
157	A reasonable size limit on metalines is 127(+newline), since
158	- 22 = 1+1+20 for operator, tab and mfn (64 bits ~ 20 digits)
159	- 43 = 1+20+1+10+1+10 for tab position.length.fields
160	- 32 = 1+17+14 for tab, milliseconds time and id
161	totals to 97, so we have some room left.
162
163
164	* advance mode
165
166	For advanced space efficiency at the cost of read increased access time,
167	the following lines may be used:
168	- D mfn [oldpos [timestamp]]
169	a record entry consisting of only a D line denotes the deletion of
170	record number mfn.
171	Basically equivalent to a write with no data following,
172	but a little bit more explicit.
173	- I mfn [timestamp]
174	(set id) override the implicit MFN counter,
175	e.g. after a series of deleted rows or just to be explicit.
176	Basically equivalent to a write with no old pos,
177	but a little bit more explicit.
178	Recommended when appending records after some deletes.
179	- C mfn oldpos [timestamp]
180	introduces a series of patch commands specifying how record mfn
181	was changed.
182
183	Software writing the masterfile will typically choose to
184	write full updates and MUST provide a switch to forcedly do so,
185	in order to be compatible with basic mode readers.
186
187	However, the patch command language is particularily useful
188	with server operations, to avoid the need for read-write
189	sequences with some sort of locking.
190
191
192	* the patch language
193
194	The patch commands are lines starting with special
195	characters like +, -, ~ and so on,
196	followed by (an optional TAB and) field addresses, TAB and field data.
197
198
199	The simplest case is the '+' command,
200	meaning that it's data is to be appended to the record.
201	The + and TAB may be omitted
202	(both, in order to not be confused with a continuation line).
203	In other words, the add command may look exactly like an ordinary
204	field line.
205
206	A series of '=' commands works exactly like the set operation in an
207	OpenIsis Tcl record. Especially, field indexes and subfields are supported.
208	The '-' command resembles the del operation.
209
210	A detailled description of the "patch language" is to be done.
211
212
213	example
214	$
215	C 1234
216	= 24 foo
217	= 24 bar
218	25 baz
219	$
220	changes record 1234 by setting the first to occurences of
221	field 24 to foo and baz, respectively, deleting any other occurences
222	of field 24, and appending a field 25 with value baz.
223
224
225	* the pointer file
226
227	The pointer file is an array of n-Byte (n >= 6) entries,
228	the ith entry referencing mfn i, similar to the traditional .XRF (crossref).
229
230	The n=k+l+m bytes specify two or three numbers (in native byte order)
231	of up to 8 bytes each:
232	- the first k bytes (k >= 4) give the position of the record
233	(or it's last update or change entry)
234	- the next l bytes (l >= 2) give the length of the record
235	(excluding the last field's terminating newline and following blank line)
236	- the final m bytes (m >= 0) give the number of fields.
237	If m is 0, or all bits in a field number are set (for large records),
238	the reader has to determine the number fo fields by inspecting the record.
239
240	The first six bytes of the first entry describe the detailled layout.
241	Four bytes are the "magic number" containing the ASCII characters "ISIX".
242	Two bytes are the number (m256 + l16 + k) in native byte order.
243
244	The minimum case k=4, l=2 imposes the limits of traditional ISIS.
245	Actually the lower limits are not enforced;
246	in a very specialised application one might want to use k/l/m = 3/1/0.
247	Recommended are at least eight bytes as k/l/m = 4/3/1 or 4/4/0
248	or, with large file support, 5/3/0.
249
250
251	However, when any of the limits is reached,
252	or an unsupported combination or byte order is found,
253	the Xref can easily be recreated with greater values.
254	The 12 byte pointer with k/l/m = 6/4/2 will be enough for
255	even gigantic databases of a quarter Petabyte (262.144 Gigabyte).
256
257	The number of fields is redundant, but as an optimization
258	may make live a little bit easier for a reader.
259	If the pointer structure has m>0 (typically m=1),
260	a value of 0 must be stored if the number of fields exceeds the representable
261	range and a reader should be prepared to figure it out itself in that case.
262
263
264	---
265	$Id: Serialized.txt,v 1.8 2003/05/30 13:26:34 kripke Exp $