1 |
ISIS records serialized |
2 |
|
3 |
Serialization means to convert the internal representation |
4 |
of an ISIS record to a sequence of bytes ("octets") suitable |
5 |
to be stored in a file or transferred via a network. |
6 |
|
7 |
The serialization format described here is used by OpenIsis |
8 |
for both database master files and network communications. |
9 |
|
10 |
|
11 |
* design goals |
12 |
|
13 |
The serialization format should be |
14 |
- easy to use |
15 |
for programmers and tool writers |
16 |
- efficient |
17 |
in execution time and space used |
18 |
- robust |
19 |
a broken masterfile should be fixed using a text editor |
20 |
- versatile |
21 |
can be used for a variety of applications using a variety of tools |
22 |
- without limits |
23 |
in number and size of records and fields |
24 |
|
25 |
|
26 |
* basic format |
27 |
|
28 |
In general, a record is serialized by |
29 |
- serializing meta information |
30 |
- serializing the fields in order |
31 |
- appending a blank line |
32 |
|
33 |
Fields are serialized as |
34 |
- the field tag printed using ASCII decimal digits |
35 |
(optionally preceeded by a minus sign, if negative tags are allowed) |
36 |
- a (horizontal) TAB character (ASCII value 9, ^I) |
37 |
- the field value |
38 |
- a newline character (ASCII value 10, ^J) |
39 |
|
40 |
Metadata is serialized in the same way, |
41 |
using special tags according to the needs of the environment. |
42 |
Two situations are distinguished: |
43 |
- "soft" metadata, which may and should be accessible as part of the record. |
44 |
This is encoded by convention using negative tags. |
45 |
An example of this is HTTP and other MIME-style communication, |
46 |
where the MIME headers like "User-agent" or "Date" are encoded |
47 |
in such a way, while content data like GET or POST parameters |
48 |
should be mapped to positive IDs. |
49 |
- "hard" metadata, which must not interfere with the record contents |
50 |
in order for the environment to work properly. |
51 |
This is encoded using a single non-digit character instead of tag digits. |
52 |
An example of this is the MFN in a master file and information |
53 |
regarding record deletion or update. |
54 |
|
55 |
The final blankline may be omitted, |
56 |
where only a single record is contained in an otherwise delimited byte sequence. |
57 |
|
58 |
A reader should support a lazy mode, |
59 |
allowing the TAB to be omitted, where unambigous |
60 |
(the field value does not start with a digit or a TAB). |
61 |
Writers, however, are strongly urged to write the TAB. |
62 |
|
63 |
Tags with leading zeros are allowed (typically with %03d) |
64 |
and must not be interpreted as octal using atoi. |
65 |
|
66 |
|
67 |
* newline conventions |
68 |
|
69 |
Two ways are supported to deal with newline characters in field values: |
70 |
- in "text mode", newlines are replaced with vertical tabs (ASCII 11, ^K) |
71 |
on serialization and vice versa on deserialization. |
72 |
- in "binary mode", newlines are replaced as newline-TAB sequences. |
73 |
|
74 |
As an alternative to these protocol-level modes one may choose "field mode" |
75 |
at the application level: simply claim that you are not interested in |
76 |
newlines and replace any by tab or space (w/o ever converting back). |
77 |
|
78 |
|
79 |
The advantages of text mode over binary mode are |
80 |
- it is slightly faster than the binary translation |
81 |
- the serialized records do not need more space |
82 |
than the internal representation |
83 |
(whereas the binary serialization might need nearly twice as much |
84 |
in worst case) |
85 |
- it is easily used with line-oriented utilities like grep or sed, |
86 |
since each field is contained within one line |
87 |
|
88 |
The binary mode (which resembles MIME continuation lines) has the |
89 |
advantage of not loosing vertical tab characters that might have |
90 |
been contained in the original field values. |
91 |
It is fully transparent and can be used to store any binary data like images |
92 |
with an average overhead of 0.4% |
93 |
(as compared to +33% needed by BASE64 encoding). |
94 |
|
95 |
|
96 |
The OpenIsis server automatically detects binary mode, |
97 |
if the client uses a continuation line. |
98 |
|
99 |
|
100 |
* masterfile format |
101 |
|
102 |
A basic masterfile consists of a blank-line separated series of records. |
103 |
|
104 |
The first record is the "controlling record", |
105 |
containing descriptive information such as the newline convention, |
106 |
the subfield separator and the character encoding. |
107 |
All of this is optional; the masterfile might just start with a blank line. |
108 |
|
109 |
The MFNs are then assigned implicitly in order, starting from 1. |
110 |
There is no distinction between |
111 |
empty (consecutive blanklines) and deleted records. |
112 |
There is absolutely no redundant information contained. |
113 |
|
114 |
Masterfile compression creates this state |
115 |
(however, it may choose to use special meta lines for long ranges |
116 |
of deleted records, see below -- but those are inefficient in |
117 |
the Xref, anyway). |
118 |
Such a masterfile can be very easily created by any tool (like Perl and such). |
119 |
The Xref file can be easily, fast and reliably recreated. |
120 |
|
121 |
|
122 |
When writing to the database, information is ALWAYS appended to the end. |
123 |
There is NO OVERWRITING of any data, ever, period. |
124 |
That way data CANT BE DESTROYED by any operation |
125 |
(one could advise the operating system to set a mandatory read lock on that), |
126 |
and all changes are easily traced using tail -f. |
127 |
|
128 |
A binary mode masterfile starts with a "broken" continuation line |
129 |
containing a single TAB. |
130 |
|
131 |
|
132 |
* basic mode writing |
133 |
|
134 |
In metalines, all numbers are given in decimal digits |
135 |
and multiple items are separated by TABs. |
136 |
The optional timestamps are an arbitrary prefix of generalized time format |
137 |
YYYYMMDDhhmmssttt... as of 'date +%Y%m%d%H%M' plus milliseconds ttt, |
138 |
optionally followed by up to 14 characters to create a unique request id. |
139 |
|
140 |
Write operations use the following meta line (preceeding record data): |
141 |
- W mfn [oldpos [timestamp]] |
142 |
followed by new record data denotes a write of record mfn. |
143 |
Depending on the needs of the environment, |
144 |
the byte offset of the last version might be added in order to |
145 |
support access to old versions (e.g. for delayed index update). |
146 |
oldpos may be given as position[.length[.fields]] |
147 |
|
148 |
A W with no data following is mostly equivalent to a delete. |
149 |
If an otherwise written mfn is higher than the highest known, |
150 |
the highest known (and thus the implicit counter) is set to this. |
151 |
|
152 |
A lazy reader should not require meta lines to be followed by a blank line, |
153 |
where unambigous. |
154 |
For writers, however, the blank line is strongly recommended. |
155 |
|
156 |
|
157 |
A reasonable size limit on metalines is 127(+newline), since |
158 |
- 22 = 1+1+20 for operator, tab and mfn (64 bits ~ 20 digits) |
159 |
- 43 = 1+20+1+10+1+10 for tab position.length.fields |
160 |
- 32 = 1+17+14 for tab, milliseconds time and id |
161 |
totals to 97, so we have some room left. |
162 |
|
163 |
|
164 |
* advance mode |
165 |
|
166 |
For advanced space efficiency at the cost of read increased access time, |
167 |
the following lines may be used: |
168 |
- D mfn [oldpos [timestamp]] |
169 |
a record entry consisting of only a D line denotes the deletion of |
170 |
record number mfn. |
171 |
Basically equivalent to a write with no data following, |
172 |
but a little bit more explicit. |
173 |
- I mfn [timestamp] |
174 |
(set id) override the implicit MFN counter, |
175 |
e.g. after a series of deleted rows or just to be explicit. |
176 |
Basically equivalent to a write with no old pos, |
177 |
but a little bit more explicit. |
178 |
Recommended when appending records after some deletes. |
179 |
- C mfn oldpos [timestamp] |
180 |
introduces a series of patch commands specifying how record mfn |
181 |
was changed. |
182 |
|
183 |
Software writing the masterfile will typically choose to |
184 |
write full updates and MUST provide a switch to forcedly do so, |
185 |
in order to be compatible with basic mode readers. |
186 |
|
187 |
However, the patch command language is particularily useful |
188 |
with server operations, to avoid the need for read-write |
189 |
sequences with some sort of locking. |
190 |
|
191 |
|
192 |
* the patch language |
193 |
|
194 |
The patch commands are lines starting with special |
195 |
characters like +, -, ~ and so on, |
196 |
followed by (an optional TAB and) field addresses, TAB and field data. |
197 |
|
198 |
|
199 |
The simplest case is the '+' command, |
200 |
meaning that it's data is to be appended to the record. |
201 |
The + and TAB may be omitted |
202 |
(both, in order to not be confused with a continuation line). |
203 |
In other words, the add command may look exactly like an ordinary |
204 |
field line. |
205 |
|
206 |
A series of '=' commands works exactly like the set operation in an |
207 |
OpenIsis Tcl record. Especially, field indexes and subfields are supported. |
208 |
The '-' command resembles the del operation. |
209 |
|
210 |
A detailled description of the "patch language" is to be done. |
211 |
|
212 |
|
213 |
example |
214 |
$ |
215 |
C 1234 |
216 |
= 24 foo |
217 |
= 24 bar |
218 |
25 baz |
219 |
$ |
220 |
changes record 1234 by setting the first to occurences of |
221 |
field 24 to foo and baz, respectively, deleting any other occurences |
222 |
of field 24, and appending a field 25 with value baz. |
223 |
|
224 |
|
225 |
* the pointer file |
226 |
|
227 |
The pointer file is an array of n-Byte (n >= 6) entries, |
228 |
the ith entry referencing mfn i, similar to the traditional .XRF (crossref). |
229 |
|
230 |
The n=k+l+m bytes specify two or three numbers (in native byte order) |
231 |
of up to 8 bytes each: |
232 |
- the first k bytes (k >= 4) give the position of the record |
233 |
(or it's last update or change entry) |
234 |
- the next l bytes (l >= 2) give the length of the record |
235 |
(excluding the last field's terminating newline and following blank line) |
236 |
- the final m bytes (m >= 0) give the number of fields. |
237 |
If m is 0, or all bits in a field number are set (for large records), |
238 |
the reader has to determine the number fo fields by inspecting the record. |
239 |
|
240 |
The first six bytes of the first entry describe the detailled layout. |
241 |
Four bytes are the "magic number" containing the ASCII characters "ISIX". |
242 |
Two bytes are the number (m*256 + l*16 + k) in native byte order. |
243 |
|
244 |
The minimum case k=4, l=2 imposes the limits of traditional ISIS. |
245 |
Actually the lower limits are not enforced; |
246 |
in a very specialised application one might want to use k/l/m = 3/1/0. |
247 |
Recommended are at least eight bytes as k/l/m = 4/3/1 or 4/4/0 |
248 |
or, with large file support, 5/3/0. |
249 |
|
250 |
|
251 |
However, when any of the limits is reached, |
252 |
or an unsupported combination or byte order is found, |
253 |
the Xref can easily be recreated with greater values. |
254 |
The 12 byte pointer with k/l/m = 6/4/2 will be enough for |
255 |
even gigantic databases of a quarter Petabyte (262.144 Gigabyte). |
256 |
|
257 |
The number of fields is redundant, but as an optimization |
258 |
may make live a little bit easier for a reader. |
259 |
If the pointer structure has m>0 (typically m=1), |
260 |
a value of 0 must be stored if the number of fields exceeds the representable |
261 |
range and a reader should be prepared to figure it out itself in that case. |
262 |
|
263 |
|
264 |
--- |
265 |
$Id: Serialized.txt,v 1.8 2003/05/30 13:26:34 kripke Exp $ |