1 |
dpavlin |
237 |
ISIS records serialized |
2 |
|
|
|
3 |
|
|
Serialization means to convert the internal representation |
4 |
|
|
of an ISIS record to a sequence of bytes ("octets") suitable |
5 |
|
|
to be stored in a file or transferred via a network. |
6 |
|
|
|
7 |
|
|
The serialization format described here is used by OpenIsis |
8 |
|
|
for both database master files and network communications. |
9 |
|
|
|
10 |
|
|
|
11 |
|
|
* design goals |
12 |
|
|
|
13 |
|
|
The serialization format should be |
14 |
|
|
- easy to use |
15 |
|
|
for programmers and tool writers |
16 |
|
|
- efficient |
17 |
|
|
in execution time and space used |
18 |
|
|
- robust |
19 |
|
|
a broken masterfile should be fixed using a text editor |
20 |
|
|
- versatile |
21 |
|
|
can be used for a variety of applications using a variety of tools |
22 |
|
|
- without limits |
23 |
|
|
in number and size of records and fields |
24 |
|
|
|
25 |
|
|
|
26 |
|
|
* basic format |
27 |
|
|
|
28 |
|
|
In general, a record is serialized by |
29 |
|
|
- serializing meta information |
30 |
|
|
- serializing the fields in order |
31 |
|
|
- appending a blank line |
32 |
|
|
|
33 |
|
|
Fields are serialized as |
34 |
|
|
- the field tag printed using ASCII decimal digits |
35 |
|
|
(optionally preceeded by a minus sign, if negative tags are allowed) |
36 |
|
|
- a (horizontal) TAB character (ASCII value 9, ^I) |
37 |
|
|
- the field value |
38 |
|
|
- a newline character (ASCII value 10, ^J) |
39 |
|
|
|
40 |
|
|
Metadata is serialized in the same way, |
41 |
|
|
using special tags according to the needs of the environment. |
42 |
|
|
Two situations are distinguished: |
43 |
|
|
- "soft" metadata, which may and should be accessible as part of the record. |
44 |
|
|
This is encoded by convention using negative tags. |
45 |
|
|
An example of this is HTTP and other MIME-style communication, |
46 |
|
|
where the MIME headers like "User-agent" or "Date" are encoded |
47 |
|
|
in such a way, while content data like GET or POST parameters |
48 |
|
|
should be mapped to positive IDs. |
49 |
|
|
- "hard" metadata, which must not interfere with the record contents |
50 |
|
|
in order for the environment to work properly. |
51 |
|
|
This is encoded using a single non-digit character instead of tag digits. |
52 |
|
|
An example of this is the MFN in a master file and information |
53 |
|
|
regarding record deletion or update. |
54 |
|
|
|
55 |
|
|
The final blankline may be omitted, |
56 |
|
|
where only a single record is contained in an otherwise delimited byte sequence. |
57 |
|
|
|
58 |
|
|
A reader should support a lazy mode, |
59 |
|
|
allowing the TAB to be omitted, where unambigous |
60 |
|
|
(the field value does not start with a digit or a TAB). |
61 |
|
|
Writers, however, are strongly urged to write the TAB. |
62 |
|
|
|
63 |
|
|
Tags with leading zeros are allowed (typically with %03d) |
64 |
|
|
and must not be interpreted as octal using atoi. |
65 |
|
|
|
66 |
|
|
|
67 |
|
|
* newline conventions |
68 |
|
|
|
69 |
|
|
Two ways are supported to deal with newline characters in field values: |
70 |
|
|
- in "text mode", newlines are replaced with vertical tabs (ASCII 11, ^K) |
71 |
|
|
on serialization and vice versa on deserialization. |
72 |
|
|
- in "binary mode", newlines are replaced as newline-TAB sequences. |
73 |
|
|
|
74 |
|
|
As an alternative to these protocol-level modes one may choose "field mode" |
75 |
|
|
at the application level: simply claim that you are not interested in |
76 |
|
|
newlines and replace any by tab or space (w/o ever converting back). |
77 |
|
|
|
78 |
|
|
|
79 |
|
|
The advantages of text mode over binary mode are |
80 |
|
|
- it is slightly faster than the binary translation |
81 |
|
|
- the serialized records do not need more space |
82 |
|
|
than the internal representation |
83 |
|
|
(whereas the binary serialization might need nearly twice as much |
84 |
|
|
in worst case) |
85 |
|
|
- it is easily used with line-oriented utilities like grep or sed, |
86 |
|
|
since each field is contained within one line |
87 |
|
|
|
88 |
|
|
The binary mode (which resembles MIME continuation lines) has the |
89 |
|
|
advantage of not loosing vertical tab characters that might have |
90 |
|
|
been contained in the original field values. |
91 |
|
|
It is fully transparent and can be used to store any binary data like images |
92 |
|
|
with an average overhead of 0.4% |
93 |
|
|
(as compared to +33% needed by BASE64 encoding). |
94 |
|
|
|
95 |
|
|
|
96 |
|
|
The OpenIsis server automatically detects binary mode, |
97 |
|
|
if the client uses a continuation line. |
98 |
|
|
|
99 |
|
|
|
100 |
|
|
* masterfile format |
101 |
|
|
|
102 |
|
|
A basic masterfile consists of a blank-line separated series of records. |
103 |
|
|
|
104 |
|
|
The first record is the "controlling record", |
105 |
|
|
containing descriptive information such as the newline convention, |
106 |
|
|
the subfield separator and the character encoding. |
107 |
|
|
All of this is optional; the masterfile might just start with a blank line. |
108 |
|
|
|
109 |
|
|
The MFNs are then assigned implicitly in order, starting from 1. |
110 |
|
|
There is no distinction between |
111 |
|
|
empty (consecutive blanklines) and deleted records. |
112 |
|
|
There is absolutely no redundant information contained. |
113 |
|
|
|
114 |
|
|
Masterfile compression creates this state |
115 |
|
|
(however, it may choose to use special meta lines for long ranges |
116 |
|
|
of deleted records, see below -- but those are inefficient in |
117 |
|
|
the Xref, anyway). |
118 |
|
|
Such a masterfile can be very easily created by any tool (like Perl and such). |
119 |
|
|
The Xref file can be easily, fast and reliably recreated. |
120 |
|
|
|
121 |
|
|
|
122 |
|
|
When writing to the database, information is ALWAYS appended to the end. |
123 |
|
|
There is NO OVERWRITING of any data, ever, period. |
124 |
|
|
That way data CANT BE DESTROYED by any operation |
125 |
|
|
(one could advise the operating system to set a mandatory read lock on that), |
126 |
|
|
and all changes are easily traced using tail -f. |
127 |
|
|
|
128 |
|
|
A binary mode masterfile starts with a "broken" continuation line |
129 |
|
|
containing a single TAB. |
130 |
|
|
|
131 |
|
|
|
132 |
|
|
* basic mode writing |
133 |
|
|
|
134 |
|
|
In metalines, all numbers are given in decimal digits |
135 |
|
|
and multiple items are separated by TABs. |
136 |
|
|
The optional timestamps are an arbitrary prefix of generalized time format |
137 |
|
|
YYYYMMDDhhmmssttt... as of 'date +%Y%m%d%H%M' plus milliseconds ttt, |
138 |
|
|
optionally followed by up to 14 characters to create a unique request id. |
139 |
|
|
|
140 |
|
|
Write operations use the following meta line (preceeding record data): |
141 |
|
|
- W mfn [oldpos [timestamp]] |
142 |
|
|
followed by new record data denotes a write of record mfn. |
143 |
|
|
Depending on the needs of the environment, |
144 |
|
|
the byte offset of the last version might be added in order to |
145 |
|
|
support access to old versions (e.g. for delayed index update). |
146 |
|
|
oldpos may be given as position[.length[.fields]] |
147 |
|
|
|
148 |
|
|
A W with no data following is mostly equivalent to a delete. |
149 |
|
|
If an otherwise written mfn is higher than the highest known, |
150 |
|
|
the highest known (and thus the implicit counter) is set to this. |
151 |
|
|
|
152 |
|
|
A lazy reader should not require meta lines to be followed by a blank line, |
153 |
|
|
where unambigous. |
154 |
|
|
For writers, however, the blank line is strongly recommended. |
155 |
|
|
|
156 |
|
|
|
157 |
|
|
A reasonable size limit on metalines is 127(+newline), since |
158 |
|
|
- 22 = 1+1+20 for operator, tab and mfn (64 bits ~ 20 digits) |
159 |
|
|
- 43 = 1+20+1+10+1+10 for tab position.length.fields |
160 |
|
|
- 32 = 1+17+14 for tab, milliseconds time and id |
161 |
|
|
totals to 97, so we have some room left. |
162 |
|
|
|
163 |
|
|
|
164 |
|
|
* advance mode |
165 |
|
|
|
166 |
|
|
For advanced space efficiency at the cost of read increased access time, |
167 |
|
|
the following lines may be used: |
168 |
|
|
- D mfn [oldpos [timestamp]] |
169 |
|
|
a record entry consisting of only a D line denotes the deletion of |
170 |
|
|
record number mfn. |
171 |
|
|
Basically equivalent to a write with no data following, |
172 |
|
|
but a little bit more explicit. |
173 |
|
|
- I mfn [timestamp] |
174 |
|
|
(set id) override the implicit MFN counter, |
175 |
|
|
e.g. after a series of deleted rows or just to be explicit. |
176 |
|
|
Basically equivalent to a write with no old pos, |
177 |
|
|
but a little bit more explicit. |
178 |
|
|
Recommended when appending records after some deletes. |
179 |
|
|
- C mfn oldpos [timestamp] |
180 |
|
|
introduces a series of patch commands specifying how record mfn |
181 |
|
|
was changed. |
182 |
|
|
|
183 |
|
|
Software writing the masterfile will typically choose to |
184 |
|
|
write full updates and MUST provide a switch to forcedly do so, |
185 |
|
|
in order to be compatible with basic mode readers. |
186 |
|
|
|
187 |
|
|
However, the patch command language is particularily useful |
188 |
|
|
with server operations, to avoid the need for read-write |
189 |
|
|
sequences with some sort of locking. |
190 |
|
|
|
191 |
|
|
|
192 |
|
|
* the patch language |
193 |
|
|
|
194 |
|
|
The patch commands are lines starting with special |
195 |
|
|
characters like +, -, ~ and so on, |
196 |
|
|
followed by (an optional TAB and) field addresses, TAB and field data. |
197 |
|
|
|
198 |
|
|
|
199 |
|
|
The simplest case is the '+' command, |
200 |
|
|
meaning that it's data is to be appended to the record. |
201 |
|
|
The + and TAB may be omitted |
202 |
|
|
(both, in order to not be confused with a continuation line). |
203 |
|
|
In other words, the add command may look exactly like an ordinary |
204 |
|
|
field line. |
205 |
|
|
|
206 |
|
|
A series of '=' commands works exactly like the set operation in an |
207 |
|
|
OpenIsis Tcl record. Especially, field indexes and subfields are supported. |
208 |
|
|
The '-' command resembles the del operation. |
209 |
|
|
|
210 |
|
|
A detailled description of the "patch language" is to be done. |
211 |
|
|
|
212 |
|
|
|
213 |
|
|
example |
214 |
|
|
$ |
215 |
|
|
C 1234 |
216 |
|
|
= 24 foo |
217 |
|
|
= 24 bar |
218 |
|
|
25 baz |
219 |
|
|
$ |
220 |
|
|
changes record 1234 by setting the first to occurences of |
221 |
|
|
field 24 to foo and baz, respectively, deleting any other occurences |
222 |
|
|
of field 24, and appending a field 25 with value baz. |
223 |
|
|
|
224 |
|
|
|
225 |
|
|
* the pointer file |
226 |
|
|
|
227 |
|
|
The pointer file is an array of n-Byte (n >= 6) entries, |
228 |
|
|
the ith entry referencing mfn i, similar to the traditional .XRF (crossref). |
229 |
|
|
|
230 |
|
|
The n=k+l+m bytes specify two or three numbers (in native byte order) |
231 |
|
|
of up to 8 bytes each: |
232 |
|
|
- the first k bytes (k >= 4) give the position of the record |
233 |
|
|
(or it's last update or change entry) |
234 |
|
|
- the next l bytes (l >= 2) give the length of the record |
235 |
|
|
(excluding the last field's terminating newline and following blank line) |
236 |
|
|
- the final m bytes (m >= 0) give the number of fields. |
237 |
|
|
If m is 0, or all bits in a field number are set (for large records), |
238 |
|
|
the reader has to determine the number fo fields by inspecting the record. |
239 |
|
|
|
240 |
|
|
The first six bytes of the first entry describe the detailled layout. |
241 |
|
|
Four bytes are the "magic number" containing the ASCII characters "ISIX". |
242 |
|
|
Two bytes are the number (m*256 + l*16 + k) in native byte order. |
243 |
|
|
|
244 |
|
|
The minimum case k=4, l=2 imposes the limits of traditional ISIS. |
245 |
|
|
Actually the lower limits are not enforced; |
246 |
|
|
in a very specialised application one might want to use k/l/m = 3/1/0. |
247 |
|
|
Recommended are at least eight bytes as k/l/m = 4/3/1 or 4/4/0 |
248 |
|
|
or, with large file support, 5/3/0. |
249 |
|
|
|
250 |
|
|
|
251 |
|
|
However, when any of the limits is reached, |
252 |
|
|
or an unsupported combination or byte order is found, |
253 |
|
|
the Xref can easily be recreated with greater values. |
254 |
|
|
The 12 byte pointer with k/l/m = 6/4/2 will be enough for |
255 |
|
|
even gigantic databases of a quarter Petabyte (262.144 Gigabyte). |
256 |
|
|
|
257 |
|
|
The number of fields is redundant, but as an optimization |
258 |
|
|
may make live a little bit easier for a reader. |
259 |
|
|
If the pointer structure has m>0 (typically m=1), |
260 |
|
|
a value of 0 must be stored if the number of fields exceeds the representable |
261 |
|
|
range and a reader should be prepared to figure it out itself in that case. |
262 |
|
|
|
263 |
|
|
|
264 |
|
|
--- |
265 |
|
|
$Id: Serialized.txt,v 1.8 2003/05/30 13:26:34 kripke Exp $ |