/[webpac]/openisis/current/doc/Serialized.txt
This is repository of my old source code which isn't updated any more. Go to git.rot13.org for current projects!
ViewVC logotype

Contents of /openisis/current/doc/Serialized.txt

Parent Directory Parent Directory | Revision Log Revision Log


Revision 237 - (show annotations)
Mon Mar 8 17:43:12 2004 UTC (20 years ago) by dpavlin
File MIME type: text/plain
File size: 10001 byte(s)
initial import of openisis 0.9.0 vendor drop

1 ISIS records serialized
2
3 Serialization means to convert the internal representation
4 of an ISIS record to a sequence of bytes ("octets") suitable
5 to be stored in a file or transferred via a network.
6
7 The serialization format described here is used by OpenIsis
8 for both database master files and network communications.
9
10
11 * design goals
12
13 The serialization format should be
14 - easy to use
15 for programmers and tool writers
16 - efficient
17 in execution time and space used
18 - robust
19 a broken masterfile should be fixed using a text editor
20 - versatile
21 can be used for a variety of applications using a variety of tools
22 - without limits
23 in number and size of records and fields
24
25
26 * basic format
27
28 In general, a record is serialized by
29 - serializing meta information
30 - serializing the fields in order
31 - appending a blank line
32
33 Fields are serialized as
34 - the field tag printed using ASCII decimal digits
35 (optionally preceeded by a minus sign, if negative tags are allowed)
36 - a (horizontal) TAB character (ASCII value 9, ^I)
37 - the field value
38 - a newline character (ASCII value 10, ^J)
39
40 Metadata is serialized in the same way,
41 using special tags according to the needs of the environment.
42 Two situations are distinguished:
43 - "soft" metadata, which may and should be accessible as part of the record.
44 This is encoded by convention using negative tags.
45 An example of this is HTTP and other MIME-style communication,
46 where the MIME headers like "User-agent" or "Date" are encoded
47 in such a way, while content data like GET or POST parameters
48 should be mapped to positive IDs.
49 - "hard" metadata, which must not interfere with the record contents
50 in order for the environment to work properly.
51 This is encoded using a single non-digit character instead of tag digits.
52 An example of this is the MFN in a master file and information
53 regarding record deletion or update.
54
55 The final blankline may be omitted,
56 where only a single record is contained in an otherwise delimited byte sequence.
57
58 A reader should support a lazy mode,
59 allowing the TAB to be omitted, where unambigous
60 (the field value does not start with a digit or a TAB).
61 Writers, however, are strongly urged to write the TAB.
62
63 Tags with leading zeros are allowed (typically with %03d)
64 and must not be interpreted as octal using atoi.
65
66
67 * newline conventions
68
69 Two ways are supported to deal with newline characters in field values:
70 - in "text mode", newlines are replaced with vertical tabs (ASCII 11, ^K)
71 on serialization and vice versa on deserialization.
72 - in "binary mode", newlines are replaced as newline-TAB sequences.
73
74 As an alternative to these protocol-level modes one may choose "field mode"
75 at the application level: simply claim that you are not interested in
76 newlines and replace any by tab or space (w/o ever converting back).
77
78
79 The advantages of text mode over binary mode are
80 - it is slightly faster than the binary translation
81 - the serialized records do not need more space
82 than the internal representation
83 (whereas the binary serialization might need nearly twice as much
84 in worst case)
85 - it is easily used with line-oriented utilities like grep or sed,
86 since each field is contained within one line
87
88 The binary mode (which resembles MIME continuation lines) has the
89 advantage of not loosing vertical tab characters that might have
90 been contained in the original field values.
91 It is fully transparent and can be used to store any binary data like images
92 with an average overhead of 0.4%
93 (as compared to +33% needed by BASE64 encoding).
94
95
96 The OpenIsis server automatically detects binary mode,
97 if the client uses a continuation line.
98
99
100 * masterfile format
101
102 A basic masterfile consists of a blank-line separated series of records.
103
104 The first record is the "controlling record",
105 containing descriptive information such as the newline convention,
106 the subfield separator and the character encoding.
107 All of this is optional; the masterfile might just start with a blank line.
108
109 The MFNs are then assigned implicitly in order, starting from 1.
110 There is no distinction between
111 empty (consecutive blanklines) and deleted records.
112 There is absolutely no redundant information contained.
113
114 Masterfile compression creates this state
115 (however, it may choose to use special meta lines for long ranges
116 of deleted records, see below -- but those are inefficient in
117 the Xref, anyway).
118 Such a masterfile can be very easily created by any tool (like Perl and such).
119 The Xref file can be easily, fast and reliably recreated.
120
121
122 When writing to the database, information is ALWAYS appended to the end.
123 There is NO OVERWRITING of any data, ever, period.
124 That way data CANT BE DESTROYED by any operation
125 (one could advise the operating system to set a mandatory read lock on that),
126 and all changes are easily traced using tail -f.
127
128 A binary mode masterfile starts with a "broken" continuation line
129 containing a single TAB.
130
131
132 * basic mode writing
133
134 In metalines, all numbers are given in decimal digits
135 and multiple items are separated by TABs.
136 The optional timestamps are an arbitrary prefix of generalized time format
137 YYYYMMDDhhmmssttt... as of 'date +%Y%m%d%H%M' plus milliseconds ttt,
138 optionally followed by up to 14 characters to create a unique request id.
139
140 Write operations use the following meta line (preceeding record data):
141 - W mfn [oldpos [timestamp]]
142 followed by new record data denotes a write of record mfn.
143 Depending on the needs of the environment,
144 the byte offset of the last version might be added in order to
145 support access to old versions (e.g. for delayed index update).
146 oldpos may be given as position[.length[.fields]]
147
148 A W with no data following is mostly equivalent to a delete.
149 If an otherwise written mfn is higher than the highest known,
150 the highest known (and thus the implicit counter) is set to this.
151
152 A lazy reader should not require meta lines to be followed by a blank line,
153 where unambigous.
154 For writers, however, the blank line is strongly recommended.
155
156
157 A reasonable size limit on metalines is 127(+newline), since
158 - 22 = 1+1+20 for operator, tab and mfn (64 bits ~ 20 digits)
159 - 43 = 1+20+1+10+1+10 for tab position.length.fields
160 - 32 = 1+17+14 for tab, milliseconds time and id
161 totals to 97, so we have some room left.
162
163
164 * advance mode
165
166 For advanced space efficiency at the cost of read increased access time,
167 the following lines may be used:
168 - D mfn [oldpos [timestamp]]
169 a record entry consisting of only a D line denotes the deletion of
170 record number mfn.
171 Basically equivalent to a write with no data following,
172 but a little bit more explicit.
173 - I mfn [timestamp]
174 (set id) override the implicit MFN counter,
175 e.g. after a series of deleted rows or just to be explicit.
176 Basically equivalent to a write with no old pos,
177 but a little bit more explicit.
178 Recommended when appending records after some deletes.
179 - C mfn oldpos [timestamp]
180 introduces a series of patch commands specifying how record mfn
181 was changed.
182
183 Software writing the masterfile will typically choose to
184 write full updates and MUST provide a switch to forcedly do so,
185 in order to be compatible with basic mode readers.
186
187 However, the patch command language is particularily useful
188 with server operations, to avoid the need for read-write
189 sequences with some sort of locking.
190
191
192 * the patch language
193
194 The patch commands are lines starting with special
195 characters like +, -, ~ and so on,
196 followed by (an optional TAB and) field addresses, TAB and field data.
197
198
199 The simplest case is the '+' command,
200 meaning that it's data is to be appended to the record.
201 The + and TAB may be omitted
202 (both, in order to not be confused with a continuation line).
203 In other words, the add command may look exactly like an ordinary
204 field line.
205
206 A series of '=' commands works exactly like the set operation in an
207 OpenIsis Tcl record. Especially, field indexes and subfields are supported.
208 The '-' command resembles the del operation.
209
210 A detailled description of the "patch language" is to be done.
211
212
213 example
214 $
215 C 1234
216 = 24 foo
217 = 24 bar
218 25 baz
219 $
220 changes record 1234 by setting the first to occurences of
221 field 24 to foo and baz, respectively, deleting any other occurences
222 of field 24, and appending a field 25 with value baz.
223
224
225 * the pointer file
226
227 The pointer file is an array of n-Byte (n >= 6) entries,
228 the ith entry referencing mfn i, similar to the traditional .XRF (crossref).
229
230 The n=k+l+m bytes specify two or three numbers (in native byte order)
231 of up to 8 bytes each:
232 - the first k bytes (k >= 4) give the position of the record
233 (or it's last update or change entry)
234 - the next l bytes (l >= 2) give the length of the record
235 (excluding the last field's terminating newline and following blank line)
236 - the final m bytes (m >= 0) give the number of fields.
237 If m is 0, or all bits in a field number are set (for large records),
238 the reader has to determine the number fo fields by inspecting the record.
239
240 The first six bytes of the first entry describe the detailled layout.
241 Four bytes are the "magic number" containing the ASCII characters "ISIX".
242 Two bytes are the number (m*256 + l*16 + k) in native byte order.
243
244 The minimum case k=4, l=2 imposes the limits of traditional ISIS.
245 Actually the lower limits are not enforced;
246 in a very specialised application one might want to use k/l/m = 3/1/0.
247 Recommended are at least eight bytes as k/l/m = 4/3/1 or 4/4/0
248 or, with large file support, 5/3/0.
249
250
251 However, when any of the limits is reached,
252 or an unsupported combination or byte order is found,
253 the Xref can easily be recreated with greater values.
254 The 12 byte pointer with k/l/m = 6/4/2 will be enough for
255 even gigantic databases of a quarter Petabyte (262.144 Gigabyte).
256
257 The number of fields is redundant, but as an optimization
258 may make live a little bit easier for a reader.
259 If the pointer structure has m>0 (typically m=1),
260 a value of 0 must be stored if the number of fields exceeds the representable
261 range and a reader should be prepared to figure it out itself in that case.
262
263
264 ---
265 $Id: Serialized.txt,v 1.8 2003/05/30 13:26:34 kripke Exp $

  ViewVC Help
Powered by ViewVC 1.1.26