/[webpac]/openisis/0.9.9e/doc/Protocol.txt
This is repository of my old source code which isn't updated any more. Go to git.rot13.org for current projects!
ViewVC logotype

Contents of /openisis/0.9.9e/doc/Protocol.txt

Parent Directory Parent Directory | Revision Log Revision Log


Revision 604 - (show annotations)
Mon Dec 27 21:49:01 2004 UTC (19 years, 6 months ago) by dpavlin
File MIME type: text/plain
File size: 20039 byte(s)
import of new openisis release, 0.9.9e

1 The Malete server protocol.
2
3
4 * introduction
5
6 The Malete server is based on passing of messages, which are represented
7 as records. The only interface to the server can be regarded as a single
8 function "send", which takes a record as parameter and returns a record.
9 The result record itself is a valid message.
10 This "send" can be actually invoked in one of two ways:
11
12 - by having the server in process
13 i.e. by actually calling the C function "send",
14 possibly via some wrapper to interface another programming language.
15 This is the way the Malete Tcl extension works.
16 - via some bytestream
17 This can be regarded as just one of the wrappers, interfacing a
18 bytestream by deserializing message records from the bytestream
19 and serializing result records to the bytestream.
20 The standard server process uses stdin and stdout and thus can
21 be invoked by executing it from pipes or by contacting it via TCP,
22 when running from
23 > http://openisis.org/Doc/UcspiSsl tcpserver.
24 As a special case, the record data file itself is such a bytestream,
25 however only containing simple write messages.
26
27 The server maintains a session state bound to a bytestream,
28 e.g. one TCP connection.
29
30
31 * messages and data
32
33 In Malete, every record has a "header", which is the value of the first field.
34 The header specifies which message the record represents,
35 with the following fields ("body") containing parameter data for the message.
36
37 Recall that
38 - the first field's tag denotes the number of fields in the record
39 - a "data record" is a record that can be written to a database.
40 This requires a record id (MFN), which, however, can be 0
41 to denote an append with the next available id.
42 - for a data record read from or written to the database,
43 the header will/must be empty or start with a digit.
44 The general format is 'rid[@pos][*TAB*leader]'.
45 Rid is the record id (MFN), which on write may be 0 to append a new record.
46 Pos is the optional old position to guard an updating write
47 against concurrent changes.
48 Leader contains arbitrary data like e.g. a MARC leader,
49 a record key or a message header.
50
51 Proper message headers are not empty and do not start with a digit.
52 The first token of a message header (up to a *TAB* or end of value)
53 is the message name, optionally qualified by a message target,
54 i.e. an object to receive the message (usually a database).
55
56
57 However, messages and data are converted into each other canonically:
58 - If a data record header is encountered where a message is expected,
59 it is treated as a write message as if 'W*TAB*' where prepended
60 (which oviously will write just this record).
61 Even the empty message (a record with 0 fields) is a valid message
62 and will append an empty record when sent to a database.
63 - If a message is treated as data, its header is treated as leader
64 as if '0*TAB*' where prepended.
65
66
67 * message targets are objects
68
69 A server processes messages by first looking up a target object by
70 inspecting and stripping an initial addressing part of the message header
71 (or resorting to some default) and then passing the message to this object.
72 (Actually, even this dispatching is done by an object, the session).
73
74
75 In general, objects are free in how they process messages.
76 For example, an object might represent a (session on a) remote server,
77 and simply pass every message there. Objects using the same processing
78 function are said to be in the same "class". Commonly processing functions
79 handle only some known messages and pass anything else on to the function
80 of another class, which is called "inheriting from this class".
81
82
83 Objects to which messages can be send are
84 - a structure
85 is a collection of other (child) objects like databases (tables).
86 It does basically nothing but passing messages to its childs.
87 It may support a listing of the known childs.
88 The structure interface may be implemented locally or as a remote server.
89 - a database (table)
90 supports reading and writing of record and query data.
91 A database is a structure, it may support childs e.g. to provide views.
92 - a session
93 is a structure representing the connection to a (local or remote) server.
94 It passes messages to the server's childs (like databases) and maintains
95 some state, called the environment.
96
97
98 Any object should recognize
99 - the comment '#'
100 a special message used to pass additional info (echo/error)
101 - rooting '.'
102 the message is passed to the session as is.
103 A session strips the '.' and processes the rest as usual.
104 - options '=' (optional extension)
105 to get or set values of object options (not implemented).
106 - messages starting with other special characters like '|' and ';'
107 are reserved for future special processing
108
109
110 A structure in addition recognizes
111 - child addressing '.'
112 if the message name starts with a letter and contains a dot '.',
113 everything up to the dot is taken as the name of a child.
114 After stripping the child qualifier, the message is send to the child.
115 With no additional message, the child's existence is tested
116 and returned in a comment.
117 The qualification can contain several dots, which are processed from left.
118 Therefore, 'a.b.c' means to send message 'b.c' to target 'a',
119 which could be for example a remote server, which in turn is expected
120 to somehow dispatch message 'c' to its local child 'b'.
121
122 A session also supports:
123 - default path (optional extension)
124 Similar to a current working directory, a default path can be set
125 as session option '@', which is then lexically prepended to any
126 unrooted request to the session. (not implemented).
127
128
129 The standard messages a database should recognize are
130 - the write message W
131 writing one or more records to a database
132 - the read message R
133 reading records by record id
134 - the query message Q
135 to search the query data (btree index)
136 - the index message X
137 to write index data
138
139
140
141 Standard message and object names always start with an ASCII letter.
142 As a convention, message names should start uppercase and
143 object names lowercase.
144
145 Every message returns an error comment message in case of error
146 or another message as specified (possibly the empty message).
147
148
149 The body of a message (i.e. the fields following the header)
150 may define a fixed or variable number of parameter fields
151 or one or more records, which are in turn, depending on the message,
152 used as message or data records (generally regardless of their contents):
153 - header only:
154 The message is not using any fields or records as parameter.
155 Such messages treat any body as embedded records (see below) specifying
156 one or more chained messages, which are then processed in turn.
157 A possible but currently unused generalization of this is
158 a fixed number of parameter fields.
159 - parameter list:
160 the contents of following fields is interpreted by the message itself.
161 Many messages use only one type of parameter fields and ignore their tags.
162 - embedded records:
163 Each of the records begins with a proper header field,
164 with the tag being its negative length (including the header).
165 A tag of 0 is treated as using all available fields.
166 Should such a tag be positive or specify a length
167 exceeding the number of available fields, the result is undefined,
168 but either an error or treating it as record using all available fields.
169 - immediate record:
170 Some messages also support a short form, where they do not themselves
171 take all of their header, but only chop off some initial part of it,
172 using the remaining message as record.
173
174
175 * write
176
177 The write message takes one of two forms:
178 - short write (immediate record):
179 The header is of the form 'W*TAB*rid[@pos][*TAB*leader]',
180 and the following fields are the body of a record to write.
181 This message writes the record with header 'rid[@pos][*TAB*leader]'
182 and the body as given by the following fields.
183 It returns a short read message with the record id written.
184 - long write (embedded records):
185 The header is a single 'W'. The body contains any number of embedded records.
186 Multiwrite returns a long read message with the record ids written.
187 With an empty body, long write can be used to test the existence
188 and writeability of a database.
189
190 Note that there is no special support for deleting records;
191 writing empty records has the same effect.
192
193
194 * read
195
196 Like write, the read message takes one of two forms,
197 all returning a long write for the retrieved records:
198 - short read (header only):
199 The header is of the form 'R*TAB*rid[*TAB*count]'.
200 It reads count (default 1) records starting at record rid.
201 A count of 0 reads any records as available and within the read limit.
202 Note that a read of record 0 retrieves the metadata.
203 - long read (parameter list):
204 The header is a single 'R'.
205 The following fields contain one record id each.
206
207 Note that
208 - the number of records read at once is limited by the session option 'r'
209 - read might retrieve older versions of records,
210 if the database has a snapshot position set
211
212
213 * query
214
215 The query message is of the form 'Q[*TAB*query]',
216 where query is an expression in the
217 > Query Malete query language.
218 With parameters, the query message creates a new query as the current.
219 With or without parameters, the query message returns an echo
220 of the estimated remaining result set size, followed by a long write
221 containing the next 'r' records from the current query set
222 (subject to a snapshot like read).
223
224 The query can contain two parts, separated by a '?':
225 - an index based search defining a result set.
226 If it is empty, the search result set is the entire database.
227 - a filter to be applied on record retrieval.
228 If no filter is specified (i.e. no '?'), only record ids are returned.
229 An empty filter selects every record with all fields.
230 Other filters will select records and/or fields.
231
232 In future versions, one or both parts might be specified as embedded
233 records. By now, however, the query message is header only.
234
235
236 Note that
237 - the session keeps a total of 'q' queries with the query expression,
238 the cursor (offset of next record to retrieve) and search result set.
239 If a query expression is only a reference '#n' to an open query,
240 this query is used from its current position without establishing
241 a new query.
242 - the size of a search result set is limited by the session option 's'.
243 This limit applies also to any intermediate result, thus the
244 actual set might be much smaller or even empty due to the limit.
245 Some search expressions might allow larger set sizes,
246 especially the empty one does (since no record ids need to be stored).
247
248 The returned echo contains several numbers:
249 - estimated number of remaining records, including the ones just read.
250 This number may be wrong for a number of reasons, especially it does
251 not account for filtering. However, if it equals the number of returned
252 records, it is safe to assume that there are no more records.
253 This number is the primary echo code, if it is negative,
254 the rest of the echo is some error message.
255 - number of the query, by which it can be referenced.
256 These numbers are per database.
257 - truncation record id. If not 0, this is a record id where the search
258 was truncated due to the result set size limit.
259 Future versions might support transparent continuation after truncation.
260
261
262 * terms
263
264 The terms message has one of the forms
265 - 'T*TAB*from*TAB*to'
266 Selects terms greater or equal the first parameter and less than the second.
267 Where the second parameter is empty, no upper bound is used.
268 - 'T*TAB*prefix'
269 Selects terms with the parameter as prefix.
270 Using a prefix ABC is just a shorthand for from ABC to ABD.
271 - 'T*TAB*from*TAB*to*TAB*tag'
272 Like the first form, but restrict matches to the given tag (number).
273
274
275 Terms are returned as a list (record with 0-tagged fields),
276 where each field value is a count of hits of the term,
277 followed by a *TAB* and the term.
278 The list is limited to the result set size.
279 The full index can be looped by using the last returned term
280 as from parameter for the next invocation.
281
282
283 When not restricting to a tag, the hit count is just the number of all
284 index entries for the selected terms. This may be higher than the number
285 of matched records, where a term has multiple hits for the same records.
286
287 With a restriction to a tag, the count is the actual number of records
288 (even where a term has multiple entries for the same record and tag).
289 If the database uses the traditional fulltext index format (the default),
290 tag 0 selects any tag, else tag 0 selects actual tag 0 entries (unique keys).
291
292
293 * index
294
295 The index message 'X' takes a parameter list of data and control fields.
296 Control fields have tag 0 and change the way the data fields are processed.
297 All other fields contain index data. During processing of the message,
298 a position counter is maintained which is incremented by one for every word
299 (in word or split mode), to the next multiple of the field step (default 65536)
300 for every field (1 in word mode), and reset to 0 on tag change.
301
302
303 Every control field contains one or more instructions
304 (as always, separated by TABs):
305 - f[pos]
306 sets default (full field) indexing mode where every data field contains
307 one index entry. The position is set to the given or 0 and then
308 incremented to the field step.
309 - w[pos]
310 Like field mode, but incrementing the position by one.
311 - s[pos]
312 Split mode, where each data field is split into words according
313 to collation info.
314 If the index has no collation info, all characters but the well-known
315 ASCII non-letters are assumed to be word characters.
316 - a[pos]
317 set add mode (default)
318 - d[pos]
319 set delete mode: following index entries are deleted.
320 - m[mode]
321 mode 'H' selects traditional conversion of angle brackets:
322 <a[=b]> is replaced by b (or nothing).
323 mode 'P' or none turns this off.
324 - p*pfx*
325 prepend prefix pfx to index entries
326 - r*id*
327 set record id (defaults to the session's last written record)
328 - [+|-]*tag*
329 where tag is a number, stops processing of the field and treats
330 everything after the next *TAB* as data field with *tag*.
331 With a leading + or -, set mode to add or del, resp.
332
333 Control instructions may also be part of the message header.
334 The index message echoes a count of the index entries made.
335
336
337 * comment
338
339 The comment message '#' is used to augment other messages.
340 It is header only (executing any body) of the form '#*TAB*code[*TAB*message]',
341 where code is a number.
342 A nonnegative code indicates a success, typically some count.
343 A negative code indicates some sort of error (-1..-10) or notification.
344 Message is arbitrary.
345 This message copies itself to the result.
346
347
348 * options
349
350 Some objects have options, which can be given as subfields
351 in some configuration header for the object and be set and retrieved
352 using the '=' message. The '=' message echoes a comment containing
353 some or all options as subfields.
354
355 - a single '=' echoes all options
356 - '=' immediatly followed by option characters echoes these options
357 - additional subfields set options and, after a single '=', echo these.
358
359
360 * special message processing
361
362 optional extensions
363
364 There are more special messages envisioned which are used to control or
365 modify the processing of one or more other messages.
366 Given here is a rough sketch as a guide for future implementation,
367 however, this may be not yet implemented and is still subject to change.
368
369
370 The pipe '|' reuses the result created by one message as or for another message.
371 It scans its header for occurences of '*TAB*|' (i.e. tabseparated subfields
372 with subfield code '|'), each of which starts a new submessage.
373 Iteratively, the part of the header up to the next submessage is processed
374 as a message, creating a result.
375
376 Then if the next submessage
377 (the part of the header starting with the next character after the pipe
378 and extending to the character before the next '*TAB*|' or end of header)
379 - is empty,
380 the result is processed as message.
381 This is convenient to immediatly execute the read returned by a query.
382 - starts with a *TAB*,
383 the submessage (including the *TAB*) is appended to the result's header,
384 and the result is processed as message.
385 - else,
386 the result's header is echoed to the final (not the next intermediate)
387 result and then replaced by the submessage before processing.
388
389 As a special case, if the pipe message header did not contain any '*TAB*|',
390 it is treated as with '*TAB*|' at end, i.e. the only submessage's result
391 is executed (mimicking the effect of backticks).
392
393 In a long form, where the pipe message header is only the '|',
394 the submessages are embedded records in the body.
395 Here, in each step, any body fields of the following submessage
396 are prepended to the result before execution.
397
398
399 The composition ';' processes several messages, appending to the same result.
400 In the long form, submessages are embedded records.
401 In the compact form, the header is split into submessages as for the pipe.
402 (Details to specify).
403
404
405 * serialization
406
407 Message can be represented in byte streams according to the following rules:
408 - Field values (including the header) MUST NOT contain a newline character,
409 else the results are undefined. Where an application must be prepared
410 to handle newlines, it must take care of encoding them (see below).
411 - If the message header is empty, no header is printed
412 - else if the message is a regular message (not starting with a digit),
413 the header is printed followed by a newline.
414 - else 'W*TAB*' is printed followed by the header and a newline.
415 - All body fields are printed as the tag followed by a *TAB*,
416 the value and a newline.
417 - A single newline is printed to terminate the message.
418
419
420 On deserialization, if a message starts with a number (digit or -sign),
421 this is the tag of the first body field, and an empty header is to
422 be assumed (equivalent to a 'W*TAB*0' append message).
423
424 For all body fields, the deserialization must be done in the following steps:
425 - take an initial '-' sign and any digits as tag, defaulting to 0
426 - skip one following *TAB* character
427 - use anything up to a newline as value
428 Consequently, on serialization:
429 - a tag of 0 may and commonly will be omitted
430 - where a value does not start with a TAB,
431 the TAB may be ommited
432 - where a value does not start with a '-', digit or TAB,
433 both a 0 tag and the TAB may be ommited
434 - where values containing newlines are used unencoded,
435 they will in most cases result in following 0 tagged fields
436 However, ommiting the TAB is considered bad style.
437
438
439 The record data ("master") file is simply a stream of data record messages,
440 using headerless mode where possible (i.e. appends of leaderless records).
441
442
443 Some easy common encodings are suggested to deal with newline characters:
444 - in "field mode",
445 discard newlines by replacing them with spaces or tabs.
446 - in "text mode",
447 newlines are replaced with vertical tabs VT (ASCII 11, ^K).
448 This maybe reversed to restore newline-separated lines if needed,
449 but e.g. on printing the VT will have the desired effect.
450 - in "binary mode",
451 newlines are replaced as VT followed by a byte value 1,
452 if the newline is followed by a byte value 0 or 1, else by a single VT.
453 A VT is replaced by a VT and a 0 byte.
454 - as an "ultra robust binary mode", use BASE64.
455
456 The advantages of text mode over binary mode are
457 - it is slightly faster than the binary translation
458 - the serialized records do not need more space
459 (whereas the binary serialization might need twice the space)
460
461 The binary mode has the advantage of not loosing vertical tab characters that
462 might have been contained in the original field values.
463 It is fully transparent and can be used to store any binary data like images
464 with an average overhead of 0.4% (as compared to +33% with BASE64 encoding).
465 Note that for a plain text not containing control characters 0, 1 or 11,
466 text and binary mode have the same results, thus it is reasonably safe
467 for client libraries to use binary mode by default on all communication.
468
469 However, BASE64 has the advantage of even surviving a character set recoding,
470 thus is more robust for databases which may be exchanged internationally.
471 Also the overhead of BASE64 is fixed to 33% (4 bytes for every 3),
472 while the binary mode has a worst case of +100% (on all VTs).
473
474 ---
475 $Id: Protocol.txt,v 1.12 2004/06/15 11:11:16 kripke Exp $

  ViewVC Help
Powered by ViewVC 1.1.26