/[webpac]/openisis/0.9.9e/doc/RecStruct.txt
This is repository of my old source code which isn't updated any more. Go to git.rot13.org for current projects!
ViewVC logotype

Contents of /openisis/0.9.9e/doc/RecStruct.txt

Parent Directory Parent Directory | Revision Log Revision Log


Revision 604 - (show annotations)
Mon Dec 27 21:49:01 2004 UTC (19 years, 3 months ago) by dpavlin
File MIME type: text/plain
File size: 13806 byte(s)
import of new openisis release, 0.9.9e

1 OpenIsis/Malete field definition and record structures
2
3
4 * overview
5
6 A Malete record is a sequence of one or more fields.
7 The first one is called the header, all others are identified by a numeric tag.
8
9 As far as the Malete database core is concerned,
10 a field may contain any arbitrary bytes but newline characters.
11 Assuming anything about the structure of field data,
12 including any encoding of binary data,
13 is solely at the application's discretion.
14
15 As Malete is designed to be a multi-purpose database engine,
16 there is no special schema enforced.
17 However, there is a schema suggested and used by the OpenIsis application.
18 In the database's
19 > MetaData metadata record,
20 fields with tag (00)6 are reserved for this purpose (abuse at your own risk).
21
22
23 The rationale of this field definition is to provide enough flexibility
24 to efficiently support representations of all structures found in Z39.2
25 based systems (including but transcending the traditional CDS/ISIS software),
26 especially the various MARC formats, as well as full representations of
27 data commonly stored and transmitted in a couple of other formats like
28 MIME and XML.
29
30 The term "representation" means that Malete will not bother to
31 directly support XML's angle brackets nor XML's/MIME's foo="bar" options
32 nor the subfield delimiter characters of MARC or CDS/ISIS.
33 Rather, for any such data there should be a lossless transformation to an
34 efficient representation in some format described by this field definition.
35
36
37 * structure of fields
38
39 While fields may be used to hold a single value,
40 it is a common technique to treat them as a sequence of subfields.
41 ("A data element considered as a component of a field.", Z39.2).
42
43 A field may contain, in that order:
44 - 0 or more positional subfields of fixed length
45 - 1 or more positional subfields of variable length
46 - 0 or more identified subfields of variable length
47
48 Fixed length subfields end after as many bytes (not characters!) as given by
49 their length. They are typically used for data coded in some ASCII values.
50 Neither UTF-8 characters nor the delimiter character should be stored
51 in fixed length fields (however, it's up to the application to exercise care).
52
53 Variable length subfields end at a delimiter character or end of field.
54 Malete by default uses a tabulator as delimiter,
55 and import of CDS/ISIS databases converts the caret (hat '^') to tabs,
56 however applications are free to use any delimiter they want.
57
58
59 Positional subfields are identified by their position within the field,
60 i.e. by counting that many bytes and delimiters.
61 Of course, there is only one nth position within a field,
62 i.e. every positional subfield can occur at most once.
63 Since the first n bytes and first m delimited subfields are used as the
64 positional subfields, they may be omitted only if end of field is seen,
65 i.e. all other subfields are omitted.
66
67 Identified subfields, on the other hand, start with a single character
68 identifying the subfield, just like fields in a record are identified by a tag.
69 Applications unaware of UTF-8 may demand a single byte as identifier.
70 Where portability is an issue, only ASCII letters and digits should be used.
71 Since there is at least one positional variable subfield,
72 identified subfields always start after a delimiter (in accordance with Z39.2).
73 An identified subfield may occur zero, one or more times in a field.
74
75
76 The MAIN VALUE of a field contains the fixed length subfields together with
77 the first positional variable subfield. Sloppy applications may use anything
78 up to the first delimiter, assuming that fixed subfields do not contain it.
79 In the common situation of having no fixed length subfields,
80 the main value equals the first positional field.
81 The main value in a field is very similar to a record's header
82 and commonly used as a key to select a field in a record.
83
84
85 The properties of subfields stated so far are consequences of their very
86 definition. Additional properties, e.g. the main value being empty
87 or an identified subfield having a fixed length a/o occuring exactly once,
88 may be demanded by field definition.
89 It is the applications responsibility to make sure records do not violate
90 the field definition; the Malete server will happily store whatever it receives.
91
92
93 * definition of fields
94
95 The field definition uses fields of the metadata record,
96 one per each field and one per subfield.
97 These fields themselves do not use fixed length subfields.
98 The main value is a (non-unique) key:
99 - 'tag' for a field definition,
100 where tag is an integer. Negative numbers are reserved for counted structures.
101 By convention, general application data fields should
102 > TagUse use tags
103 100 - 999.
104 - 'tag#len' for a fixed subfield,
105 where len is a positive integer
106 - 'tag#' for an additional variable positional subfield.
107 the first variable positional subfield's type, values and xref
108 are defined with the main field definition.
109 - 'tag^i' for a subfield identified by character i
110 ('^' is the actual hat character, which is NOT the subfield delimiter;
111 the field definition uses tabulators)
112
113 All other subfields in the field definition are identified and optional:
114 - n name
115 A name by which a field or subfield can be referred to.
116 Field names must be unique and subfield names must be unique in their field.
117 It is strongly recommended to only use C identifiers,
118 i.e. ASCII letters, digits and the underscore, not starting with a digit.
119 - d description
120 Some textual description suitable for the database users.
121 - m min/mandatory
122 The sub/field must occur at least as many times as given by this option's
123 value (empty=1, absent=0).
124 - r repeatable
125 The sub/field must occur at most as many times as given by this option's
126 value (empty=any, absent=1). A value preceeded by '+' (including a single
127 '+' for any) implies the mandatory option (at least one occurrence).
128 - v value
129 Every occurrence of this repeatable option is of the form name=value,
130 associating the symbolic name with a legal value for the sub/field.
131 The first such value is used as a default where the sub/field is created
132 for some reason.
133 - t type
134 Type of this sub/field; see further below.
135 Defaults to any (non-control) characters.
136 Applications might support repeated alternative types.
137
138 * types of subfields
139
140 Note that a field's type actually defines the type of its first
141 positional variable subfield (which is usually the main value).
142 If there are no subfields defined for a field,
143 the field's value equals its main value.
144
145
146 A simple type definition consists of a single letter indicating
147 a character type, optionally followed by some digits giving a repeat count.
148 Unlike the byte-based length restrictions of fixed length fields,
149 the repeat count should be assumed in terms of characters.
150
151 For the terms "alphabetic" and "digit", it's up to the application's
152 UNICODE support to properly check these attributes for non-ASCII characters.
153 Simple environments may assume any code greater than 127 alphabetic.
154
155 Basic character types are:
156 - c character
157 Any character with a code value greater or equal 32 (i.e. no C0 controls).
158 - a alpha
159 Any alphabetic character.
160 - d digit
161 ASCII digits '0'-'9'.
162 - n numeric
163 Digits and optional leading minus sign.
164 - w word
165 Alpha, digits and underscore.
166
167 Extended character/byte types, possibly not supported by all environments, are:
168 - b bit/boolean
169 ASCII digits '0' or '1'. If a subfield of this type is absent, '0' should
170 be assumed, but a '1' if it's present and empty.
171 - r raw
172 Raw bytes using newline/vertical tab encoding as suggested by the
173 > Protocol
174 - i integer
175 Binary coded fix point decimal numbers using two decimal digits per byte
176 (128-99 .. 128+99) and starting with a byte 144 plus the bytes before
177 the decimal point (minus for negative numbers).
178 Such integers sort properly, avoid newlines and tabs, and the first byte
179 (for up to 30 decimal digits) is not valid in UTF-8 or any ISO charset.
180 - t time
181 Date and time as GTF integer. Up to 8 digits before the decimal point
182 for date YYYYMMDD, after the decimal point hhmmss...
183
184 For all simple type definitions, the same letter may be used uppercase.
185 With lowercase, the repeat count gives a maximum and defaults to any.
186 With an uppercase type letter, the repeat count is exact and defaults to 1.
187
188
189 Complex type definitions include the following:
190 - = pattern
191 Pattern is a sequence of simple type definitions of basic character types.
192 E.g. 'A3a6' denotes 3 to 9 alphabetic characters.
193 Any special character in pattern denotes itself (typically as separator).
194 - ~ regexp
195 Depending on the regexp package used.
196 - " literal
197 Must have one of the values listed with the field's v option.
198
199 The field definition of basic field definition is:
200 $
201 6 6 nfdt dfield definition r t=Nc
202 6 6^n nname dsub/field name
203 6 6^d ndesc dsub/field description
204 6 6^m nmin dmin number of occurrences tn
205 6 6^r nrep dmax number of occurrences
206 6 6^v nval dnamed values r
207 6 6^t ntype dsub/field type
208 $
209
210
211 * advanced field definition
212
213 There are some advanced field definition options which are probably
214 not supported by all applications.
215 Where used, however, the following formats are recommended:
216 - b base
217 The key or name of another sub/field definition in this metadata record
218 from which options (and, for a field, subfield definitions) should be
219 used for this entity. Obviously just a convenience feature.
220 - x xref
221 Definition of some other entity referred to by the value of this sub/field.
222 Described elsewhere.
223 - s structure a.k.a. subrecord
224 The field introduces a structure in the record; see further below.
225 - c child
226 This repeatable option specifies a tag or name of a legal child field.
227 Applications might support this being followed by '[:min][-max]'
228 to specify a min a/o max count of occurences of this child,
229 or one of the letters '+' (at least once), '?' (at most once),
230 '!' (exactly once) or '*' (any number of times, default).
231 In the definition of those childs, r0 may be used to indicate that they
232 should not occur in the record but where explicitly listed as legal child.
233
234 * structures
235
236 The structure option indicates that a field is the header of a structure,
237 indicating that some fields following it in the record somehow belong to it.
238 ("A group of fields within a record that may be treated as a logical entity.
239 (When a record describes more than one entity, the descriptions of individual
240 entities may be treated as subrecords.)", Z39.2).
241
242
243 While in general there are a couple of ways to mark a sequence of fields
244 as logically being one entity, there are three methods supported by
245 the field definition:
246 - counted structures
247 If the s option's value is empty,
248 the field's tag is the negative number of fields belonging to the
249 structure, including the header. This is the means used by the
250 > Protocol
251 to efficiently and transparently embed any records in messages.
252 Obviously counted structures cannot be accessed by their tag.
253 They are defined as some negative tags.
254 Some known format of their main value (especially a literal)
255 may be used to access them by key.
256 - delimited structures
257 If the s option's value is '+', the field has one additional initial
258 subfield of fixed length 1. For a given occurence of this field,
259 this subfield must contain either '-', indicating that there are
260 no childs, be absent (i.e. the field is completely empty),
261 or contain a '+', indicating that everything up to a matching
262 empty field of same tag are the structures childs.
263 - fixed structures
264 If the s option's value is a number, the structure has exactly as
265 many childs as given by this number. Note that the number of fields
266 may be greater if the childs are structures themselves. Rarely used.
267
268 Note that while the field definition in general does not specify
269 the ordering of fields, the childs of a structure are always
270 a consecutive range according to the structure's definition.
271
272
273 Z39.2 reserves control field 002 for "subrecord purposes",
274 e.g. listing the offsets of such "groups of fields".
275
276
277 * recommendations
278
279 - fixed subfields should contain only bytes 32 to 126, inclusive
280 - if delimited structures are used, they should be used consistently,
281 i.e. all fields (but 0) should have that type
282 - fixed structures should only be used for internal purposes
283
284 * examples
285
286 The headers of email or other MIME messages like
287 $
288 Subject: hi there
289 Content-Type: text/plain; charset="iso8859-1"
290 $
291 using a field definition of
292 $
293 6 10 nsubject
294 6 11 ncontent-type
295 6 11^c ncharset
296 $
297 map to
298 $
299 10 hi there
300 11 text/plain ciso8859-1
301 $
302 Value options could be used to encode common value like text/plain.
303
304
305 Using delimited structures, a typical HTML table definition starting with
306 $
307 <table width="100%" cellpadding="0" cellspacing="0"
308 marginwidth="0" marginheight="0" topmargin="0" leftmargin="0" border="0">
309 <tr>
310 <td valign="top" width="160">
311 this is the textbody <br/> of the td node
312 </td>
313 </tr>
314 ...
315 $
316 using
317 $
318 6 100 ntd s+
319 6 100^w nwidth
320 ...
321 6 101 ntr
322 ...
323 $
324 will be compacted to
325 $
326 100 + w100% p0 s0 m0 h0 t0 l0 b0
327 101 +
328 102 + vtop w160
329 0 this is the textbody
330 103 -
331 0 of the td node
332 102
333 101
334 ...
335 $
336 which could save half of the internet's bandwidth.
337
338 Some strict XML parsers limit a node to at most one textnode child,
339 which then should be stored in the node's main value.
340
341
342 * conformance
343
344 Most features of Z39.2 (a.k.a. ISO2709 a.k.a. IIF) map directly to
345 Malete records. Subfield identifiers in Z39.2 can use more than one
346 character, however, MARC always uses one.
347 Initial fixed subfields are dubbed "indicators" by Z39.2,
348 MARC uses two of length 1. They are not considered "data elements",
349 as other subfields are. Here, fixed subfields are considered less special.
350
351
352 MIME and *ML (SGML,HTML,XML...) data structures can be converted to records
353 in a straightforward manner after a parser has resolved entities and the like.
354
355
356 ---
357 $Id: RecStruct.txt,v 1.6 2004/07/26 12:23:34 kripke Exp $

  ViewVC Help
Powered by ViewVC 1.1.26