/[webpac]/openisis/0.9.9e/doc/RecStruct.txt
This is repository of my old source code which isn't updated any more. Go to git.rot13.org for current projects!
ViewVC logotype

Annotation of /openisis/0.9.9e/doc/RecStruct.txt

Parent Directory Parent Directory | Revision Log Revision Log


Revision 604 - (hide annotations)
Mon Dec 27 21:49:01 2004 UTC (19 years, 4 months ago) by dpavlin
File MIME type: text/plain
File size: 13806 byte(s)
import of new openisis release, 0.9.9e

1 dpavlin 604 OpenIsis/Malete field definition and record structures
2    
3    
4     * overview
5    
6     A Malete record is a sequence of one or more fields.
7     The first one is called the header, all others are identified by a numeric tag.
8    
9     As far as the Malete database core is concerned,
10     a field may contain any arbitrary bytes but newline characters.
11     Assuming anything about the structure of field data,
12     including any encoding of binary data,
13     is solely at the application's discretion.
14    
15     As Malete is designed to be a multi-purpose database engine,
16     there is no special schema enforced.
17     However, there is a schema suggested and used by the OpenIsis application.
18     In the database's
19     > MetaData metadata record,
20     fields with tag (00)6 are reserved for this purpose (abuse at your own risk).
21    
22    
23     The rationale of this field definition is to provide enough flexibility
24     to efficiently support representations of all structures found in Z39.2
25     based systems (including but transcending the traditional CDS/ISIS software),
26     especially the various MARC formats, as well as full representations of
27     data commonly stored and transmitted in a couple of other formats like
28     MIME and XML.
29    
30     The term "representation" means that Malete will not bother to
31     directly support XML's angle brackets nor XML's/MIME's foo="bar" options
32     nor the subfield delimiter characters of MARC or CDS/ISIS.
33     Rather, for any such data there should be a lossless transformation to an
34     efficient representation in some format described by this field definition.
35    
36    
37     * structure of fields
38    
39     While fields may be used to hold a single value,
40     it is a common technique to treat them as a sequence of subfields.
41     ("A data element considered as a component of a field.", Z39.2).
42    
43     A field may contain, in that order:
44     - 0 or more positional subfields of fixed length
45     - 1 or more positional subfields of variable length
46     - 0 or more identified subfields of variable length
47    
48     Fixed length subfields end after as many bytes (not characters!) as given by
49     their length. They are typically used for data coded in some ASCII values.
50     Neither UTF-8 characters nor the delimiter character should be stored
51     in fixed length fields (however, it's up to the application to exercise care).
52    
53     Variable length subfields end at a delimiter character or end of field.
54     Malete by default uses a tabulator as delimiter,
55     and import of CDS/ISIS databases converts the caret (hat '^') to tabs,
56     however applications are free to use any delimiter they want.
57    
58    
59     Positional subfields are identified by their position within the field,
60     i.e. by counting that many bytes and delimiters.
61     Of course, there is only one nth position within a field,
62     i.e. every positional subfield can occur at most once.
63     Since the first n bytes and first m delimited subfields are used as the
64     positional subfields, they may be omitted only if end of field is seen,
65     i.e. all other subfields are omitted.
66    
67     Identified subfields, on the other hand, start with a single character
68     identifying the subfield, just like fields in a record are identified by a tag.
69     Applications unaware of UTF-8 may demand a single byte as identifier.
70     Where portability is an issue, only ASCII letters and digits should be used.
71     Since there is at least one positional variable subfield,
72     identified subfields always start after a delimiter (in accordance with Z39.2).
73     An identified subfield may occur zero, one or more times in a field.
74    
75    
76     The MAIN VALUE of a field contains the fixed length subfields together with
77     the first positional variable subfield. Sloppy applications may use anything
78     up to the first delimiter, assuming that fixed subfields do not contain it.
79     In the common situation of having no fixed length subfields,
80     the main value equals the first positional field.
81     The main value in a field is very similar to a record's header
82     and commonly used as a key to select a field in a record.
83    
84    
85     The properties of subfields stated so far are consequences of their very
86     definition. Additional properties, e.g. the main value being empty
87     or an identified subfield having a fixed length a/o occuring exactly once,
88     may be demanded by field definition.
89     It is the applications responsibility to make sure records do not violate
90     the field definition; the Malete server will happily store whatever it receives.
91    
92    
93     * definition of fields
94    
95     The field definition uses fields of the metadata record,
96     one per each field and one per subfield.
97     These fields themselves do not use fixed length subfields.
98     The main value is a (non-unique) key:
99     - 'tag' for a field definition,
100     where tag is an integer. Negative numbers are reserved for counted structures.
101     By convention, general application data fields should
102     > TagUse use tags
103     100 - 999.
104     - 'tag#len' for a fixed subfield,
105     where len is a positive integer
106     - 'tag#' for an additional variable positional subfield.
107     the first variable positional subfield's type, values and xref
108     are defined with the main field definition.
109     - 'tag^i' for a subfield identified by character i
110     ('^' is the actual hat character, which is NOT the subfield delimiter;
111     the field definition uses tabulators)
112    
113     All other subfields in the field definition are identified and optional:
114     - n name
115     A name by which a field or subfield can be referred to.
116     Field names must be unique and subfield names must be unique in their field.
117     It is strongly recommended to only use C identifiers,
118     i.e. ASCII letters, digits and the underscore, not starting with a digit.
119     - d description
120     Some textual description suitable for the database users.
121     - m min/mandatory
122     The sub/field must occur at least as many times as given by this option's
123     value (empty=1, absent=0).
124     - r repeatable
125     The sub/field must occur at most as many times as given by this option's
126     value (empty=any, absent=1). A value preceeded by '+' (including a single
127     '+' for any) implies the mandatory option (at least one occurrence).
128     - v value
129     Every occurrence of this repeatable option is of the form name=value,
130     associating the symbolic name with a legal value for the sub/field.
131     The first such value is used as a default where the sub/field is created
132     for some reason.
133     - t type
134     Type of this sub/field; see further below.
135     Defaults to any (non-control) characters.
136     Applications might support repeated alternative types.
137    
138     * types of subfields
139    
140     Note that a field's type actually defines the type of its first
141     positional variable subfield (which is usually the main value).
142     If there are no subfields defined for a field,
143     the field's value equals its main value.
144    
145    
146     A simple type definition consists of a single letter indicating
147     a character type, optionally followed by some digits giving a repeat count.
148     Unlike the byte-based length restrictions of fixed length fields,
149     the repeat count should be assumed in terms of characters.
150    
151     For the terms "alphabetic" and "digit", it's up to the application's
152     UNICODE support to properly check these attributes for non-ASCII characters.
153     Simple environments may assume any code greater than 127 alphabetic.
154    
155     Basic character types are:
156     - c character
157     Any character with a code value greater or equal 32 (i.e. no C0 controls).
158     - a alpha
159     Any alphabetic character.
160     - d digit
161     ASCII digits '0'-'9'.
162     - n numeric
163     Digits and optional leading minus sign.
164     - w word
165     Alpha, digits and underscore.
166    
167     Extended character/byte types, possibly not supported by all environments, are:
168     - b bit/boolean
169     ASCII digits '0' or '1'. If a subfield of this type is absent, '0' should
170     be assumed, but a '1' if it's present and empty.
171     - r raw
172     Raw bytes using newline/vertical tab encoding as suggested by the
173     > Protocol
174     - i integer
175     Binary coded fix point decimal numbers using two decimal digits per byte
176     (128-99 .. 128+99) and starting with a byte 144 plus the bytes before
177     the decimal point (minus for negative numbers).
178     Such integers sort properly, avoid newlines and tabs, and the first byte
179     (for up to 30 decimal digits) is not valid in UTF-8 or any ISO charset.
180     - t time
181     Date and time as GTF integer. Up to 8 digits before the decimal point
182     for date YYYYMMDD, after the decimal point hhmmss...
183    
184     For all simple type definitions, the same letter may be used uppercase.
185     With lowercase, the repeat count gives a maximum and defaults to any.
186     With an uppercase type letter, the repeat count is exact and defaults to 1.
187    
188    
189     Complex type definitions include the following:
190     - = pattern
191     Pattern is a sequence of simple type definitions of basic character types.
192     E.g. 'A3a6' denotes 3 to 9 alphabetic characters.
193     Any special character in pattern denotes itself (typically as separator).
194     - ~ regexp
195     Depending on the regexp package used.
196     - " literal
197     Must have one of the values listed with the field's v option.
198    
199     The field definition of basic field definition is:
200     $
201     6 6 nfdt dfield definition r t=Nc
202     6 6^n nname dsub/field name
203     6 6^d ndesc dsub/field description
204     6 6^m nmin dmin number of occurrences tn
205     6 6^r nrep dmax number of occurrences
206     6 6^v nval dnamed values r
207     6 6^t ntype dsub/field type
208     $
209    
210    
211     * advanced field definition
212    
213     There are some advanced field definition options which are probably
214     not supported by all applications.
215     Where used, however, the following formats are recommended:
216     - b base
217     The key or name of another sub/field definition in this metadata record
218     from which options (and, for a field, subfield definitions) should be
219     used for this entity. Obviously just a convenience feature.
220     - x xref
221     Definition of some other entity referred to by the value of this sub/field.
222     Described elsewhere.
223     - s structure a.k.a. subrecord
224     The field introduces a structure in the record; see further below.
225     - c child
226     This repeatable option specifies a tag or name of a legal child field.
227     Applications might support this being followed by '[:min][-max]'
228     to specify a min a/o max count of occurences of this child,
229     or one of the letters '+' (at least once), '?' (at most once),
230     '!' (exactly once) or '*' (any number of times, default).
231     In the definition of those childs, r0 may be used to indicate that they
232     should not occur in the record but where explicitly listed as legal child.
233    
234     * structures
235    
236     The structure option indicates that a field is the header of a structure,
237     indicating that some fields following it in the record somehow belong to it.
238     ("A group of fields within a record that may be treated as a logical entity.
239     (When a record describes more than one entity, the descriptions of individual
240     entities may be treated as subrecords.)", Z39.2).
241    
242    
243     While in general there are a couple of ways to mark a sequence of fields
244     as logically being one entity, there are three methods supported by
245     the field definition:
246     - counted structures
247     If the s option's value is empty,
248     the field's tag is the negative number of fields belonging to the
249     structure, including the header. This is the means used by the
250     > Protocol
251     to efficiently and transparently embed any records in messages.
252     Obviously counted structures cannot be accessed by their tag.
253     They are defined as some negative tags.
254     Some known format of their main value (especially a literal)
255     may be used to access them by key.
256     - delimited structures
257     If the s option's value is '+', the field has one additional initial
258     subfield of fixed length 1. For a given occurence of this field,
259     this subfield must contain either '-', indicating that there are
260     no childs, be absent (i.e. the field is completely empty),
261     or contain a '+', indicating that everything up to a matching
262     empty field of same tag are the structures childs.
263     - fixed structures
264     If the s option's value is a number, the structure has exactly as
265     many childs as given by this number. Note that the number of fields
266     may be greater if the childs are structures themselves. Rarely used.
267    
268     Note that while the field definition in general does not specify
269     the ordering of fields, the childs of a structure are always
270     a consecutive range according to the structure's definition.
271    
272    
273     Z39.2 reserves control field 002 for "subrecord purposes",
274     e.g. listing the offsets of such "groups of fields".
275    
276    
277     * recommendations
278    
279     - fixed subfields should contain only bytes 32 to 126, inclusive
280     - if delimited structures are used, they should be used consistently,
281     i.e. all fields (but 0) should have that type
282     - fixed structures should only be used for internal purposes
283    
284     * examples
285    
286     The headers of email or other MIME messages like
287     $
288     Subject: hi there
289     Content-Type: text/plain; charset="iso8859-1"
290     $
291     using a field definition of
292     $
293     6 10 nsubject
294     6 11 ncontent-type
295     6 11^c ncharset
296     $
297     map to
298     $
299     10 hi there
300     11 text/plain ciso8859-1
301     $
302     Value options could be used to encode common value like text/plain.
303    
304    
305     Using delimited structures, a typical HTML table definition starting with
306     $
307     <table width="100%" cellpadding="0" cellspacing="0"
308     marginwidth="0" marginheight="0" topmargin="0" leftmargin="0" border="0">
309     <tr>
310     <td valign="top" width="160">
311     this is the textbody <br/> of the td node
312     </td>
313     </tr>
314     ...
315     $
316     using
317     $
318     6 100 ntd s+
319     6 100^w nwidth
320     ...
321     6 101 ntr
322     ...
323     $
324     will be compacted to
325     $
326     100 + w100% p0 s0 m0 h0 t0 l0 b0
327     101 +
328     102 + vtop w160
329     0 this is the textbody
330     103 -
331     0 of the td node
332     102
333     101
334     ...
335     $
336     which could save half of the internet's bandwidth.
337    
338     Some strict XML parsers limit a node to at most one textnode child,
339     which then should be stored in the node's main value.
340    
341    
342     * conformance
343    
344     Most features of Z39.2 (a.k.a. ISO2709 a.k.a. IIF) map directly to
345     Malete records. Subfield identifiers in Z39.2 can use more than one
346     character, however, MARC always uses one.
347     Initial fixed subfields are dubbed "indicators" by Z39.2,
348     MARC uses two of length 1. They are not considered "data elements",
349     as other subfields are. Here, fixed subfields are considered less special.
350    
351    
352     MIME and *ML (SGML,HTML,XML...) data structures can be converted to records
353     in a straightforward manner after a parser has resolved entities and the like.
354    
355    
356     ---
357     $Id: RecStruct.txt,v 1.6 2004/07/26 12:23:34 kripke Exp $

  ViewVC Help
Powered by ViewVC 1.1.26