0.9.9e/doc/RecStruct.txt

OpenIsis/Malete field definition and record structures


*       overview

A Malete record is a sequence of one or more fields.
The first one is called the header, all others are identified by a numeric tag.

As far as the Malete database core is concerned,
a field may contain any arbitrary bytes but newline characters.
Assuming anything about the structure of field data,
including any encoding of binary data,
is solely at the application's discretion.

As Malete is designed to be a multi-purpose database engine,
there is no special schema enforced.
However, there is a schema suggested and used by the OpenIsis application.
In the database's
>       MetaData        metadata record,
fields with tag (00)6 are reserved for this purpose (abuse at your own risk).


The rationale of this field definition is to provide enough flexibility
to efficiently support representations of all structures found in Z39.2
based systems (including but transcending the traditional CDS/ISIS software), 
especially the various MARC formats, as well as full representations of
data commonly stored and transmitted in a couple of other formats like
MIME and XML.

The term "representation" means that Malete will not bother to
directly support XML's angle brackets nor XML's/MIME's foo="bar" options
nor the subfield delimiter characters of MARC or CDS/ISIS.
Rather, for any such data there should be a lossless transformation  to an
efficient representation in some format described by this field definition.


*       structure of fields

While fields may be used to hold a single value,
it is a common technique to treat them as a sequence of subfields.
("A data element considered as a component of a field.", Z39.2).

A field may contain, in that order:
-       0 or more positional subfields of fixed length
-       1 or more positional subfields of variable length
-       0 or more identified subfields of variable length

Fixed length subfields end after as many bytes (not characters!) as given by
their length. They are typically used for data coded in some ASCII values.
Neither UTF-8 characters nor the delimiter character should be stored
in fixed length fields (however, it's up to the application to exercise care).

Variable length subfields end at a delimiter character or end of field.
Malete by default uses a tabulator as delimiter,
and import of CDS/ISIS databases converts the caret (hat '^') to tabs,
however applications are free to use any delimiter they want.


Positional subfields are identified by their position within the field,
i.e. by counting that many bytes and delimiters.
Of course, there is only one nth position within a field,
i.e. every positional subfield can occur at most once.
Since the first n bytes and first m delimited subfields are used as the
positional subfields, they may be omitted only if end of field is seen,
i.e. all other subfields are omitted.

Identified subfields, on the other hand, start with a single character
identifying the subfield, just like fields in a record are identified by a tag.
Applications unaware of UTF-8 may demand a single byte as identifier.
Where portability is an issue, only ASCII letters and digits should be used.
Since there is at least one positional variable subfield,
identified subfields always start after a delimiter (in accordance with Z39.2).
An identified subfield may occur zero, one or more times in a field.


The MAIN VALUE of a field contains the fixed length subfields together with
the first positional variable subfield. Sloppy applications may use anything
up to the first delimiter, assuming that fixed subfields do not contain it.
In the common situation of having no fixed length subfields,
the main value equals the first positional field.
The main value in a field is very similar to a record's header
and commonly used as a key to select a field in a record.


The properties of subfields stated so far are consequences of their very
definition. Additional properties, e.g. the main value being empty
or an identified subfield having a fixed length a/o occuring exactly once,
may be demanded by field definition.
It is the applications responsibility to make sure records do not violate
the field definition; the Malete server will happily store whatever it receives.


*       definition of fields

The field definition uses fields of the metadata record,
one per each field and one per subfield.
These fields themselves do not use fixed length subfields.
The main value is a (non-unique) key:
-       'tag' for a field definition,
        where tag is an integer. Negative numbers are reserved for counted structures.
        By convention, general application data fields should
>       TagUse  use tags
        100 - 999.
-       'tag#len' for a fixed subfield,
        where len is a positive integer
-       'tag#' for an additional variable positional subfield.
        the first variable positional subfield's type, values and xref
        are defined with the main field definition.
-       'tag^i' for a subfield identified by character i
        ('^' is the actual hat character, which is NOT the subfield delimiter;
        the field definition uses tabulators)

All other subfields in the field definition are identified and optional:
-       n       name
        A name by which a field or subfield can be referred to.
        Field names must be unique and subfield names must be unique in their field.
        It is strongly recommended to only use C identifiers,
        i.e. ASCII letters, digits and the underscore, not starting with a digit.
-       d description
        Some textual description suitable for the database users.
-       m min/mandatory
        The sub/field must occur at least as many times as given by this option's
        value (empty=1, absent=0).
-       r repeatable
        The sub/field must occur at most as many times as given by this option's
        value (empty=any, absent=1). A value preceeded by '+' (including a single
        '+' for any) implies the mandatory option (at least one occurrence).
-       v value
        Every occurrence of this repeatable option is of the form name=value,
        associating the symbolic name with a legal value for the sub/field.
        The first such value is used as a default where the sub/field is created
        for some reason.
-       t type
        Type of this sub/field; see further below.
        Defaults to any (non-control) characters.
        Applications might support repeated alternative types.

*       types of subfields

Note that a field's type actually defines the type of its first
positional variable subfield (which is usually the main value).
If there are no subfields defined for a field,
the field's value equals its main value.


A simple type definition consists of a single letter indicating
a character type, optionally followed by some digits giving a repeat count.
Unlike the byte-based length restrictions of fixed length fields,
the repeat count should be assumed in terms of characters.

For the terms "alphabetic" and "digit", it's up to the application's
UNICODE support to properly check these attributes for non-ASCII characters.
Simple environments may assume any code greater than 127 alphabetic.

Basic character types are:
-       c character
        Any character with a code value greater or equal 32 (i.e. no C0 controls).
-       a alpha
        Any alphabetic character.
-       d digit
        ASCII digits '0'-'9'.
-       n numeric
        Digits and optional leading minus sign.
-       w word
        Alpha, digits and underscore.

Extended character/byte types, possibly not supported by all environments, are:
-       b bit/boolean
        ASCII digits '0' or '1'. If a subfield of this type is absent, '0' should
        be assumed, but a '1' if it's present and empty.
-       r       raw
        Raw bytes using newline/vertical tab encoding as suggested by the
>       Protocol
-       i integer
        Binary coded fix point decimal numbers using two decimal digits per byte
        (128-99 .. 128+99) and starting with a byte 144 plus the bytes before
        the decimal point (minus for negative numbers).
        Such integers sort properly, avoid newlines and tabs, and the first byte
        (for up to 30 decimal digits) is not valid in UTF-8 or any ISO charset.
-       t time
        Date and time as GTF integer. Up to 8 digits before the decimal point
        for date YYYYMMDD, after the decimal point hhmmss...

For all simple type definitions, the same letter may be used uppercase.
With lowercase, the repeat count gives a maximum and defaults to any.
With an uppercase type letter, the repeat count is exact and defaults to 1.


Complex type definitions include the following:
-       = pattern
        Pattern is a sequence of simple type definitions of basic character types.
        E.g. 'A3a6' denotes 3 to 9 alphabetic characters.
        Any special character in pattern denotes itself (typically as separator).
-       ~ regexp
        Depending on the regexp package used.
-       " literal
        Must have one of the values listed with the field's v option.

The field definition of basic field definition is:
$
6       6       nfdt    dfield definition       r       t=Nc
6       6^n     nname   dsub/field name
6       6^d     ndesc   dsub/field description
6       6^m     nmin    dmin number of occurrences      tn
6       6^r     nrep    dmax number of occurrences
6       6^v     nval    dnamed values   r
6       6^t     ntype   dsub/field type
$


*       advanced field definition

There are some advanced field definition options which are probably
not supported by all applications.
Where used, however, the following formats are recommended:
-       b base
        The key or name of another sub/field definition in this metadata record
        from which options (and, for a field, subfield definitions) should be
        used for this entity. Obviously just a convenience feature.
-       x xref
        Definition of some other entity referred to by the value of this sub/field.
        Described elsewhere.
-       s structure a.k.a. subrecord
        The field introduces a structure in the record; see further below.
-       c child
        This repeatable option specifies a tag or name of a legal child field.
        Applications might support this being followed by '[:min][-max]'
        to specify a min a/o max count of occurences of this child,
        or one of the letters '+' (at least once), '?' (at most once),
        '!' (exactly once) or '*' (any number of times, default).
        In the definition of those childs, r0 may be used to indicate that they
        should not occur in the record but where explicitly listed as legal child.

*       structures

The structure option indicates that a field is the header of a structure,
indicating that some fields following it in the record somehow belong to it.
("A group of fields within a record that may be treated as a logical entity.
(When a record describes more than one entity, the descriptions of individual
entities may be treated as subrecords.)", Z39.2).


While in general there are a couple of ways to mark a sequence of fields
as logically being one entity, there are three methods supported by
the field definition:
-       counted structures
        If the s option's value is empty,
        the field's tag is the negative number of fields belonging to the
        structure, including the header. This is the means used by the
>       Protocol
        to efficiently and transparently embed any records in messages.
        Obviously counted structures cannot be accessed by their tag.
        They are defined as some negative tags.
        Some known format of their main value (especially a literal)
        may be used to access them by key.
-       delimited structures
        If the s option's value is '+', the field has one additional initial
        subfield of fixed length 1. For a given occurence of this field,
        this subfield must contain either '-', indicating that there are
        no childs, be absent (i.e. the field is completely empty),
        or contain a '+', indicating that everything up to a matching
        empty field of same tag are the structures childs.
-       fixed structures
        If the s option's value is a number, the structure has exactly as
        many childs as given by this number. Note that the number of fields
        may be greater if the childs are structures themselves. Rarely used.

Note that while the field definition in general does not specify
the ordering of fields, the childs of a structure are always
a consecutive range according to the structure's definition.


Z39.2 reserves control field 002 for "subrecord purposes",
e.g. listing the offsets of such "groups of fields".


*       recommendations

-       fixed subfields should contain only bytes 32 to 126, inclusive
-       if delimited structures are used, they should be used consistently,
        i.e. all fields (but 0) should have that type
-       fixed structures should only be used for internal purposes

*       examples

The headers of email or other MIME messages like
$
Subject: hi there
Content-Type: text/plain; charset="iso8859-1"
$
using a field definition of
$
6       10      nsubject
6       11      ncontent-type
6       11^c    ncharset
$
map to
$
10      hi there
11      text/plain      ciso8859-1
$
Value options could be used to encode common value like text/plain.


Using delimited structures, a typical HTML table definition starting with
$
<table width="100%" cellpadding="0" cellspacing="0"
  marginwidth="0" marginheight="0" topmargin="0" leftmargin="0" border="0">
<tr>
<td valign="top" width="160">
this is the textbody <br/> of the td node
</td>
</tr>
...
$
using
$
6       100     ntd     s+
6       100^w   nwidth
...
6       101     ntr
...
$
will be compacted to
$
100     +       w100%   p0      s0      m0      h0      t0      l0      b0
101     +
102     +       vtop    w160
0       this is the textbody
103     -
0       of the td node
102
101
...
$
which could save half of the internet's bandwidth.

Some strict XML parsers limit a node to at most one textnode child,
which then should be stored in the node's main value.


*       conformance

Most features of Z39.2 (a.k.a. ISO2709 a.k.a. IIF) map directly to
Malete records. Subfield identifiers in Z39.2 can use more than one
character, however, MARC always uses one.
Initial fixed subfields are dubbed "indicators" by Z39.2,
MARC uses two of length 1. They are not considered "data elements",
as other subfields are. Here, fixed subfields are considered less special.


MIME and *ML (SGML,HTML,XML...) data structures can be converted to records
in a straightforward manner after a parser has resolved entities and the like.


---
        $Id: RecStruct.txt,v 1.6 2004/07/26 12:23:34 kripke Exp $
1	dpavlin	604	OpenIsis/Malete field definition and record structures
2
3
4			* overview
5
6			A Malete record is a sequence of one or more fields.
7			The first one is called the header, all others are identified by a numeric tag.
8
9			As far as the Malete database core is concerned,
10			a field may contain any arbitrary bytes but newline characters.
11			Assuming anything about the structure of field data,
12			including any encoding of binary data,
13			is solely at the application's discretion.
14
15			As Malete is designed to be a multi-purpose database engine,
16			there is no special schema enforced.
17			However, there is a schema suggested and used by the OpenIsis application.
18			In the database's
19			> MetaData metadata record,
20			fields with tag (00)6 are reserved for this purpose (abuse at your own risk).
21
22
23			The rationale of this field definition is to provide enough flexibility
24			to efficiently support representations of all structures found in Z39.2
25			based systems (including but transcending the traditional CDS/ISIS software),
26			especially the various MARC formats, as well as full representations of
27			data commonly stored and transmitted in a couple of other formats like
28			MIME and XML.
29
30			The term "representation" means that Malete will not bother to
31			directly support XML's angle brackets nor XML's/MIME's foo="bar" options
32			nor the subfield delimiter characters of MARC or CDS/ISIS.
33			Rather, for any such data there should be a lossless transformation to an
34			efficient representation in some format described by this field definition.
35
36
37			* structure of fields
38
39			While fields may be used to hold a single value,
40			it is a common technique to treat them as a sequence of subfields.
41			("A data element considered as a component of a field.", Z39.2).
42
43			A field may contain, in that order:
44			- 0 or more positional subfields of fixed length
45			- 1 or more positional subfields of variable length
46			- 0 or more identified subfields of variable length
47
48			Fixed length subfields end after as many bytes (not characters!) as given by
49			their length. They are typically used for data coded in some ASCII values.
50			Neither UTF-8 characters nor the delimiter character should be stored
51			in fixed length fields (however, it's up to the application to exercise care).
52
53			Variable length subfields end at a delimiter character or end of field.
54			Malete by default uses a tabulator as delimiter,
55			and import of CDS/ISIS databases converts the caret (hat '^') to tabs,
56			however applications are free to use any delimiter they want.
57
58
59			Positional subfields are identified by their position within the field,
60			i.e. by counting that many bytes and delimiters.
61			Of course, there is only one nth position within a field,
62			i.e. every positional subfield can occur at most once.
63			Since the first n bytes and first m delimited subfields are used as the
64			positional subfields, they may be omitted only if end of field is seen,
65			i.e. all other subfields are omitted.
66
67			Identified subfields, on the other hand, start with a single character
68			identifying the subfield, just like fields in a record are identified by a tag.
69			Applications unaware of UTF-8 may demand a single byte as identifier.
70			Where portability is an issue, only ASCII letters and digits should be used.
71			Since there is at least one positional variable subfield,
72			identified subfields always start after a delimiter (in accordance with Z39.2).
73			An identified subfield may occur zero, one or more times in a field.
74
75
76			The MAIN VALUE of a field contains the fixed length subfields together with
77			the first positional variable subfield. Sloppy applications may use anything
78			up to the first delimiter, assuming that fixed subfields do not contain it.
79			In the common situation of having no fixed length subfields,
80			the main value equals the first positional field.
81			The main value in a field is very similar to a record's header
82			and commonly used as a key to select a field in a record.
83
84
85			The properties of subfields stated so far are consequences of their very
86			definition. Additional properties, e.g. the main value being empty
87			or an identified subfield having a fixed length a/o occuring exactly once,
88			may be demanded by field definition.
89			It is the applications responsibility to make sure records do not violate
90			the field definition; the Malete server will happily store whatever it receives.
91
92
93			* definition of fields
94
95			The field definition uses fields of the metadata record,
96			one per each field and one per subfield.
97			These fields themselves do not use fixed length subfields.
98			The main value is a (non-unique) key:
99			- 'tag' for a field definition,
100			where tag is an integer. Negative numbers are reserved for counted structures.
101			By convention, general application data fields should
102			> TagUse use tags
103			100 - 999.
104			- 'tag#len' for a fixed subfield,
105			where len is a positive integer
106			- 'tag#' for an additional variable positional subfield.
107			the first variable positional subfield's type, values and xref
108			are defined with the main field definition.
109			- 'tag^i' for a subfield identified by character i
110			('^' is the actual hat character, which is NOT the subfield delimiter;
111			the field definition uses tabulators)
112
113			All other subfields in the field definition are identified and optional:
114			- n name
115			A name by which a field or subfield can be referred to.
116			Field names must be unique and subfield names must be unique in their field.
117			It is strongly recommended to only use C identifiers,
118			i.e. ASCII letters, digits and the underscore, not starting with a digit.
119			- d description
120			Some textual description suitable for the database users.
121			- m min/mandatory
122			The sub/field must occur at least as many times as given by this option's
123			value (empty=1, absent=0).
124			- r repeatable
125			The sub/field must occur at most as many times as given by this option's
126			value (empty=any, absent=1). A value preceeded by '+' (including a single
127			'+' for any) implies the mandatory option (at least one occurrence).
128			- v value
129			Every occurrence of this repeatable option is of the form name=value,
130			associating the symbolic name with a legal value for the sub/field.
131			The first such value is used as a default where the sub/field is created
132			for some reason.
133			- t type
134			Type of this sub/field; see further below.
135			Defaults to any (non-control) characters.
136			Applications might support repeated alternative types.
137
138			* types of subfields
139
140			Note that a field's type actually defines the type of its first
141			positional variable subfield (which is usually the main value).
142			If there are no subfields defined for a field,
143			the field's value equals its main value.
144
145
146			A simple type definition consists of a single letter indicating
147			a character type, optionally followed by some digits giving a repeat count.
148			Unlike the byte-based length restrictions of fixed length fields,
149			the repeat count should be assumed in terms of characters.
150
151			For the terms "alphabetic" and "digit", it's up to the application's
152			UNICODE support to properly check these attributes for non-ASCII characters.
153			Simple environments may assume any code greater than 127 alphabetic.
154
155			Basic character types are:
156			- c character
157			Any character with a code value greater or equal 32 (i.e. no C0 controls).
158			- a alpha
159			Any alphabetic character.
160			- d digit
161			ASCII digits '0'-'9'.
162			- n numeric
163			Digits and optional leading minus sign.
164			- w word
165			Alpha, digits and underscore.
166
167			Extended character/byte types, possibly not supported by all environments, are:
168			- b bit/boolean
169			ASCII digits '0' or '1'. If a subfield of this type is absent, '0' should
170			be assumed, but a '1' if it's present and empty.
171			- r raw
172			Raw bytes using newline/vertical tab encoding as suggested by the
173			> Protocol
174			- i integer
175			Binary coded fix point decimal numbers using two decimal digits per byte
176			(128-99 .. 128+99) and starting with a byte 144 plus the bytes before
177			the decimal point (minus for negative numbers).
178			Such integers sort properly, avoid newlines and tabs, and the first byte
179			(for up to 30 decimal digits) is not valid in UTF-8 or any ISO charset.
180			- t time
181			Date and time as GTF integer. Up to 8 digits before the decimal point
182			for date YYYYMMDD, after the decimal point hhmmss...
183
184			For all simple type definitions, the same letter may be used uppercase.
185			With lowercase, the repeat count gives a maximum and defaults to any.
186			With an uppercase type letter, the repeat count is exact and defaults to 1.
187
188
189			Complex type definitions include the following:
190			- = pattern
191			Pattern is a sequence of simple type definitions of basic character types.
192			E.g. 'A3a6' denotes 3 to 9 alphabetic characters.
193			Any special character in pattern denotes itself (typically as separator).
194			- ~ regexp
195			Depending on the regexp package used.
196			- " literal
197			Must have one of the values listed with the field's v option.
198
199			The field definition of basic field definition is:
200			$
201			6 6 nfdt dfield definition r t=Nc
202			6 6^n nname dsub/field name
203			6 6^d ndesc dsub/field description
204			6 6^m nmin dmin number of occurrences tn
205			6 6^r nrep dmax number of occurrences
206			6 6^v nval dnamed values r
207			6 6^t ntype dsub/field type
208			$
209
210
211			* advanced field definition
212
213			There are some advanced field definition options which are probably
214			not supported by all applications.
215			Where used, however, the following formats are recommended:
216			- b base
217			The key or name of another sub/field definition in this metadata record
218			from which options (and, for a field, subfield definitions) should be
219			used for this entity. Obviously just a convenience feature.
220			- x xref
221			Definition of some other entity referred to by the value of this sub/field.
222			Described elsewhere.
223			- s structure a.k.a. subrecord
224			The field introduces a structure in the record; see further below.
225			- c child
226			This repeatable option specifies a tag or name of a legal child field.
227			Applications might support this being followed by '[:min][-max]'
228			to specify a min a/o max count of occurences of this child,
229			or one of the letters '+' (at least once), '?' (at most once),
230			'!' (exactly once) or '*' (any number of times, default).
231			In the definition of those childs, r0 may be used to indicate that they
232			should not occur in the record but where explicitly listed as legal child.
233
234			* structures
235
236			The structure option indicates that a field is the header of a structure,
237			indicating that some fields following it in the record somehow belong to it.
238			("A group of fields within a record that may be treated as a logical entity.
239			(When a record describes more than one entity, the descriptions of individual
240			entities may be treated as subrecords.)", Z39.2).
241
242
243			While in general there are a couple of ways to mark a sequence of fields
244			as logically being one entity, there are three methods supported by
245			the field definition:
246			- counted structures
247			If the s option's value is empty,
248			the field's tag is the negative number of fields belonging to the
249			structure, including the header. This is the means used by the
250			> Protocol
251			to efficiently and transparently embed any records in messages.
252			Obviously counted structures cannot be accessed by their tag.
253			They are defined as some negative tags.
254			Some known format of their main value (especially a literal)
255			may be used to access them by key.
256			- delimited structures
257			If the s option's value is '+', the field has one additional initial
258			subfield of fixed length 1. For a given occurence of this field,
259			this subfield must contain either '-', indicating that there are
260			no childs, be absent (i.e. the field is completely empty),
261			or contain a '+', indicating that everything up to a matching
262			empty field of same tag are the structures childs.
263			- fixed structures
264			If the s option's value is a number, the structure has exactly as
265			many childs as given by this number. Note that the number of fields
266			may be greater if the childs are structures themselves. Rarely used.
267
268			Note that while the field definition in general does not specify
269			the ordering of fields, the childs of a structure are always
270			a consecutive range according to the structure's definition.
271
272
273			Z39.2 reserves control field 002 for "subrecord purposes",
274			e.g. listing the offsets of such "groups of fields".
275
276
277			* recommendations
278
279			- fixed subfields should contain only bytes 32 to 126, inclusive
280			- if delimited structures are used, they should be used consistently,
281			i.e. all fields (but 0) should have that type
282			- fixed structures should only be used for internal purposes
283
284			* examples
285
286			The headers of email or other MIME messages like
287			$
288			Subject: hi there
289			Content-Type: text/plain; charset="iso8859-1"
290			$
291			using a field definition of
292			$
293			6 10 nsubject
294			6 11 ncontent-type
295			6 11^c ncharset
296			$
297			map to
298			$
299			10 hi there
300			11 text/plain ciso8859-1
301			$
302			Value options could be used to encode common value like text/plain.
303
304
305			Using delimited structures, a typical HTML table definition starting with
306			$
307			<table width="100%" cellpadding="0" cellspacing="0"
308			marginwidth="0" marginheight="0" topmargin="0" leftmargin="0" border="0">
309			<tr>
310			<td valign="top" width="160">
311			this is the textbody <br/> of the td node
312			</td>
313			</tr>
314			...
315			$
316			using
317			$
318			6 100 ntd s+
319			6 100^w nwidth
320			...
321			6 101 ntr
322			...
323			$
324			will be compacted to
325			$
326			100 + w100% p0 s0 m0 h0 t0 l0 b0
327			101 +
328			102 + vtop w160
329			0 this is the textbody
330			103 -
331			0 of the td node
332			102
333			101
334			...
335			$
336			which could save half of the internet's bandwidth.
337
338			Some strict XML parsers limit a node to at most one textnode child,
339			which then should be stored in the node's main value.
340
341
342			* conformance
343
344			Most features of Z39.2 (a.k.a. ISO2709 a.k.a. IIF) map directly to
345			Malete records. Subfield identifiers in Z39.2 can use more than one
346			character, however, MARC always uses one.
347			Initial fixed subfields are dubbed "indicators" by Z39.2,
348			MARC uses two of length 1. They are not considered "data elements",
349			as other subfields are. Here, fixed subfields are considered less special.
350
351
352			MIME and *ML (SGML,HTML,XML...) data structures can be converted to records
353			in a straightforward manner after a parser has resolved entities and the like.
354
355
356			---
357			$Id: RecStruct.txt,v 1.6 2004/07/26 12:23:34 kripke Exp $