0.9.9e/doc/RecStruct.txt

OpenIsis/Malete field definition and record structures


*       overview

A Malete record is a sequence of one or more fields.
The first one is called the header, all others are identified by a numeric tag.

As far as the Malete database core is concerned,
a field may contain any arbitrary bytes but newline characters.
Assuming anything about the structure of field data,
including any encoding of binary data,
is solely at the application's discretion.

As Malete is designed to be a multi-purpose database engine,
there is no special schema enforced.
However, there is a schema suggested and used by the OpenIsis application.
In the database's
>       MetaData        metadata record,
fields with tag (00)6 are reserved for this purpose (abuse at your own risk).


The rationale of this field definition is to provide enough flexibility
to efficiently support representations of all structures found in Z39.2
based systems (including but transcending the traditional CDS/ISIS software), 
especially the various MARC formats, as well as full representations of
data commonly stored and transmitted in a couple of other formats like
MIME and XML.

The term "representation" means that Malete will not bother to
directly support XML's angle brackets nor XML's/MIME's foo="bar" options
nor the subfield delimiter characters of MARC or CDS/ISIS.
Rather, for any such data there should be a lossless transformation  to an
efficient representation in some format described by this field definition.


*       structure of fields

While fields may be used to hold a single value,
it is a common technique to treat them as a sequence of subfields.
("A data element considered as a component of a field.", Z39.2).

A field may contain, in that order:
-       0 or more positional subfields of fixed length
-       1 or more positional subfields of variable length
-       0 or more identified subfields of variable length

Fixed length subfields end after as many bytes (not characters!) as given by
their length. They are typically used for data coded in some ASCII values.
Neither UTF-8 characters nor the delimiter character should be stored
in fixed length fields (however, it's up to the application to exercise care).

Variable length subfields end at a delimiter character or end of field.
Malete by default uses a tabulator as delimiter,
and import of CDS/ISIS databases converts the caret (hat '^') to tabs,
however applications are free to use any delimiter they want.


Positional subfields are identified by their position within the field,
i.e. by counting that many bytes and delimiters.
Of course, there is only one nth position within a field,
i.e. every positional subfield can occur at most once.
Since the first n bytes and first m delimited subfields are used as the
positional subfields, they may be omitted only if end of field is seen,
i.e. all other subfields are omitted.

Identified subfields, on the other hand, start with a single character
identifying the subfield, just like fields in a record are identified by a tag.
Applications unaware of UTF-8 may demand a single byte as identifier.
Where portability is an issue, only ASCII letters and digits should be used.
Since there is at least one positional variable subfield,
identified subfields always start after a delimiter (in accordance with Z39.2).
An identified subfield may occur zero, one or more times in a field.


The MAIN VALUE of a field contains the fixed length subfields together with
the first positional variable subfield. Sloppy applications may use anything
up to the first delimiter, assuming that fixed subfields do not contain it.
In the common situation of having no fixed length subfields,
the main value equals the first positional field.
The main value in a field is very similar to a record's header
and commonly used as a key to select a field in a record.


The properties of subfields stated so far are consequences of their very
definition. Additional properties, e.g. the main value being empty
or an identified subfield having a fixed length a/o occuring exactly once,
may be demanded by field definition.
It is the applications responsibility to make sure records do not violate
the field definition; the Malete server will happily store whatever it receives.


*       definition of fields

The field definition uses fields of the metadata record,
one per each field and one per subfield.
These fields themselves do not use fixed length subfields.
The main value is a (non-unique) key:
-       'tag' for a field definition,
        where tag is an integer. Negative numbers are reserved for counted structures.
        By convention, general application data fields should
>       TagUse  use tags
        100 - 999.
-       'tag#len' for a fixed subfield,
        where len is a positive integer
-       'tag#' for an additional variable positional subfield.
        the first variable positional subfield's type, values and xref
        are defined with the main field definition.
-       'tag^i' for a subfield identified by character i
        ('^' is the actual hat character, which is NOT the subfield delimiter;
        the field definition uses tabulators)

All other subfields in the field definition are identified and optional:
-       n       name
        A name by which a field or subfield can be referred to.
        Field names must be unique and subfield names must be unique in their field.
        It is strongly recommended to only use C identifiers,
        i.e. ASCII letters, digits and the underscore, not starting with a digit.
-       d description
        Some textual description suitable for the database users.
-       m min/mandatory
        The sub/field must occur at least as many times as given by this option's
        value (empty=1, absent=0).
-       r repeatable
        The sub/field must occur at most as many times as given by this option's
        value (empty=any, absent=1). A value preceeded by '+' (including a single
        '+' for any) implies the mandatory option (at least one occurrence).
-       v value
        Every occurrence of this repeatable option is of the form name=value,
        associating the symbolic name with a legal value for the sub/field.
        The first such value is used as a default where the sub/field is created
        for some reason.
-       t type
        Type of this sub/field; see further below.
        Defaults to any (non-control) characters.
        Applications might support repeated alternative types.

*       types of subfields

Note that a field's type actually defines the type of its first
positional variable subfield (which is usually the main value).
If there are no subfields defined for a field,
the field's value equals its main value.


A simple type definition consists of a single letter indicating
a character type, optionally followed by some digits giving a repeat count.
Unlike the byte-based length restrictions of fixed length fields,
the repeat count should be assumed in terms of characters.

For the terms "alphabetic" and "digit", it's up to the application's
UNICODE support to properly check these attributes for non-ASCII characters.
Simple environments may assume any code greater than 127 alphabetic.

Basic character types are:
-       c character
        Any character with a code value greater or equal 32 (i.e. no C0 controls).
-       a alpha
        Any alphabetic character.
-       d digit
        ASCII digits '0'-'9'.
-       n numeric
        Digits and optional leading minus sign.
-       w word
        Alpha, digits and underscore.

Extended character/byte types, possibly not supported by all environments, are:
-       b bit/boolean
        ASCII digits '0' or '1'. If a subfield of this type is absent, '0' should
        be assumed, but a '1' if it's present and empty.
-       r       raw
        Raw bytes using newline/vertical tab encoding as suggested by the
>       Protocol
-       i integer
        Binary coded fix point decimal numbers using two decimal digits per byte
        (128-99 .. 128+99) and starting with a byte 144 plus the bytes before
        the decimal point (minus for negative numbers).
        Such integers sort properly, avoid newlines and tabs, and the first byte
        (for up to 30 decimal digits) is not valid in UTF-8 or any ISO charset.
-       t time
        Date and time as GTF integer. Up to 8 digits before the decimal point
        for date YYYYMMDD, after the decimal point hhmmss...

For all simple type definitions, the same letter may be used uppercase.
With lowercase, the repeat count gives a maximum and defaults to any.
With an uppercase type letter, the repeat count is exact and defaults to 1.


Complex type definitions include the following:
-       = pattern
        Pattern is a sequence of simple type definitions of basic character types.
        E.g. 'A3a6' denotes 3 to 9 alphabetic characters.
        Any special character in pattern denotes itself (typically as separator).
-       ~ regexp
        Depending on the regexp package used.
-       " literal
        Must have one of the values listed with the field's v option.

The field definition of basic field definition is:
$
6       6       nfdt    dfield definition       r       t=Nc
6       6^n     nname   dsub/field name
6       6^d     ndesc   dsub/field description
6       6^m     nmin    dmin number of occurrences      tn
6       6^r     nrep    dmax number of occurrences
6       6^v     nval    dnamed values   r
6       6^t     ntype   dsub/field type
$


*       advanced field definition

There are some advanced field definition options which are probably
not supported by all applications.
Where used, however, the following formats are recommended:
-       b base
        The key or name of another sub/field definition in this metadata record
        from which options (and, for a field, subfield definitions) should be
        used for this entity. Obviously just a convenience feature.
-       x xref
        Definition of some other entity referred to by the value of this sub/field.
        Described elsewhere.
-       s structure a.k.a. subrecord
        The field introduces a structure in the record; see further below.
-       c child
        This repeatable option specifies a tag or name of a legal child field.
        Applications might support this being followed by '[:min][-max]'
        to specify a min a/o max count of occurences of this child,
        or one of the letters '+' (at least once), '?' (at most once),
        '!' (exactly once) or '*' (any number of times, default).
        In the definition of those childs, r0 may be used to indicate that they
        should not occur in the record but where explicitly listed as legal child.

*       structures

The structure option indicates that a field is the header of a structure,
indicating that some fields following it in the record somehow belong to it.
("A group of fields within a record that may be treated as a logical entity.
(When a record describes more than one entity, the descriptions of individual
entities may be treated as subrecords.)", Z39.2).


While in general there are a couple of ways to mark a sequence of fields
as logically being one entity, there are three methods supported by
the field definition:
-       counted structures
        If the s option's value is empty,
        the field's tag is the negative number of fields belonging to the
        structure, including the header. This is the means used by the
>       Protocol
        to efficiently and transparently embed any records in messages.
        Obviously counted structures cannot be accessed by their tag.
        They are defined as some negative tags.
        Some known format of their main value (especially a literal)
        may be used to access them by key.
-       delimited structures
        If the s option's value is '+', the field has one additional initial
        subfield of fixed length 1. For a given occurence of this field,
        this subfield must contain either '-', indicating that there are
        no childs, be absent (i.e. the field is completely empty),
        or contain a '+', indicating that everything up to a matching
        empty field of same tag are the structures childs.
-       fixed structures
        If the s option's value is a number, the structure has exactly as
        many childs as given by this number. Note that the number of fields
        may be greater if the childs are structures themselves. Rarely used.

Note that while the field definition in general does not specify
the ordering of fields, the childs of a structure are always
a consecutive range according to the structure's definition.


Z39.2 reserves control field 002 for "subrecord purposes",
e.g. listing the offsets of such "groups of fields".


*       recommendations

-       fixed subfields should contain only bytes 32 to 126, inclusive
-       if delimited structures are used, they should be used consistently,
        i.e. all fields (but 0) should have that type
-       fixed structures should only be used for internal purposes

*       examples

The headers of email or other MIME messages like
$
Subject: hi there
Content-Type: text/plain; charset="iso8859-1"
$
using a field definition of
$
6       10      nsubject
6       11      ncontent-type
6       11^c    ncharset
$
map to
$
10      hi there
11      text/plain      ciso8859-1
$
Value options could be used to encode common value like text/plain.


Using delimited structures, a typical HTML table definition starting with
$
<table width="100%" cellpadding="0" cellspacing="0"
  marginwidth="0" marginheight="0" topmargin="0" leftmargin="0" border="0">
<tr>
<td valign="top" width="160">
this is the textbody <br/> of the td node
</td>
</tr>
...
$
using
$
6       100     ntd     s+
6       100^w   nwidth
...
6       101     ntr
...
$
will be compacted to
$
100     +       w100%   p0      s0      m0      h0      t0      l0      b0
101     +
102     +       vtop    w160
0       this is the textbody
103     -
0       of the td node
102
101
...
$
which could save half of the internet's bandwidth.

Some strict XML parsers limit a node to at most one textnode child,
which then should be stored in the node's main value.


*       conformance

Most features of Z39.2 (a.k.a. ISO2709 a.k.a. IIF) map directly to
Malete records. Subfield identifiers in Z39.2 can use more than one
character, however, MARC always uses one.
Initial fixed subfields are dubbed "indicators" by Z39.2,
MARC uses two of length 1. They are not considered "data elements",
as other subfields are. Here, fixed subfields are considered less special.


MIME and *ML (SGML,HTML,XML...) data structures can be converted to records
in a straightforward manner after a parser has resolved entities and the like.


---
        $Id: RecStruct.txt,v 1.6 2004/07/26 12:23:34 kripke Exp $
1	OpenIsis/Malete field definition and record structures
2
3
4	* overview
5
6	A Malete record is a sequence of one or more fields.
7	The first one is called the header, all others are identified by a numeric tag.
8
9	As far as the Malete database core is concerned,
10	a field may contain any arbitrary bytes but newline characters.
11	Assuming anything about the structure of field data,
12	including any encoding of binary data,
13	is solely at the application's discretion.
14
15	As Malete is designed to be a multi-purpose database engine,
16	there is no special schema enforced.
17	However, there is a schema suggested and used by the OpenIsis application.
18	In the database's
19	> MetaData metadata record,
20	fields with tag (00)6 are reserved for this purpose (abuse at your own risk).
21
22
23	The rationale of this field definition is to provide enough flexibility
24	to efficiently support representations of all structures found in Z39.2
25	based systems (including but transcending the traditional CDS/ISIS software),
26	especially the various MARC formats, as well as full representations of
27	data commonly stored and transmitted in a couple of other formats like
28	MIME and XML.
29
30	The term "representation" means that Malete will not bother to
31	directly support XML's angle brackets nor XML's/MIME's foo="bar" options
32	nor the subfield delimiter characters of MARC or CDS/ISIS.
33	Rather, for any such data there should be a lossless transformation to an
34	efficient representation in some format described by this field definition.
35
36
37	* structure of fields
38
39	While fields may be used to hold a single value,
40	it is a common technique to treat them as a sequence of subfields.
41	("A data element considered as a component of a field.", Z39.2).
42
43	A field may contain, in that order:
44	- 0 or more positional subfields of fixed length
45	- 1 or more positional subfields of variable length
46	- 0 or more identified subfields of variable length
47
48	Fixed length subfields end after as many bytes (not characters!) as given by
49	their length. They are typically used for data coded in some ASCII values.
50	Neither UTF-8 characters nor the delimiter character should be stored
51	in fixed length fields (however, it's up to the application to exercise care).
52
53	Variable length subfields end at a delimiter character or end of field.
54	Malete by default uses a tabulator as delimiter,
55	and import of CDS/ISIS databases converts the caret (hat '^') to tabs,
56	however applications are free to use any delimiter they want.
57
58
59	Positional subfields are identified by their position within the field,
60	i.e. by counting that many bytes and delimiters.
61	Of course, there is only one nth position within a field,
62	i.e. every positional subfield can occur at most once.
63	Since the first n bytes and first m delimited subfields are used as the
64	positional subfields, they may be omitted only if end of field is seen,
65	i.e. all other subfields are omitted.
66
67	Identified subfields, on the other hand, start with a single character
68	identifying the subfield, just like fields in a record are identified by a tag.
69	Applications unaware of UTF-8 may demand a single byte as identifier.
70	Where portability is an issue, only ASCII letters and digits should be used.
71	Since there is at least one positional variable subfield,
72	identified subfields always start after a delimiter (in accordance with Z39.2).
73	An identified subfield may occur zero, one or more times in a field.
74
75
76	The MAIN VALUE of a field contains the fixed length subfields together with
77	the first positional variable subfield. Sloppy applications may use anything
78	up to the first delimiter, assuming that fixed subfields do not contain it.
79	In the common situation of having no fixed length subfields,
80	the main value equals the first positional field.
81	The main value in a field is very similar to a record's header
82	and commonly used as a key to select a field in a record.
83
84
85	The properties of subfields stated so far are consequences of their very
86	definition. Additional properties, e.g. the main value being empty
87	or an identified subfield having a fixed length a/o occuring exactly once,
88	may be demanded by field definition.
89	It is the applications responsibility to make sure records do not violate
90	the field definition; the Malete server will happily store whatever it receives.
91
92
93	* definition of fields
94
95	The field definition uses fields of the metadata record,
96	one per each field and one per subfield.
97	These fields themselves do not use fixed length subfields.
98	The main value is a (non-unique) key:
99	- 'tag' for a field definition,
100	where tag is an integer. Negative numbers are reserved for counted structures.
101	By convention, general application data fields should
102	> TagUse use tags
103	100 - 999.
104	- 'tag#len' for a fixed subfield,
105	where len is a positive integer
106	- 'tag#' for an additional variable positional subfield.
107	the first variable positional subfield's type, values and xref
108	are defined with the main field definition.
109	- 'tag^i' for a subfield identified by character i
110	('^' is the actual hat character, which is NOT the subfield delimiter;
111	the field definition uses tabulators)
112
113	All other subfields in the field definition are identified and optional:
114	- n name
115	A name by which a field or subfield can be referred to.
116	Field names must be unique and subfield names must be unique in their field.
117	It is strongly recommended to only use C identifiers,
118	i.e. ASCII letters, digits and the underscore, not starting with a digit.
119	- d description
120	Some textual description suitable for the database users.
121	- m min/mandatory
122	The sub/field must occur at least as many times as given by this option's
123	value (empty=1, absent=0).
124	- r repeatable
125	The sub/field must occur at most as many times as given by this option's
126	value (empty=any, absent=1). A value preceeded by '+' (including a single
127	'+' for any) implies the mandatory option (at least one occurrence).
128	- v value
129	Every occurrence of this repeatable option is of the form name=value,
130	associating the symbolic name with a legal value for the sub/field.
131	The first such value is used as a default where the sub/field is created
132	for some reason.
133	- t type
134	Type of this sub/field; see further below.
135	Defaults to any (non-control) characters.
136	Applications might support repeated alternative types.
137
138	* types of subfields
139
140	Note that a field's type actually defines the type of its first
141	positional variable subfield (which is usually the main value).
142	If there are no subfields defined for a field,
143	the field's value equals its main value.
144
145
146	A simple type definition consists of a single letter indicating
147	a character type, optionally followed by some digits giving a repeat count.
148	Unlike the byte-based length restrictions of fixed length fields,
149	the repeat count should be assumed in terms of characters.
150
151	For the terms "alphabetic" and "digit", it's up to the application's
152	UNICODE support to properly check these attributes for non-ASCII characters.
153	Simple environments may assume any code greater than 127 alphabetic.
154
155	Basic character types are:
156	- c character
157	Any character with a code value greater or equal 32 (i.e. no C0 controls).
158	- a alpha
159	Any alphabetic character.
160	- d digit
161	ASCII digits '0'-'9'.
162	- n numeric
163	Digits and optional leading minus sign.
164	- w word
165	Alpha, digits and underscore.
166
167	Extended character/byte types, possibly not supported by all environments, are:
168	- b bit/boolean
169	ASCII digits '0' or '1'. If a subfield of this type is absent, '0' should
170	be assumed, but a '1' if it's present and empty.
171	- r raw
172	Raw bytes using newline/vertical tab encoding as suggested by the
173	> Protocol
174	- i integer
175	Binary coded fix point decimal numbers using two decimal digits per byte
176	(128-99 .. 128+99) and starting with a byte 144 plus the bytes before
177	the decimal point (minus for negative numbers).
178	Such integers sort properly, avoid newlines and tabs, and the first byte
179	(for up to 30 decimal digits) is not valid in UTF-8 or any ISO charset.
180	- t time
181	Date and time as GTF integer. Up to 8 digits before the decimal point
182	for date YYYYMMDD, after the decimal point hhmmss...
183
184	For all simple type definitions, the same letter may be used uppercase.
185	With lowercase, the repeat count gives a maximum and defaults to any.
186	With an uppercase type letter, the repeat count is exact and defaults to 1.
187
188
189	Complex type definitions include the following:
190	- = pattern
191	Pattern is a sequence of simple type definitions of basic character types.
192	E.g. 'A3a6' denotes 3 to 9 alphabetic characters.
193	Any special character in pattern denotes itself (typically as separator).
194	- ~ regexp
195	Depending on the regexp package used.
196	- " literal
197	Must have one of the values listed with the field's v option.
198
199	The field definition of basic field definition is:
200	$
201	6 6 nfdt dfield definition r t=Nc
202	6 6^n nname dsub/field name
203	6 6^d ndesc dsub/field description
204	6 6^m nmin dmin number of occurrences tn
205	6 6^r nrep dmax number of occurrences
206	6 6^v nval dnamed values r
207	6 6^t ntype dsub/field type
208	$
209
210
211	* advanced field definition
212
213	There are some advanced field definition options which are probably
214	not supported by all applications.
215	Where used, however, the following formats are recommended:
216	- b base
217	The key or name of another sub/field definition in this metadata record
218	from which options (and, for a field, subfield definitions) should be
219	used for this entity. Obviously just a convenience feature.
220	- x xref
221	Definition of some other entity referred to by the value of this sub/field.
222	Described elsewhere.
223	- s structure a.k.a. subrecord
224	The field introduces a structure in the record; see further below.
225	- c child
226	This repeatable option specifies a tag or name of a legal child field.
227	Applications might support this being followed by '[:min][-max]'
228	to specify a min a/o max count of occurences of this child,
229	or one of the letters '+' (at least once), '?' (at most once),
230	'!' (exactly once) or '*' (any number of times, default).
231	In the definition of those childs, r0 may be used to indicate that they
232	should not occur in the record but where explicitly listed as legal child.
233
234	* structures
235
236	The structure option indicates that a field is the header of a structure,
237	indicating that some fields following it in the record somehow belong to it.
238	("A group of fields within a record that may be treated as a logical entity.
239	(When a record describes more than one entity, the descriptions of individual
240	entities may be treated as subrecords.)", Z39.2).
241
242
243	While in general there are a couple of ways to mark a sequence of fields
244	as logically being one entity, there are three methods supported by
245	the field definition:
246	- counted structures
247	If the s option's value is empty,
248	the field's tag is the negative number of fields belonging to the
249	structure, including the header. This is the means used by the
250	> Protocol
251	to efficiently and transparently embed any records in messages.
252	Obviously counted structures cannot be accessed by their tag.
253	They are defined as some negative tags.
254	Some known format of their main value (especially a literal)
255	may be used to access them by key.
256	- delimited structures
257	If the s option's value is '+', the field has one additional initial
258	subfield of fixed length 1. For a given occurence of this field,
259	this subfield must contain either '-', indicating that there are
260	no childs, be absent (i.e. the field is completely empty),
261	or contain a '+', indicating that everything up to a matching
262	empty field of same tag are the structures childs.
263	- fixed structures
264	If the s option's value is a number, the structure has exactly as
265	many childs as given by this number. Note that the number of fields
266	may be greater if the childs are structures themselves. Rarely used.
267
268	Note that while the field definition in general does not specify
269	the ordering of fields, the childs of a structure are always
270	a consecutive range according to the structure's definition.
271
272
273	Z39.2 reserves control field 002 for "subrecord purposes",
274	e.g. listing the offsets of such "groups of fields".
275
276
277	* recommendations
278
279	- fixed subfields should contain only bytes 32 to 126, inclusive
280	- if delimited structures are used, they should be used consistently,
281	i.e. all fields (but 0) should have that type
282	- fixed structures should only be used for internal purposes
283
284	* examples
285
286	The headers of email or other MIME messages like
287	$
288	Subject: hi there
289	Content-Type: text/plain; charset="iso8859-1"
290	$
291	using a field definition of
292	$
293	6 10 nsubject
294	6 11 ncontent-type
295	6 11^c ncharset
296	$
297	map to
298	$
299	10 hi there
300	11 text/plain ciso8859-1
301	$
302	Value options could be used to encode common value like text/plain.
303
304
305	Using delimited structures, a typical HTML table definition starting with
306	$
307	<table width="100%" cellpadding="0" cellspacing="0"
308	marginwidth="0" marginheight="0" topmargin="0" leftmargin="0" border="0">
309	<tr>
310	<td valign="top" width="160">
311	this is the textbody <br/> of the td node
312	</td>
313	</tr>
314	...
315	$
316	using
317	$
318	6 100 ntd s+
319	6 100^w nwidth
320	...
321	6 101 ntr
322	...
323	$
324	will be compacted to
325	$
326	100 + w100% p0 s0 m0 h0 t0 l0 b0
327	101 +
328	102 + vtop w160
329	0 this is the textbody
330	103 -
331	0 of the td node
332	102
333	101
334	...
335	$
336	which could save half of the internet's bandwidth.
337
338	Some strict XML parsers limit a node to at most one textnode child,
339	which then should be stored in the node's main value.
340
341
342	* conformance
343
344	Most features of Z39.2 (a.k.a. ISO2709 a.k.a. IIF) map directly to
345	Malete records. Subfield identifiers in Z39.2 can use more than one
346	character, however, MARC always uses one.
347	Initial fixed subfields are dubbed "indicators" by Z39.2,
348	MARC uses two of length 1. They are not considered "data elements",
349	as other subfields are. Here, fixed subfields are considered less special.
350
351
352	MIME and *ML (SGML,HTML,XML...) data structures can be converted to records
353	in a straightforward manner after a parser has resolved entities and the like.
354
355
356	---
357	$Id: RecStruct.txt,v 1.6 2004/07/26 12:23:34 kripke Exp $