1 |
dpavlin |
604 |
OpenIsis/Malete field definition and record structures |
2 |
|
|
|
3 |
|
|
|
4 |
|
|
* overview |
5 |
|
|
|
6 |
|
|
A Malete record is a sequence of one or more fields. |
7 |
|
|
The first one is called the header, all others are identified by a numeric tag. |
8 |
|
|
|
9 |
|
|
As far as the Malete database core is concerned, |
10 |
|
|
a field may contain any arbitrary bytes but newline characters. |
11 |
|
|
Assuming anything about the structure of field data, |
12 |
|
|
including any encoding of binary data, |
13 |
|
|
is solely at the application's discretion. |
14 |
|
|
|
15 |
|
|
As Malete is designed to be a multi-purpose database engine, |
16 |
|
|
there is no special schema enforced. |
17 |
|
|
However, there is a schema suggested and used by the OpenIsis application. |
18 |
|
|
In the database's |
19 |
|
|
> MetaData metadata record, |
20 |
|
|
fields with tag (00)6 are reserved for this purpose (abuse at your own risk). |
21 |
|
|
|
22 |
|
|
|
23 |
|
|
The rationale of this field definition is to provide enough flexibility |
24 |
|
|
to efficiently support representations of all structures found in Z39.2 |
25 |
|
|
based systems (including but transcending the traditional CDS/ISIS software), |
26 |
|
|
especially the various MARC formats, as well as full representations of |
27 |
|
|
data commonly stored and transmitted in a couple of other formats like |
28 |
|
|
MIME and XML. |
29 |
|
|
|
30 |
|
|
The term "representation" means that Malete will not bother to |
31 |
|
|
directly support XML's angle brackets nor XML's/MIME's foo="bar" options |
32 |
|
|
nor the subfield delimiter characters of MARC or CDS/ISIS. |
33 |
|
|
Rather, for any such data there should be a lossless transformation to an |
34 |
|
|
efficient representation in some format described by this field definition. |
35 |
|
|
|
36 |
|
|
|
37 |
|
|
* structure of fields |
38 |
|
|
|
39 |
|
|
While fields may be used to hold a single value, |
40 |
|
|
it is a common technique to treat them as a sequence of subfields. |
41 |
|
|
("A data element considered as a component of a field.", Z39.2). |
42 |
|
|
|
43 |
|
|
A field may contain, in that order: |
44 |
|
|
- 0 or more positional subfields of fixed length |
45 |
|
|
- 1 or more positional subfields of variable length |
46 |
|
|
- 0 or more identified subfields of variable length |
47 |
|
|
|
48 |
|
|
Fixed length subfields end after as many bytes (not characters!) as given by |
49 |
|
|
their length. They are typically used for data coded in some ASCII values. |
50 |
|
|
Neither UTF-8 characters nor the delimiter character should be stored |
51 |
|
|
in fixed length fields (however, it's up to the application to exercise care). |
52 |
|
|
|
53 |
|
|
Variable length subfields end at a delimiter character or end of field. |
54 |
|
|
Malete by default uses a tabulator as delimiter, |
55 |
|
|
and import of CDS/ISIS databases converts the caret (hat '^') to tabs, |
56 |
|
|
however applications are free to use any delimiter they want. |
57 |
|
|
|
58 |
|
|
|
59 |
|
|
Positional subfields are identified by their position within the field, |
60 |
|
|
i.e. by counting that many bytes and delimiters. |
61 |
|
|
Of course, there is only one nth position within a field, |
62 |
|
|
i.e. every positional subfield can occur at most once. |
63 |
|
|
Since the first n bytes and first m delimited subfields are used as the |
64 |
|
|
positional subfields, they may be omitted only if end of field is seen, |
65 |
|
|
i.e. all other subfields are omitted. |
66 |
|
|
|
67 |
|
|
Identified subfields, on the other hand, start with a single character |
68 |
|
|
identifying the subfield, just like fields in a record are identified by a tag. |
69 |
|
|
Applications unaware of UTF-8 may demand a single byte as identifier. |
70 |
|
|
Where portability is an issue, only ASCII letters and digits should be used. |
71 |
|
|
Since there is at least one positional variable subfield, |
72 |
|
|
identified subfields always start after a delimiter (in accordance with Z39.2). |
73 |
|
|
An identified subfield may occur zero, one or more times in a field. |
74 |
|
|
|
75 |
|
|
|
76 |
|
|
The MAIN VALUE of a field contains the fixed length subfields together with |
77 |
|
|
the first positional variable subfield. Sloppy applications may use anything |
78 |
|
|
up to the first delimiter, assuming that fixed subfields do not contain it. |
79 |
|
|
In the common situation of having no fixed length subfields, |
80 |
|
|
the main value equals the first positional field. |
81 |
|
|
The main value in a field is very similar to a record's header |
82 |
|
|
and commonly used as a key to select a field in a record. |
83 |
|
|
|
84 |
|
|
|
85 |
|
|
The properties of subfields stated so far are consequences of their very |
86 |
|
|
definition. Additional properties, e.g. the main value being empty |
87 |
|
|
or an identified subfield having a fixed length a/o occuring exactly once, |
88 |
|
|
may be demanded by field definition. |
89 |
|
|
It is the applications responsibility to make sure records do not violate |
90 |
|
|
the field definition; the Malete server will happily store whatever it receives. |
91 |
|
|
|
92 |
|
|
|
93 |
|
|
* definition of fields |
94 |
|
|
|
95 |
|
|
The field definition uses fields of the metadata record, |
96 |
|
|
one per each field and one per subfield. |
97 |
|
|
These fields themselves do not use fixed length subfields. |
98 |
|
|
The main value is a (non-unique) key: |
99 |
|
|
- 'tag' for a field definition, |
100 |
|
|
where tag is an integer. Negative numbers are reserved for counted structures. |
101 |
|
|
By convention, general application data fields should |
102 |
|
|
> TagUse use tags |
103 |
|
|
100 - 999. |
104 |
|
|
- 'tag#len' for a fixed subfield, |
105 |
|
|
where len is a positive integer |
106 |
|
|
- 'tag#' for an additional variable positional subfield. |
107 |
|
|
the first variable positional subfield's type, values and xref |
108 |
|
|
are defined with the main field definition. |
109 |
|
|
- 'tag^i' for a subfield identified by character i |
110 |
|
|
('^' is the actual hat character, which is NOT the subfield delimiter; |
111 |
|
|
the field definition uses tabulators) |
112 |
|
|
|
113 |
|
|
All other subfields in the field definition are identified and optional: |
114 |
|
|
- n name |
115 |
|
|
A name by which a field or subfield can be referred to. |
116 |
|
|
Field names must be unique and subfield names must be unique in their field. |
117 |
|
|
It is strongly recommended to only use C identifiers, |
118 |
|
|
i.e. ASCII letters, digits and the underscore, not starting with a digit. |
119 |
|
|
- d description |
120 |
|
|
Some textual description suitable for the database users. |
121 |
|
|
- m min/mandatory |
122 |
|
|
The sub/field must occur at least as many times as given by this option's |
123 |
|
|
value (empty=1, absent=0). |
124 |
|
|
- r repeatable |
125 |
|
|
The sub/field must occur at most as many times as given by this option's |
126 |
|
|
value (empty=any, absent=1). A value preceeded by '+' (including a single |
127 |
|
|
'+' for any) implies the mandatory option (at least one occurrence). |
128 |
|
|
- v value |
129 |
|
|
Every occurrence of this repeatable option is of the form name=value, |
130 |
|
|
associating the symbolic name with a legal value for the sub/field. |
131 |
|
|
The first such value is used as a default where the sub/field is created |
132 |
|
|
for some reason. |
133 |
|
|
- t type |
134 |
|
|
Type of this sub/field; see further below. |
135 |
|
|
Defaults to any (non-control) characters. |
136 |
|
|
Applications might support repeated alternative types. |
137 |
|
|
|
138 |
|
|
* types of subfields |
139 |
|
|
|
140 |
|
|
Note that a field's type actually defines the type of its first |
141 |
|
|
positional variable subfield (which is usually the main value). |
142 |
|
|
If there are no subfields defined for a field, |
143 |
|
|
the field's value equals its main value. |
144 |
|
|
|
145 |
|
|
|
146 |
|
|
A simple type definition consists of a single letter indicating |
147 |
|
|
a character type, optionally followed by some digits giving a repeat count. |
148 |
|
|
Unlike the byte-based length restrictions of fixed length fields, |
149 |
|
|
the repeat count should be assumed in terms of characters. |
150 |
|
|
|
151 |
|
|
For the terms "alphabetic" and "digit", it's up to the application's |
152 |
|
|
UNICODE support to properly check these attributes for non-ASCII characters. |
153 |
|
|
Simple environments may assume any code greater than 127 alphabetic. |
154 |
|
|
|
155 |
|
|
Basic character types are: |
156 |
|
|
- c character |
157 |
|
|
Any character with a code value greater or equal 32 (i.e. no C0 controls). |
158 |
|
|
- a alpha |
159 |
|
|
Any alphabetic character. |
160 |
|
|
- d digit |
161 |
|
|
ASCII digits '0'-'9'. |
162 |
|
|
- n numeric |
163 |
|
|
Digits and optional leading minus sign. |
164 |
|
|
- w word |
165 |
|
|
Alpha, digits and underscore. |
166 |
|
|
|
167 |
|
|
Extended character/byte types, possibly not supported by all environments, are: |
168 |
|
|
- b bit/boolean |
169 |
|
|
ASCII digits '0' or '1'. If a subfield of this type is absent, '0' should |
170 |
|
|
be assumed, but a '1' if it's present and empty. |
171 |
|
|
- r raw |
172 |
|
|
Raw bytes using newline/vertical tab encoding as suggested by the |
173 |
|
|
> Protocol |
174 |
|
|
- i integer |
175 |
|
|
Binary coded fix point decimal numbers using two decimal digits per byte |
176 |
|
|
(128-99 .. 128+99) and starting with a byte 144 plus the bytes before |
177 |
|
|
the decimal point (minus for negative numbers). |
178 |
|
|
Such integers sort properly, avoid newlines and tabs, and the first byte |
179 |
|
|
(for up to 30 decimal digits) is not valid in UTF-8 or any ISO charset. |
180 |
|
|
- t time |
181 |
|
|
Date and time as GTF integer. Up to 8 digits before the decimal point |
182 |
|
|
for date YYYYMMDD, after the decimal point hhmmss... |
183 |
|
|
|
184 |
|
|
For all simple type definitions, the same letter may be used uppercase. |
185 |
|
|
With lowercase, the repeat count gives a maximum and defaults to any. |
186 |
|
|
With an uppercase type letter, the repeat count is exact and defaults to 1. |
187 |
|
|
|
188 |
|
|
|
189 |
|
|
Complex type definitions include the following: |
190 |
|
|
- = pattern |
191 |
|
|
Pattern is a sequence of simple type definitions of basic character types. |
192 |
|
|
E.g. 'A3a6' denotes 3 to 9 alphabetic characters. |
193 |
|
|
Any special character in pattern denotes itself (typically as separator). |
194 |
|
|
- ~ regexp |
195 |
|
|
Depending on the regexp package used. |
196 |
|
|
- " literal |
197 |
|
|
Must have one of the values listed with the field's v option. |
198 |
|
|
|
199 |
|
|
The field definition of basic field definition is: |
200 |
|
|
$ |
201 |
|
|
6 6 nfdt dfield definition r t=Nc |
202 |
|
|
6 6^n nname dsub/field name |
203 |
|
|
6 6^d ndesc dsub/field description |
204 |
|
|
6 6^m nmin dmin number of occurrences tn |
205 |
|
|
6 6^r nrep dmax number of occurrences |
206 |
|
|
6 6^v nval dnamed values r |
207 |
|
|
6 6^t ntype dsub/field type |
208 |
|
|
$ |
209 |
|
|
|
210 |
|
|
|
211 |
|
|
* advanced field definition |
212 |
|
|
|
213 |
|
|
There are some advanced field definition options which are probably |
214 |
|
|
not supported by all applications. |
215 |
|
|
Where used, however, the following formats are recommended: |
216 |
|
|
- b base |
217 |
|
|
The key or name of another sub/field definition in this metadata record |
218 |
|
|
from which options (and, for a field, subfield definitions) should be |
219 |
|
|
used for this entity. Obviously just a convenience feature. |
220 |
|
|
- x xref |
221 |
|
|
Definition of some other entity referred to by the value of this sub/field. |
222 |
|
|
Described elsewhere. |
223 |
|
|
- s structure a.k.a. subrecord |
224 |
|
|
The field introduces a structure in the record; see further below. |
225 |
|
|
- c child |
226 |
|
|
This repeatable option specifies a tag or name of a legal child field. |
227 |
|
|
Applications might support this being followed by '[:min][-max]' |
228 |
|
|
to specify a min a/o max count of occurences of this child, |
229 |
|
|
or one of the letters '+' (at least once), '?' (at most once), |
230 |
|
|
'!' (exactly once) or '*' (any number of times, default). |
231 |
|
|
In the definition of those childs, r0 may be used to indicate that they |
232 |
|
|
should not occur in the record but where explicitly listed as legal child. |
233 |
|
|
|
234 |
|
|
* structures |
235 |
|
|
|
236 |
|
|
The structure option indicates that a field is the header of a structure, |
237 |
|
|
indicating that some fields following it in the record somehow belong to it. |
238 |
|
|
("A group of fields within a record that may be treated as a logical entity. |
239 |
|
|
(When a record describes more than one entity, the descriptions of individual |
240 |
|
|
entities may be treated as subrecords.)", Z39.2). |
241 |
|
|
|
242 |
|
|
|
243 |
|
|
While in general there are a couple of ways to mark a sequence of fields |
244 |
|
|
as logically being one entity, there are three methods supported by |
245 |
|
|
the field definition: |
246 |
|
|
- counted structures |
247 |
|
|
If the s option's value is empty, |
248 |
|
|
the field's tag is the negative number of fields belonging to the |
249 |
|
|
structure, including the header. This is the means used by the |
250 |
|
|
> Protocol |
251 |
|
|
to efficiently and transparently embed any records in messages. |
252 |
|
|
Obviously counted structures cannot be accessed by their tag. |
253 |
|
|
They are defined as some negative tags. |
254 |
|
|
Some known format of their main value (especially a literal) |
255 |
|
|
may be used to access them by key. |
256 |
|
|
- delimited structures |
257 |
|
|
If the s option's value is '+', the field has one additional initial |
258 |
|
|
subfield of fixed length 1. For a given occurence of this field, |
259 |
|
|
this subfield must contain either '-', indicating that there are |
260 |
|
|
no childs, be absent (i.e. the field is completely empty), |
261 |
|
|
or contain a '+', indicating that everything up to a matching |
262 |
|
|
empty field of same tag are the structures childs. |
263 |
|
|
- fixed structures |
264 |
|
|
If the s option's value is a number, the structure has exactly as |
265 |
|
|
many childs as given by this number. Note that the number of fields |
266 |
|
|
may be greater if the childs are structures themselves. Rarely used. |
267 |
|
|
|
268 |
|
|
Note that while the field definition in general does not specify |
269 |
|
|
the ordering of fields, the childs of a structure are always |
270 |
|
|
a consecutive range according to the structure's definition. |
271 |
|
|
|
272 |
|
|
|
273 |
|
|
Z39.2 reserves control field 002 for "subrecord purposes", |
274 |
|
|
e.g. listing the offsets of such "groups of fields". |
275 |
|
|
|
276 |
|
|
|
277 |
|
|
* recommendations |
278 |
|
|
|
279 |
|
|
- fixed subfields should contain only bytes 32 to 126, inclusive |
280 |
|
|
- if delimited structures are used, they should be used consistently, |
281 |
|
|
i.e. all fields (but 0) should have that type |
282 |
|
|
- fixed structures should only be used for internal purposes |
283 |
|
|
|
284 |
|
|
* examples |
285 |
|
|
|
286 |
|
|
The headers of email or other MIME messages like |
287 |
|
|
$ |
288 |
|
|
Subject: hi there |
289 |
|
|
Content-Type: text/plain; charset="iso8859-1" |
290 |
|
|
$ |
291 |
|
|
using a field definition of |
292 |
|
|
$ |
293 |
|
|
6 10 nsubject |
294 |
|
|
6 11 ncontent-type |
295 |
|
|
6 11^c ncharset |
296 |
|
|
$ |
297 |
|
|
map to |
298 |
|
|
$ |
299 |
|
|
10 hi there |
300 |
|
|
11 text/plain ciso8859-1 |
301 |
|
|
$ |
302 |
|
|
Value options could be used to encode common value like text/plain. |
303 |
|
|
|
304 |
|
|
|
305 |
|
|
Using delimited structures, a typical HTML table definition starting with |
306 |
|
|
$ |
307 |
|
|
<table width="100%" cellpadding="0" cellspacing="0" |
308 |
|
|
marginwidth="0" marginheight="0" topmargin="0" leftmargin="0" border="0"> |
309 |
|
|
<tr> |
310 |
|
|
<td valign="top" width="160"> |
311 |
|
|
this is the textbody <br/> of the td node |
312 |
|
|
</td> |
313 |
|
|
</tr> |
314 |
|
|
... |
315 |
|
|
$ |
316 |
|
|
using |
317 |
|
|
$ |
318 |
|
|
6 100 ntd s+ |
319 |
|
|
6 100^w nwidth |
320 |
|
|
... |
321 |
|
|
6 101 ntr |
322 |
|
|
... |
323 |
|
|
$ |
324 |
|
|
will be compacted to |
325 |
|
|
$ |
326 |
|
|
100 + w100% p0 s0 m0 h0 t0 l0 b0 |
327 |
|
|
101 + |
328 |
|
|
102 + vtop w160 |
329 |
|
|
0 this is the textbody |
330 |
|
|
103 - |
331 |
|
|
0 of the td node |
332 |
|
|
102 |
333 |
|
|
101 |
334 |
|
|
... |
335 |
|
|
$ |
336 |
|
|
which could save half of the internet's bandwidth. |
337 |
|
|
|
338 |
|
|
Some strict XML parsers limit a node to at most one textnode child, |
339 |
|
|
which then should be stored in the node's main value. |
340 |
|
|
|
341 |
|
|
|
342 |
|
|
* conformance |
343 |
|
|
|
344 |
|
|
Most features of Z39.2 (a.k.a. ISO2709 a.k.a. IIF) map directly to |
345 |
|
|
Malete records. Subfield identifiers in Z39.2 can use more than one |
346 |
|
|
character, however, MARC always uses one. |
347 |
|
|
Initial fixed subfields are dubbed "indicators" by Z39.2, |
348 |
|
|
MARC uses two of length 1. They are not considered "data elements", |
349 |
|
|
as other subfields are. Here, fixed subfields are considered less special. |
350 |
|
|
|
351 |
|
|
|
352 |
|
|
MIME and *ML (SGML,HTML,XML...) data structures can be converted to records |
353 |
|
|
in a straightforward manner after a parser has resolved entities and the like. |
354 |
|
|
|
355 |
|
|
|
356 |
|
|
--- |
357 |
|
|
$Id: RecStruct.txt,v 1.6 2004/07/26 12:23:34 kripke Exp $ |