1 |
structuring ISIS records using subfields or subrecords |
2 |
|
3 |
|
4 |
* structures |
5 |
|
6 |
The means by which an Isis record can be structured into "data elements" |
7 |
("A defined unit of information", Z39.2 a.k.a. ISO2709) |
8 |
fall in one of two broad categories (citing Z39.2): |
9 |
- subfields |
10 |
"A data element considered as a component of a field." |
11 |
In *ML (SGML,HTML,XML...), subfields correspond to a node's attributes. |
12 |
In MIME, subfields correspond to attributes of a MIME header value. |
13 |
- subrecords |
14 |
"A group of fields within a record that may be treated as a logical entity. |
15 |
(When a record describes more than one entity, |
16 |
the descriptions of individual entities may be treated as subrecords.)" |
17 |
In *ML, subrecords correspond to a node's childs. |
18 |
In MIME, subrecords correspond to multipart body parts. |
19 |
|
20 |
|
21 |
* subfields |
22 |
|
23 |
Since a field value can actually be anything, |
24 |
including XML text or a serialized (textual or binary) Isis record, |
25 |
it can be arbitrarily structured according to a regular expression |
26 |
or some other grammar (machine parseable or not). |
27 |
|
28 |
|
29 |
The term subfield, however, is used for a range of characters in the value |
30 |
which is identified by rather simple means: |
31 |
- fixed |
32 |
if all (or all but the last) subfields have a fixed length |
33 |
and are neither optional nor repeatable, |
34 |
then each subfield can be found at a fixed position. |
35 |
- delimited with optional identifier |
36 |
this is the proper Z39.2 notion of a subfield. |
37 |
|
38 |
If a special delimiter character is found in the field, |
39 |
it breaks the field into subfields. |
40 |
Z39.2, and thus MARC, use the character 31 as delimiter |
41 |
(hex 1F, CTRL-_, ASCII "unit separator" US). |
42 |
Traditional Isis uses the caret '^'. |
43 |
|
44 |
OpenIsis permits any character, including the horizontal TAB and semicolon. |
45 |
More precisely, OpenIsis reverts Z39.2's notion that |
46 |
"every subfield is INTRODUCED by a delimiter, unless it isn't" |
47 |
to the principle that for every data element, it is specified |
48 |
how it's end is detected, including by fixed length or varying delimiters. |
49 |
|
50 |
|
51 |
The initial n characters of a subfield are used to identify the subfield. |
52 |
Z39.2 permits any (small) fixed value for n, including 0, i.e. not identified. |
53 |
The MARC family of standards uses n=1. |
54 |
OpenIsis allows for any value, including variable length identifiers, |
55 |
which are themselves delimited by some character like a '=' (see below). |
56 |
|
57 |
|
58 |
Z39.2 states that if identifiers are used, each must be preceeded |
59 |
by a delimiter, and every data element, including the first, |
60 |
must be identified that way. However, an initial range of m characters |
61 |
(i.e. preceeding the first delimiter) in every field may serve as "indicator", |
62 |
which is not regarded a "data element". Again, m is a small fixed number; |
63 |
MARC uses m=2. Traditional Isis has no special support for indicators. |
64 |
OpenIsis allows to access whatever is before the first delimiter. |
65 |
|
66 |
|
67 |
Different subfielding methods can be mixed or nested. |
68 |
Typical cases are: |
69 |
- mixed fixed/delimited |
70 |
After some initial fixed subfields, following subffields are delimited. |
71 |
This can be used to describe MARC's fixed indicators. |
72 |
- nested delimited/fixed |
73 |
A delimited subfield has itself a fixed substructure. |
74 |
Actually the leading identifier in a subfield can be regarded |
75 |
as fixed part in a mixed substructure. |
76 |
- nested unidentified delimited |
77 |
A delimited subfield has itself a delimited structure. |
78 |
This can be used to model variable length identifiers. |
79 |
|
80 |
In other words, identifiers are themselves nothing but subfields |
81 |
used as keys on some level of nesting. |
82 |
On the other hand, any subfield could serve as a key for it's parent. |
83 |
This is used e.g. to select a field by a subfield indicating a language |
84 |
(see below for keyed subrecords). |
85 |
|
86 |
If you look at the |
87 |
> Serialized plaintext representation of an Isis record, |
88 |
actually the whole record is a newline delimited value, |
89 |
the whole database is a blankline (double newline) delimited value |
90 |
and each field has it's tag as initial tab-delimited subfield. |
91 |
|
92 |
|
93 |
In the future, OpenIsis will add support for a wide variety of |
94 |
subfielding techniques such as defined by regular expressions, |
95 |
MIME headers or produced in typical "character/comma separated values" files |
96 |
(opionally using quotes). |
97 |
|
98 |
Since splitting subfields is mostly and can always be done on the |
99 |
application level (i.e. a database server rarely needs to care), |
100 |
"support" essentially boils down to the definition of appropriate meta data. |
101 |
|
102 |
|
103 |
* subrecords |
104 |
|
105 |
A subrecord consists of a typically continuous range of fields within a record, |
106 |
started by some field to introduce the subrecord. |
107 |
Some variants, however, like keyed subfields, |
108 |
can be freely scattered and don't need a "header" field. |
109 |
|
110 |
|
111 |
There are basically four ways to denote the boundaries of structures: |
112 |
- embraced |
113 |
where a special field is used to denote the structures end. |
114 |
This resembles SGML-style notations, |
115 |
where each opening tag is matched by a closing tag. |
116 |
This is relatively easy and recommended for every day use. |
117 |
- marked |
118 |
where the fields of the child structure are marked as such. |
119 |
This is sort of the opposite approach of embracing. |
120 |
Marking comes in several powerful flavours, |
121 |
see below for a more detailled discussion. |
122 |
- counted |
123 |
where the number of fields (not childs) belonging to the |
124 |
structure is given in (any leading digits of) the initial field. |
125 |
This allows for safe embedding regardless of the |
126 |
structure's contents and is thus used in contexts where |
127 |
full generality is needed like when embedding result records |
128 |
within a server's response. |
129 |
- implicit |
130 |
where the number of childs is fixed. |
131 |
An example of this is the parse tree of a query, |
132 |
where the structure "AND" has exactly two childs |
133 |
(which in turn might be structures). |
134 |
This is used mostly for internal structures like parsed |
135 |
queries or formats, which are not meant to be exchanged. |
136 |
|
137 |
The field introducing a subrecord might have any subfields |
138 |
just like other fields, similar to the attributes that might |
139 |
be assigned to a tag in SGML applications like HTML. |
140 |
|
141 |
However, the first subfield (unidentified initial characters) |
142 |
of a field opening an embraced or counted subrecord is reserved as indicator: |
143 |
- a plus sign '+' as first character |
144 |
indicates explicity opening a subrecord |
145 |
- a minus sign '-' as first character |
146 |
indicates an empty subrecord (containing no childs) |
147 |
- an empty value |
148 |
indicates explicity closing a subrecord |
149 |
(similar to the closing blank line used in several protocols) |
150 |
- an initial numeric value |
151 |
(of decimal digits) gives the number of fields to follow. |
152 |
- an initial character @A-Z |
153 |
gives the number of childs to follow (@=0,A=1,B=2...) (rarely used) |
154 |
|
155 |
Auxiliary information about the child, |
156 |
like an embedded records row number and type, |
157 |
are stored in subfields of the parent. |
158 |
|
159 |
|
160 |
* conventions |
161 |
|
162 |
While the intented usage of subrecords might be specified in |
163 |
more detail in the |
164 |
> Meta table metadata |
165 |
, the schema can also be used standalone (without referring to metadata), |
166 |
if some conventions on tag ranges are followed. |
167 |
|
168 |
The extend of subrecords by length or braces can be safely |
169 |
determined if you just know that you want the given field |
170 |
to be regarded as subrecord. |
171 |
|
172 |
For subrecords of fixed number of childs (meant for internal use), |
173 |
it is necessary to recognize whether a following field is itself a structure. |
174 |
If they are used at all, the tag range -1..-99 should be reserved for this |
175 |
purpose. |
176 |
|
177 |
In this context, typically one of two modes is used: |
178 |
- the MIME processing mode for processing list-style content, |
179 |
assumes that negative tags denote structures, |
180 |
while positive contain plain data. |
181 |
- in XML processing mode, everything but the 0 tag (text node) is a structure. |
182 |
|
183 |
If a parent has a subfield ^0, |
184 |
that should contain the childs identity as dbname or mfn or dbname.mfn. |
185 |
If the parents indicator is delimited by a tab instead of a ^, |
186 |
the next tab-delimited subfield is interpreted that way (where applicable). |
187 |
|
188 |
|
189 |
* marked structures |
190 |
|
191 |
There is a wide variety of techniques for marking fields as "childs" |
192 |
of other fields. Marking techniques work especially well for a single |
193 |
level of substructuring; for nested structures, some restrictions apply. |
194 |
|
195 |
We give some commonly used examples: |
196 |
|
197 |
- quoting |
198 |
is done by prefixing every child field value with a special string, |
199 |
which is not used as prefix outside the child fields. |
200 |
However, at least for a single level of quoting, it does not impose |
201 |
a problem if the child fields themselves started with the same prefix: |
202 |
Still, the original value is retrieved by stripping the (first) prefix. |
203 |
This even works for multiple levels, as long as the record was properly |
204 |
constructed, i.e. the quoting prefix is not used outside childs. |
205 |
Examples are the output of the diff command (which is driving the |
206 |
RCS/CVS revision control system very reliably) and the '>' quoting |
207 |
used in e-mail replies. |
208 |
- tagging |
209 |
Instead of the field value, of course also the field tag can be used |
210 |
as child mark. In some situations it might be possible to choose |
211 |
appropriate reserved tags for the childs. |
212 |
In other situations, where some given child tag must be kept, |
213 |
it can be stored as prefix in the field value according to the canonical |
214 |
> Serialized |
215 |
plain text format. |
216 |
- keying |
217 |
If the mark used is dependent on an attribute of the parent field, |
218 |
the childs can be determined even if non-continuous. |
219 |
With some more cooperation of the childs, the mark might be an |
220 |
attribute (subfield) instead of a prefix (indicator). |
221 |
That way, childs and parents are linked together rather logically |
222 |
than "physically" by a common key just like in relational databases. |
223 |
This easily extends to multiple levels using segmented keys |
224 |
(consisting of several attributes/subfields). |
225 |
While this scheme only works with well behaved childs and may waste |
226 |
some space by replicating keys, it is simple and robust and gives |
227 |
convenient access to the childs without inspecting the structure. |
228 |
|
229 |
|
230 |
*design childs vs. attributes |
231 |
|
232 |
Every information that can be represented using an attribute, |
233 |
can also be represented using a child. |
234 |
From that point of view, attributes are a redundant "language" construct |
235 |
and one might deem a model using only childs as the simpler one. |
236 |
We call such an attributeless model "canonical verbose" representation. |
237 |
It's a little bit similar to the "everything is an object" |
238 |
approach of pure OO languages like Smalltalk. |
239 |
|
240 |
|
241 |
But then, having a richer language isn't always such a bad thing, |
242 |
if you know how to use it appropriately. |
243 |
(This "if" is the core of almost any serious criticism of rich languages, |
244 |
but for now, let's assume we know what we're doing). |
245 |
Appropriate use basically boils down to choosing the language construct |
246 |
that was just made for your situation, i.e. not the most general one, |
247 |
but quite to the opposite the most specific (restricted) one. |
248 |
That way you will not only have the most efficient representation, |
249 |
but also express additional information about what's going on. |
250 |
|
251 |
|
252 |
In short, a "canonical compact" modelling can be based upon the principle |
253 |
"Use attributes wherever possible". |
254 |
|
255 |
Some logical property of a logical structure can be represented by |
256 |
means of attributes, if |
257 |
- it is simple, |
258 |
i.e. one single string value. |
259 |
- or at least flat, |
260 |
i.e. itself a structure that can be represented based on attributes |
261 |
that do not interfere with the parents attributes. |
262 |
In the latter case, the property will show up as several |
263 |
logically interrelated attributes of the parent. |
264 |
However, such a flat group of attributes might be a candidate |
265 |
for a child under some circumstances. |
266 |
- it is not repeatable. |
267 |
Although OpenIsis supports repeated subfields as used by some MARCs, |
268 |
XML/SGML attributes can not be repeated. |
269 |
(Technically, they can, but there neither is defined semantics for |
270 |
repeated attributes nor is access supported by parsers or the DOM). |
271 |
Moreover, traditional CDS/ISIS implementations do not support |
272 |
repeated subfields, so it's probably a good idea to not use them |
273 |
without a pretty good reason. |
274 |
|
275 |
Basically, when you think C, one field's attributes take everything |
276 |
that goes into a simple struct, without using arrays or pointers. |
277 |
|
278 |
|
279 |
The detailled modelling should also take into account the intended usage. |
280 |
|
281 |
For example, one might devise some attribute candidates to childs, if |
282 |
- they are likely to be accessed or modified together |
283 |
but independent of other properties |
284 |
- they are candidates to be inherited or overridden as a group in a |
285 |
> PatchWork |
286 |
- the parent would otherwise become very large |
287 |
|
288 |
|
289 |
*variants variant structures |
290 |
|
291 |
The C language construct of a "union" is frequently used in bibliographic |
292 |
databases. The typical form resembles the PASCAL "variant record", |
293 |
using an initial field as indicator for the usage of the given field. |
294 |
Sometimes, however, the more liberal C practice is used, |
295 |
where the intented interpretation is specified somewhere in the record, somehow. |
296 |
|
297 |
A similar construct is used in ALGOL-derived OO languages like C++ or Java, |
298 |
where the indicator (of what object is this ?) is out-of-band data |
299 |
(i.e. cannot be modified or inspected like any other data). |
300 |
|
301 |
|
302 |
In Isis records, fields always have a tag |
303 |
(and subfields commonly have an identifier) indicating the kind of data. |
304 |
Therefore, there is little need to introduce another level of switches. |
305 |
A canonically decomposed model |
306 |
- would not reuse fields or subfields with different structure |
307 |
- would not contain rules like |
308 |
"if subfield a has value b then subfield c must be present" |
309 |
|
310 |
However, on the other hand, full decomposition might be tedious and |
311 |
even hide relationships. Moreover, from a given point of view, |
312 |
tags and identifiers are just ordinary subfields on some level. |
313 |
|
314 |
|
315 |
In general, if the same tag is used for variants of a field, |
316 |
the risk of misinterpretation of data should be minimized by |
317 |
not reusing the same subfields with different structure. |
318 |
After all, defining another indicator and ignoring an unexpected subfield |
319 |
or moaning on the lack of an expected one is cheaper and more robust and clear |
320 |
than verifying an expected structure based on other subfield values. |
321 |
|
322 |
|
323 |
* examples |
324 |
|
325 |
A typical HTML table definition starting with |
326 |
$ |
327 |
<table width="100%" cellpadding="0" cellspacing="0" |
328 |
marginwidth="0" marginheight="0" topmargin="0" leftmargin="0" border="0"> |
329 |
<tr> |
330 |
<td valign="top" width="160"> |
331 |
this is the textbody <br/> of the td node |
332 |
</td> |
333 |
</tr> |
334 |
... |
335 |
$ |
336 |
will be compacted to, say, |
337 |
$ |
338 |
100 +^w100%^p0^s0^m0^h0^t0^l0^b0 |
339 |
101 + |
340 |
102 +^vtop^w160 |
341 |
0 this is the textbody |
342 |
103 - |
343 |
0 of the td node |
344 |
102 |
345 |
101 |
346 |
... |
347 |
$ |
348 |
For a detailed description of the transformation, see |
349 |
> xmlisis the XML-ISIS doku |
350 |
|
351 |
A six field result record might be embedded within a response like |
352 |
$ |
353 |
908 6 cds.47 |
354 |
24 Hydrological achievements and social problems |
355 |
... |
356 |
$ |
357 |
|
358 |
Assuming we gave tag -20 to "OR" (and 0 to a literal), |
359 |
the query "plant OR water" might be parsed to |
360 |
$ |
361 |
-20 B |
362 |
0 plant |
363 |
0 water |
364 |
$ |
365 |
|
366 |
"frog AND (plant OR water)" might look like, if -21 is assigned to "AND" |
367 |
$ |
368 |
-21 B |
369 |
0 frog |
370 |
-20 B |
371 |
0 plant |
372 |
0 water |
373 |
$ |
374 |
|
375 |
For implicit tags, the number of childs is redundant |
376 |
(fixed per tag in a given use) and will typically be omitted. |
377 |
|
378 |
--- |
379 |
$Id: Struct.txt,v 1.8 2003/06/23 14:44:29 kripke Exp $ |