/[webpac]/openisis/current/doc/Struct.txt
This is repository of my old source code which isn't updated any more. Go to git.rot13.org for current projects!
ViewVC logotype

Annotation of /openisis/current/doc/Struct.txt

Parent Directory Parent Directory | Revision Log Revision Log


Revision 237 - (hide annotations)
Mon Mar 8 17:43:12 2004 UTC (20 years, 1 month ago) by dpavlin
File MIME type: text/plain
File size: 14927 byte(s)
initial import of openisis 0.9.0 vendor drop

1 dpavlin 237 structuring ISIS records using subfields or subrecords
2    
3    
4     * structures
5    
6     The means by which an Isis record can be structured into "data elements"
7     ("A defined unit of information", Z39.2 a.k.a. ISO2709)
8     fall in one of two broad categories (citing Z39.2):
9     - subfields
10     "A data element considered as a component of a field."
11     In *ML (SGML,HTML,XML...), subfields correspond to a node's attributes.
12     In MIME, subfields correspond to attributes of a MIME header value.
13     - subrecords
14     "A group of fields within a record that may be treated as a logical entity.
15     (When a record describes more than one entity,
16     the descriptions of individual entities may be treated as subrecords.)"
17     In *ML, subrecords correspond to a node's childs.
18     In MIME, subrecords correspond to multipart body parts.
19    
20    
21     * subfields
22    
23     Since a field value can actually be anything,
24     including XML text or a serialized (textual or binary) Isis record,
25     it can be arbitrarily structured according to a regular expression
26     or some other grammar (machine parseable or not).
27    
28    
29     The term subfield, however, is used for a range of characters in the value
30     which is identified by rather simple means:
31     - fixed
32     if all (or all but the last) subfields have a fixed length
33     and are neither optional nor repeatable,
34     then each subfield can be found at a fixed position.
35     - delimited with optional identifier
36     this is the proper Z39.2 notion of a subfield.
37    
38     If a special delimiter character is found in the field,
39     it breaks the field into subfields.
40     Z39.2, and thus MARC, use the character 31 as delimiter
41     (hex 1F, CTRL-_, ASCII "unit separator" US).
42     Traditional Isis uses the caret '^'.
43    
44     OpenIsis permits any character, including the horizontal TAB and semicolon.
45     More precisely, OpenIsis reverts Z39.2's notion that
46     "every subfield is INTRODUCED by a delimiter, unless it isn't"
47     to the principle that for every data element, it is specified
48     how it's end is detected, including by fixed length or varying delimiters.
49    
50    
51     The initial n characters of a subfield are used to identify the subfield.
52     Z39.2 permits any (small) fixed value for n, including 0, i.e. not identified.
53     The MARC family of standards uses n=1.
54     OpenIsis allows for any value, including variable length identifiers,
55     which are themselves delimited by some character like a '=' (see below).
56    
57    
58     Z39.2 states that if identifiers are used, each must be preceeded
59     by a delimiter, and every data element, including the first,
60     must be identified that way. However, an initial range of m characters
61     (i.e. preceeding the first delimiter) in every field may serve as "indicator",
62     which is not regarded a "data element". Again, m is a small fixed number;
63     MARC uses m=2. Traditional Isis has no special support for indicators.
64     OpenIsis allows to access whatever is before the first delimiter.
65    
66    
67     Different subfielding methods can be mixed or nested.
68     Typical cases are:
69     - mixed fixed/delimited
70     After some initial fixed subfields, following subffields are delimited.
71     This can be used to describe MARC's fixed indicators.
72     - nested delimited/fixed
73     A delimited subfield has itself a fixed substructure.
74     Actually the leading identifier in a subfield can be regarded
75     as fixed part in a mixed substructure.
76     - nested unidentified delimited
77     A delimited subfield has itself a delimited structure.
78     This can be used to model variable length identifiers.
79    
80     In other words, identifiers are themselves nothing but subfields
81     used as keys on some level of nesting.
82     On the other hand, any subfield could serve as a key for it's parent.
83     This is used e.g. to select a field by a subfield indicating a language
84     (see below for keyed subrecords).
85    
86     If you look at the
87     > Serialized plaintext representation of an Isis record,
88     actually the whole record is a newline delimited value,
89     the whole database is a blankline (double newline) delimited value
90     and each field has it's tag as initial tab-delimited subfield.
91    
92    
93     In the future, OpenIsis will add support for a wide variety of
94     subfielding techniques such as defined by regular expressions,
95     MIME headers or produced in typical "character/comma separated values" files
96     (opionally using quotes).
97    
98     Since splitting subfields is mostly and can always be done on the
99     application level (i.e. a database server rarely needs to care),
100     "support" essentially boils down to the definition of appropriate meta data.
101    
102    
103     * subrecords
104    
105     A subrecord consists of a typically continuous range of fields within a record,
106     started by some field to introduce the subrecord.
107     Some variants, however, like keyed subfields,
108     can be freely scattered and don't need a "header" field.
109    
110    
111     There are basically four ways to denote the boundaries of structures:
112     - embraced
113     where a special field is used to denote the structures end.
114     This resembles SGML-style notations,
115     where each opening tag is matched by a closing tag.
116     This is relatively easy and recommended for every day use.
117     - marked
118     where the fields of the child structure are marked as such.
119     This is sort of the opposite approach of embracing.
120     Marking comes in several powerful flavours,
121     see below for a more detailled discussion.
122     - counted
123     where the number of fields (not childs) belonging to the
124     structure is given in (any leading digits of) the initial field.
125     This allows for safe embedding regardless of the
126     structure's contents and is thus used in contexts where
127     full generality is needed like when embedding result records
128     within a server's response.
129     - implicit
130     where the number of childs is fixed.
131     An example of this is the parse tree of a query,
132     where the structure "AND" has exactly two childs
133     (which in turn might be structures).
134     This is used mostly for internal structures like parsed
135     queries or formats, which are not meant to be exchanged.
136    
137     The field introducing a subrecord might have any subfields
138     just like other fields, similar to the attributes that might
139     be assigned to a tag in SGML applications like HTML.
140    
141     However, the first subfield (unidentified initial characters)
142     of a field opening an embraced or counted subrecord is reserved as indicator:
143     - a plus sign '+' as first character
144     indicates explicity opening a subrecord
145     - a minus sign '-' as first character
146     indicates an empty subrecord (containing no childs)
147     - an empty value
148     indicates explicity closing a subrecord
149     (similar to the closing blank line used in several protocols)
150     - an initial numeric value
151     (of decimal digits) gives the number of fields to follow.
152     - an initial character @A-Z
153     gives the number of childs to follow (@=0,A=1,B=2...) (rarely used)
154    
155     Auxiliary information about the child,
156     like an embedded records row number and type,
157     are stored in subfields of the parent.
158    
159    
160     * conventions
161    
162     While the intented usage of subrecords might be specified in
163     more detail in the
164     > Meta table metadata
165     , the schema can also be used standalone (without referring to metadata),
166     if some conventions on tag ranges are followed.
167    
168     The extend of subrecords by length or braces can be safely
169     determined if you just know that you want the given field
170     to be regarded as subrecord.
171    
172     For subrecords of fixed number of childs (meant for internal use),
173     it is necessary to recognize whether a following field is itself a structure.
174     If they are used at all, the tag range -1..-99 should be reserved for this
175     purpose.
176    
177     In this context, typically one of two modes is used:
178     - the MIME processing mode for processing list-style content,
179     assumes that negative tags denote structures,
180     while positive contain plain data.
181     - in XML processing mode, everything but the 0 tag (text node) is a structure.
182    
183     If a parent has a subfield ^0,
184     that should contain the childs identity as dbname or mfn or dbname.mfn.
185     If the parents indicator is delimited by a tab instead of a ^,
186     the next tab-delimited subfield is interpreted that way (where applicable).
187    
188    
189     * marked structures
190    
191     There is a wide variety of techniques for marking fields as "childs"
192     of other fields. Marking techniques work especially well for a single
193     level of substructuring; for nested structures, some restrictions apply.
194    
195     We give some commonly used examples:
196    
197     - quoting
198     is done by prefixing every child field value with a special string,
199     which is not used as prefix outside the child fields.
200     However, at least for a single level of quoting, it does not impose
201     a problem if the child fields themselves started with the same prefix:
202     Still, the original value is retrieved by stripping the (first) prefix.
203     This even works for multiple levels, as long as the record was properly
204     constructed, i.e. the quoting prefix is not used outside childs.
205     Examples are the output of the diff command (which is driving the
206     RCS/CVS revision control system very reliably) and the '>' quoting
207     used in e-mail replies.
208     - tagging
209     Instead of the field value, of course also the field tag can be used
210     as child mark. In some situations it might be possible to choose
211     appropriate reserved tags for the childs.
212     In other situations, where some given child tag must be kept,
213     it can be stored as prefix in the field value according to the canonical
214     > Serialized
215     plain text format.
216     - keying
217     If the mark used is dependent on an attribute of the parent field,
218     the childs can be determined even if non-continuous.
219     With some more cooperation of the childs, the mark might be an
220     attribute (subfield) instead of a prefix (indicator).
221     That way, childs and parents are linked together rather logically
222     than "physically" by a common key just like in relational databases.
223     This easily extends to multiple levels using segmented keys
224     (consisting of several attributes/subfields).
225     While this scheme only works with well behaved childs and may waste
226     some space by replicating keys, it is simple and robust and gives
227     convenient access to the childs without inspecting the structure.
228    
229    
230     *design childs vs. attributes
231    
232     Every information that can be represented using an attribute,
233     can also be represented using a child.
234     From that point of view, attributes are a redundant "language" construct
235     and one might deem a model using only childs as the simpler one.
236     We call such an attributeless model "canonical verbose" representation.
237     It's a little bit similar to the "everything is an object"
238     approach of pure OO languages like Smalltalk.
239    
240    
241     But then, having a richer language isn't always such a bad thing,
242     if you know how to use it appropriately.
243     (This "if" is the core of almost any serious criticism of rich languages,
244     but for now, let's assume we know what we're doing).
245     Appropriate use basically boils down to choosing the language construct
246     that was just made for your situation, i.e. not the most general one,
247     but quite to the opposite the most specific (restricted) one.
248     That way you will not only have the most efficient representation,
249     but also express additional information about what's going on.
250    
251    
252     In short, a "canonical compact" modelling can be based upon the principle
253     "Use attributes wherever possible".
254    
255     Some logical property of a logical structure can be represented by
256     means of attributes, if
257     - it is simple,
258     i.e. one single string value.
259     - or at least flat,
260     i.e. itself a structure that can be represented based on attributes
261     that do not interfere with the parents attributes.
262     In the latter case, the property will show up as several
263     logically interrelated attributes of the parent.
264     However, such a flat group of attributes might be a candidate
265     for a child under some circumstances.
266     - it is not repeatable.
267     Although OpenIsis supports repeated subfields as used by some MARCs,
268     XML/SGML attributes can not be repeated.
269     (Technically, they can, but there neither is defined semantics for
270     repeated attributes nor is access supported by parsers or the DOM).
271     Moreover, traditional CDS/ISIS implementations do not support
272     repeated subfields, so it's probably a good idea to not use them
273     without a pretty good reason.
274    
275     Basically, when you think C, one field's attributes take everything
276     that goes into a simple struct, without using arrays or pointers.
277    
278    
279     The detailled modelling should also take into account the intended usage.
280    
281     For example, one might devise some attribute candidates to childs, if
282     - they are likely to be accessed or modified together
283     but independent of other properties
284     - they are candidates to be inherited or overridden as a group in a
285     > PatchWork
286     - the parent would otherwise become very large
287    
288    
289     *variants variant structures
290    
291     The C language construct of a "union" is frequently used in bibliographic
292     databases. The typical form resembles the PASCAL "variant record",
293     using an initial field as indicator for the usage of the given field.
294     Sometimes, however, the more liberal C practice is used,
295     where the intented interpretation is specified somewhere in the record, somehow.
296    
297     A similar construct is used in ALGOL-derived OO languages like C++ or Java,
298     where the indicator (of what object is this ?) is out-of-band data
299     (i.e. cannot be modified or inspected like any other data).
300    
301    
302     In Isis records, fields always have a tag
303     (and subfields commonly have an identifier) indicating the kind of data.
304     Therefore, there is little need to introduce another level of switches.
305     A canonically decomposed model
306     - would not reuse fields or subfields with different structure
307     - would not contain rules like
308     "if subfield a has value b then subfield c must be present"
309    
310     However, on the other hand, full decomposition might be tedious and
311     even hide relationships. Moreover, from a given point of view,
312     tags and identifiers are just ordinary subfields on some level.
313    
314    
315     In general, if the same tag is used for variants of a field,
316     the risk of misinterpretation of data should be minimized by
317     not reusing the same subfields with different structure.
318     After all, defining another indicator and ignoring an unexpected subfield
319     or moaning on the lack of an expected one is cheaper and more robust and clear
320     than verifying an expected structure based on other subfield values.
321    
322    
323     * examples
324    
325     A typical HTML table definition starting with
326     $
327     <table width="100%" cellpadding="0" cellspacing="0"
328     marginwidth="0" marginheight="0" topmargin="0" leftmargin="0" border="0">
329     <tr>
330     <td valign="top" width="160">
331     this is the textbody <br/> of the td node
332     </td>
333     </tr>
334     ...
335     $
336     will be compacted to, say,
337     $
338     100 +^w100%^p0^s0^m0^h0^t0^l0^b0
339     101 +
340     102 +^vtop^w160
341     0 this is the textbody
342     103 -
343     0 of the td node
344     102
345     101
346     ...
347     $
348     For a detailed description of the transformation, see
349     > xmlisis the XML-ISIS doku
350    
351     A six field result record might be embedded within a response like
352     $
353     908 6 cds.47
354     24 Hydrological achievements and social problems
355     ...
356     $
357    
358     Assuming we gave tag -20 to "OR" (and 0 to a literal),
359     the query "plant OR water" might be parsed to
360     $
361     -20 B
362     0 plant
363     0 water
364     $
365    
366     "frog AND (plant OR water)" might look like, if -21 is assigned to "AND"
367     $
368     -21 B
369     0 frog
370     -20 B
371     0 plant
372     0 water
373     $
374    
375     For implicit tags, the number of childs is redundant
376     (fixed per tag in a given use) and will typically be omitted.
377    
378     ---
379     $Id: Struct.txt,v 1.8 2003/06/23 14:44:29 kripke Exp $

  ViewVC Help
Powered by ViewVC 1.1.26