0.9.0/doc/Views.txt

Using views in OpenIsis.

A "view", like a VIEW in SQL, creates new, typically temporary records based on
existing ones by means of some transformation like selecting a subset of the
available fields (a projection), retagging fields or manipulating field values.


As general concept, a view can be implemented using any algorithm
in any of the available programming languages to create new records
(and need not only refer to record contents, but may also access other
ressources like files).

In a more narrow sense, however, a view is a special kind of transformation
defined by a "view record". The fields of a view record have tags
as they should appear in the target, typically some valid tags of the source
plus, for example, index control tags, if the view describes indexing.


In the following, the term "alphanumeric" denotes any ASCII letter or digit,
or any non-ASCII character.
"Word character" denotes any alphanumeric, hyphen '-' or underscore '_'.


The value can have one of several forms:
-       if it is empty,
        the tag is passed to the source record's v command (see below).
-       if it starts with a %,
        the rest of the value (w/o the %) is passed to the source record's v command.
        If the tag is not 0, '=tag;' is prepended.
-       if the value starts with any word character,
        it is used literally.
-       if it starts with a quote,
        the rest of the value is used literally (w/o the quote).
        If the value's last character is a quote, it is discarded.
-       if it starts with an @,
        the rest of the value names a view to be included
-       if it starts with an &,
        the rest of the value is the name of an extension exit to call
-       if it starts with an {,
        the rest of the value is a script to be executed in the host language
        (after stripping an optional } as last character)
-       any other form
        (i.e. starting with other ASCII punctuation) is reserved for future use

Example: the view
$
24
70
$
is a simple projection selecting fields 24 and 70 from the source.


*       the v command

is described here as an abstract command.
It is available in the C-API as well as from the language bindings,
possibly with language specific variations.

It resembles the core concepts of traditional formatting,
including access to and looping over fields and subfields, 
selecting substrings and attaching optional literals.
It is sort of the record's printf.
Like printf, and unlike traditional formatting,
it neither supports flow control nor screen rendering.


It takes a source and target record plus a string specifying a format.
Depending on the language environment, the source and/or target may be implicit.

If  the format starts with '=tag;', where tag is a tag,
this gives the tag used in the target and as default.
Otherwise, tags from the source are used in the target and default is *.

The first (next) character is then checked for an encoding mode, see below.


The format is a series of output specifications,
consisting of a field tag (word characters, either numerical or by field name),
selectors and modifiers. The special tag * selects all fields.
Each spec may contain several subspecs, separated by commas,
using the same child context (otherwise, specs and subspecs are the same).
So the format is spec[;spec...], and a spec is spec[,subspec...].  


The general operation of the v command is to loop over the record
until the last occurence was seen for all tags.
In the nth repetition, for each tag in any spec,
the (n+i)th occurence of a field with this tag is used,
where i is an offset given by an occurence selector.
Determine whether this is the last occurence.
For every iteration, a new output field is started,
and the format is processed as follows:
-       loop over the (main) specifications
-       loop over childs (or use the given field)
-       loop over subspecs
-       loop over subfields (or use the whole field)
-       apply decoding
-       apply substring
-       apply encoding
-       attach literals
-       append the result to the target record


Each spec starts with an optional decoding mode,
optionally followed by a tag,
optionally followed by a child selector,
optionally followed by a subfield selector,
optionally followed by string modifiers,
optionally intermingled with occurence selectors and literals:
-       , starts a new subspec
-       ; starts a new spec with default context reset to the last tag seen
-       . starts a child selector
-       ^% start a subfield selector
-       ([ start an occurence selector
-       /~"'`|+ start a literal
-       : starts a substring selector
-       & calls an extension
-       { evaluates a script


*       encoding mode

One of the following operators as first character of the format
can select an output "encoding":
-       ? outputs a 1, if the selected entitity exists, 0 else
-       !       the opposite of ?
-       &       applies HTML encoding
-       %       applies URL encoding

The test encodings ?! inhibit normal processing;
they immediatly return after checking the first occurence of the the first tag.
For example, using a default of all tags (*), the format consisting
solely of a '?' checks wether a record is empty.

More special characters (but not the '*') may be designated in the future,
so a format should always start with a tag (possibly explicit *).


*       decoding mode

An uppercase character before the tag may denote a decoding mode:
$
-       H heading mode:
^x is replaced as ';' for x=a, ',' for x=b..i, '.' for others
angle brackets are removed (>< replaced by '; '), <a> or <a=b> evaluates to a

-       D data mode:
in addition to heading mode, if there is no explicit literal after this field,
append '  ', if it ends in "punctuation", or '.  ' else.

-       X index mode
like heading, but <a> evaluates to nothing and <a=b> to b

-       M traditional
For compatibility, specs reading MHx or MDx (x = L or U) set heading
or data mode, resp., as default processing (before substringing).
The case directive is ignored.
$


*       child selector

If a tag is immediatly followed by a dot '.' and optional tag, 
field context is switched, for this spec and following specs separated by ',',
to loop over the childs with the given tag.
Tag defaults to 0, selecting text nodes in the canonical XML representation.
A * selects all childs, a second . recursively selects all childs.


*       subfield selectors

The primary subfield selector is the hat '^', followed by one character.
It can produce multiple items, like repetitions of a subfield or keywords.

If the selector character is
-       alphanumeric
        select the (repetitions of the) subfield tagged with this character.
-       an opening pairing brace
        i.e. one of '(','{','[' or the angle bracket '&lt;',
        words between pairs of this brace are selected (commonly keywords).
-       a *
        selects the part up to the first subfield delimiter
-       a space
        selects naive words as sequences of alphanum
-       a )
        selects parts between TABs (array mode)
-       other punctuation
        like / or | selects parts between pairs of this character


The percent sign '%' (think printf) works basically like the hat, but
-       removes quotes surrounding values
-       by default treats the TAB as subfield delimiter
-       if followed by a punctuation character or space,
        treats this plus surrounding whitespace as delimiter,
        not separating within quotes.
-       if followed by a ),
        (optionally after another punctuation) goes to array mode,
        that is there is no subfield indicator stripped from the values
-       if followed by multiple word characters,
        (including '-' and '_', optionally after an initial punctuation)
        searches for subfields starting with that sequence followed by '=' or ':'

Examples:
-       '^)' splits at TABs
-       '%)' splits at TABs with quote removal
-       '%a' selects a sequence following a TAB and 'a'
-       '%,)' splits a line of comma separated values
-       '%;*' selects the primary value of a MIME property
-       '%;charset' selects the charset attribute of a MIME property


*       occurence selector

By default, all occurences of fields, childs and subfields are used.
One or multiple occurences can be selected explicitly following a tag,
child selector or subfield selector using brackets [] (counting from 1)
or parentheses (counting from 0) like (i) or (i..j).

-       If i is ommited, it defaults to the first (1 or 0, resp.).
-       If j is ommited, it defaults to last.

Alternatively occurences may be selected by contents.
The general format is an optional subfield selector,
followed by an comparision operator, followed by a literal.
Only occurences where the field or specified subfield matches
the literal according to comparision are selected.
Parentheses select all such occurences,
while brackets select the first match
and default to the first occurence if none matches.

Operators are
-       = for equality
-       ~ for contains
-       * for starts with
-       + for ends with
The equality operator may be ommited, where unambigous.
If some key subfield is known to occur at the start or end of field,
it is probably more efficient to test for +^zen than for ^z=en.


*       literals

Each tag, child or subfield selector may be followed by one or more literals.
Every literal but the / extends to the next occurence of the same
special character by which it is introduced.
This special character may be escaped using a backslash.
A literal backslash may be escaped as two (but need not, except at the end).

The special character governs when and where the literal is output:
-       " before the first occurence
        (of the entity in question; i.e. field, child or subfield)
-       ' before each
-       ` after each
-       | inbetween (after each but the last)
-       + after the last
-       / this single-character literal starts a new output field after each occurence
-       ~       this literal is used if the given entitity does NOT occur

Literals are not subject to the string modifiers.


*       substring selector

Introduced by a colon ':', it has the form :l or :o.l, where o and
l are integers denoting an offset and length to cut from the currently
selected value.


*       extension exits

An exit is a C-function (i.e., using C calling convention) in a dynamic library.
TODO: describe interface.


*       script evaluation

If a scripting environment like Tcl is available,
a {} block may contain a script to be evaluated.
TODO: describe interface.


---
        $Id: Views.txt,v 1.3 2003/06/02 07:49:08 kripke Exp $
1	Using views in OpenIsis.
2
3	A "view", like a VIEW in SQL, creates new, typically temporary records based on
4	existing ones by means of some transformation like selecting a subset of the
5	available fields (a projection), retagging fields or manipulating field values.
6
7
8	As general concept, a view can be implemented using any algorithm
9	in any of the available programming languages to create new records
10	(and need not only refer to record contents, but may also access other
11	ressources like files).
12
13	In a more narrow sense, however, a view is a special kind of transformation
14	defined by a "view record". The fields of a view record have tags
15	as they should appear in the target, typically some valid tags of the source
16	plus, for example, index control tags, if the view describes indexing.
17
18
19	In the following, the term "alphanumeric" denotes any ASCII letter or digit,
20	or any non-ASCII character.
21	"Word character" denotes any alphanumeric, hyphen '-' or underscore '_'.
22
23
24	The value can have one of several forms:
25	- if it is empty,
26	the tag is passed to the source record's v command (see below).
27	- if it starts with a %,
28	the rest of the value (w/o the %) is passed to the source record's v command.
29	If the tag is not 0, '=tag;' is prepended.
30	- if the value starts with any word character,
31	it is used literally.
32	- if it starts with a quote,
33	the rest of the value is used literally (w/o the quote).
34	If the value's last character is a quote, it is discarded.
35	- if it starts with an @,
36	the rest of the value names a view to be included
37	- if it starts with an &,
38	the rest of the value is the name of an extension exit to call
39	- if it starts with an {,
40	the rest of the value is a script to be executed in the host language
41	(after stripping an optional } as last character)
42	- any other form
43	(i.e. starting with other ASCII punctuation) is reserved for future use
44
45	Example: the view
46	$
47	24
48	70
49	$
50	is a simple projection selecting fields 24 and 70 from the source.
51
52
53	* the v command
54
55	is described here as an abstract command.
56	It is available in the C-API as well as from the language bindings,
57	possibly with language specific variations.
58
59	It resembles the core concepts of traditional formatting,
60	including access to and looping over fields and subfields,
61	selecting substrings and attaching optional literals.
62	It is sort of the record's printf.
63	Like printf, and unlike traditional formatting,
64	it neither supports flow control nor screen rendering.
65
66
67	It takes a source and target record plus a string specifying a format.
68	Depending on the language environment, the source and/or target may be implicit.
69
70	If the format starts with '=tag;', where tag is a tag,
71	this gives the tag used in the target and as default.
72	Otherwise, tags from the source are used in the target and default is *.
73
74	The first (next) character is then checked for an encoding mode, see below.
75
76
77	The format is a series of output specifications,
78	consisting of a field tag (word characters, either numerical or by field name),
79	selectors and modifiers. The special tag * selects all fields.
80	Each spec may contain several subspecs, separated by commas,
81	using the same child context (otherwise, specs and subspecs are the same).
82	So the format is spec[;spec...], and a spec is spec[,subspec...].
83
84
85	The general operation of the v command is to loop over the record
86	until the last occurence was seen for all tags.
87	In the nth repetition, for each tag in any spec,
88	the (n+i)th occurence of a field with this tag is used,
89	where i is an offset given by an occurence selector.
90	Determine whether this is the last occurence.
91	For every iteration, a new output field is started,
92	and the format is processed as follows:
93	- loop over the (main) specifications
94	- loop over childs (or use the given field)
95	- loop over subspecs
96	- loop over subfields (or use the whole field)
97	- apply decoding
98	- apply substring
99	- apply encoding
100	- attach literals
101	- append the result to the target record
102
103
104	Each spec starts with an optional decoding mode,
105	optionally followed by a tag,
106	optionally followed by a child selector,
107	optionally followed by a subfield selector,
108	optionally followed by string modifiers,
109	optionally intermingled with occurence selectors and literals:
110	- , starts a new subspec
111	- ; starts a new spec with default context reset to the last tag seen
112	- . starts a child selector
113	- ^% start a subfield selector
114	- ([ start an occurence selector
115	- /~"'`\|+ start a literal
116	- : starts a substring selector
117	- & calls an extension
118	- { evaluates a script
119
120
121	* encoding mode
122
123	One of the following operators as first character of the format
124	can select an output "encoding":
125	- ? outputs a 1, if the selected entitity exists, 0 else
126	- ! the opposite of ?
127	- & applies HTML encoding
128	- % applies URL encoding
129
130	The test encodings ?! inhibit normal processing;
131	they immediatly return after checking the first occurence of the the first tag.
132	For example, using a default of all tags (*), the format consisting
133	solely of a '?' checks wether a record is empty.
134
135	More special characters (but not the '*') may be designated in the future,
136	so a format should always start with a tag (possibly explicit *).
137
138
139	* decoding mode
140
141	An uppercase character before the tag may denote a decoding mode:
142	$
143	- H heading mode:
144	^x is replaced as ';' for x=a, ',' for x=b..i, '.' for others
145	angle brackets are removed (>< replaced by '; '), <a> or <a=b> evaluates to a
146
147	- D data mode:
148	in addition to heading mode, if there is no explicit literal after this field,
149	append ' ', if it ends in "punctuation", or '. ' else.
150
151	- X index mode
152	like heading, but <a> evaluates to nothing and <a=b> to b
153
154	- M traditional
155	For compatibility, specs reading MHx or MDx (x = L or U) set heading
156	or data mode, resp., as default processing (before substringing).
157	The case directive is ignored.
158	$
159
160
161	* child selector
162
163	If a tag is immediatly followed by a dot '.' and optional tag,
164	field context is switched, for this spec and following specs separated by ',',
165	to loop over the childs with the given tag.
166	Tag defaults to 0, selecting text nodes in the canonical XML representation.
167	A * selects all childs, a second . recursively selects all childs.
168
169
170	* subfield selectors
171
172	The primary subfield selector is the hat '^', followed by one character.
173	It can produce multiple items, like repetitions of a subfield or keywords.
174
175	If the selector character is
176	- alphanumeric
177	select the (repetitions of the) subfield tagged with this character.
178	- an opening pairing brace
179	i.e. one of '(','{','[' or the angle bracket '<',
180	words between pairs of this brace are selected (commonly keywords).
181	- a *
182	selects the part up to the first subfield delimiter
183	- a space
184	selects naive words as sequences of alphanum
185	- a )
186	selects parts between TABs (array mode)
187	- other punctuation
188	like / or \| selects parts between pairs of this character
189
190
191	The percent sign '%' (think printf) works basically like the hat, but
192	- removes quotes surrounding values
193	- by default treats the TAB as subfield delimiter
194	- if followed by a punctuation character or space,
195	treats this plus surrounding whitespace as delimiter,
196	not separating within quotes.
197	- if followed by a ),
198	(optionally after another punctuation) goes to array mode,
199	that is there is no subfield indicator stripped from the values
200	- if followed by multiple word characters,
201	(including '-' and '_', optionally after an initial punctuation)
202	searches for subfields starting with that sequence followed by '=' or ':'
203
204	Examples:
205	- '^)' splits at TABs
206	- '%)' splits at TABs with quote removal
207	- '%a' selects a sequence following a TAB and 'a'
208	- '%,)' splits a line of comma separated values
209	- '%;*' selects the primary value of a MIME property
210	- '%;charset' selects the charset attribute of a MIME property
211
212
213	* occurence selector
214
215	By default, all occurences of fields, childs and subfields are used.
216	One or multiple occurences can be selected explicitly following a tag,
217	child selector or subfield selector using brackets [] (counting from 1)
218	or parentheses (counting from 0) like (i) or (i..j).
219
220	- If i is ommited, it defaults to the first (1 or 0, resp.).
221	- If j is ommited, it defaults to last.
222
223	Alternatively occurences may be selected by contents.
224	The general format is an optional subfield selector,
225	followed by an comparision operator, followed by a literal.
226	Only occurences where the field or specified subfield matches
227	the literal according to comparision are selected.
228	Parentheses select all such occurences,
229	while brackets select the first match
230	and default to the first occurence if none matches.
231
232	Operators are
233	- = for equality
234	- ~ for contains
235	- * for starts with
236	- + for ends with
237	The equality operator may be ommited, where unambigous.
238	If some key subfield is known to occur at the start or end of field,
239	it is probably more efficient to test for +^zen than for ^z=en.
240
241
242	* literals
243
244	Each tag, child or subfield selector may be followed by one or more literals.
245	Every literal but the / extends to the next occurence of the same
246	special character by which it is introduced.
247	This special character may be escaped using a backslash.
248	A literal backslash may be escaped as two (but need not, except at the end).
249
250	The special character governs when and where the literal is output:
251	- " before the first occurence
252	(of the entity in question; i.e. field, child or subfield)
253	- ' before each
254	- ` after each
255	- \| inbetween (after each but the last)
256	- + after the last
257	- / this single-character literal starts a new output field after each occurence
258	- ~ this literal is used if the given entitity does NOT occur
259
260	Literals are not subject to the string modifiers.
261
262
263	* substring selector
264
265	Introduced by a colon ':', it has the form :l or :o.l, where o and
266	l are integers denoting an offset and length to cut from the currently
267	selected value.
268
269
270	* extension exits
271
272	An exit is a C-function (i.e., using C calling convention) in a dynamic library.
273	TODO: describe interface.
274
275
276	* script evaluation
277
278	If a scripting environment like Tcl is available,
279	a {} block may contain a script to be evaluated.
280	TODO: describe interface.
281
282
283	---
284	$Id: Views.txt,v 1.3 2003/06/02 07:49:08 kripke Exp $