1 
Malete query expressions: 
2 

3 
Like CDS/ISIS, Malete supports both index based searching and 
4 
record based filtering. Unlike in CDS/ISIS, the basic filtering syntax is not 
5 
based on the formatting language, but a simple extension of the search syntax. 
6 

7 
However, a more general record manipulation language, the "patchwork", 
8 
is planned, which will also act as a filter based on complex conditions. 
9 

10 

11 
In a query expression, everything up to a '?' is a search expression, 
12 
the remaining part defines a filter. 
13 

14 

15 
In general, an expression defines a result set, consisting of pointers 
16 
to occurrences of keys in records. For searching, these pointers are 
17 
fetched from the "inverted file", the .mqd BTree. On filtering, 
18 
the expression creates pointers as needed from the current record 
19 
and matches the record, if the result set is not empty. 
20 

21 

22 
An expression consists of 
23 
 simple expressions creating result sets 
24 
 operators modifying result sets 
25 
 parentheses to control evaluation order 
26 

27 

28 
* pointers 
29 

30 
A pointer (to an occurrence of a key in a record) 
31 
is a 4tuple (a point in a 4 dimensional space) (r,t,o,p) of 
32 
 1 the record's id 
33 
 2 the tag of the field where the key was found, 
34 
so Smith in an author field can be distinguished from Smith in title 
35 
 3 the "occurrence" (repeat count) of field, 
36 
so one can demand Mark and Twain in the same author field, 
37 
instead of Mark and Twain in two author fields 
38 
 4 the position (nth word) within the field 
39 

40 

41 
* simple expressions 
42 

43 
A set of results can be created by 
44 
 [rel][whitespace]term 
45 
a relation to a term, producing all pointers to keys that have the given 
46 
relation (as listed below) to term. Relation defaults to equality =. 
47 
term$ is a legacy form of %term. 
48 
Term is a series of "word characters", which are ASCII letters and digits, 
49 
the underscore and all nonASCII characters (code greater than 127), 
50 
or any characters in double quotes ". 
51 
Between the quotes, two quotes evaluate to a single quote. 
52 
 #n 
53 
a reference to the nth query. 
54 
Inserts the search or filter expression, as applicable, of the nth query. 
55 
(Might not be fully implemented). 
56 
In a search expression, an existing search result set might be used. 
57 
As a special case, if the whole query expression is only #n, 
58 
the nth query is used as is, including search, filter and cursor. 
59 

60 
Relations are: 
61 
$ 
62 
= equality (not needed) 
63 
% prefix match 
64 
> greater than 
65 
>= greater than or equal 
66 
< less than (not yet implemented, behaves like '>'. For range queries 
67 
use '' rather than '<' '*' '>' ) 
68 
<= less than or equal (not yet implemented, behaves like '>=') 
69 
: contains (filtering only) 
70 
~ regular expression match (optional, filtering only) 
71 
$ 
72 
The regexp match will not be available in all environments 
73 
(since it requires a regexp implementation, probably the one from Tcl). 
74 
Yet, it's just too tempting ... 
75 

76 

77 
The less/greater than (or equal) operators are not used alone, 
78 
but in "key identity" expressions to define ranges. 
79 

80 

81 
* key identity 
82 

83 
The semantics of pointers is to denote a word position in an occurrence 
84 
of a tag in a record. We assume that a pointer is only found at the one 
85 
key being (some transformation of) the word at that position, 
86 
i.e. pointers are unique. 
87 

88 

89 
Therefore, matching arbitrary result sets on pointer identity is of little 
90 
use. However, when applied to terms, we can use a pointer identity condition 
91 
as a key identity condition, effectively limiting the range of keys searched: 
92 
  key identity 
93 
is a pseudo operator, syntactically operating on result sets, 
94 
but actually implemented as joining and modifying term relations. 
95 

96 
Both operands of  must be terms, it refuses to operate on other expressions. 
97 
If the left hand term has the = relation, it is modified to >=. 
98 
If the right hand has the = relation, it is modified to <. 
99 
The expression 'A  B' will search all keys greater or equal "A" 
100 
and less than "B". To include "B", use 'A  <=B'. 
101 
The prefix '%ABC' is actually just a shorthand for 'ABC  ABD'. 
102 
If multiple upper or lower bounds, including those from a prefix, 
103 
are found in the operands, the lowest lower and highest upper ones apply. 
104 

105 

106 
Currently, only the default case greater or equal to A and less than B 
107 
is implemented. 
108 

109 

110 
* tag filter 
111 

112 
The tag filter selects pointers with certain tags by restricting 
113 
the second dimension to one or more tag values: 
114 
 expr/tag or expr/(tag[,tag...]) 
115 
where tags are literal numbers. In other words, it intersects the left hand 
116 
result set with the set {(r,t,o,p)t element of taglist} 
117 

118 
As a special case, a tag filter with an empty left hand expression can be used 
119 
at the very start of (the outermost level of) an expression: 
120 
 in filtering to select the fields for final output. 
121 
This is a proper part of filtering, 
122 
since a record with no matching fields is skipped. 
123 
 in searching to force a given ordering. 
124 
This is not a proper part of searching, but rather defines a cursor 
125 
for record retrieval, which then uses the search result set as filter. 
126 
We propably should define an extended syntax for ordering, 
127 
like we have for filtering, to allow key range conditions here. 
128 
Since this does a full index scan while retrieving records, 
129 
results are not made unique, and ordering will be most useful only 
130 
with a single tag (or some alternative tags) of a nonrepeatable field. 
131 
Currently not implemented. 
132 

133 
The tag filter, like the pointer identity, is another pseudo operator, 
134 
actually modifying its subexpressions rather than their result sets. 
135 
See below for details. 
136 

137 

138 
* binary operators 
139 

140 
Additive operator: 
141 
 + is the logical OR 
142 
calculating the union of result sets 
143 

144 
All other operators reduce their left hand result set by selecting elements 
145 
based on certain relations they have to those of the right hand set. 
146 
In other words they are intersections of the left hand with a certain 
147 
set defined (but not actually created) based on the right hand set. 
148 

149 

150 
Multiplicative operators select pointers from their left hand set 
151 
based upon their equality to pointers in the right hand set 
152 
in the first n dimensions: 
153 
 * ("AND") is 1dimensional equality 
154 
selects pointers with the same record id, i.e. those pointer from the 
155 
left hand set for which there exists a pointer with the same record id 
156 
in the right hand set 
157 
 ; or (G) is 2dimensional equality 
158 
selecting on same (record and) tag 
159 
 , or (F) is 3dimensional equality 
160 
selecting on same (record and tag and) occurrence 
161 
Mathematically, these operators create an intersection of the left hand 
162 
with a projection of the right hand to the first n dimensions 
163 
cross product with the full space of the remaining dimensions. 
164 
E.g. A;B := A intersect {(r,t,x,y)exist o,p with (r,t,o,p) in B}, 
165 
which we can denote as the (G) set of B. 
166 

167 
The full 4dimensional equality operator, pointer identity, 
168 
is a pseudo operator as described above. 
169 

170 

171 
The difference operator 
172 
 ^ ("NOT" or "WITHOUT") 
173 
selects based on 1dimensional inequality, i.e. those pointer from the 
174 
left hand set for which there exists NO pointer with the same record id 
175 
in the right hand set. 
176 
A^B := A intersect {(r,x,y,z)not exist t,o,p with (r,t,o,p) in B} 
177 
(the NOT set of B). 
178 
Note that the literal words OR, AND and NOT are not operators, but terms. 
179 

180 

181 
The distance operators are based on a proximity relation in the 4th dimension, 
182 
requiring 3dimensional equality (same record, tag and occurrence): 
183 
 .[...] or (n) max distance 
184 
A sequence of n dots or a number in parentheses selects pointers from the 
185 
left hand, for which there is a pointer in the right hand where the word 
186 
position p differs by at most n. (0) is pointer identity. 
187 
A(n)B := A intersect 
188 
{(r,t,o,x)exists p with abs(px) <=n and (r,t,o,p) in B} 
189 
(the dist set of B). 
190 
 $$[$$] exact distance: 
191 
likewise with exact position difference n. 
192 
(A single $ could be confused with prefix match, but is the same as a .). 
193 

194 
Due to the problems of word counting during filtering, 
195 
the distance operators may silently behave like 3dimensional equality. 
196 

197 

198 
* operator precedence 
199 

200 
Adjacent terms with no operator in between are combined by *. 
201 
The operator precedence can by overriden by enclosing any expression 
202 
in parentheses (). 
203 

204 

205 
The operators in decreasing precedence: 
206 
 pointer/key identity  
207 
 positional distance . and $$ 
208 
 field level match , and ; ((F) and (G)) 
209 
 filter / 
210 
 record level match * and ^ 
211 
 additive + 
212 
On the same precedence level, 
213 
the distance operators .. and $$ associate right ('A.B.C' is 'A.(B.C)'), 
214 
all others associate left ('A B C' is '(A*B)*C'). 
215 

216 
The tag filter / has highest precedence to its right hand, 
217 
since the tag list is not subject to any operator evaluation. 
218 

219 

220 
* detailed semantics of searching 
221 

222 
First of all, tag filters are *defined* to be applied fully distributive. 
223 
For operators of higher precedence, they are logically distributive: 
224 
'(A (G) B)/101' yields the same as '(A/101) (G) (B/101)', 
225 
hence their lower precedence. For '(A ^ B)/100', 
226 
it looks like we should first select all results for A, 
227 
strip those where B is (anywhere) in the same record, 
228 
and then strip all in fields other than 100. 
229 
This, however, would be better written as 'A/100 ^ B': 
230 
there is no point in delaying a tag filter. 
231 
The only reasonable use of applying a tag filter to a lower precedence 
232 
operation (which requires use of parentheses anyway) is to distribute it. 
233 
So, '(A ^ B)/100' is *defined* as 'A/100 ^ B/100'. 
234 

235 
Second, we define tag filters to not nest, but override, 
236 
so that only the innermost applies. E.g. in '(A/101 B C)/102', 
237 
only B and C have to be found in field 102, while A is checked in 101. 
238 
Again, this is because there would be no point in intersecting nested 
239 
tag lists: it's shorter to write what should be in effect in the first place. 
240 

241 

242 
To summarize, 
243 
 tag filters are always distributed down to the term level, 
244 
where they are applied most efficiently, and 
245 
 there is always at most one tag filter in effect. 
246 
In the further reasoning, tag filters are not considered. 
247 

248 

249 
The final result set of a search expression is 1dimensional 
250 
(record id only), in other words Malete does not maintain 
251 
> http://lcweb.loc.gov/z3950/agency/markup/23.html Extended Result Sets 
252 
, but discards the other dimensions after evaluating a search. 
253 
If higher dimensional operators (including tag filter) are applied to #n 
254 
references, the original search has to be recomputed. 
255 
(Which might not be supported). 
256 

257 

258 
The reason for this is: 
259 
 it reduces space requirements between queries 
260 
 delivering records from search results is easier, 
261 
especially support for free cursor movement 
262 
 commutative properties allow for a couple of optimizations 
263 

264 
Intermediate results during query evaluation may have to contain higher 
265 
dimensional data, depending on the operations which will be applied to them: 
266 
 1 record id only for OR, AND, NOT 
267 
 2 plus tag for ; (G) 
268 
 3 plus occurrence for , (F) 
269 
 4 plus position in field for .. and $$ 
270 
We say that an expression is evaluated in a ndimensional context, 
271 
if n is the highest dimension of an operator to be applied to its result set. 
272 
(The OR here is considered neutral in that it keeps data in all dimensions, 
273 
but does not require it). This is an important property, 
274 
because it allows to ignore what happens to other dimensions. 
275 

276 

277 
Not all operators are associative or commutative, 
278 
in general, the order of evaluation and operands matter. 
279 
For example, should 'A.B.C' select a C adjacent to A or to B 
280 
or to any of these? By the right associative property, 'A.B.C' is 'A.(B.C)', 
281 
which by definition first selects those pointers to (occurrences of) B, 
282 
which are adjacent to C, then pointers to A adjacent to one of those Bs, 
283 
which is probably what you expected. 
284 
It is important for 'B.C' to create pointers to B, not to C, 
285 
so we can not use 'A.(C.B)' nor '(A.B).C', 
286 
both of which would select As adjacent to C. 
287 

288 

289 
* properties of operators 
290 

291 
Yet, there are some useful properties of operators. 
292 

293 

294 
The multiplicative operators and the difference are based on equality relations, 
295 
which are reflexive (p=p), symmetric (if p=q then q=p) and transitive 
296 
(if p=q and q=r, then p=r). 
297 

298 
The proximity relation is symmetric, but not transitive: 
299 
If A is adjacent to B, and B adjacent to C, A is usually not adjacent to C 
300 
(actually it can only be adjacent to another C like in a field "C A B C"). 
301 
Exact distance is not even reflexive. 
302 

303 

304 
Therefore: 
305 
 intersection operators are (left and right) distributive over union: 
306 
'(A+B) op C' = '(A op C)+(B op C)' and 'A op (B+C)' = '(A op B)+(A op C)' 
307 
for all but the NOT, which reads 'A^(B+C)' = 'A^B^C'. 
308 
 multiplicative operators and NOT are (self) associative: 
309 
'(A op B) op C' = 'A op (B op C)', where op is one of * , ; or ^. 
310 
This also holds where the operators are not the same, 
311 
but the first is not NOT and the second has no higher dimension. 
312 
 multiplicative operators are relative commutative in their dimension: 
313 
'A op B' = 'B op A' if evaluated in no higher dimensional context 
314 
(since the results are based on a ndimensional equality, 
315 
they are equal in the first n dimensions). 
316 
 for all intersection operators o,p we also have a property similar to 
317 
associativity: '(A o B) p C' = '(A p C) o B', since both expressions 
318 
intersect A with both the oset of B and the pset of C. 
319 
In a commutative context, this also equals 'B o (A p C)', 
320 
which can be used to get an existing left hand result set B 
321 
inside a right hand subexpression. 
322 
 distance is commutative in lower dimensional contexts: 
323 
e.g. '(A.B),C' = '(B.A),C', since it requires 3dimensional equality, 
324 
and, with the above, also '(A.B).C' = 'B.(A.C)'. 
325 

326 
Note that the context dimension depends not only on the direct parent operator, 
327 
but the highest dimension of any ancestor in the parse tree: 
328 
While '(A.B),C' = '(B.A),C' as final expressions have the same 
329 
(1dimensional) results, '((A.B),C)..D' != '((B.A),C)..D', 
330 
since D's distance will be tested against A or B, resp. 
331 

332 

333 
* processing search expressions 
334 

335 
Some notes on how a search expression is evaluated might be helpful 
336 
in order to write efficient queries. 
337 

338 

339 
Basic naive query processing proceeds by 
340 
 recursively calculating the result sets of subexpressions 
341 
 combining these result sets according to the operator 
342 

343 
Techniques used in combining result sets are 
344 
 merging 
345 
This is used by the OR to create the union of sets. 
346 
Duplicates with regard to the required dimensions are eliminated. 
347 
 marking 
348 
First calculate the left hand result set. Iterate over the right hand set, 
349 
look up and mark elements having the relation in the left hand set. 
350 
Finally, eliminate all unmarked (or marked, for NOT) elements. 
351 
This can be used to implement all relation based operators. 
352 
However, it is most useful where the right hand expression can loop 
353 
its elements without actually creating the set (direct marking). 
354 
 filtering 
355 
First calculate the right hand result set. Then, while generating 
356 
elements for the left hand, look up the relation in the right hand 
357 
set and store left hand elements only as appropriate. 
358 
Note that a filter applies only while creating sets, 
359 
not during marking or merging. 
360 
All techniques require the result sets to be ordered according to the 
361 
dimensions, i.e. by record id then tag then occ and pos, to run efficiently. 
362 

363 

364 
The actual implementation tries to avoid the naive approach. 
365 
Instead of calculating subexpression results independently, 
366 
the results of previous calculations and the mode in which they 
367 
finally will be combined are used where possible. 
368 
 tag filters are always applied at the lowest (term) level, 
369 
thus do not require an additional filtering step 
370 
 the dimensional context is used to purge unnecessary duplicates and, 
371 
optionally, can be used to reorder expressions according to their properties. 
372 

373 

374 
Creating result sets: 
375 
 for a ref, in an onedimensional search expression, we have the results set. 
376 
Else, it can be created by processing the query independently or inlining 
377 
its text. Anyway, in most cases, especially where we create under a filter, 
378 
we need to create a copy. 
379 
 for an exact match term, the result set can be directly fetched 
380 
from the index. Other term relations are an OR over all keys they match. 
381 
 the result set of an OR is created by creating both parts recursively 
382 
and merging them. Any filter is applied to all subexpressions. 
383 
 for all other expressions, if not filtering and the right hand does 
384 
not support direct marking, create it and use as left hand filter. 
385 
Else create the left hand (using any filter), evaluate the right hand 
386 
using direct marking, if possible, else create it and mark. 
387 

388 
Direct marking in subexpressions: 
389 
 all simple expressions can efficiently loop their elements 
390 
and thus support direct marking. 
391 
 OR and NOT support direct marking, if the left hand supports it 
392 
and the right hand is a simple expression 
393 

394 
Notes 
395 
 OR supports marking also if the right hand is a plain OR. 
396 
However, this case (A+(B+C)) should be expressed as ((A+B)+C). 
397 
 marking can be extended to AND and OR, NOT with complex subexpressions 
398 
using multiple mark values. (G) and (F) can support marking like AND when 
399 
evaluated in their dimension; the only nontrivial case of this is when 
400 
embedded in an OR. This extended marking is an optional feature. 
401 
 in terms of buffers, filtering is most efficient in an intersection, 
402 
because the filter set can be released early. 
403 
It is most efficient, where the right hand support direct marking. 
404 
In associative situations, it may be possible to rewrite '(A op B) op C' 
405 
as 'A op (B op C)', so C's result can be released early during the 
406 
evaluation of B and likewise B's result in A. 
407 

408 

409 
Reordering expressions is by far the most important technique, 
410 
and the best thing about it is that you can do it. 
411 
To be most efficient, the parse tree must be as unbalanced as possible: 
412 
In the best case, it is fully left associative like 
413 
'((((A op B) op C) op D) op E) op F' 
414 
(with AF simple expressions supporting direct marking). 
415 
That way we can always directly operate at the outermost level on 
416 
the result set of the previous expression. 
417 

418 

419 
On the default optimization level 0, no automatic expressions rewriting is done. 
420 
Higher optimization levels will be specified, e.g. we would expect level 1 
421 
to recognize a common case 'A.B.C' in a lower dimensional context, 
422 
and rewrite the proper expression 'A.(B.C)' to '(B.C).A', 
423 
so we can reuse B's details once using marking. 
424 

425 
Full optimization should also use all commutattive properties 
426 
to choose subexpressions for marking and filtering. 
427 

428 

429 
* processing filter expressions 
430 

431 
For filter expressions, the result sets are created from the current record 
432 
and thus are assumed to be rather small, so that buffer space is not an issue. 
433 

434 
Instead, short cut evaluation can be assumed to be much more important, 
435 
since for an interesting filter expression we expect it to not match 
436 
in most cases. All multiplicative operators should evaluate a term before 
437 
a complex subexpression and stop processing if it evaluates to the empty set. 
438 

439 

440 
Probably the most important optimization here would be to detect terms with 
441 
exact, prefix or contains relation which are required by multiplicative 
442 
operators and to do a quick check for their occurence as soon as possible. 
443 

444 
Especially where such a filter is applied to all records of a db, 
445 
it should be applied while streaming the record file, 
446 
hopefully approaching the performance of grep. 
447 

448 

449 
* limits 
450 

451 
Several limits apply to query and especially search evaluation. 
452 
Some are currently hardcoded, but may become configuration options 
453 
in future releases. 
454 

455 
 the number of subexpressions (terminals and operators) in any 
456 
expression is limited to 500. 
457 
 the nesting depth during parsing is limited to 50. 
458 
(This is usually lower then the depth of the final parse tree). 
459 
 the number of open queries is limited by session option q (default 20). 
460 
 the size (number of records) of a search result set is limited 
461 
by session option s (default 10.000). 
462 

463 

464 
 
465 
$Id: Query.txt,v 1.19 2004/07/16 12:42:39 istr Exp $ 
466 

467 
For all those mathematical terms, 
468 
> http://mathforum.org/dr.math/faq/faq.property.glossary.html ask Dr. Math 