This is repository of my old source code which isn't updated any more. Go to git.rot13.org for current projects!
ViewVC logotype

Contents of /openisis/0.9.9e/doc/Query.txt

Parent Directory Parent Directory | Revision Log Revision Log

Revision 604 - (show annotations)
Mon Dec 27 21:49:01 2004 UTC (19 years, 6 months ago) by dpavlin
File MIME type: text/plain
File size: 19950 byte(s)
import of new openisis release, 0.9.9e

1 Malete query expressions:
3 Like CDS/ISIS, Malete supports both index based searching and
4 record based filtering. Unlike in CDS/ISIS, the basic filtering syntax is not
5 based on the formatting language, but a simple extension of the search syntax.
7 However, a more general record manipulation language, the "patchwork",
8 is planned, which will also act as a filter based on complex conditions.
11 In a query expression, everything up to a '?' is a search expression,
12 the remaining part defines a filter.
15 In general, an expression defines a result set, consisting of pointers
16 to occurrences of keys in records. For searching, these pointers are
17 fetched from the "inverted file", the .mqd B-Tree. On filtering,
18 the expression creates pointers as needed from the current record
19 and matches the record, if the result set is not empty.
22 An expression consists of
23 - simple expressions creating result sets
24 - operators modifying result sets
25 - parentheses to control evaluation order
28 * pointers
30 A pointer (to an occurrence of a key in a record)
31 is a 4-tuple (a point in a 4 dimensional space) (r,t,o,p) of
32 - 1 the record's id
33 - 2 the tag of the field where the key was found,
34 so Smith in an author field can be distinguished from Smith in title
35 - 3 the "occurrence" (repeat count) of field,
36 so one can demand Mark and Twain in the same author field,
37 instead of Mark and Twain in two author fields
38 - 4 the position (nth word) within the field
41 * simple expressions
43 A set of results can be created by
44 - [rel][whitespace]term
45 a relation to a term, producing all pointers to keys that have the given
46 relation (as listed below) to term. Relation defaults to equality =.
47 term$ is a legacy form of %term.
48 Term is a series of "word characters", which are ASCII letters and digits,
49 the underscore and all non-ASCII characters (code greater than 127),
50 or any characters in double quotes ".
51 Between the quotes, two quotes evaluate to a single quote.
52 - #n
53 a reference to the nth query.
54 Inserts the search or filter expression, as applicable, of the nth query.
55 (Might not be fully implemented).
56 In a search expression, an existing search result set might be used.
57 As a special case, if the whole query expression is only #n,
58 the nth query is used as is, including search, filter and cursor.
60 Relations are:
61 $
62 = equality (not needed)
63 % prefix match
64 > greater than
65 >= greater than or equal
66 < less than (not yet implemented, behaves like '>'. For range queries
67 use '-' rather than '<' '*' '>' )
68 <= less than or equal (not yet implemented, behaves like '>=')
69 : contains (filtering only)
70 ~ regular expression match (optional, filtering only)
71 $
72 The regexp match will not be available in all environments
73 (since it requires a regexp implementation, probably the one from Tcl).
74 Yet, it's just too tempting ...
77 The less/greater than (or equal) operators are not used alone,
78 but in "key identity" expressions to define ranges.
81 * key identity
83 The semantics of pointers is to denote a word position in an occurrence
84 of a tag in a record. We assume that a pointer is only found at the one
85 key being (some transformation of) the word at that position,
86 i.e. pointers are unique.
89 Therefore, matching arbitrary result sets on pointer identity is of little
90 use. However, when applied to terms, we can use a pointer identity condition
91 as a key identity condition, effectively limiting the range of keys searched:
92 - - key identity
93 is a pseudo operator, syntactically operating on result sets,
94 but actually implemented as joining and modifying term relations.
96 Both operands of - must be terms, it refuses to operate on other expressions.
97 If the left hand term has the = relation, it is modified to &gt;=.
98 If the right hand has the = relation, it is modified to &lt;.
99 The expression 'A - B' will search all keys greater or equal "A"
100 and less than "B". To include "B", use 'A - &lt;=B'.
101 The prefix '%ABC' is actually just a shorthand for 'ABC - ABD'.
102 If multiple upper or lower bounds, including those from a prefix,
103 are found in the operands, the lowest lower and highest upper ones apply.
106 Currently, only the default case greater or equal to A and less than B
107 is implemented.
110 * tag filter
112 The tag filter selects pointers with certain tags by restricting
113 the second dimension to one or more tag values:
114 - expr/tag or expr/(tag[,tag...])
115 where tags are literal numbers. In other words, it intersects the left hand
116 result set with the set {(r,t,o,p)|t element of taglist}
118 As a special case, a tag filter with an empty left hand expression can be used
119 at the very start of (the outermost level of) an expression:
120 - in filtering to select the fields for final output.
121 This is a proper part of filtering,
122 since a record with no matching fields is skipped.
123 - in searching to force a given ordering.
124 This is not a proper part of searching, but rather defines a cursor
125 for record retrieval, which then uses the search result set as filter.
126 We propably should define an extended syntax for ordering,
127 like we have for filtering, to allow key range conditions here.
128 Since this does a full index scan while retrieving records,
129 results are not made unique, and ordering will be most useful only
130 with a single tag (or some alternative tags) of a non-repeatable field.
131 Currently not implemented.
133 The tag filter, like the pointer identity, is another pseudo operator,
134 actually modifying its subexpressions rather than their result sets.
135 See below for details.
138 * binary operators
140 Additive operator:
141 - + is the logical OR
142 calculating the union of result sets
144 All other operators reduce their left hand result set by selecting elements
145 based on certain relations they have to those of the right hand set.
146 In other words they are intersections of the left hand with a certain
147 set defined (but not actually created) based on the right hand set.
150 Multiplicative operators select pointers from their left hand set
151 based upon their equality to pointers in the right hand set
152 in the first n dimensions:
153 - * ("AND") is 1-dimensional equality
154 selects pointers with the same record id, i.e. those pointer from the
155 left hand set for which there exists a pointer with the same record id
156 in the right hand set
157 - ; or (G) is 2-dimensional equality
158 selecting on same (record and) tag
159 - , or (F) is 3-dimensional equality
160 selecting on same (record and tag and) occurrence
161 Mathematically, these operators create an intersection of the left hand
162 with a projection of the right hand to the first n dimensions
163 cross product with the full space of the remaining dimensions.
164 E.g. A;B := A intersect {(r,t,x,y)|exist o,p with (r,t,o,p) in B},
165 which we can denote as the (G) set of B.
167 The full 4-dimensional equality operator, pointer identity,
168 is a pseudo operator as described above.
171 The difference operator
172 - ^ ("NOT" or "WITHOUT")
173 selects based on 1-dimensional inequality, i.e. those pointer from the
174 left hand set for which there exists NO pointer with the same record id
175 in the right hand set.
176 A^B := A intersect {(r,x,y,z)|not exist t,o,p with (r,t,o,p) in B}
177 (the NOT set of B).
178 Note that the literal words OR, AND and NOT are not operators, but terms.
181 The distance operators are based on a proximity relation in the 4th dimension,
182 requiring 3-dimensional equality (same record, tag and occurrence):
183 - .[...] or (n) max distance
184 A sequence of n dots or a number in parentheses selects pointers from the
185 left hand, for which there is a pointer in the right hand where the word
186 position p differs by at most n. (0) is pointer identity.
187 A(n)B := A intersect
188 {(r,t,o,x)|exists p with abs(p-x) &lt;=n and (r,t,o,p) in B}
189 (the dist set of B).
190 - $$[$$] exact distance:
191 likewise with exact position difference n.
192 (A single $ could be confused with prefix match, but is the same as a .).
194 Due to the problems of word counting during filtering,
195 the distance operators may silently behave like 3-dimensional equality.
198 * operator precedence
200 Adjacent terms with no operator in between are combined by *.
201 The operator precedence can by overriden by enclosing any expression
202 in parentheses ().
205 The operators in decreasing precedence:
206 - pointer/key identity -
207 - positional distance . and $$
208 - field level match , and ; ((F) and (G))
209 - filter /
210 - record level match * and ^
211 - additive +
212 On the same precedence level,
213 the distance operators .. and $$ associate right ('A.B.C' is 'A.(B.C)'),
214 all others associate left ('A B C' is '(A*B)*C').
216 The tag filter / has highest precedence to its right hand,
217 since the tag list is not subject to any operator evaluation.
220 * detailed semantics of searching
222 First of all, tag filters are *defined* to be applied fully distributive.
223 For operators of higher precedence, they are logically distributive:
224 '(A (G) B)/101' yields the same as '(A/101) (G) (B/101)',
225 hence their lower precedence. For '(A ^ B)/100',
226 it looks like we should first select all results for A,
227 strip those where B is (anywhere) in the same record,
228 and then strip all in fields other than 100.
229 This, however, would be better written as 'A/100 ^ B':
230 there is no point in delaying a tag filter.
231 The only reasonable use of applying a tag filter to a lower precedence
232 operation (which requires use of parentheses anyway) is to distribute it.
233 So, '(A ^ B)/100' is *defined* as 'A/100 ^ B/100'.
235 Second, we define tag filters to not nest, but override,
236 so that only the innermost applies. E.g. in '(A/101 B C)/102',
237 only B and C have to be found in field 102, while A is checked in 101.
238 Again, this is because there would be no point in intersecting nested
239 tag lists: it's shorter to write what should be in effect in the first place.
242 To summarize,
243 - tag filters are always distributed down to the term level,
244 where they are applied most efficiently, and
245 - there is always at most one tag filter in effect.
246 In the further reasoning, tag filters are not considered.
249 The final result set of a search expression is 1-dimensional
250 (record id only), in other words Malete does not maintain
251 > http://lcweb.loc.gov/z3950/agency/markup/23.html Extended Result Sets
252 , but discards the other dimensions after evaluating a search.
253 If higher dimensional operators (including tag filter) are applied to #n
254 references, the original search has to be recomputed.
255 (Which might not be supported).
258 The reason for this is:
259 - it reduces space requirements between queries
260 - delivering records from search results is easier,
261 especially support for free cursor movement
262 - commutative properties allow for a couple of optimizations
264 Intermediate results during query evaluation may have to contain higher
265 dimensional data, depending on the operations which will be applied to them:
266 - 1 record id only for OR, AND, NOT
267 - 2 plus tag for ; (G)
268 - 3 plus occurrence for , (F)
269 - 4 plus position in field for .. and $$
270 We say that an expression is evaluated in a n-dimensional context,
271 if n is the highest dimension of an operator to be applied to its result set.
272 (The OR here is considered neutral in that it keeps data in all dimensions,
273 but does not require it). This is an important property,
274 because it allows to ignore what happens to other dimensions.
277 Not all operators are associative or commutative,
278 in general, the order of evaluation and operands matter.
279 For example, should 'A.B.C' select a C adjacent to A or to B
280 or to any of these? By the right associative property, 'A.B.C' is 'A.(B.C)',
281 which by definition first selects those pointers to (occurrences of) B,
282 which are adjacent to C, then pointers to A adjacent to one of those Bs,
283 which is probably what you expected.
284 It is important for 'B.C' to create pointers to B, not to C,
285 so we can not use 'A.(C.B)' nor '(A.B).C',
286 both of which would select As adjacent to C.
289 * properties of operators
291 Yet, there are some useful properties of operators.
294 The multiplicative operators and the difference are based on equality relations,
295 which are reflexive (p=p), symmetric (if p=q then q=p) and transitive
296 (if p=q and q=r, then p=r).
298 The proximity relation is symmetric, but not transitive:
299 If A is adjacent to B, and B adjacent to C, A is usually not adjacent to C
300 (actually it can only be adjacent to another C like in a field "C A B C").
301 Exact distance is not even reflexive.
304 Therefore:
305 - intersection operators are (left and right) distributive over union:
306 '(A+B) op C' = '(A op C)+(B op C)' and 'A op (B+C)' = '(A op B)+(A op C)'
307 for all but the NOT, which reads 'A^(B+C)' = 'A^B^C'.
308 - multiplicative operators and NOT are (self) associative:
309 '(A op B) op C' = 'A op (B op C)', where op is one of * , ; or ^.
310 This also holds where the operators are not the same,
311 but the first is not NOT and the second has no higher dimension.
312 - multiplicative operators are relative commutative in their dimension:
313 'A op B' = 'B op A' if evaluated in no higher dimensional context
314 (since the results are based on a n-dimensional equality,
315 they are equal in the first n dimensions).
316 - for all intersection operators o,p we also have a property similar to
317 associativity: '(A o B) p C' = '(A p C) o B', since both expressions
318 intersect A with both the o-set of B and the p-set of C.
319 In a commutative context, this also equals 'B o (A p C)',
320 which can be used to get an existing left hand result set B
321 inside a right hand subexpression.
322 - distance is commutative in lower dimensional contexts:
323 e.g. '(A.B),C' = '(B.A),C', since it requires 3-dimensional equality,
324 and, with the above, also '(A.B).C' = 'B.(A.C)'.
326 Note that the context dimension depends not only on the direct parent operator,
327 but the highest dimension of any ancestor in the parse tree:
328 While '(A.B),C' = '(B.A),C' as final expressions have the same
329 (1-dimensional) results, '((A.B),C)..D' != '((B.A),C)..D',
330 since D's distance will be tested against A or B, resp.
333 * processing search expressions
335 Some notes on how a search expression is evaluated might be helpful
336 in order to write efficient queries.
339 Basic naive query processing proceeds by
340 - recursively calculating the result sets of subexpressions
341 - combining these result sets according to the operator
343 Techniques used in combining result sets are
344 - merging
345 This is used by the OR to create the union of sets.
346 Duplicates with regard to the required dimensions are eliminated.
347 - marking
348 First calculate the left hand result set. Iterate over the right hand set,
349 look up and mark elements having the relation in the left hand set.
350 Finally, eliminate all unmarked (or marked, for NOT) elements.
351 This can be used to implement all relation based operators.
352 However, it is most useful where the right hand expression can loop
353 its elements without actually creating the set (direct marking).
354 - filtering
355 First calculate the right hand result set. Then, while generating
356 elements for the left hand, look up the relation in the right hand
357 set and store left hand elements only as appropriate.
358 Note that a filter applies only while creating sets,
359 not during marking or merging.
360 All techniques require the result sets to be ordered according to the
361 dimensions, i.e. by record id then tag then occ and pos, to run efficiently.
364 The actual implementation tries to avoid the naive approach.
365 Instead of calculating subexpression results independently,
366 the results of previous calculations and the mode in which they
367 finally will be combined are used where possible.
368 - tag filters are always applied at the lowest (term) level,
369 thus do not require an additional filtering step
370 - the dimensional context is used to purge unnecessary duplicates and,
371 optionally, can be used to reorder expressions according to their properties.
374 Creating result sets:
375 - for a ref, in an one-dimensional search expression, we have the results set.
376 Else, it can be created by processing the query independently or inlining
377 its text. Anyway, in most cases, especially where we create under a filter,
378 we need to create a copy.
379 - for an exact match term, the result set can be directly fetched
380 from the index. Other term relations are an OR over all keys they match.
381 - the result set of an OR is created by creating both parts recursively
382 and merging them. Any filter is applied to all subexpressions.
383 - for all other expressions, if not filtering and the right hand does
384 not support direct marking, create it and use as left hand filter.
385 Else create the left hand (using any filter), evaluate the right hand
386 using direct marking, if possible, else create it and mark.
388 Direct marking in subexpressions:
389 - all simple expressions can efficiently loop their elements
390 and thus support direct marking.
391 - OR and NOT support direct marking, if the left hand supports it
392 and the right hand is a simple expression
394 Notes
395 - OR supports marking also if the right hand is a plain OR.
396 However, this case (A+(B+C)) should be expressed as ((A+B)+C).
397 - marking can be extended to AND and OR, NOT with complex subexpressions
398 using multiple mark values. (G) and (F) can support marking like AND when
399 evaluated in their dimension; the only non-trivial case of this is when
400 embedded in an OR. This extended marking is an optional feature.
401 - in terms of buffers, filtering is most efficient in an intersection,
402 because the filter set can be released early.
403 It is most efficient, where the right hand support direct marking.
404 In associative situations, it may be possible to rewrite '(A op B) op C'
405 as 'A op (B op C)', so C's result can be released early during the
406 evaluation of B and likewise B's result in A.
409 Reordering expressions is by far the most important technique,
410 and the best thing about it is that you can do it.
411 To be most efficient, the parse tree must be as unbalanced as possible:
412 In the best case, it is fully left associative like
413 '((((A op B) op C) op D) op E) op F'
414 (with A-F simple expressions supporting direct marking).
415 That way we can always directly operate at the outermost level on
416 the result set of the previous expression.
419 On the default optimization level 0, no automatic expressions rewriting is done.
420 Higher optimization levels will be specified, e.g. we would expect level 1
421 to recognize a common case 'A.B.C' in a lower dimensional context,
422 and rewrite the proper expression 'A.(B.C)' to '(B.C).A',
423 so we can reuse B's details once using marking.
425 Full optimization should also use all commutattive properties
426 to choose subexpressions for marking and filtering.
429 * processing filter expressions
431 For filter expressions, the result sets are created from the current record
432 and thus are assumed to be rather small, so that buffer space is not an issue.
434 Instead, short cut evaluation can be assumed to be much more important,
435 since for an interesting filter expression we expect it to not match
436 in most cases. All multiplicative operators should evaluate a term before
437 a complex subexpression and stop processing if it evaluates to the empty set.
440 Probably the most important optimization here would be to detect terms with
441 exact, prefix or contains relation which are required by multiplicative
442 operators and to do a quick check for their occurence as soon as possible.
444 Especially where such a filter is applied to all records of a db,
445 it should be applied while streaming the record file,
446 hopefully approaching the performance of grep.
449 * limits
451 Several limits apply to query and especially search evaluation.
452 Some are currently hardcoded, but may become configuration options
453 in future releases.
455 - the number of subexpressions (terminals and operators) in any
456 expression is limited to 500.
457 - the nesting depth during parsing is limited to 50.
458 (This is usually lower then the depth of the final parse tree).
459 - the number of open queries is limited by session option q (default 20).
460 - the size (number of records) of a search result set is limited
461 by session option s (default 10.000).
464 ---
465 $Id: Query.txt,v 1.19 2004/07/16 12:42:39 istr Exp $
467 For all those mathematical terms,
468 > http://mathforum.org/dr.math/faq/faq.property.glossary.html ask Dr. Math

  ViewVC Help
Powered by ViewVC 1.1.26