openisis/doc/PatchWork.txt

principles of a patchworked database:
how OO-style inheritance works for data


*       references

It is quite common for records in a database to refer in some way to other
records. In fact, it seems to be such an interesting property,
that the main classification of databases is along the lines of how
they handle references:
-       in a hierarchical DB, every record has 1 or 0 parents,
        and any number of childs
-       in a networked DB, every record has a variable number of references
        (typically fix for a given type of record)
-       in a relational DB, every record has any number of references,
        which are determined by queries. References are constructed on the fly;
        if they are backed by some index and the query optimizer is able to
        figure that out, this works even for larger DBs.

While basically every model can be emulated on top of every other,
it is a matter of performance and convenience (i.e. performance counted
by hours of programming) how well a given model of data access supports 
a given pattern of data usage (that's why you should stop thinking that
RDBMS solve all problems).


The magic working behind references is always the same:
either a (more or less) physical key of the referred record
or some logical key is stored in the referring record,
the latter requiring translation by an index (typically B-Tree).
In general, the key might as well be a RDF URI,
resolved by a http or other server.


The "physical" key is known as "master file number" (MFN) in ISIS databases,
as "ROWID" or similar in most RDBMS. Using a ROWID is a gross violation
of the principles of relational databases (Codd's axioms),
and most RDBMS cannot retain a ROWID reference via export/import
(i.e. DBs using such references can neither be reliably exported nor backed up,
short to a DB shutdown and file backup, as stated by a major vendor).


Hierarchical and networked DBs, on the other hand, have much stronger support
for the fast physical key, but typically still allow for logical keys to
reach via an index on any record. They just don't sport an elaborate
and "standardized" query language like SQL giving convenient access
to what does and doesn't work. It's up to the application programmer,
rather than the query optimizer designer, to follow the paths of well
supported references.


ISIS databases are clearly in the latter (older) branch of the family of
databases, believing more in the application programmer's knowledge
than in artificial intelligence. Besides the conceptually simple and ultra-fast
MFN reference, it also has a very flexible indexing mechanism,
allowing for references to even be based on fulltext.


We want to present a view on data, that has some interesting properties
and (while it can be emulated on any DBMS) is particularily well supported
on ISIS DBs:


*       the patch relation

We propose a relation between records that transcedes a mere reference
(independent of whether the underlying reference is based on a physical
or logical key):
-       there is a patch operator
        that constructs a new record from referring and referred-to record
-       there is a diff operator
        that constructs a new record from referring and referred-to record,
        in such a way that the patch operator, applied to the latter and the result,
        will construct the referring record

Basically, a very small record can tell a story like
-       I am related to that record
-       I am very similar to that record, but
-       I have these and those fields added/changed/deleted


Everybody who is accustomed to tools like RCS or CVS (based on RCS)
will get the idea of such a relation:
depending on what is most commonly needed, either one or the other version
can be stored much more compactly as just the diff to the other one.
Or, if both are referred to oftenly, you can store them both at full
life size, but put only the differences in the index.


If the idea of applying diff and patch to database records seems strange,
have a look at
>       Serialized      a plain text representation for ISIS records.
This paper does include a means to store patch "scripts" even at database
master file level, thereby providing very efficient support for
cross-referring patch records.


However, the patch relation can as well very efficiently
be implemented on top of a standard ISIS database.
Some meta data entries (per field in the FDT) might be used to set
convenient defaults for interpretation of references and patches:
-       to which database an entry in field x referres to
-       index tag and/or prefix to be used on lookup
-       whether field occurences in the patch record are additive or overriding

In the absence of more specific instructions,
the default patch operation is used by reading the referring record
as a series of set statements against it's base:
every field (tag) which is present in the referring record
overrides all fields of the same tag in the base.
Should there be multiple bases defined, they are by default added
(i.e. concatenated). A multiplikative operation is modeled
by chaining inheritance.


In a way, every reference can be regarded as defining a patch relation:
If you resolve the reference in a table join, you are actually creating
a new virtual record, comprised of some fields of the referred and referring
records. However, in the much more flexible ISIS data model,
the creation of a new record is much more flexibel.


*       inheritance by patching

One way to use patching is to think of a patch relation as being an
inheritance relation: not by class, but by object!
A record can inherit data fields from one or several other records,
which still exists as independent entities.


When the inheritance is resolved, some data from the parents is copied
to the new virtual record, some is dropped or overwritten.
An inheritance relation does not only apply to logically subordinate child
records, it can also be used for versioning, with a successor inheriting
from it's predecessor, while both are still available as independent entities.


*       example: the serial killer

Management of serials is known as a notoriuosly hard problem.
So let's have a look at how we could make a data model using
record inheritance.


We start out simple, with
-       one record for the serial,
-       one for each issue, inheriting from the serial
-       one for each copy, inheriting from the issue

The issue adds a number, date, table of contents and so on to the serial.
The copy adds some id to the issue, and maybe a note that it's in a poor
state, since somebody poured coffee over it.


But then things can get a little bit more complicated:
-       if the serial changes it's name or other attributes,
        a successor may inherit from the original one,
        and following issues inherit from the successor
        (while earlier issues still belong to the original)
-       if two serials are joined, a successor may inherit from both
-       sometimes two otherwise independent serials are printed
        together, so the actual issues inherit from both
-       an issue might have a reprint,
        which inherits from the original issue,
        changing the date and adding a new preface
-       if somebody ripped an article out of a given copy,
        that copy might adjust it's issues table of contents
-       if copies of several issues are bound together,
        you actually have only one copy, inheriting from several issues


To summarize, a lot of complicated situations can be modelled in a quite
natural fashion when thinking in terms of inheritance through patch relations,
and these, in turn can be easily and efficiently be implemented based
on ISIS record and database structures.

If you think of modelling that in a relational database,
you will probably find that you end up with
>       RdbConv mimicking
the ISIS record.

---
        $Id: PatchWork.txt,v 1.2 2003/06/23 14:43:24 kripke Exp $
1	dpavlin	237	principles of a patchworked database:
2			how OO-style inheritance works for data
3
4
5			* references
6
7			It is quite common for records in a database to refer in some way to other
8			records. In fact, it seems to be such an interesting property,
9			that the main classification of databases is along the lines of how
10			they handle references:
11			- in a hierarchical DB, every record has 1 or 0 parents,
12			and any number of childs
13			- in a networked DB, every record has a variable number of references
14			(typically fix for a given type of record)
15			- in a relational DB, every record has any number of references,
16			which are determined by queries. References are constructed on the fly;
17			if they are backed by some index and the query optimizer is able to
18			figure that out, this works even for larger DBs.
19
20			While basically every model can be emulated on top of every other,
21			it is a matter of performance and convenience (i.e. performance counted
22			by hours of programming) how well a given model of data access supports
23			a given pattern of data usage (that's why you should stop thinking that
24			RDBMS solve all problems).
25
26
27			The magic working behind references is always the same:
28			either a (more or less) physical key of the referred record
29			or some logical key is stored in the referring record,
30			the latter requiring translation by an index (typically B-Tree).
31			In general, the key might as well be a RDF URI,
32			resolved by a http or other server.
33
34
35			The "physical" key is known as "master file number" (MFN) in ISIS databases,
36			as "ROWID" or similar in most RDBMS. Using a ROWID is a gross violation
37			of the principles of relational databases (Codd's axioms),
38			and most RDBMS cannot retain a ROWID reference via export/import
39			(i.e. DBs using such references can neither be reliably exported nor backed up,
40			short to a DB shutdown and file backup, as stated by a major vendor).
41
42
43			Hierarchical and networked DBs, on the other hand, have much stronger support
44			for the fast physical key, but typically still allow for logical keys to
45			reach via an index on any record. They just don't sport an elaborate
46			and "standardized" query language like SQL giving convenient access
47			to what does and doesn't work. It's up to the application programmer,
48			rather than the query optimizer designer, to follow the paths of well
49			supported references.
50
51
52			ISIS databases are clearly in the latter (older) branch of the family of
53			databases, believing more in the application programmer's knowledge
54			than in artificial intelligence. Besides the conceptually simple and ultra-fast
55			MFN reference, it also has a very flexible indexing mechanism,
56			allowing for references to even be based on fulltext.
57
58
59			We want to present a view on data, that has some interesting properties
60			and (while it can be emulated on any DBMS) is particularily well supported
61			on ISIS DBs:
62
63
64			* the patch relation
65
66			We propose a relation between records that transcedes a mere reference
67			(independent of whether the underlying reference is based on a physical
68			or logical key):
69			- there is a patch operator
70			that constructs a new record from referring and referred-to record
71			- there is a diff operator
72			that constructs a new record from referring and referred-to record,
73			in such a way that the patch operator, applied to the latter and the result,
74			will construct the referring record
75
76			Basically, a very small record can tell a story like
77			- I am related to that record
78			- I am very similar to that record, but
79			- I have these and those fields added/changed/deleted
80
81
82			Everybody who is accustomed to tools like RCS or CVS (based on RCS)
83			will get the idea of such a relation:
84			depending on what is most commonly needed, either one or the other version
85			can be stored much more compactly as just the diff to the other one.
86			Or, if both are referred to oftenly, you can store them both at full
87			life size, but put only the differences in the index.
88
89
90			If the idea of applying diff and patch to database records seems strange,
91			have a look at
92			> Serialized a plain text representation for ISIS records.
93			This paper does include a means to store patch "scripts" even at database
94			master file level, thereby providing very efficient support for
95			cross-referring patch records.
96
97
98			However, the patch relation can as well very efficiently
99			be implemented on top of a standard ISIS database.
100			Some meta data entries (per field in the FDT) might be used to set
101			convenient defaults for interpretation of references and patches:
102			- to which database an entry in field x referres to
103			- index tag and/or prefix to be used on lookup
104			- whether field occurences in the patch record are additive or overriding
105
106			In the absence of more specific instructions,
107			the default patch operation is used by reading the referring record
108			as a series of set statements against it's base:
109			every field (tag) which is present in the referring record
110			overrides all fields of the same tag in the base.
111			Should there be multiple bases defined, they are by default added
112			(i.e. concatenated). A multiplikative operation is modeled
113			by chaining inheritance.
114
115
116			In a way, every reference can be regarded as defining a patch relation:
117			If you resolve the reference in a table join, you are actually creating
118			a new virtual record, comprised of some fields of the referred and referring
119			records. However, in the much more flexible ISIS data model,
120			the creation of a new record is much more flexibel.
121
122
123			* inheritance by patching
124
125			One way to use patching is to think of a patch relation as being an
126			inheritance relation: not by class, but by object!
127			A record can inherit data fields from one or several other records,
128			which still exists as independent entities.
129
130
131			When the inheritance is resolved, some data from the parents is copied
132			to the new virtual record, some is dropped or overwritten.
133			An inheritance relation does not only apply to logically subordinate child
134			records, it can also be used for versioning, with a successor inheriting
135			from it's predecessor, while both are still available as independent entities.
136
137
138			* example: the serial killer
139
140			Management of serials is known as a notoriuosly hard problem.
141			So let's have a look at how we could make a data model using
142			record inheritance.
143
144
145			We start out simple, with
146			- one record for the serial,
147			- one for each issue, inheriting from the serial
148			- one for each copy, inheriting from the issue
149
150			The issue adds a number, date, table of contents and so on to the serial.
151			The copy adds some id to the issue, and maybe a note that it's in a poor
152			state, since somebody poured coffee over it.
153
154
155			But then things can get a little bit more complicated:
156			- if the serial changes it's name or other attributes,
157			a successor may inherit from the original one,
158			and following issues inherit from the successor
159			(while earlier issues still belong to the original)
160			- if two serials are joined, a successor may inherit from both
161			- sometimes two otherwise independent serials are printed
162			together, so the actual issues inherit from both
163			- an issue might have a reprint,
164			which inherits from the original issue,
165			changing the date and adding a new preface
166			- if somebody ripped an article out of a given copy,
167			that copy might adjust it's issues table of contents
168			- if copies of several issues are bound together,
169			you actually have only one copy, inheriting from several issues
170
171
172			To summarize, a lot of complicated situations can be modelled in a quite
173			natural fashion when thinking in terms of inheritance through patch relations,
174			and these, in turn can be easily and efficiently be implemented based
175			on ISIS record and database structures.
176
177			If you think of modelling that in a relational database,
178			you will probably find that you end up with
179			> RdbConv mimicking
180			the ISIS record.
181
182			---
183			$Id: PatchWork.txt,v 1.2 2003/06/23 14:43:24 kripke Exp $