1 |
Representing ISIS data in a relational database |
2 |
|
3 |
If you consider setting up a table with columns "author", "title" |
4 |
and so on, don't do it. The following structure gives you at least |
5 |
some of the benefits of ISIS, because it's shaped more closely |
6 |
to the ISIS data model. |
7 |
|
8 |
* table structure |
9 |
|
10 |
It is not very difficult to think of a table structure |
11 |
for lossless representation of the contents of an ISIS masterfile: |
12 |
$ |
13 |
CREATE TABLE Mst ( |
14 |
mfn number, -- master file number |
15 |
fld number, -- counting fields in this masterfile record |
16 |
tag number, -- the field tag |
17 |
val varchar(2048) -- field's content |
18 |
) |
19 |
$ |
20 |
Notes: |
21 |
- a row in this table does NOT hold an ISIS record, |
22 |
but one field of an ISIS record |
23 |
- each ISIS record is represented by all table rows |
24 |
with the same mfn |
25 |
- you may omit the column "fld" if you don't care for the ordering |
26 |
- you may choose other datatypes for the field tag, |
27 |
for example number(3) or char(3), |
28 |
depending on the compatibility required (e.g. MARC) |
29 |
- you may choose other dimensioning for the val column. |
30 |
ISIS can hold up to 32KB (signed short) in one field, |
31 |
Oracle has a maximum of 4KB on the varchar type. |
32 |
|
33 |
* accessing the data |
34 |
|
35 |
Retrieving row 123 is easily done with |
36 |
$ |
37 |
SELECT tag, val WHERE mfn = 123 ORDER BY fld |
38 |
$ |
39 |
It is suggested that you fill the val column with native |
40 |
ISIS data as of "MPL" proof mode. Then you have to extract |
41 |
subfields and perform other substitutions for output. |
42 |
This can be achieved for example using the v-method of the |
43 |
> /jdoc/org/openisis/Field.html Field |
44 |
Java class (it's a static method of about 100 lines, |
45 |
which you may grab and port to your system). |
46 |
|
47 |
|
48 |
An alternative would be to pre-split the subfields |
49 |
into a structure like |
50 |
$ |
51 |
CREATE TABLE Mst ( |
52 |
mfn number, |
53 |
fld number, |
54 |
tag number, |
55 |
sub char(1), -- subfield code |
56 |
val varchar(2048) -- subfield's content |
57 |
) |
58 |
$ |
59 |
A typical keywords field like |
60 |
$ |
61 |
<deltas><humid zones><tropical zones> |
62 |
$ |
63 |
should be split into three subfields using any subfield code, say '>'. |
64 |
|
65 |
The advantage of the pre-split approach is that it makes both |
66 |
display and indexing easier. |
67 |
The disadvantage is, that you'll loose some information, |
68 |
albeit not too much. |
69 |
|
70 |
* searching |
71 |
|
72 |
Searching is also relatively easy, as long as you ask simple questions ;). |
73 |
To get a list of the mfns with a field 24 containing Twain, use |
74 |
$ |
75 |
SELECT DISTINCT(mfn) FROM Mst WHERE val = 'Twain' AND tag = 24 |
76 |
$ |
77 |
For this to work efficiently, you have to put an index on val. |
78 |
You may than search for 'Twain' with or without the restriction |
79 |
on a certain tag, just like in ISIS. |
80 |
You can also use right truncation and restrict on a couple of tags: |
81 |
$ |
82 |
SELECT DISTINCT(mfn) FROM Mst WHERE val LIKE 'Twain%' AND tag IN (201,401) |
83 |
$ |
84 |
Right truncation should still use the index on val, |
85 |
thus be reasonable efficient (depending on how smart your RDB is). |
86 |
With the pre-split structure as above, this applies to any subfield. |
87 |
If you want to find Twain anywhere within the field's values, you need |
88 |
$ |
89 |
SELECT DISTINCT(mfn) FROM Mst WHERE val LIKE '%Twain%' |
90 |
$ |
91 |
This will perform a pretty costly full table scan, |
92 |
so you may go fetch yourself a coffee meanwhile. |
93 |
|
94 |
|
95 |
While you may use SQLs OR predicate to combine several terms as |
96 |
alternatives, the ISIS AND does NOT map to SQLs AND, |
97 |
but to an INTERSECTion on the mfn lists. |
98 |
It's actually pretty much the same as ISIS does internally (but more costly). |
99 |
|
100 |
* building an index |
101 |
|
102 |
Still, this is lightyears away from the possibilities of an ISIS db. |
103 |
The index on val does not do word splits, does not use stopwords, |
104 |
is not case insensitive, does not apply character or word conversion |
105 |
and so on. |
106 |
|
107 |
To come anywhere near the features of ISIS, you have to do just |
108 |
what ISIS does: set up a separate "Inverted File". |
109 |
$ |
110 |
CREATE TABLE Index ( |
111 |
val varchar(30), -- index entry |
112 |
mfn number, -- record containing entry |
113 |
tag number, -- tag of field containing entry |
114 |
occ number, -- field is occ'th occurence of field tag in record |
115 |
pos number -- entry is pos'th word in field |
116 |
) |
117 |
$ |
118 |
To fill this table, you could of course go and reimplement |
119 |
all the features of an ISIS .FST in, say, Oracle's PL/SQL |
120 |
and recreate one record's entries from within an INSERT/UPDATE trigger. |
121 |
|
122 |
Much easier, given you had a true ISIS database in the first place, |
123 |
is to load the .LK1 and .LK2 text files created during full index update. |
124 |
Some ISIS tools also allow you to create these files separately. |
125 |
|
126 |
Searching is performed against the Index table |
127 |
(which of course needs an RDB index at least on val), |
128 |
not the Mst table. |
129 |
The latter is only used to retrieve the full content for a given mfn, |
130 |
or for full table scans. |
131 |
|
132 |
* so why then use ISIS in the first place? |
133 |
|
134 |
There is a long list of reasons, here go a few of them: |
135 |
- You don't want to reimplement the .FST, the full formatting language, |
136 |
all the nifty details of the various indexing techniques and so on. |
137 |
Actually, if you DO want, you're very welcome in the OpenIsis team ;) |
138 |
- Not only the index building, but also data entry is done much more |
139 |
comfortable using ISIS tools. |
140 |
- Any queries but the most simple examples given above will not only |
141 |
be somewhat complex to write in SQL, but will also execute much slower |
142 |
than in a native ISIS implementation (consider publishing with OpenIsis). |
143 |
- Especially the NEAR operators can't easily be expressed in SQL. |
144 |
Although the index structure given above contains the necessary data |
145 |
in occ and pos, advanced SQL features like nested queries are needed |
146 |
in the appropriate SQL query. You need to have an advanced (read: |
147 |
expensive) RDBMS and a deep and thorough understanding of it's |
148 |
query optimizer and hinting system for this to run in any |
149 |
reasonable time. |
150 |
- So why then use a RDBMS in the first place? |
151 |
|
152 |
See also |
153 |
> whatabout a comparision of ISIS and other database systems |
154 |
and |
155 |
> unirec notes on the flexibility of the ISIS data model. |
156 |
--- |
157 |
$Id: RdbConv.txt,v 1.4 2002/11/26 16:57:19 kripke Exp $ |