/[Biblio-Isis]/trunk/lib/Biblio/Isis/Manual.pod
This is repository of my old source code which isn't updated any more. Go to git.rot13.org for current projects!
ViewVC logotype

Contents of /trunk/lib/Biblio/Isis/Manual.pod

Parent Directory Parent Directory | Revision Log Revision Log


Revision 37 - (show annotations)
Fri Jan 7 20:57:56 2005 UTC (19 years, 2 months ago) by dpavlin
File size: 28177 byte(s)
re-organize directories, add CDS/ISIS manual -- part about file structure

1 =pod
2
3 =head1 NAME
4
5 CDS/ISIS manual appendix F, G and H
6
7 =head1 DESCRIPTION
8
9 This is partial scan of CDS/ISIS manual (appendix F, G and H, pages
10 257-272) which is than converted to text using OCR and proofread.
11 However, there might be mistakes, and any corrections sent to
12 C<dpavlin@rot13.org> will be greatly appreciated.
13
14 This digital version is made because current version available in ditial
15 form doesn't contain details about CDS/ISIS file format and was essential
16 in making L<Biblio::Isis> module.
17
18 This extract of manual has been produced in compliance with section (d) of
19 WinIsis LICENCE for receiving institution/person which say:
20
21 The receiving institution/person may:
22
23 (d) Print/reproduce the CDS/ISIS manuals or portions thereof,
24 provided that such copies reproduce the copyright notice;
25
26 =head1 CDS/ISIS Files
27
28 This section describes the various files of the CDS/ISIS system, the
29 file naming conventions and the file extensions used for each type of
30 file. All CDS/ISIS files have standard names as follows:
31
32 nnnnnn.eee
33
34 where:
35
36 =over 10
37
38 =item C<nnnnnn>
39
40 is the file name (all file names, except program names, are limited to
41 a maximum of 6 characters)
42
43 =item C<.eee>
44
45 is the file extension identifying a particular type of file.
46
47 =back
48
49 Files marked with C<*> are ASCII files which you may display or print. The
50 other files are binary files.
51
52 =head2 A. System files
53
54 System files are common to all CDS/ISIS users and include the various
55 executable programs as well as system menus, worksheets and message
56 files provided by Unesco as well as additional ones which you may
57 create.
58
59 =head3 CDS/ISIS Program
60
61 The name of the program file, as supplied by Unesco is
62
63 ISIS.EXE
64
65 Depending on the release and/or target computer, there may also be one
66 or more overlay files. These, if present, have the extension C<OVL>.
67 Check the contents of your system diskettes or tape to see whether
68 overlay files are present.
69
70 =head3 System menus and worksheets
71
72 All system menus and worksheets have the file extension FMT and the
73 names are built as follows:
74
75 pctnnn.FMT
76
77 where:
78
79 =over 10
80
81 =item C<p>
82
83 is the page number (A for the first page, B for the second, etc.)
84
85 =item C<c>
86
87 is the language code (e.g. E for English), which must be one of those
88 provided for in the language selection menu xXLNG.
89
90 =item C<t>
91
92 is X for menus and Y for system worksheets
93
94 =item C<nnn>
95
96 is a unique identifier
97
98 =back
99
100 For example the full name of the English version of the menu xXGEN is
101 C<AEXGEN.FMT>.
102
103 The page number is transparent to the CDS/ISIS user. Like the file
104 extension the page number is automatically provided by the system.
105 Therefore when a CDS/ISIS program prompts you to enter a menu or
106 worksheet name you must not include the page number. Furthermore as
107 file names are restricted to 6 characters, menus and worksheets names
108 may not be longer than 5 characters.
109
110 System menus and worksheets may only have one page.
111
112 The language code is mandatory for system menus and standard system
113 worksheets. For example if you want to link a HELP menu to the system
114 menu EXGEN, its name must begin with the letter E.
115
116 The B<X> convention is only enforced for standard system menus. It is a
117 good practice, however, to use the same convention for menus that you
118 create, and to avoid creating worksheets (including data entry
119 worksheets) with X in this position, that is with names like xB<X>xxx.
120
121 Furthermore, if a data base name contains B<X> or B<Y> in the second
122 position, then the corresponding data entry worksheets will be created
123 in the system worksheet directory (parameter 2 of C<SYSPAR.PAR>) rather
124 then the data base directory. Although this will not prevent normal
125 operation of the data base, it is not recommended.
126
127 =head3 System messages files
128
129 System messages and prompts are stored in standard CDS/ISIS data bases.
130 All corresponding data base files (see below) are required when
131 updating a message file, but only the Master file is used to display
132 messages.
133
134 There must be a message data base for each language supported through
135 the language selection menu xXLNG.
136
137 The data base name assigned to message data bases is xMSG (where x is
138 the language code).
139
140 =head3 System tables
141
142 System tables are used by CDS/ISIS to define character sets. Two are
143 required at present:
144
145 =over
146
147 =item C<ISISUC.TAB>*
148
149 defines lower to upper-case translation
150
151 =item C<ISISAC.TAB>*
152
153 defines the alphabetic characters.
154
155 =back
156
157 =head3 System print and work files
158
159 Certain CDS/ISIS print functions do not send the output directly to the
160 printer but store it on a disk file from which you may then print it at
161 a convenient time. These files have all the file extension C<LST> and
162 are reused each time the corresponding function is executed.
163
164 In addition CDS/ISIS creates temporary work files which are normally
165 automatically discarded at the end of the session. If the session
166 terminates abnormally, however, they will not be deleted. A case of
167 abnormal termination would be a power failure while you are using a
168 CDS/ISIS program. Also these files, however, are reused each time,
169 so that you do not normally need to delete them manually. Work files
170 all have the extension C<TMP>.
171
172 The print and work files created by CDS/ISIS are given below:
173
174 =over
175
176 =item C<IFLIST.LST>*
177
178 Inverted file listing file (produced by ISISINV)
179
180 =item C<WSLIST.LST>*
181
182 Worksheet/menu listing file (produced by ISISUTL)
183
184 =item C<xMSG.LST>*
185
186 System messages listing file (produced by ISISUTL)
187
188 =item C<x.LST>*
189
190 Printed output (produced by ISISPRT when printing no print file name is
191 supplied)
192
193 =item C<SORTIO.TMP>
194
195 Sort work file 1
196
197 =item C<SORTII.TMP>
198
199 Sort work file 2
200
201 =item C<SORTI2.TMP>
202
203 Sort work file 3
204
205 =item C<SORTI3.TMP>
206
207 Sort work file 4
208
209 =item C<SORT20.TMP>
210
211 Sort work file 5
212
213 =item C<SORT2I.TMP>
214
215 Sort work file 6
216
217 =item C<SORT22.TMP>
218
219 Sort work file 7
220
221 =item C<SORT23.TMP>
222
223 Sort work file 8
224
225 =item C<TRACE.TMP>*
226
227 Trace file created by certain programs
228
229 =item C<ATSF.TMP>
230
231 Temporary storage for hit lists created during retrieval
232
233 =item C<ATSQ.TMP>
234
235 Temporary storage for search expressions
236
237 =back
238
239 =head2 B. Data Base files
240
241 =over
242
243 =item 1
244
245 mandatory files, which must always be present.
246 These are normally established when the data base is defined by means of the
247 ISISDEF services and should never be deleted;
248
249 =item 2
250
251 auxiliary files created by the system whenever certain functions are
252 performed.
253 These can periodically be deleted when they are no longer needed.
254
255 =item 3
256
257 user files created by the data base user (such as display formats),
258 which are fully under the user's responsibility.
259
260 =back
261
262 Each data base consists of a number of physically distinct files as
263 indicated below. There are three categories of data base files:
264
265 In the following description C<xxxxxx> is the 1-6 character data base
266 name.
267
268 =head3 Mandatory data base files
269
270 =over
271
272 =item C<xxxxxx.FDT>*
273
274 Field Definition Table
275
276 =item C<xxxxxx.FST>*
277
278 Field Select Table for Inverted file
279
280 =item C<xxxxxx.FMT>*
281
282 Default data entry worksheet (where p is the page number).
283
284 Note that the data base name is truncated to 5 characters if necessary
285
286 =item C<xxxxxx.PFT>*
287
288 Default display format
289
290 =item C<xxxxxx.MST>
291
292 Master file
293
294 =item C<xxxxxx.XRF>
295
296 Crossreference file (Master file index)
297
298 =item C<xxxxxx.CNT>
299
300 B*tree (search term dictionary) control file
301
302 =item C<xxxxxx.N01>
303
304 B*tree Nodes (for terms up to 10 characters long)
305
306 =item C<xxxxxx.L01>
307
308 B*tree Leafs (for terms up to 10 characters long)
309
310 =item C<xxxxxx.N02>
311
312 B*tree Nodes (for terms longer than 10 characters)
313
314 =item C<xxxxxx.L02>
315
316 B*tree Leafs (for terms longer than 10 characters)
317
318 =item C<xxxxxx.IFP>
319
320 Inverted file postings
321
322 =item C<xxxxxx.ANY>*
323
324 ANY file
325
326 =back
327
328 =head3 Auxiliary files
329
330 =over
331
332 =item C<xxxxx.STW>*
333
334 Stopword file used during inverted file generation
335
336 =item C<xxxxxx.LN1>*
337
338 Unsorted Link file (short terms)
339
340 =item C<xxxxxx.LN2>*
341
342 Unsorted Link file (long terms)
343
344 =item C<xxxxxx.LKl>*
345
346 Sorted Link file (short terms)
347
348 =item C<xxxxxx.LK2>*
349
350 Sorted Link file (long terms)
351
352 =item C<xxxxxx.BKP>
353
354 Master file backup
355
356 =item C<xxxxxx.XHF>
357
358 Hit file index
359
360 =item C<xxxxxx.HIT>
361
362 Hit file
363
364 =item C<xxxxxx.SRT>*
365
366 Sort convertion table (see "Uppercase conversion table (1SISUC.TAB)" on
367 page 227)
368
369 =back
370
371 =head3 User files
372
373 =over
374
375 =item C<yyyyyy.FST>*
376
377 Field Select tables used for sorting
378
379 =item C<yyyyyy.PFT>*
380
381 Additional display formats
382
383 =item C<yyyyyy.FMT>*
384
385 Additional data entry worksheets
386
387 =item C<yyyyyy.STW>*
388
389 Additional stopword files
390
391 =item C<yyyyyy.SAV>
392
393 Save files created during retrieval
394
395 =back
396
397 The name of user files is fully under user control. However, in order
398 to avoid possible name conflicts it is advisable to establish some
399 standard conventions to be followed by all CDS/ISIS users at a given
400 site, such as for example to define C<yyyyyy> as follows:
401
402 xxxyyy
403
404 where:
405
406 =over
407
408 =item C<xxx>
409
410 is a data base identifier (which could be the first three letters of
411 the data base name if no two data bases names are allowed to begin with
412 the same three letters)
413
414 =item C<yyy>
415
416 a user chosen name.
417
418 =back
419
420 =head1 Master file structure and record format
421
422 =head2 A. Master file record format
423
424 The Master record is a variable length record consisting of three
425 sections: a fixed length leader; a directory; and the variable length
426 data fields.
427
428 =head3 Leader format
429
430 The leader consists of the following 7 integers (fields marked with *
431 are 31-bit signed integers):
432
433 =over
434
435 =item C<MFN>*
436
437 Master file number
438
439 =item C<MFRL>
440
441 Record length (always an even number)
442
443 =item C<MFBWB>*
444
445 Backward pointer - Block number
446
447 =item C<MFBWP>
448
449 Backward pointer - Offset
450
451 =item C<BASE>
452
453 Offset to variable fields (this is the combined length of the Leader
454 and Directory part of the record, in bytes)
455
456 =item C<NVF>
457
458 Number of fields in the record (i.e. number of directory entries)
459
460 =item C<STATUS>
461
462 Logical deletion indicator (0=record active; 1=record marked for
463 deletion)
464
465 =back
466
467 C<MFBWB> and C<MFBWP> are initially set to 0 when the record is
468 created. They are subsequently updated each time the record itself is
469 updated (see below).
470
471 =head3 Directory format
472
473 The directory is a table indicating the record contents. There is one
474 directory entry for each field present in, the record (i.e. the
475 directory has exactly NVF entries). Each directory entry consists of 3
476 integers:
477
478 =over
479
480 =item C<TAG>
481
482 Field Tag
483
484 =item C<POS>
485
486 Offset to first character position of field in the variable field
487 section (the first field has C<POS=0>)
488
489 =item C<LEN>
490
491 Field length in bytes
492
493 =back
494
495 The total directory length in bytes is therefore C<6*NVF>; the C<BASE> field
496 in the leader is always: C<18+6*NVF>.
497
498 =head3 Variable fields
499
500 This section contains the data fields (in the order indicated by the
501 directory). Data fields are placed one after the other, with no
502 separating characters.
503
504 =head2 B. Control record
505
506 The first record in the Master file is a control record which the
507 system maintains automatically. This is never accessible to the ISIS
508 user. Its contents are as follows (fields marked with C<*> are 31-bit
509 signed integers):
510
511 =over
512
513 =item C<CTLMFN>*
514
515 always 0
516
517 =item C<NXTMFN>*
518
519 MFN to be assigned to the next record created in the data base
520
521 =item C<NXTMFB>*
522
523 Last block number allocated to the Master file (first block is 1)
524
525 =item C<NXTMFP>
526
527 Offset to next available position in last block
528
529 =item C<MFTYPE>
530
531 always 0 for user data base file (1 for system message files)
532
533 =back
534
535 (the last four fields are used for statistics during backup/restore).
536
537 =head2 C. Master file block format
538
539 The Master file records are stored consecutively, one after the other,
540 each record occupying exactly C<MFRL> bytes. The file is stored as
541 physical blocks of 512 bytes. A record may begin at any word boundary
542 between 0-498 (no record begins between 500-510) and may span over two
543 or more blocks.
544
545 As the Master file is created and/or updated, the system maintains an
546 index indicating the position of each record. The index is stored in
547 the Crossreference file (C<.XRF>)
548
549 =head2 D. Crossreference file
550
551 The C<XRF> file is organized as a table of pointers to the Master file.
552 The first pointer corresponds to MFN 1, the second to MFN 2, etc.
553
554 Each pointer consists of two fields:
555
556 =over
557
558 =item C<RECCNT>*
559
560 =item C<MFCXX1>*
561
562 =item C<MFCXX2>*
563
564 =item C<MFCXX3>*
565
566 =item C<XRFMFB>
567
568 (21 bits) Block number of Master file block containing the record
569
570 =item C<XRFMFP>
571
572 (11 bits) Offset in block of first character position of Master record
573 (first block position is 0)
574
575 =back
576
577 which are stored in a 31-bit signed integer (4 bytes) as follows:
578
579 pointer = XRFMFB * 2048 + XRFMFP
580
581 (giving therefore a maximum Master file size of 500 Megabytes).
582
583 Each block of the C<XRF> file is 512 bytes and contains 127 pointers. The
584 first field in each block (C<XRFPOS>) is a 31-bit signed integer whose
585 absolute value is the C<XRF> block number. A negative C<XRFPOS> indicates
586 the last block.
587
588 I<Deleted> records are indicated as follows:
589
590 =over
591
592 =item C<XRFMFB E<lt> 0> and C<XRFMFP E<gt> 0>
593
594 logically deleted record (in this case C<ABS(XRFMFB)> is the correct block
595 pointer and C<XRFMFP> is the offset of the record, which can therefore
596 still be retrieved)
597
598 =item C<XRFMFB = -1> and C<XRFMFP = 0>
599
600 physically deleted record
601
602 =item C<XRFMFB = 0> and C<XRFMFP = 0>
603
604 inexistent record (all records beyond the highest C<MFN> assigned in the
605 data base)
606
607 =back
608
609 =head2 E. Master file updating technique
610
611 =head3 Creation of new records
612
613 New records are always added at the end of the Master file, at the
614 position indicated by the fields C<NXTMFB>/C<NXTMFP> in the Master file
615 control record. The C<MFN> to be assigned is also obtained from the field
616 C<NXTMFN> in the control record.
617
618 After adding the record, C<NXTMFN> is increased by 1 and C<NXTMFB>/C<NXTMFP>
619 are updated to point to the next available position. In addition a new
620 pointer is created in the C<XRF> file and the C<XRFMFP> field corresponding
621 to the record is increased by 1024 to indicate that this is a new
622 record to be inverted (after the inversion of the record 1024 is
623 subtracted from C<XRFMFP>).
624
625 =head3 Update of existing records
626
627 Whenever you update a record (i.e., you call it in data entry and exit
628 with option X from the editor) the system writes the record back to the
629 Master file. Where it is written depends on the status of the record
630 when it was initially read.
631
632 =head4 There was no inverted file update pending for the record
633
634 This condition is indicated by the following:
635
636 On C<XRF> C<XRFMFP E<lt> 512> and
637
638 On C<MST> C<MFBWB = 0> and C<MFBWP = 0>
639
640 In this case, the record is always rewritten at the end of the Master
641 file (as if it were a new record) as indicated by C<NXTMFB>/C<NXTMFP> in the
642 control record. In the new version of the record C<MFBWB>/C<MFBWP> are set to
643 point to the old version of the record, while in the C<XRF> file the
644 pointer points to the new version. In addition 512 is added to C<XRFMFP>
645 to indicate that an inverted file update is pending. When the inverted
646 file is updated, the old version of the record is used to determine the
647 postings to be deleted and the new version is used to add the new
648 postings. After the update of the Inverted file, 512 is subtracted from
649 C<XRFMFP>, and C<MFBWB>/C<MFBWP> are reset to 0.
650
651 =head4 An inverted file update was pending
652
653 This condition is indicated by the following:
654
655 On C<XRF> C<XRFMFP E<gt> 512> and
656
657 On C<MST> C<MFBWB E<gt> 0>
658
659 In this case C<MFBWB>/C<MFBWP> point to the version of the record which is
660 currently reflected in the Inverted file. If possible, i.e. if the
661 record length was not increased, the record is written back at its
662 original location, otherwise it is written at the end of the file. In
663 both cases, C<MFBWB>/C<MFBWP> are not changed.
664
665 =head3 Deletion of records
666
667 Record deletion is treated as an update, with the following additional
668 markings:
669
670 On C<XRF> C<XRFMFB> is negative
671
672 On C<MST> C<STATUS> is set to 1
673
674 =head2 F. Master file reorganization
675
676 As indicated above, as Master file records are updated the C<MST> file
677 grows in size and there will be lost space in the file which cannot be
678 used. The reorganization facilities allow this space to be reclaimed by
679 recompacting the file.
680
681 During the backup phase a Master file backup file is created (C<.BKP>).
682 The structure and format of this file is the same as the Master file
683 (C<.MST>), except that a Crossreference file is not required as all the
684 records are adjacent. Records marked for deletion are not backed up.
685 Because only the latest copy of each record is backed up, the system
686 does not allow you to perform a backup whenever an Inverted file update
687 is pending for one or more records.
688
689 During the restore phase the backup file is read sequentially and the
690 program recreates the C<MST> and C<XRF> file. At this point alt records which
691 were marked for logical deletion (before the backup) are now marked as
692 physically deleted (by setting C<XRFMFB = -1> and C<XRFMFP = 0>.
693 Deleted records are detected by checking holes in the C<MFN> numbering.
694
695 =head1 Inverted file structure and record formats
696
697 =head2 A. Introduction
698
699 The CDS/ISIS Inverted file consists of six physical files, five of
700 which contain the dictionary of searchable terms (organized as a
701 B*tree) and the sixth contains the list of postings associated with
702 each term. In order to optimize disk storage, two separate B*trees are
703 maintained, one for terms of up to 10 characters (stored in files
704 C<.N01>/C<.L01>) and one for terms longer than 10 characters, up to a maximum
705 of 30 characters (stored in files C<.N02>/C<.L02>). The file C<CNT> contains
706 control fields for both B*trees. In each B*tree the file C<.N0x> contains
707 the nodes of the tree and the C<.L0x> file contains the leafs. The leaf
708 records point to the postings file C<.IFP>.
709
710 The relationship between the various files is schematically represented
711 in Figure 67.
712
713 The physical relationship between these six files is a
714 pointer, which represents the relative address of the record being
715 pointed to. A relative address is the ordinal record number of a record
716 in a given file (i.e. the first record is record number 1, the second
717 is record number 2, etc.). The file C<.CNT> points to the file C<.N0x>,
718 C<.N0x> points to C<.L0x>, and C<.L0x> points to C<.IFP>. Because the
719 C<.IFP> is a packed file, the pointer from C<.L0x> to C<.IFP> has two
720 components: the block number and the offset within the block, each expressed
721 as an integer.
722
723 =head2 B. Format of C<.CNT> file
724
725 This file contain two 26-byte fixed length records (one for each
726 B*tree) each containing 10 integers as follows (fields marked with *
727 are 31-bit signed integers):
728
729 =over
730
731 =item C<IDTYPE>
732
733 B*tree type (1 for C<.N01>/C<.L01>, 2 for C<.N02>/C<.L02>)
734
735 =item C<ORDN>
736
737 Nodes order (each C<.N0x> record contains at most C<2*ORDN> keys)
738
739 =item C<ORDF>
740
741 Leafs order (each C<.L0x> record contains at most C<2*ORDF> keys)
742
743 =item C<N>
744
745 Number of memory buffers allocated for nodes
746
747 =item C<K>
748
749 Number of buffers allocated to lst level index (C<K E<lt> N>)
750
751 =item C<LIV>
752
753 Current number of index levels
754
755 =item C<POSRX>*
756
757 Pointer to Root record in C<.N0x>
758
759 =item C<NMAXPOS>*
760
761 Next available position in C<.N0x> file
762
763 =item C<FMAXPOS>*
764
765 Next available position in C<.L0x> file
766
767 =item C<ABNORMAL>
768
769 Formal B*tree normality indicator (0 if B*tree is abnormal, 1 if B*tree
770 is normal). A B*tree is abnormal if the nodes file C<.N0x> contains only
771 the Root.
772
773 =back
774
775 C<ORDN>, C<ORDF>, C<N> and C<K> are fixed for a given generated system.
776 Currently these values are set as follows:
777
778 C<ORDN = 5>; C<ORDF = 5>; C<N = 15>; C<K = 5> for both B*trees
779
780 +--------------+
781 | Root address |
782 +-------|------+
783 | .CNT file
784 | -------------
785 | .N0x file
786 +-----------V--------+
787 | Key1 Key2 ... Keyn | Root
788 +---|-------------|--+
789 | |
790 +-----+ +------+
791 | |
792 +----------V----------+ +---------V----------+ 1st level
793 | Key1 Key2 ... Keyn | ... | Key1 Key2 ... Keyn | index
794 +--|------------------+ +-----------------|--+
795 | :
796 : +-------+
797 | |
798 +--V------------------+ +---------V----------+ last level
799 | Key1 Key2 ... Keyn | ... | Key1 Key2 ... Keyn | index
800 +---------|-----------+ +---------|----------+
801 | |
802 | | -------------
803 | | .L0x file
804 +---------V-----------+ +---------V----------+
805 | Key1 Key2 ... Keyn | ... | Key1 Key2 ... Keyn |
806 +--|------------------+ +--------------------+
807 |
808 | -------------
809 | .IPF file
810 +--V----------------------------------+
811 | P1 P2 P3 ..................... Pn |
812 +-------------------------------------+
813
814 I<Figure 67: Inverted file structure>
815
816 The other values are set as required when the B*trees are generated.
817
818 =head2 C. Format of C<.N0x> files
819
820 These files contain the indexes) of the dictionary of searchable terms
821 (C<.N01> for terms shorter than 11 characters and C<.N02> for terms longer
822 than 10 characters). The C<.N0x> file records have the following format
823 (fields marked with * are 31-bit signed integers):
824
825 =over
826
827 =item C<POS>*
828
829 an integer indicating the relative record number (1 for the first
830 record, 2 for the second record, etc.)
831
832 =item C<OCK>
833
834 an integer indicating the number of active keys in the record
835 ( C<1 E<lt>= OCK E<lt>= 2*ORDN> )
836
837 =item C<IT>
838
839 an integer indicating the type of B*tree (1 for C<.N01>, 2 for C<.N02>)
840
841 =item C<IDX>
842
843 an array of C<ORDN> entries (C<OCK> of which are active), each having the
844 following format:
845
846 =over 4
847
848 =item C<KEY>
849
850 a fixed length character string of length C<.LEx> (C<LE1 =10>, C<LE2 = 30>)
851
852 =item C<PUNT>
853
854 a pointer to the C<.N0x> record (if C<PUNT E<gt> 0>) or C<.L0x> record
855 (if C<PUNT E<lt> 0>) whose C<IDX(1).KEY = KEY>. C<PUNT = 0> indicates
856 an inactive entry. A positive C<PUNT> indicates a branch to a hierarchically
857 lower level index. The lowest level index (C<PUNT E<lt> 0>) points the leafs in
858 the C<.L0x> file.
859
860 =back
861
862 =back
863
864 =head2 D. Format of C<.L0x> files
865
866 These files contain the full dictionary of searchable terms (C<.L01> for
867 terms shorter than 11 characters and C<.L02> for terms longer than 10
868 characters). The C<.L0x> file records have the following format (fields
869 marked with C<*> are 31-bit signed integers):
870
871 =over
872
873 =item C<POS>*
874
875 an integer indicating the relative record number (1 for the first
876 record, 2 for the second record, etc.)
877
878 =item C<OCK>
879
880 an integer indicating the number of active keys in the record
881 (C<1 E<lt> OCK E<lt>= 2*ORDF>)
882
883 =item C<IT>
884
885 an integer indicating the type of B*tree (1 for C<.N01>, 2 for C<.N02>)
886
887 =item C<PS>*
888
889 is the immediate successor of C<IDX[OCK].KEY> in this record (this is used
890 to speed up sequential access to the file)
891
892 =item C<IDX>
893
894 an array of C<ORDN> entries (C<OCK> of which are active), each having the
895 following format:
896
897 =over 4
898
899 =item C<KEY>
900
901 a fixed length character string of length C<LEx> (C<LE1=10>, C<LE2=30>)
902
903 =item C<INFO>
904
905 a pointer to the C<.IFP> record where the list of postings associated with
906 C<KEY> begins. This pointer consists of two 31-bit signed integers as
907 follows:
908
909 =over 8
910
911 =item C<INFO[1]>*
912
913 relative block number in C<.IFP>
914
915 =item C<INFO[2]>*
916
917 offset (word number relative to 0) to postings list
918
919 =back
920
921 =back
922
923 =back
924
925 =head2 E. Format of C<.IFP> file
926
927 This file contains the list of postings for each dictionary term. Each
928 list of postings has the format indicated below. The file is structured
929 in blocks of 512 characters, where (for an initially loaded and
930 compacted file) the lists of postings for each term are adjacent,
931 except as noted below.
932
933 The general format of each block is:
934
935 =over
936
937 =item C<IFPBLK>
938
939 a 31-bit signed integer indicating the Block number of this block
940 (blocks are numbered from 1)
941
942 =item C<IFPREC>
943
944 An array of 127 31-bit signed integers
945
946 =back
947
948 C<IFPREC[1]> and C<FPREC[2]> of the first block are a pointer to the
949 next available position in the C<.IFP> file.
950
951 Pointers from C<.L0x> to C<.IFP> and pointers within C<.IFP> consist of two
952 31-bit signed integers: the first integer is a block number, and the
953 second integer is a word offset in C<IFPREC> (e.g. the offset to the
954 first word in C<IFPREC> is 0). The list of postings associated with the
955 first search term will therefore start at 1/0.
956
957 Each list of postings consists of a header (5 double-words) followed by
958 the actual list of postings (8 bytes for each posting). The header has
959 the following format (each field is a 31-bit signed integer):
960
961 =over
962
963 =item C<IFPNXTB>*
964
965 Pointer to next segment (Block number)
966
967 =item C<IFPNXTP>*
968
969 Pointer to next segment (offset)
970
971 =item C<IFPTOTP>*
972
973 Total number of postings (accurate only in first segment)
974
975 =item C<IFPSEGP>*
976
977 Number of postings in this segment (C<IFPSEGP E<lt>= IFPTOTP>)
978
979 =item C<IFPSEGC>*
980
981 Segment capacity (i.e. number of postings which can be stored in this
982 segment)
983
984 =back
985
986 Each posting is a 64-bit string partitioned as follows:
987
988 =over
989
990 =item C<PMFN>
991
992 (24 bits) Master file number
993
994 =item C<PTAG>
995
996 (16 bits) Field identifier (assigned from the C<FST>)
997
998 =item C<POCC>
999
1000 (8 bits) Occurrence number
1001
1002 =item C<PCNT>
1003
1004 (16 bits) Term sequence number in field
1005
1006 =back
1007
1008 Each field is stored in a strict left-to-right sequence with leading
1009 zeros added if necessary to adjust the corresponding bit string to the
1010 right (this allows comparisons of two postings as character strings).
1011
1012 The list of postings is stored in ascending C<PMFN>/C<PTAG>/C<POCC>/C<PCNT>
1013 sequence. When the inverted file is loaded sequentially (e.g. after a
1014 full inverted file generation with ISISINV), each list consists of one
1015 or more adjacent segments. If C<IFPTOT E<lt>= 32768> then:
1016 C<IFPNXTB/IFPNXTP = 0/0> and C<IFPTOT = IFPSEGP = IFPSEGC>.
1017
1018 As updates are performed, additional segments may be created whenever
1019 new postings must be added. In this case a new segment with capacity
1020 C<IFPTOTP> is created and linked to other segments (through the pointer
1021 C<IFPNXTB>/C<IFPNXTP>) in such a way that the sequence
1022 C<PMFN>/C<PTAG>/C<POCC>/C<PCNT> is maintained. Whenever such a split occurs
1023 the postings of the segment where the new posting should have been inserted
1024 are equally distributed between this segment and the newly created segment.
1025 New segments are always written at the end of the file (which is maintained
1026 in C<IFPREC[1]>/C<IFPREC[2]> of the first C<.IFP> block.
1027
1028 For example, assume that a new posting C<Px> has to be inserted between C<P2>
1029 and C<P3> in the following list:
1030
1031 +----------------------------+
1032 | 0 0 5 5 5 | P1 P2 P3 P4 P5 |
1033 +----------------------------+
1034
1035 after the split (and assuming that the next available position in C<.IFP>
1036 is 3/4) the list of postings will consist of the following two segments:
1037
1038 +----------------------------+
1039 | 3 4 5 3 5 | P2 P2 Px -- -- |
1040 +--|-------------------------+
1041 |
1042 +--V-------------------------+
1043 | 0 0 5 3 5 | P3 P4 P5 -- -- |
1044 +----------------------------+
1045
1046 In this situation, no new segment will be created until either segment
1047 becomes again full.
1048
1049 As mentioned above, the posting lists are normally stored one after the
1050 other. However, in order to facilitate access to the C<.IFP> file the
1051 segments are stored in such a way that:
1052
1053 =over
1054
1055 =item 1
1056
1057 the header and the first posting in each list (28 bytes) are never
1058 split between two blocks.
1059
1060 =item 2
1061
1062 a posting is never split between two blocks; if there is not enough
1063 room in the current block the whole posting is stored in the next
1064 block.
1065
1066 =back
1067
1068 =head1 LICENCE
1069
1070 UNESCO has developed and owns the intellectual property of the CDS/ISIS
1071 software (in whole or in part, including all files and documentation, from
1072 here on referred to as CDS/ISIS) for the storage and retrieval of
1073 information.
1074
1075 For complete text of licence visit
1076 L<http://www.unesco.org/isis/files/winisislicense.html>.
1077
1078 =cut
1079

  ViewVC Help
Powered by ViewVC 1.1.26