/[Biblio-Isis]/trunk/lib/Biblio/Isis/Manual.pod
This is repository of my old source code which isn't updated any more. Go to git.rot13.org for current projects!
ViewVC logotype

Annotation of /trunk/lib/Biblio/Isis/Manual.pod

Parent Directory Parent Directory | Revision Log Revision Log


Revision 37 - (hide annotations)
Fri Jan 7 20:57:56 2005 UTC (19 years, 3 months ago) by dpavlin
File size: 28177 byte(s)
re-organize directories, add CDS/ISIS manual -- part about file structure

1 dpavlin 37 =pod
2    
3     =head1 NAME
4    
5     CDS/ISIS manual appendix F, G and H
6    
7     =head1 DESCRIPTION
8    
9     This is partial scan of CDS/ISIS manual (appendix F, G and H, pages
10     257-272) which is than converted to text using OCR and proofread.
11     However, there might be mistakes, and any corrections sent to
12     C<dpavlin@rot13.org> will be greatly appreciated.
13    
14     This digital version is made because current version available in ditial
15     form doesn't contain details about CDS/ISIS file format and was essential
16     in making L<Biblio::Isis> module.
17    
18     This extract of manual has been produced in compliance with section (d) of
19     WinIsis LICENCE for receiving institution/person which say:
20    
21     The receiving institution/person may:
22    
23     (d) Print/reproduce the CDS/ISIS manuals or portions thereof,
24     provided that such copies reproduce the copyright notice;
25    
26     =head1 CDS/ISIS Files
27    
28     This section describes the various files of the CDS/ISIS system, the
29     file naming conventions and the file extensions used for each type of
30     file. All CDS/ISIS files have standard names as follows:
31    
32     nnnnnn.eee
33    
34     where:
35    
36     =over 10
37    
38     =item C<nnnnnn>
39    
40     is the file name (all file names, except program names, are limited to
41     a maximum of 6 characters)
42    
43     =item C<.eee>
44    
45     is the file extension identifying a particular type of file.
46    
47     =back
48    
49     Files marked with C<*> are ASCII files which you may display or print. The
50     other files are binary files.
51    
52     =head2 A. System files
53    
54     System files are common to all CDS/ISIS users and include the various
55     executable programs as well as system menus, worksheets and message
56     files provided by Unesco as well as additional ones which you may
57     create.
58    
59     =head3 CDS/ISIS Program
60    
61     The name of the program file, as supplied by Unesco is
62    
63     ISIS.EXE
64    
65     Depending on the release and/or target computer, there may also be one
66     or more overlay files. These, if present, have the extension C<OVL>.
67     Check the contents of your system diskettes or tape to see whether
68     overlay files are present.
69    
70     =head3 System menus and worksheets
71    
72     All system menus and worksheets have the file extension FMT and the
73     names are built as follows:
74    
75     pctnnn.FMT
76    
77     where:
78    
79     =over 10
80    
81     =item C<p>
82    
83     is the page number (A for the first page, B for the second, etc.)
84    
85     =item C<c>
86    
87     is the language code (e.g. E for English), which must be one of those
88     provided for in the language selection menu xXLNG.
89    
90     =item C<t>
91    
92     is X for menus and Y for system worksheets
93    
94     =item C<nnn>
95    
96     is a unique identifier
97    
98     =back
99    
100     For example the full name of the English version of the menu xXGEN is
101     C<AEXGEN.FMT>.
102    
103     The page number is transparent to the CDS/ISIS user. Like the file
104     extension the page number is automatically provided by the system.
105     Therefore when a CDS/ISIS program prompts you to enter a menu or
106     worksheet name you must not include the page number. Furthermore as
107     file names are restricted to 6 characters, menus and worksheets names
108     may not be longer than 5 characters.
109    
110     System menus and worksheets may only have one page.
111    
112     The language code is mandatory for system menus and standard system
113     worksheets. For example if you want to link a HELP menu to the system
114     menu EXGEN, its name must begin with the letter E.
115    
116     The B<X> convention is only enforced for standard system menus. It is a
117     good practice, however, to use the same convention for menus that you
118     create, and to avoid creating worksheets (including data entry
119     worksheets) with X in this position, that is with names like xB<X>xxx.
120    
121     Furthermore, if a data base name contains B<X> or B<Y> in the second
122     position, then the corresponding data entry worksheets will be created
123     in the system worksheet directory (parameter 2 of C<SYSPAR.PAR>) rather
124     then the data base directory. Although this will not prevent normal
125     operation of the data base, it is not recommended.
126    
127     =head3 System messages files
128    
129     System messages and prompts are stored in standard CDS/ISIS data bases.
130     All corresponding data base files (see below) are required when
131     updating a message file, but only the Master file is used to display
132     messages.
133    
134     There must be a message data base for each language supported through
135     the language selection menu xXLNG.
136    
137     The data base name assigned to message data bases is xMSG (where x is
138     the language code).
139    
140     =head3 System tables
141    
142     System tables are used by CDS/ISIS to define character sets. Two are
143     required at present:
144    
145     =over
146    
147     =item C<ISISUC.TAB>*
148    
149     defines lower to upper-case translation
150    
151     =item C<ISISAC.TAB>*
152    
153     defines the alphabetic characters.
154    
155     =back
156    
157     =head3 System print and work files
158    
159     Certain CDS/ISIS print functions do not send the output directly to the
160     printer but store it on a disk file from which you may then print it at
161     a convenient time. These files have all the file extension C<LST> and
162     are reused each time the corresponding function is executed.
163    
164     In addition CDS/ISIS creates temporary work files which are normally
165     automatically discarded at the end of the session. If the session
166     terminates abnormally, however, they will not be deleted. A case of
167     abnormal termination would be a power failure while you are using a
168     CDS/ISIS program. Also these files, however, are reused each time,
169     so that you do not normally need to delete them manually. Work files
170     all have the extension C<TMP>.
171    
172     The print and work files created by CDS/ISIS are given below:
173    
174     =over
175    
176     =item C<IFLIST.LST>*
177    
178     Inverted file listing file (produced by ISISINV)
179    
180     =item C<WSLIST.LST>*
181    
182     Worksheet/menu listing file (produced by ISISUTL)
183    
184     =item C<xMSG.LST>*
185    
186     System messages listing file (produced by ISISUTL)
187    
188     =item C<x.LST>*
189    
190     Printed output (produced by ISISPRT when printing no print file name is
191     supplied)
192    
193     =item C<SORTIO.TMP>
194    
195     Sort work file 1
196    
197     =item C<SORTII.TMP>
198    
199     Sort work file 2
200    
201     =item C<SORTI2.TMP>
202    
203     Sort work file 3
204    
205     =item C<SORTI3.TMP>
206    
207     Sort work file 4
208    
209     =item C<SORT20.TMP>
210    
211     Sort work file 5
212    
213     =item C<SORT2I.TMP>
214    
215     Sort work file 6
216    
217     =item C<SORT22.TMP>
218    
219     Sort work file 7
220    
221     =item C<SORT23.TMP>
222    
223     Sort work file 8
224    
225     =item C<TRACE.TMP>*
226    
227     Trace file created by certain programs
228    
229     =item C<ATSF.TMP>
230    
231     Temporary storage for hit lists created during retrieval
232    
233     =item C<ATSQ.TMP>
234    
235     Temporary storage for search expressions
236    
237     =back
238    
239     =head2 B. Data Base files
240    
241     =over
242    
243     =item 1
244    
245     mandatory files, which must always be present.
246     These are normally established when the data base is defined by means of the
247     ISISDEF services and should never be deleted;
248    
249     =item 2
250    
251     auxiliary files created by the system whenever certain functions are
252     performed.
253     These can periodically be deleted when they are no longer needed.
254    
255     =item 3
256    
257     user files created by the data base user (such as display formats),
258     which are fully under the user's responsibility.
259    
260     =back
261    
262     Each data base consists of a number of physically distinct files as
263     indicated below. There are three categories of data base files:
264    
265     In the following description C<xxxxxx> is the 1-6 character data base
266     name.
267    
268     =head3 Mandatory data base files
269    
270     =over
271    
272     =item C<xxxxxx.FDT>*
273    
274     Field Definition Table
275    
276     =item C<xxxxxx.FST>*
277    
278     Field Select Table for Inverted file
279    
280     =item C<xxxxxx.FMT>*
281    
282     Default data entry worksheet (where p is the page number).
283    
284     Note that the data base name is truncated to 5 characters if necessary
285    
286     =item C<xxxxxx.PFT>*
287    
288     Default display format
289    
290     =item C<xxxxxx.MST>
291    
292     Master file
293    
294     =item C<xxxxxx.XRF>
295    
296     Crossreference file (Master file index)
297    
298     =item C<xxxxxx.CNT>
299    
300     B*tree (search term dictionary) control file
301    
302     =item C<xxxxxx.N01>
303    
304     B*tree Nodes (for terms up to 10 characters long)
305    
306     =item C<xxxxxx.L01>
307    
308     B*tree Leafs (for terms up to 10 characters long)
309    
310     =item C<xxxxxx.N02>
311    
312     B*tree Nodes (for terms longer than 10 characters)
313    
314     =item C<xxxxxx.L02>
315    
316     B*tree Leafs (for terms longer than 10 characters)
317    
318     =item C<xxxxxx.IFP>
319    
320     Inverted file postings
321    
322     =item C<xxxxxx.ANY>*
323    
324     ANY file
325    
326     =back
327    
328     =head3 Auxiliary files
329    
330     =over
331    
332     =item C<xxxxx.STW>*
333    
334     Stopword file used during inverted file generation
335    
336     =item C<xxxxxx.LN1>*
337    
338     Unsorted Link file (short terms)
339    
340     =item C<xxxxxx.LN2>*
341    
342     Unsorted Link file (long terms)
343    
344     =item C<xxxxxx.LKl>*
345    
346     Sorted Link file (short terms)
347    
348     =item C<xxxxxx.LK2>*
349    
350     Sorted Link file (long terms)
351    
352     =item C<xxxxxx.BKP>
353    
354     Master file backup
355    
356     =item C<xxxxxx.XHF>
357    
358     Hit file index
359    
360     =item C<xxxxxx.HIT>
361    
362     Hit file
363    
364     =item C<xxxxxx.SRT>*
365    
366     Sort convertion table (see "Uppercase conversion table (1SISUC.TAB)" on
367     page 227)
368    
369     =back
370    
371     =head3 User files
372    
373     =over
374    
375     =item C<yyyyyy.FST>*
376    
377     Field Select tables used for sorting
378    
379     =item C<yyyyyy.PFT>*
380    
381     Additional display formats
382    
383     =item C<yyyyyy.FMT>*
384    
385     Additional data entry worksheets
386    
387     =item C<yyyyyy.STW>*
388    
389     Additional stopword files
390    
391     =item C<yyyyyy.SAV>
392    
393     Save files created during retrieval
394    
395     =back
396    
397     The name of user files is fully under user control. However, in order
398     to avoid possible name conflicts it is advisable to establish some
399     standard conventions to be followed by all CDS/ISIS users at a given
400     site, such as for example to define C<yyyyyy> as follows:
401    
402     xxxyyy
403    
404     where:
405    
406     =over
407    
408     =item C<xxx>
409    
410     is a data base identifier (which could be the first three letters of
411     the data base name if no two data bases names are allowed to begin with
412     the same three letters)
413    
414     =item C<yyy>
415    
416     a user chosen name.
417    
418     =back
419    
420     =head1 Master file structure and record format
421    
422     =head2 A. Master file record format
423    
424     The Master record is a variable length record consisting of three
425     sections: a fixed length leader; a directory; and the variable length
426     data fields.
427    
428     =head3 Leader format
429    
430     The leader consists of the following 7 integers (fields marked with *
431     are 31-bit signed integers):
432    
433     =over
434    
435     =item C<MFN>*
436    
437     Master file number
438    
439     =item C<MFRL>
440    
441     Record length (always an even number)
442    
443     =item C<MFBWB>*
444    
445     Backward pointer - Block number
446    
447     =item C<MFBWP>
448    
449     Backward pointer - Offset
450    
451     =item C<BASE>
452    
453     Offset to variable fields (this is the combined length of the Leader
454     and Directory part of the record, in bytes)
455    
456     =item C<NVF>
457    
458     Number of fields in the record (i.e. number of directory entries)
459    
460     =item C<STATUS>
461    
462     Logical deletion indicator (0=record active; 1=record marked for
463     deletion)
464    
465     =back
466    
467     C<MFBWB> and C<MFBWP> are initially set to 0 when the record is
468     created. They are subsequently updated each time the record itself is
469     updated (see below).
470    
471     =head3 Directory format
472    
473     The directory is a table indicating the record contents. There is one
474     directory entry for each field present in, the record (i.e. the
475     directory has exactly NVF entries). Each directory entry consists of 3
476     integers:
477    
478     =over
479    
480     =item C<TAG>
481    
482     Field Tag
483    
484     =item C<POS>
485    
486     Offset to first character position of field in the variable field
487     section (the first field has C<POS=0>)
488    
489     =item C<LEN>
490    
491     Field length in bytes
492    
493     =back
494    
495     The total directory length in bytes is therefore C<6*NVF>; the C<BASE> field
496     in the leader is always: C<18+6*NVF>.
497    
498     =head3 Variable fields
499    
500     This section contains the data fields (in the order indicated by the
501     directory). Data fields are placed one after the other, with no
502     separating characters.
503    
504     =head2 B. Control record
505    
506     The first record in the Master file is a control record which the
507     system maintains automatically. This is never accessible to the ISIS
508     user. Its contents are as follows (fields marked with C<*> are 31-bit
509     signed integers):
510    
511     =over
512    
513     =item C<CTLMFN>*
514    
515     always 0
516    
517     =item C<NXTMFN>*
518    
519     MFN to be assigned to the next record created in the data base
520    
521     =item C<NXTMFB>*
522    
523     Last block number allocated to the Master file (first block is 1)
524    
525     =item C<NXTMFP>
526    
527     Offset to next available position in last block
528    
529     =item C<MFTYPE>
530    
531     always 0 for user data base file (1 for system message files)
532    
533     =back
534    
535     (the last four fields are used for statistics during backup/restore).
536    
537     =head2 C. Master file block format
538    
539     The Master file records are stored consecutively, one after the other,
540     each record occupying exactly C<MFRL> bytes. The file is stored as
541     physical blocks of 512 bytes. A record may begin at any word boundary
542     between 0-498 (no record begins between 500-510) and may span over two
543     or more blocks.
544    
545     As the Master file is created and/or updated, the system maintains an
546     index indicating the position of each record. The index is stored in
547     the Crossreference file (C<.XRF>)
548    
549     =head2 D. Crossreference file
550    
551     The C<XRF> file is organized as a table of pointers to the Master file.
552     The first pointer corresponds to MFN 1, the second to MFN 2, etc.
553    
554     Each pointer consists of two fields:
555    
556     =over
557    
558     =item C<RECCNT>*
559    
560     =item C<MFCXX1>*
561    
562     =item C<MFCXX2>*
563    
564     =item C<MFCXX3>*
565    
566     =item C<XRFMFB>
567    
568     (21 bits) Block number of Master file block containing the record
569    
570     =item C<XRFMFP>
571    
572     (11 bits) Offset in block of first character position of Master record
573     (first block position is 0)
574    
575     =back
576    
577     which are stored in a 31-bit signed integer (4 bytes) as follows:
578    
579     pointer = XRFMFB * 2048 + XRFMFP
580    
581     (giving therefore a maximum Master file size of 500 Megabytes).
582    
583     Each block of the C<XRF> file is 512 bytes and contains 127 pointers. The
584     first field in each block (C<XRFPOS>) is a 31-bit signed integer whose
585     absolute value is the C<XRF> block number. A negative C<XRFPOS> indicates
586     the last block.
587    
588     I<Deleted> records are indicated as follows:
589    
590     =over
591    
592     =item C<XRFMFB E<lt> 0> and C<XRFMFP E<gt> 0>
593    
594     logically deleted record (in this case C<ABS(XRFMFB)> is the correct block
595     pointer and C<XRFMFP> is the offset of the record, which can therefore
596     still be retrieved)
597    
598     =item C<XRFMFB = -1> and C<XRFMFP = 0>
599    
600     physically deleted record
601    
602     =item C<XRFMFB = 0> and C<XRFMFP = 0>
603    
604     inexistent record (all records beyond the highest C<MFN> assigned in the
605     data base)
606    
607     =back
608    
609     =head2 E. Master file updating technique
610    
611     =head3 Creation of new records
612    
613     New records are always added at the end of the Master file, at the
614     position indicated by the fields C<NXTMFB>/C<NXTMFP> in the Master file
615     control record. The C<MFN> to be assigned is also obtained from the field
616     C<NXTMFN> in the control record.
617    
618     After adding the record, C<NXTMFN> is increased by 1 and C<NXTMFB>/C<NXTMFP>
619     are updated to point to the next available position. In addition a new
620     pointer is created in the C<XRF> file and the C<XRFMFP> field corresponding
621     to the record is increased by 1024 to indicate that this is a new
622     record to be inverted (after the inversion of the record 1024 is
623     subtracted from C<XRFMFP>).
624    
625     =head3 Update of existing records
626    
627     Whenever you update a record (i.e., you call it in data entry and exit
628     with option X from the editor) the system writes the record back to the
629     Master file. Where it is written depends on the status of the record
630     when it was initially read.
631    
632     =head4 There was no inverted file update pending for the record
633    
634     This condition is indicated by the following:
635    
636     On C<XRF> C<XRFMFP E<lt> 512> and
637    
638     On C<MST> C<MFBWB = 0> and C<MFBWP = 0>
639    
640     In this case, the record is always rewritten at the end of the Master
641     file (as if it were a new record) as indicated by C<NXTMFB>/C<NXTMFP> in the
642     control record. In the new version of the record C<MFBWB>/C<MFBWP> are set to
643     point to the old version of the record, while in the C<XRF> file the
644     pointer points to the new version. In addition 512 is added to C<XRFMFP>
645     to indicate that an inverted file update is pending. When the inverted
646     file is updated, the old version of the record is used to determine the
647     postings to be deleted and the new version is used to add the new
648     postings. After the update of the Inverted file, 512 is subtracted from
649     C<XRFMFP>, and C<MFBWB>/C<MFBWP> are reset to 0.
650    
651     =head4 An inverted file update was pending
652    
653     This condition is indicated by the following:
654    
655     On C<XRF> C<XRFMFP E<gt> 512> and
656    
657     On C<MST> C<MFBWB E<gt> 0>
658    
659     In this case C<MFBWB>/C<MFBWP> point to the version of the record which is
660     currently reflected in the Inverted file. If possible, i.e. if the
661     record length was not increased, the record is written back at its
662     original location, otherwise it is written at the end of the file. In
663     both cases, C<MFBWB>/C<MFBWP> are not changed.
664    
665     =head3 Deletion of records
666    
667     Record deletion is treated as an update, with the following additional
668     markings:
669    
670     On C<XRF> C<XRFMFB> is negative
671    
672     On C<MST> C<STATUS> is set to 1
673    
674     =head2 F. Master file reorganization
675    
676     As indicated above, as Master file records are updated the C<MST> file
677     grows in size and there will be lost space in the file which cannot be
678     used. The reorganization facilities allow this space to be reclaimed by
679     recompacting the file.
680    
681     During the backup phase a Master file backup file is created (C<.BKP>).
682     The structure and format of this file is the same as the Master file
683     (C<.MST>), except that a Crossreference file is not required as all the
684     records are adjacent. Records marked for deletion are not backed up.
685     Because only the latest copy of each record is backed up, the system
686     does not allow you to perform a backup whenever an Inverted file update
687     is pending for one or more records.
688    
689     During the restore phase the backup file is read sequentially and the
690     program recreates the C<MST> and C<XRF> file. At this point alt records which
691     were marked for logical deletion (before the backup) are now marked as
692     physically deleted (by setting C<XRFMFB = -1> and C<XRFMFP = 0>.
693     Deleted records are detected by checking holes in the C<MFN> numbering.
694    
695     =head1 Inverted file structure and record formats
696    
697     =head2 A. Introduction
698    
699     The CDS/ISIS Inverted file consists of six physical files, five of
700     which contain the dictionary of searchable terms (organized as a
701     B*tree) and the sixth contains the list of postings associated with
702     each term. In order to optimize disk storage, two separate B*trees are
703     maintained, one for terms of up to 10 characters (stored in files
704     C<.N01>/C<.L01>) and one for terms longer than 10 characters, up to a maximum
705     of 30 characters (stored in files C<.N02>/C<.L02>). The file C<CNT> contains
706     control fields for both B*trees. In each B*tree the file C<.N0x> contains
707     the nodes of the tree and the C<.L0x> file contains the leafs. The leaf
708     records point to the postings file C<.IFP>.
709    
710     The relationship between the various files is schematically represented
711     in Figure 67.
712    
713     The physical relationship between these six files is a
714     pointer, which represents the relative address of the record being
715     pointed to. A relative address is the ordinal record number of a record
716     in a given file (i.e. the first record is record number 1, the second
717     is record number 2, etc.). The file C<.CNT> points to the file C<.N0x>,
718     C<.N0x> points to C<.L0x>, and C<.L0x> points to C<.IFP>. Because the
719     C<.IFP> is a packed file, the pointer from C<.L0x> to C<.IFP> has two
720     components: the block number and the offset within the block, each expressed
721     as an integer.
722    
723     =head2 B. Format of C<.CNT> file
724    
725     This file contain two 26-byte fixed length records (one for each
726     B*tree) each containing 10 integers as follows (fields marked with *
727     are 31-bit signed integers):
728    
729     =over
730    
731     =item C<IDTYPE>
732    
733     B*tree type (1 for C<.N01>/C<.L01>, 2 for C<.N02>/C<.L02>)
734    
735     =item C<ORDN>
736    
737     Nodes order (each C<.N0x> record contains at most C<2*ORDN> keys)
738    
739     =item C<ORDF>
740    
741     Leafs order (each C<.L0x> record contains at most C<2*ORDF> keys)
742    
743     =item C<N>
744    
745     Number of memory buffers allocated for nodes
746    
747     =item C<K>
748    
749     Number of buffers allocated to lst level index (C<K E<lt> N>)
750    
751     =item C<LIV>
752    
753     Current number of index levels
754    
755     =item C<POSRX>*
756    
757     Pointer to Root record in C<.N0x>
758    
759     =item C<NMAXPOS>*
760    
761     Next available position in C<.N0x> file
762    
763     =item C<FMAXPOS>*
764    
765     Next available position in C<.L0x> file
766    
767     =item C<ABNORMAL>
768    
769     Formal B*tree normality indicator (0 if B*tree is abnormal, 1 if B*tree
770     is normal). A B*tree is abnormal if the nodes file C<.N0x> contains only
771     the Root.
772    
773     =back
774    
775     C<ORDN>, C<ORDF>, C<N> and C<K> are fixed for a given generated system.
776     Currently these values are set as follows:
777    
778     C<ORDN = 5>; C<ORDF = 5>; C<N = 15>; C<K = 5> for both B*trees
779    
780     +--------------+
781     | Root address |
782     +-------|------+
783     | .CNT file
784     | -------------
785     | .N0x file
786     +-----------V--------+
787     | Key1 Key2 ... Keyn | Root
788     +---|-------------|--+
789     | |
790     +-----+ +------+
791     | |
792     +----------V----------+ +---------V----------+ 1st level
793     | Key1 Key2 ... Keyn | ... | Key1 Key2 ... Keyn | index
794     +--|------------------+ +-----------------|--+
795     | :
796     : +-------+
797     | |
798     +--V------------------+ +---------V----------+ last level
799     | Key1 Key2 ... Keyn | ... | Key1 Key2 ... Keyn | index
800     +---------|-----------+ +---------|----------+
801     | |
802     | | -------------
803     | | .L0x file
804     +---------V-----------+ +---------V----------+
805     | Key1 Key2 ... Keyn | ... | Key1 Key2 ... Keyn |
806     +--|------------------+ +--------------------+
807     |
808     | -------------
809     | .IPF file
810     +--V----------------------------------+
811     | P1 P2 P3 ..................... Pn |
812     +-------------------------------------+
813    
814     I<Figure 67: Inverted file structure>
815    
816     The other values are set as required when the B*trees are generated.
817    
818     =head2 C. Format of C<.N0x> files
819    
820     These files contain the indexes) of the dictionary of searchable terms
821     (C<.N01> for terms shorter than 11 characters and C<.N02> for terms longer
822     than 10 characters). The C<.N0x> file records have the following format
823     (fields marked with * are 31-bit signed integers):
824    
825     =over
826    
827     =item C<POS>*
828    
829     an integer indicating the relative record number (1 for the first
830     record, 2 for the second record, etc.)
831    
832     =item C<OCK>
833    
834     an integer indicating the number of active keys in the record
835     ( C<1 E<lt>= OCK E<lt>= 2*ORDN> )
836    
837     =item C<IT>
838    
839     an integer indicating the type of B*tree (1 for C<.N01>, 2 for C<.N02>)
840    
841     =item C<IDX>
842    
843     an array of C<ORDN> entries (C<OCK> of which are active), each having the
844     following format:
845    
846     =over 4
847    
848     =item C<KEY>
849    
850     a fixed length character string of length C<.LEx> (C<LE1 =10>, C<LE2 = 30>)
851    
852     =item C<PUNT>
853    
854     a pointer to the C<.N0x> record (if C<PUNT E<gt> 0>) or C<.L0x> record
855     (if C<PUNT E<lt> 0>) whose C<IDX(1).KEY = KEY>. C<PUNT = 0> indicates
856     an inactive entry. A positive C<PUNT> indicates a branch to a hierarchically
857     lower level index. The lowest level index (C<PUNT E<lt> 0>) points the leafs in
858     the C<.L0x> file.
859    
860     =back
861    
862     =back
863    
864     =head2 D. Format of C<.L0x> files
865    
866     These files contain the full dictionary of searchable terms (C<.L01> for
867     terms shorter than 11 characters and C<.L02> for terms longer than 10
868     characters). The C<.L0x> file records have the following format (fields
869     marked with C<*> are 31-bit signed integers):
870    
871     =over
872    
873     =item C<POS>*
874    
875     an integer indicating the relative record number (1 for the first
876     record, 2 for the second record, etc.)
877    
878     =item C<OCK>
879    
880     an integer indicating the number of active keys in the record
881     (C<1 E<lt> OCK E<lt>= 2*ORDF>)
882    
883     =item C<IT>
884    
885     an integer indicating the type of B*tree (1 for C<.N01>, 2 for C<.N02>)
886    
887     =item C<PS>*
888    
889     is the immediate successor of C<IDX[OCK].KEY> in this record (this is used
890     to speed up sequential access to the file)
891    
892     =item C<IDX>
893    
894     an array of C<ORDN> entries (C<OCK> of which are active), each having the
895     following format:
896    
897     =over 4
898    
899     =item C<KEY>
900    
901     a fixed length character string of length C<LEx> (C<LE1=10>, C<LE2=30>)
902    
903     =item C<INFO>
904    
905     a pointer to the C<.IFP> record where the list of postings associated with
906     C<KEY> begins. This pointer consists of two 31-bit signed integers as
907     follows:
908    
909     =over 8
910    
911     =item C<INFO[1]>*
912    
913     relative block number in C<.IFP>
914    
915     =item C<INFO[2]>*
916    
917     offset (word number relative to 0) to postings list
918    
919     =back
920    
921     =back
922    
923     =back
924    
925     =head2 E. Format of C<.IFP> file
926    
927     This file contains the list of postings for each dictionary term. Each
928     list of postings has the format indicated below. The file is structured
929     in blocks of 512 characters, where (for an initially loaded and
930     compacted file) the lists of postings for each term are adjacent,
931     except as noted below.
932    
933     The general format of each block is:
934    
935     =over
936    
937     =item C<IFPBLK>
938    
939     a 31-bit signed integer indicating the Block number of this block
940     (blocks are numbered from 1)
941    
942     =item C<IFPREC>
943    
944     An array of 127 31-bit signed integers
945    
946     =back
947    
948     C<IFPREC[1]> and C<FPREC[2]> of the first block are a pointer to the
949     next available position in the C<.IFP> file.
950    
951     Pointers from C<.L0x> to C<.IFP> and pointers within C<.IFP> consist of two
952     31-bit signed integers: the first integer is a block number, and the
953     second integer is a word offset in C<IFPREC> (e.g. the offset to the
954     first word in C<IFPREC> is 0). The list of postings associated with the
955     first search term will therefore start at 1/0.
956    
957     Each list of postings consists of a header (5 double-words) followed by
958     the actual list of postings (8 bytes for each posting). The header has
959     the following format (each field is a 31-bit signed integer):
960    
961     =over
962    
963     =item C<IFPNXTB>*
964    
965     Pointer to next segment (Block number)
966    
967     =item C<IFPNXTP>*
968    
969     Pointer to next segment (offset)
970    
971     =item C<IFPTOTP>*
972    
973     Total number of postings (accurate only in first segment)
974    
975     =item C<IFPSEGP>*
976    
977     Number of postings in this segment (C<IFPSEGP E<lt>= IFPTOTP>)
978    
979     =item C<IFPSEGC>*
980    
981     Segment capacity (i.e. number of postings which can be stored in this
982     segment)
983    
984     =back
985    
986     Each posting is a 64-bit string partitioned as follows:
987    
988     =over
989    
990     =item C<PMFN>
991    
992     (24 bits) Master file number
993    
994     =item C<PTAG>
995    
996     (16 bits) Field identifier (assigned from the C<FST>)
997    
998     =item C<POCC>
999    
1000     (8 bits) Occurrence number
1001    
1002     =item C<PCNT>
1003    
1004     (16 bits) Term sequence number in field
1005    
1006     =back
1007    
1008     Each field is stored in a strict left-to-right sequence with leading
1009     zeros added if necessary to adjust the corresponding bit string to the
1010     right (this allows comparisons of two postings as character strings).
1011    
1012     The list of postings is stored in ascending C<PMFN>/C<PTAG>/C<POCC>/C<PCNT>
1013     sequence. When the inverted file is loaded sequentially (e.g. after a
1014     full inverted file generation with ISISINV), each list consists of one
1015     or more adjacent segments. If C<IFPTOT E<lt>= 32768> then:
1016     C<IFPNXTB/IFPNXTP = 0/0> and C<IFPTOT = IFPSEGP = IFPSEGC>.
1017    
1018     As updates are performed, additional segments may be created whenever
1019     new postings must be added. In this case a new segment with capacity
1020     C<IFPTOTP> is created and linked to other segments (through the pointer
1021     C<IFPNXTB>/C<IFPNXTP>) in such a way that the sequence
1022     C<PMFN>/C<PTAG>/C<POCC>/C<PCNT> is maintained. Whenever such a split occurs
1023     the postings of the segment where the new posting should have been inserted
1024     are equally distributed between this segment and the newly created segment.
1025     New segments are always written at the end of the file (which is maintained
1026     in C<IFPREC[1]>/C<IFPREC[2]> of the first C<.IFP> block.
1027    
1028     For example, assume that a new posting C<Px> has to be inserted between C<P2>
1029     and C<P3> in the following list:
1030    
1031     +----------------------------+
1032     | 0 0 5 5 5 | P1 P2 P3 P4 P5 |
1033     +----------------------------+
1034    
1035     after the split (and assuming that the next available position in C<.IFP>
1036     is 3/4) the list of postings will consist of the following two segments:
1037    
1038     +----------------------------+
1039     | 3 4 5 3 5 | P2 P2 Px -- -- |
1040     +--|-------------------------+
1041     |
1042     +--V-------------------------+
1043     | 0 0 5 3 5 | P3 P4 P5 -- -- |
1044     +----------------------------+
1045    
1046     In this situation, no new segment will be created until either segment
1047     becomes again full.
1048    
1049     As mentioned above, the posting lists are normally stored one after the
1050     other. However, in order to facilitate access to the C<.IFP> file the
1051     segments are stored in such a way that:
1052    
1053     =over
1054    
1055     =item 1
1056    
1057     the header and the first posting in each list (28 bytes) are never
1058     split between two blocks.
1059    
1060     =item 2
1061    
1062     a posting is never split between two blocks; if there is not enough
1063     room in the current block the whole posting is stored in the next
1064     block.
1065    
1066     =back
1067    
1068     =head1 LICENCE
1069    
1070     UNESCO has developed and owns the intellectual property of the CDS/ISIS
1071     software (in whole or in part, including all files and documentation, from
1072     here on referred to as CDS/ISIS) for the storage and retrieval of
1073     information.
1074    
1075     For complete text of licence visit
1076     L<http://www.unesco.org/isis/files/winisislicense.html>.
1077    
1078     =cut
1079    

  ViewVC Help
Powered by ViewVC 1.1.26