1 |
=head1 WebPAC - Search engine or data-warehouse manual |
2 |
|
3 |
It's quite hard to explain conceisly what webpac is. It's a mix between |
4 |
search engine and data warehousing application. Let's see that in detail... |
5 |
|
6 |
WebPAC was originally written to search CDS/ISIS records using C<swish-e>. |
7 |
Since then it has, however, adopted different other input formats and added |
8 |
support for alphabetical lists (earlier described as indexes). |
9 |
|
10 |
With evolution of this concept, we decided to produce following work-flow |
11 |
of your data: |
12 |
|
13 |
step |
14 |
|
15 |
source file CDS/ISIS, MARC, Excel, robots, ... |
16 |
| |
17 |
1 | apply input normalisation rules (xml or yaml) |
18 |
V |
19 |
intermidiate this data is re-formatted source data converted |
20 |
data to chunks based on tag names from config/input/ |
21 |
| |
22 |
2 | optionally apply output filter (TT2) |
23 |
V |
24 |
data search engine, HTML, OAI, RDBMS |
25 |
| |
26 |
3 | filter using query in REST format |
27 |
4 | apply output filter (TT2) |
28 |
V |
29 |
client Web browser, SOAP |
30 |
|
31 |
=head2 Normalisation and Intermidiate data |
32 |
|
33 |
This is first step in working with your data. |
34 |
|
35 |
You are creating mappings, one-to-one from source data records to documents |
36 |
in webpac. You can split or merge data from input records, apply filters |
37 |
(perl subroutines), use lookups within same source file or do simple |
38 |
evaluations while producing output. |
39 |
|
40 |
All that is controlled with C<config/input/> configuration file. You |
41 |
will want to create fine-grained chunks of data (like separate first and |
42 |
last name), which will later be used to produce output. You can think of |
43 |
conversation process as application of C<config/input/> recepie on |
44 |
every input record. |
45 |
|
46 |
Each tag within recepie is creating one new records as long as there are |
47 |
fields in input format (which can be repeatable) that satisfy at least one |
48 |
field within tag. |
49 |
|
50 |
Users of older webpac should note that this file doesn't contain any more |
51 |
formatting or specification of output type and that granularity of each tag |
52 |
has increased. |
53 |
|
54 |
B<this document should really be updated to reflect Webpacus front-end from |
55 |
this point...> |
56 |
|
57 |
=head2 Output filter |
58 |
|
59 |
Now that we have normalized record, we can create some output. You can create |
60 |
html from it, data files for search engine or insert them into RDBMS. |
61 |
|
62 |
The twist is that application of output filters can be recursive, allowing |
63 |
you to query data generated in previous step. This enables to you represent |
64 |
lists or trees from source data that have structure. This also requires to |
65 |
produce structured data in step 2 which can be filtered and queried in steps |
66 |
3 and 4 to produce final output. |
67 |
|
68 |
You should note that you can query intermidiate data in step 4 also, not |
69 |
just data produced in step 2. |
70 |
|
71 |
Output filter use Template Toolkit 2, so you have full power of simple |
72 |
procedural language (loops, conditions) and handy built-in functions to |
73 |
produce output. |
74 |
|
75 |
=head2 REST Query Format |
76 |
|
77 |
Design decision is to use REST query format. This has benefit of simplicity |
78 |
and ability to create unique URLs to all content within webpac. Simple query |
79 |
format is: |
80 |
|
81 |
http://webpac/search/html/personal_name/Joe%20Doe/AND/year/LT%201995 |
82 |
|
83 |
This REST query can be broken down to: |
84 |
|
85 |
=over |
86 |
|
87 |
=item http://webpac |
88 |
|
89 |
Hostname on which service is running. Not required if doing lookups, just |
90 |
for browser usage. |
91 |
|
92 |
=item search |
93 |
|
94 |
Name of output filtering methods. This will specify search engine. |
95 |
|
96 |
=item html |
97 |
|
98 |
Specified template that will be used to produce output. |
99 |
|
100 |
=item perlsonal_name/Joe%20Doe... |
101 |
|
102 |
URL encoded query string. It is specific to filtering method used. |
103 |
|
104 |
=back |
105 |
|
106 |
You can easily produce RSS feed for same query using follwing REST url: |
107 |
|
108 |
http://webpac/search/rss/personal_name/Joe%20Doe/AND/year/LT%201995 |
109 |
|
110 |
Yes, it really is that simple. As it should be. |
111 |
|
112 |
=head1 Tehnical stuff |
113 |
|
114 |
Following text will be more hard-code tehnical stuff about how is webpac |
115 |
implemented and why. |
116 |
|
117 |
=head2 Search Engine |
118 |
|
119 |
We are using Hyper Estraier search engine using pgestraier PostgreSQL bindings |
120 |
for it. |
121 |
|
122 |
It should be relativly easy to plugin another one if need arise. |
123 |
|
124 |
=head2 Data Warehouse |
125 |
|
126 |
In a nutshell, webpac has evolved to support hybrid data as input. That |
127 |
means it has become kind of data-warehouse application. It doesn't support |
128 |
directly roll-up and roll-down operations, but they can be emulated using |
129 |
intermidiate data step or output step. |
130 |
|