1 |
dpavlin |
1 |
=head1 WebPAC - Search engine or data-warehouse manual |
2 |
|
|
|
3 |
|
|
It's quite hard to explain conceisly what webpac is. It's a mix between |
4 |
|
|
search engine and data warehousing application. Let's see that in detail... |
5 |
|
|
|
6 |
|
|
WebPAC was originally written to search CDS/ISIS records using C<swish-e>. |
7 |
|
|
Since then it has, however, adopted different other input formats and added |
8 |
|
|
support for alphabetical lists (earlier described as indexes). |
9 |
|
|
|
10 |
|
|
With evolution of this concept, we decided to produce following work-flow |
11 |
|
|
of your data: |
12 |
|
|
|
13 |
|
|
step |
14 |
|
|
|
15 |
|
|
source file CDS/ISIS, MARC, Excel, robots, ... |
16 |
|
|
| |
17 |
dpavlin |
311 |
1 | apply input normalisation rules (xml or yaml) |
18 |
dpavlin |
1 |
V |
19 |
|
|
intermidiate this data is re-formatted source data converted |
20 |
dpavlin |
311 |
data to chunks based on tag names from config/input/ |
21 |
dpavlin |
1 |
| |
22 |
dpavlin |
311 |
2 | optionally apply output filter (TT2) |
23 |
dpavlin |
1 |
V |
24 |
|
|
data search engine, HTML, OAI, RDBMS |
25 |
|
|
| |
26 |
|
|
3 | filter using query in REST format |
27 |
|
|
4 | apply output filter (TT2) |
28 |
|
|
V |
29 |
|
|
client Web browser, SOAP |
30 |
|
|
|
31 |
|
|
=head2 Normalisation and Intermidiate data |
32 |
|
|
|
33 |
|
|
This is first step in working with your data. |
34 |
|
|
|
35 |
|
|
You are creating mappings, one-to-one from source data records to documents |
36 |
|
|
in webpac. You can split or merge data from input records, apply filters |
37 |
|
|
(perl subroutines), use lookups within same source file or do simple |
38 |
|
|
evaluations while producing output. |
39 |
|
|
|
40 |
dpavlin |
311 |
All that is controlled with C<config/input/> configuration file. You |
41 |
dpavlin |
8 |
will want to create fine-grained chunks of data (like separate first and |
42 |
|
|
last name), which will later be used to produce output. You can think of |
43 |
dpavlin |
311 |
conversation process as application of C<config/input/> recepie on |
44 |
dpavlin |
8 |
every input record. |
45 |
dpavlin |
1 |
|
46 |
|
|
Each tag within recepie is creating one new records as long as there are |
47 |
|
|
fields in input format (which can be repeatable) that satisfy at least one |
48 |
|
|
field within tag. |
49 |
|
|
|
50 |
|
|
Users of older webpac should note that this file doesn't contain any more |
51 |
|
|
formatting or specification of output type and that granularity of each tag |
52 |
|
|
has increased. |
53 |
|
|
|
54 |
dpavlin |
311 |
B<this document should really be updated to reflect Webpacus front-end from |
55 |
|
|
this point...> |
56 |
|
|
|
57 |
dpavlin |
1 |
=head2 Output filter |
58 |
|
|
|
59 |
|
|
Now that we have normalized record, we can create some output. You can create |
60 |
|
|
html from it, data files for search engine or insert them into RDBMS. |
61 |
|
|
|
62 |
|
|
The twist is that application of output filters can be recursive, allowing |
63 |
|
|
you to query data generated in previous step. This enables to you represent |
64 |
|
|
lists or trees from source data that have structure. This also requires to |
65 |
|
|
produce structured data in step 2 which can be filtered and queried in steps |
66 |
|
|
3 and 4 to produce final output. |
67 |
|
|
|
68 |
|
|
You should note that you can query intermidiate data in step 4 also, not |
69 |
|
|
just data produced in step 2. |
70 |
|
|
|
71 |
|
|
Output filter use Template Toolkit 2, so you have full power of simple |
72 |
|
|
procedural language (loops, conditions) and handy built-in functions to |
73 |
|
|
produce output. |
74 |
|
|
|
75 |
|
|
=head2 REST Query Format |
76 |
|
|
|
77 |
|
|
Design decision is to use REST query format. This has benefit of simplicity |
78 |
|
|
and ability to create unique URLs to all content within webpac. Simple query |
79 |
|
|
format is: |
80 |
|
|
|
81 |
|
|
http://webpac/search/html/personal_name/Joe%20Doe/AND/year/LT%201995 |
82 |
|
|
|
83 |
|
|
This REST query can be broken down to: |
84 |
|
|
|
85 |
|
|
=over |
86 |
|
|
|
87 |
|
|
=item http://webpac |
88 |
|
|
|
89 |
|
|
Hostname on which service is running. Not required if doing lookups, just |
90 |
|
|
for browser usage. |
91 |
|
|
|
92 |
|
|
=item search |
93 |
|
|
|
94 |
|
|
Name of output filtering methods. This will specify search engine. |
95 |
|
|
|
96 |
|
|
=item html |
97 |
|
|
|
98 |
|
|
Specified template that will be used to produce output. |
99 |
|
|
|
100 |
|
|
=item perlsonal_name/Joe%20Doe... |
101 |
|
|
|
102 |
|
|
URL encoded query string. It is specific to filtering method used. |
103 |
|
|
|
104 |
|
|
=back |
105 |
|
|
|
106 |
|
|
You can easily produce RSS feed for same query using follwing REST url: |
107 |
|
|
|
108 |
|
|
http://webpac/search/rss/personal_name/Joe%20Doe/AND/year/LT%201995 |
109 |
|
|
|
110 |
|
|
Yes, it really is that simple. As it should be. |
111 |
|
|
|
112 |
|
|
=head1 Tehnical stuff |
113 |
|
|
|
114 |
|
|
Following text will be more hard-code tehnical stuff about how is webpac |
115 |
|
|
implemented and why. |
116 |
|
|
|
117 |
|
|
=head2 Search Engine |
118 |
|
|
|
119 |
|
|
We are using Hyper Estraier search engine using pgestraier PostgreSQL bindings |
120 |
|
|
for it. |
121 |
|
|
|
122 |
|
|
It should be relativly easy to plugin another one if need arise. |
123 |
|
|
|
124 |
|
|
=head2 Data Warehouse |
125 |
|
|
|
126 |
|
|
In a nutshell, webpac has evolved to support hybrid data as input. That |
127 |
|
|
means it has become kind of data-warehouse application. It doesn't support |
128 |
|
|
directly roll-up and roll-down operations, but they can be emulated using |
129 |
|
|
intermidiate data step or output step. |
130 |
|
|
|