lib/WebPAC/Manual.pod

=head1 WebPAC - Search engine or data-warehouse manual

It's quite hard to explain conceisly what webpac is. It's a mix between
search engine and data warehousing application. Let's see that in detail...

WebPAC was originally written to search CDS/ISIS records using C<swish-e>.
Since then it has, however, adopted different other input formats and added
support for alphabetical lists (earlier described as indexes).

With evolution of this concept, we decided to produce following work-flow
of your data:

  step

   source file          CDS/ISIS, MARC, Excel, robots, ...
      |
  1   | apply input normalisation rules (xml or yaml)
      V
   intermidiate         this data is re-formatted source data converted
     data               to chunks based on tag names from config/input/
      |
  2   | optionally apply output filter (TT2)
      V
     data               search engine, HTML, OAI, RDBMS
      |
  3   | filter using query in REST format
  4   | apply output filter (TT2)
      V
    client              Web browser, SOAP

=head2 Normalisation and Intermidiate data

This is first step in working with your data.

You are creating mappings, one-to-one from source data records to documents
in webpac. You can split or merge data from input records, apply filters
(perl subroutines), use lookups within same source file or do simple
evaluations while producing output.

All that is controlled with C<config/input/> configuration file. You
will want to create fine-grained chunks of data (like separate first and
last name), which will later be used to produce output. You can think of
conversation process as application of C<config/input/> recepie on
every input record.

Each tag within recepie is creating one new records as long as there are
fields in input format (which can be repeatable) that satisfy at least one
field within tag.

Users of older webpac should note that this file doesn't contain any more
formatting or specification of output type and that granularity of each tag
has increased.

B<this document should really be updated to reflect Webpacus front-end from
this point...>

=head2 Output filter

Now that we have normalized record, we can create some output. You can create
html from it, data files for search engine or insert them into RDBMS.

The twist is that application of output filters can be recursive, allowing
you to query data generated in previous step. This enables to you represent
lists or trees from source data that have structure. This also requires to
produce structured data in step 2 which can be filtered and queried in steps
3 and 4 to produce final output.

You should note that you can query intermidiate data in step 4 also, not
just data produced in step 2.

Output filter use Template Toolkit 2, so you have full power of simple
procedural language (loops, conditions) and handy built-in functions to
produce output.

=head2 REST Query Format

Design decision is to use REST query format. This has benefit of simplicity
and ability to create unique URLs to all content within webpac. Simple query
format is:

  http://webpac/search/html/personal_name/Joe%20Doe/AND/year/LT%201995

This REST query can be broken down to:

=over

=item http://webpac

Hostname on which service is running. Not required if doing lookups, just
for browser usage.

=item search

Name of output filtering methods. This will specify search engine.

=item html

Specified template that will be used to produce output.

=item perlsonal_name/Joe%20Doe...

URL encoded query string. It is specific to filtering method used.

=back

You can easily produce RSS feed for same query using follwing REST url:

  http://webpac/search/rss/personal_name/Joe%20Doe/AND/year/LT%201995

Yes, it really is that simple. As it should be.

=head1 Tehnical stuff

Following text will be more hard-code tehnical stuff about how is webpac
implemented and why.

=head2 Search Engine

We are using Hyper Estraier search engine using pgestraier PostgreSQL bindings
for it.

It should be relativly easy to plugin another one if need arise.

=head2 Data Warehouse

In a nutshell, webpac has evolved to support hybrid data as input. That
means it has become kind of data-warehouse application. It doesn't support
directly roll-up and roll-down operations, but they can be emulated using
intermidiate data step or output step.

1	dpavlin	1	=head1 WebPAC - Search engine or data-warehouse manual
2
3			It's quite hard to explain conceisly what webpac is. It's a mix between
4			search engine and data warehousing application. Let's see that in detail...
5
6			WebPAC was originally written to search CDS/ISIS records using C<swish-e>.
7			Since then it has, however, adopted different other input formats and added
8			support for alphabetical lists (earlier described as indexes).
9
10			With evolution of this concept, we decided to produce following work-flow
11			of your data:
12
13			step
14
15			source file CDS/ISIS, MARC, Excel, robots, ...
16			\|
17	dpavlin	311	1 \| apply input normalisation rules (xml or yaml)
18	dpavlin	1	V
19			intermidiate this data is re-formatted source data converted
20	dpavlin	311	data to chunks based on tag names from config/input/
21	dpavlin	1	\|
22	dpavlin	311	2 \| optionally apply output filter (TT2)
23	dpavlin	1	V
24			data search engine, HTML, OAI, RDBMS
25			\|
26			3 \| filter using query in REST format
27			4 \| apply output filter (TT2)
28			V
29			client Web browser, SOAP
30
31			=head2 Normalisation and Intermidiate data
32
33			This is first step in working with your data.
34
35			You are creating mappings, one-to-one from source data records to documents
36			in webpac. You can split or merge data from input records, apply filters
37			(perl subroutines), use lookups within same source file or do simple
38			evaluations while producing output.
39
40	dpavlin	311	All that is controlled with C<config/input/> configuration file. You
41	dpavlin	8	will want to create fine-grained chunks of data (like separate first and
42			last name), which will later be used to produce output. You can think of
43	dpavlin	311	conversation process as application of C<config/input/> recepie on
44	dpavlin	8	every input record.
45	dpavlin	1
46			Each tag within recepie is creating one new records as long as there are
47			fields in input format (which can be repeatable) that satisfy at least one
48			field within tag.
49
50			Users of older webpac should note that this file doesn't contain any more
51			formatting or specification of output type and that granularity of each tag
52			has increased.
53
54	dpavlin	311	B<this document should really be updated to reflect Webpacus front-end from
55			this point...>
56
57	dpavlin	1	=head2 Output filter
58
59			Now that we have normalized record, we can create some output. You can create
60			html from it, data files for search engine or insert them into RDBMS.
61
62			The twist is that application of output filters can be recursive, allowing
63			you to query data generated in previous step. This enables to you represent
64			lists or trees from source data that have structure. This also requires to
65			produce structured data in step 2 which can be filtered and queried in steps
66			3 and 4 to produce final output.
67
68			You should note that you can query intermidiate data in step 4 also, not
69			just data produced in step 2.
70
71			Output filter use Template Toolkit 2, so you have full power of simple
72			procedural language (loops, conditions) and handy built-in functions to
73			produce output.
74
75			=head2 REST Query Format
76
77			Design decision is to use REST query format. This has benefit of simplicity
78			and ability to create unique URLs to all content within webpac. Simple query
79			format is:
80
81			http://webpac/search/html/personal_name/Joe%20Doe/AND/year/LT%201995
82
83			This REST query can be broken down to:
84
85			=over
86
87			=item http://webpac
88
89			Hostname on which service is running. Not required if doing lookups, just
90			for browser usage.
91
92			=item search
93
94			Name of output filtering methods. This will specify search engine.
95
96			=item html
97
98			Specified template that will be used to produce output.
99
100			=item perlsonal_name/Joe%20Doe...
101
102			URL encoded query string. It is specific to filtering method used.
103
104			=back
105
106			You can easily produce RSS feed for same query using follwing REST url:
107
108			http://webpac/search/rss/personal_name/Joe%20Doe/AND/year/LT%201995
109
110			Yes, it really is that simple. As it should be.
111
112			=head1 Tehnical stuff
113
114			Following text will be more hard-code tehnical stuff about how is webpac
115			implemented and why.
116
117			=head2 Search Engine
118
119			We are using Hyper Estraier search engine using pgestraier PostgreSQL bindings
120			for it.
121
122			It should be relativly easy to plugin another one if need arise.
123
124			=head2 Data Warehouse
125
126			In a nutshell, webpac has evolved to support hybrid data as input. That
127			means it has become kind of data-warehouse application. It doesn't support
128			directly roll-up and roll-down operations, but they can be emulated using
129			intermidiate data step or output step.
130