This is repository of my old source code which isn't updated any more. Go to git.rot13.org for current projects!
Log of /trunk/spider
Directory Listing
Revision
103 -
Directory Listing
Modified
Sat Apr 30 23:29:27 2005 UTC
(18 years, 11 months ago)
by
dpavlin
fixed warning
Revision
100 -
Directory Listing
Modified
Sat Apr 30 20:21:02 2005 UTC
(18 years, 11 months ago)
by
dpavlin
extract title from beginning of document if no other data is found
Revision
99 -
Directory Listing
Modified
Sat Apr 30 20:20:42 2005 UTC
(18 years, 11 months ago)
by
dpavlin
support for multipe directories
Revision
98 -
Directory Listing
Modified
Sun Apr 24 18:09:01 2005 UTC
(18 years, 11 months ago)
by
dpavlin
added --skipoutput (for testing)
Revision
95 -
Directory Listing
Modified
Sun Apr 24 16:33:53 2005 UTC
(18 years, 11 months ago)
by
dpavlin
added --exclude path
Revision
92 -
Directory Listing
Modified
Mon Nov 22 17:09:23 2004 UTC
(19 years, 4 months ago)
by
dpavlin
skip symlinks
Revision
85 -
Directory Listing
Modified
Mon Aug 30 11:14:24 2004 UTC
(19 years, 7 months ago)
by
dpavlin
extract metadata for LJ
Revision
84 -
Directory Listing
Modified
Sun Aug 29 21:19:13 2004 UTC
(19 years, 7 months ago)
by
dpavlin
if pdf file doesn't have a title, display filesname and page number
Revision
81 -
Directory Listing
Modified
Sat Aug 28 22:15:59 2004 UTC
(19 years, 7 months ago)
by
dpavlin
implement snippets of content and highlighthing of words
Revision
74 -
Directory Listing
Modified
Wed Apr 7 12:54:21 2004 UTC
(19 years, 11 months ago)
by
dpavlin
fix title extraction (again)
Revision
72 -
Directory Listing
Modified
Tue Apr 6 15:06:58 2004 UTC
(19 years, 11 months ago)
by
dpavlin
pdf pagination now works correctly
Revision
71 -
Directory Listing
Modified
Sat Apr 3 15:15:36 2004 UTC
(19 years, 11 months ago)
by
dpavlin
remove empty lines before <html> so that swish parser will catch <title>
correctly
Revision
69 -
Directory Listing
Modified
Thu Mar 18 23:07:21 2004 UTC
(20 years ago)
by
dpavlin
more verbose adding of titles
Revision
68 -
Directory Listing
Modified
Thu Mar 18 11:14:49 2004 UTC
(20 years ago)
by
dpavlin
don't save empty pages in index
Revision
66 -
Directory Listing
Modified
Wed Mar 17 12:19:42 2004 UTC
(20 years ago)
by
dpavlin
index pdf files page-by-page
Revision
65 -
Directory Listing
Modified
Wed Mar 17 12:19:14 2004 UTC
(20 years ago)
by
dpavlin
fixed back-references in regexps
Revision
63 -
Directory Listing
Modified
Fri Feb 6 13:29:39 2004 UTC
(20 years, 1 month ago)
by
dpavlin
convert pdf files when indexing with progspider
Revision
61 -
Directory Listing
Modified
Thu Jan 29 18:26:19 2004 UTC
(20 years, 2 months ago)
by
dpavlin
better extracting of titles
Revision
57 -
Directory Listing
Modified
Sun Jan 25 16:49:50 2004 UTC
(20 years, 2 months ago)
by
dpavlin
various fixes
Revision
56 -
Directory Listing
Modified
Fri Jan 23 13:10:40 2004 UTC
(20 years, 2 months ago)
by
dpavlin
better support for DocBook generated files
Revision
51 -
Directory Listing
Modified
Tue Jan 20 18:40:06 2004 UTC
(20 years, 2 months ago)
by
dpavlin
better removal of JavaScript
Revision
50 -
Directory Listing
Modified
Tue Jan 20 18:13:32 2004 UTC
(20 years, 2 months ago)
by
dpavlin
support for 0-size files
Revision
48 -
Directory Listing
Modified
Tue Jan 20 16:01:13 2004 UTC
(20 years, 2 months ago)
by
dpavlin
removed debugging output
Revision
46 -
Directory Listing
Modified
Sat Jan 17 23:57:55 2004 UTC
(20 years, 2 months ago)
by
dpavlin
- moved text/html content filtering to filter.pm to faciliate code re-use
- added progspider which can be used with -S prog to crawl files and
use filtering subroutines
Revision
45 -
Directory Listing
Modified
Wed Nov 19 12:07:07 2003 UTC
(20 years, 4 months ago)
by
dpavlin
fixes and improvements
Revision
42 -
Directory Listing
Modified
Tue Jul 29 10:40:58 2003 UTC
(20 years, 8 months ago)
by
dpavlin
better handling of chars in URL, support for
<!-- noindex -->, <!-- index --> which is supported natively in swish 2.4
Revision
40 -
Directory Listing
Modified
Sun Jun 1 11:45:19 2003 UTC
(20 years, 10 months ago)
by
dpavlin
- support for listing of files in .tar.gz; decompressing of .gz and .bz2
content
- changed order of arguments for swishspider: now baseurl,url (but it's
backwards compatibile, so your old configurations will work)
- do html fixup just on html files (to prevent binary archive corruption)
- crawl sites that have frames
Revision
32 -
Directory Listing
Modified
Wed Apr 30 12:40:09 2003 UTC
(20 years, 11 months ago)
by
dpavlin
added make_config.pl which creates swish config file
added checkbox to hide document properties (like content, size etc)
remove comments between <html> and <head> which confuse swish
Revision
30 -
Directory Listing
Modified
Mon Mar 24 09:57:44 2003 UTC
(21 years ago)
by
dpavlin
added instructions about formating of html before indexing it (and added
ability to unroll wrongly splited tags in form which is acceptable to swish)
Revision
15 -
Directory Listing
Modified
Sun Mar 16 21:31:55 2003 UTC
(21 years ago)
by
dpavlin
support for image map and skip pictures (speedup)
Revision
1 -
Directory Listing
Added
Tue Jun 4 06:39:53 2002 UTC
(21 years, 9 months ago)
by
dpavlin
Initial revision