/[swish]/trunk/spider
This is repository of my old source code which isn't updated any more. Go to git.rot13.org for current projects!
ViewVC logotype

Log of /trunk/spider

View Directory Listing Directory Listing


Sticky Revision:

Revision 103 - Directory Listing
Modified Sat Apr 30 23:29:27 2005 UTC (18 years, 11 months ago) by dpavlin
fixed warning


Revision 100 - Directory Listing
Modified Sat Apr 30 20:21:02 2005 UTC (18 years, 11 months ago) by dpavlin
extract title from beginning of document if no other data is found


Revision 99 - Directory Listing
Modified Sat Apr 30 20:20:42 2005 UTC (18 years, 11 months ago) by dpavlin
support for multipe directories


Revision 98 - Directory Listing
Modified Sun Apr 24 18:09:01 2005 UTC (18 years, 11 months ago) by dpavlin
added --skipoutput (for testing)


Revision 95 - Directory Listing
Modified Sun Apr 24 16:33:53 2005 UTC (18 years, 11 months ago) by dpavlin
added --exclude path


Revision 92 - Directory Listing
Modified Mon Nov 22 17:09:23 2004 UTC (19 years, 5 months ago) by dpavlin
skip symlinks


Revision 85 - Directory Listing
Modified Mon Aug 30 11:14:24 2004 UTC (19 years, 7 months ago) by dpavlin
extract metadata for LJ


Revision 84 - Directory Listing
Modified Sun Aug 29 21:19:13 2004 UTC (19 years, 7 months ago) by dpavlin
if pdf file doesn't have a title, display filesname and page number


Revision 81 - Directory Listing
Modified Sat Aug 28 22:15:59 2004 UTC (19 years, 7 months ago) by dpavlin
implement snippets of content and highlighthing of words


Revision 74 - Directory Listing
Modified Wed Apr 7 12:54:21 2004 UTC (20 years ago) by dpavlin
fix title extraction (again)


Revision 72 - Directory Listing
Modified Tue Apr 6 15:06:58 2004 UTC (20 years ago) by dpavlin
pdf pagination now works correctly


Revision 71 - Directory Listing
Modified Sat Apr 3 15:15:36 2004 UTC (20 years ago) by dpavlin
remove empty lines before <html> so that swish parser will catch <title>
correctly


Revision 69 - Directory Listing
Modified Thu Mar 18 23:07:21 2004 UTC (20 years, 1 month ago) by dpavlin
more verbose adding of titles


Revision 68 - Directory Listing
Modified Thu Mar 18 11:14:49 2004 UTC (20 years, 1 month ago) by dpavlin
don't save empty pages in index


Revision 66 - Directory Listing
Modified Wed Mar 17 12:19:42 2004 UTC (20 years, 1 month ago) by dpavlin
index pdf files page-by-page


Revision 65 - Directory Listing
Modified Wed Mar 17 12:19:14 2004 UTC (20 years, 1 month ago) by dpavlin
fixed back-references in regexps


Revision 63 - Directory Listing
Modified Fri Feb 6 13:29:39 2004 UTC (20 years, 2 months ago) by dpavlin
convert pdf files when indexing with progspider


Revision 61 - Directory Listing
Modified Thu Jan 29 18:26:19 2004 UTC (20 years, 2 months ago) by dpavlin
better extracting of titles


Revision 57 - Directory Listing
Modified Sun Jan 25 16:49:50 2004 UTC (20 years, 2 months ago) by dpavlin
various fixes


Revision 56 - Directory Listing
Modified Fri Jan 23 13:10:40 2004 UTC (20 years, 3 months ago) by dpavlin
better support for DocBook generated files


Revision 51 - Directory Listing
Modified Tue Jan 20 18:40:06 2004 UTC (20 years, 3 months ago) by dpavlin
better removal of JavaScript


Revision 50 - Directory Listing
Modified Tue Jan 20 18:13:32 2004 UTC (20 years, 3 months ago) by dpavlin
support for 0-size files


Revision 48 - Directory Listing
Modified Tue Jan 20 16:01:13 2004 UTC (20 years, 3 months ago) by dpavlin
removed debugging output


Revision 46 - Directory Listing
Modified Sat Jan 17 23:57:55 2004 UTC (20 years, 3 months ago) by dpavlin
- moved text/html content filtering to filter.pm to faciliate code re-use
- added progspider which can be used with -S prog to crawl files and
  use filtering subroutines


Revision 45 - Directory Listing
Modified Wed Nov 19 12:07:07 2003 UTC (20 years, 5 months ago) by dpavlin
fixes and improvements


Revision 42 - Directory Listing
Modified Tue Jul 29 10:40:58 2003 UTC (20 years, 8 months ago) by dpavlin
better handling of chars in URL, support for
<!-- noindex -->, <!-- index --> which is supported natively in swish 2.4


Revision 40 - Directory Listing
Modified Sun Jun 1 11:45:19 2003 UTC (20 years, 10 months ago) by dpavlin
- support for listing of files in .tar.gz; decompressing of .gz and .bz2
  content
- changed order of arguments for swishspider: now baseurl,url (but it's
  backwards compatibile, so your old configurations will work)
- do html fixup just on html files (to prevent binary archive corruption)
- crawl sites that have frames


Revision 32 - Directory Listing
Modified Wed Apr 30 12:40:09 2003 UTC (20 years, 11 months ago) by dpavlin
added make_config.pl which creates swish config file
added checkbox to hide document properties (like content, size etc)
remove comments between <html> and <head> which confuse swish


Revision 30 - Directory Listing
Modified Mon Mar 24 09:57:44 2003 UTC (21 years, 1 month ago) by dpavlin
added instructions about formating of html before indexing it (and added
ability to unroll wrongly splited tags in form which is acceptable to swish)


Revision 15 - Directory Listing
Modified Sun Mar 16 21:31:55 2003 UTC (21 years, 1 month ago) by dpavlin
support for image map and skip pictures (speedup)


Revision 1 - Directory Listing
Added Tue Jun 4 06:39:53 2002 UTC (21 years, 10 months ago) by dpavlin
Initial revision


  ViewVC Help
Powered by ViewVC 1.1.26