Listserv to Mailman Part 1.2: Installing Swish for Archive Searching

Background

Mailman, unlike Listserv, doesn’t come with built-in list archive searching. (Which is funny since I’d always thought Listserv’s archive search was clunky and dated—but at least it had archive search! :))

I think Mailman may not have archive searching because the developers wanted to keep their focus on the mailing list software itself rather than trying to write and maintain search software too. They probably figured it would be better to leave search to search software developers, and also give people the freedom to install whatever search package they want.

Nonetheless, it would have been a great relief if Mailman had just come with a default search package for list archives that could be uninstalled/overridden if necessary, rather than making everyone who wants to provide list archive searching (which must be a pretty common requirement) re-invent the search wheel by hunting down a package and figuring out how to integrate it into Mailman.

The Swish-E Package

After a good deal of searching (ha) I eventually found that many others who also wanted searchable list archives seemed to lean toward using Swish-E (swish-e.org, Wikipedia).

(Note: Since writing this guide, I’ve found that others have used htdig successfully for archive searching too; see the _README file at msapiro.net/mm/ and this documentation.)

Fortunately, I already had experience with Swish because our organization already used Swish for the search feature on its main website.

I wish I could provide step-by-step instructions for setting up Swish , but I forget the steps I used to install it from source on our development box (web host) back in 2003, and the server admins on our production list host installed it for me, probably using a binary package manager like RPM or Apt.

Even though I don’t have step-by-step instructions for installing Swish (though if you are root, downloading/unpacking the source, running “configure” and “make install” will probably do the trick), I wanted to include a section about it for a few reasons.

Aside from pointing you to the main swish-e.org site, in particular the Download and Documentation sections (the latter including a nice INSTALL page) , I also wanted to point you to the extremely helpful Integrating Mailman with a Swish-e Search Engine page.

I ended up only following some of that page’s advice, but it was invaluable for getting started on the Listserv to Mailman archive conversion as far as what issues to consider. So that page is definitely worth checking out just to see what’s involved. (Though note that I ended up writing code to automate some of the things on that page, which I’ll cover later.)

First though, just focus on installing Swish-e itself, which indexes and searches 1) the Mailman HTML archive pages (one page per message) and, optionally, 2) any message attachment files such as PDFs or Word docs by using extra Swish add-on/filter programs.

Using Swish-E for Searching Message Attachments

While the basic setup of Swish for HTML archive messages is fairly easy, setting it up to search non-HTML/text files such as Word, Excel, and PDF attachments is a little more tricky. I struggled with this when I set up Swish to search binary files on our website, so I wanted to give some info on it in case you want to support searching message attachments too.

(Mailman does allow attachments to be included in list archives, it just separates each from its associated message and puts a link to the attachment file at the bottom of the archived message.)

Note though: This entire section (the rest of this page) is optional. To be blunt, if you don’t have to support searching archive message attachments, make your life easier and don’t do it; you can always go back and add it later if you need to. We didn’t even do it for our list archives, I’m just including this info about using Swish to index binary files based on my experience doing that on our main website.

For a general article about indexing HTML pages and other file types with Swish-E, see How to Index Anything from the Linux Journal in 2003. At first glance that page is very good, but I haven’t examined it in detail and it’s possible that some details have changed since 2003.

Basically, for non-text/HTML files Swish relies on external helper programs to extract text from each file. For example, it uses the pdftotext program in the xpdf package for extracting text from PDF files, the catdoc program to get text from Word .doc files, etc. See the “Optional But Recommended Packages” section of the Swish-E INSTALL doc for more info on what’s available.

(I haven’t yet tackled the issue of extracting text from Word’s newer .docx file format for Swish; if I do, I’ll update this page, but if you’ve already done so please let me know; some possibilities are 1) docx2txt, 2) unoconv though that might unusable by Swish because it requires a running OpenOffice instance, or 3) maybe even a quick-and-dirty unzip/sed/grep combination.)

For more information about supporting searching of PDF, Word, etc. files, see “How do I index my PDF, Word, and compressed documents?” and the sections after it on the Swish-E FAQ page as well as the example filters in the “Document Filter Directives” section of the SWISH-CONFIG man page.

Note: I had to tweak the example given on that last page for PDF and DOC files, which were the only two binary file types I included in our website search. Specifically, the SWISH-CONFIG page gave the example of:

FileFilter .pdf       pdftotext   "%p -"

and that produced an error during indexing; every time the Swish indexer encountered a PDF file and tried to run pdftotext, it printed pdftotext’s usage info:

pdftotext version 3.02
Copyright 1996-2007 Glyph & Cog, LLC
Usage: pdftotext [options] <PDF-file> [<text-file>]
  -f <int>          : first page to convert
  -l <int>          : last page to convert
  ... etc.

To fix this, I had to tweak it to:

FileFilter .pdf pdftotext "'%p' -"

Actually I chose to spell out the full /path/to/pdftotext instead of just pdftotext there, but you get the idea—the main difference is in the quoting at the end, to put %p within its own single quotes.

I had to do the same thing with the catdoc example; the SWISH-CONFIG page suggested:

FileFilter .doc     /usr/local/bin/catdoc "-s8859-1 -d8859-1 %p"

… but that failed too so I enclosed the %p in single quotes there as well:

FileFilter .doc /path/to/our/catdoc "-s8859-1 -d8859-1 '%p'"

Character Set Problems While Indexing

Another issue was that at some point I started getting character set error messages when running the Swish indexer. I wish I’d documented the problem and my solution when it happened, but I’ve done my best to reconstruct the issue in case it’s useful to someone.

I believe the error may have been something about catdoc not being able to find the ascii.replchars and/or ascii.specchars character set files. I think I hunted around to find these files, but since catdoc seems to be a fairly old (and seemingly unmaintained) package, my search was fruitless.

Ultimately I think my solution was that I noticed I did have ascii.rpl and ascii.spc files, so I copied those to ascii.replchars and ascii.specchars and that fixed the problem.

I’m also still getting the following error (from pdftotext I believe) when running the Swish indexer on our web site:

Error: Unknown character collection 'Adobe-Korea1'

I searched online and couldn’t find anything for that specific character set, but when I searched for “swish unknown character collection” I found this post which recommended upgrading the xpdf package as a possible solution. I haven’t tried it yet because it’s only a few errors and I hate to upgrade things unless I absolutely have to, but I wanted to mention it here in case someone else gets similar errors using Swish to index PDFs.

Next: Installing an Administrative Command Handler or Up: Table of Contents