• Subscribe
  • Home
  • About
  • Contact
  • Donate
  • Listserv to Mailman
  • How to Make a Cat Scratching Post

Anthony R. Thompson's Blog

Helpful Things

9 Jul 10

Listserv to Mailman Part 3.2: Setting Up Archive Search

Introduction

Mailman doesn’t come with built-in archive searching. Instead, maybe in the Unix tradition of “do one thing and do it well, leave the rest to other tools”, you must find, install, and integrate your own search package. In some places it’s mentioned that people have used the Swish search package successfully, but usually no further explanation of how to do this is given.

(Note that there are actually two “swish” index/search packages out there, Swish-E and Swish++. For this guide, and on our server, I used Swish-E because we already used it for search on our main website.)

Mailman seems to add each message to the list archive (mbox file and HTML archive pages) as it’s processed for delivery/handling. We needed to somehow periodically have Swish generate a search index from all the HTML archive pages.

We also had to tweak the Mailman installation to add a search box to the list template pages, and add other tweaks to limit searching and search results to subscribers only (since all our lists and their archives were private).

Adding a Search Box to List Templates

To enable list archive searching, we needed a place where people could enter their search parameters, which required editing a few Mailman list template files.

As written at http://wpkg.org/Integrating_Mailman_with_a_Swish-e_search_engine, we didn’t want to edit the installed Mailman templates, we wanted to copy the default templates to a special area and then edit the copies to override the default ones. This made sure that our customized templates wouldn’t be overwritten by a Mailman upgrade.

On our server, the default templates were in /etc/mailman/en/ so we created /etc/mailman/site/en/ and copied the archidxhead.html, archtoc.html, and archtocnombox.html files from the former directory to the latter.

Then I edited archidxhead.html to add the following line:

<li><b><a href="/cgi-bin/mailman/search/%(listname)s"
>Search the archives of this list</a></b></li>

archtoc.html was edited to have:

<p>You can get <a href="%(listinfo)s">more information about this list</a>,
<a href="%(fullarch)s">download the full raw archive</a> (%(size)s),
or <a href="/cgi-bin/mailman/search/%(listname)s"
>search all the archives of the list</a>.</p>

Finally, archtocnombox.html was edited to have:

<p>You can get <a href="%(listinfo)s">more information about this list</a>
or <a href="/cgi-bin/mailman/search/%(listname)s"
>search the list archives</a>.</p>

Mailman seems to load templates into memory when starting up, so to get it to recognize the custom template overrides, we had to restart Mailman (which I think was done with /etc/init.d/mailman restart but I’m not 100% sure because our server admins actually did the restart).

Note that if your mailman CGI scripts end with cgi (e.g., if you built from source and used the –with-cgi-ext=.cgi flag to the configure script), it would be search.cgi in the above snippets.

After putting links to the search script on the Mailman templates, we had to put the search script in place to receive search queries. This was a bit complicated due to our requirement to keep search (and search results) limited to subscribers only, and it’s where we had to deviate from the Integrating Mailman with a Swish-e search engine page because that didn’t cover keeping searches private.

The script I wrote to create the Swish search index for each list (arch_index.py) also does as much setup for Swish searching as possible but there are some one-time setup things we had to do manually first.

Before I describe those one-time setup steps, I want to describe the general process of using arch_index.py to create the Swish index files for Mailman lists.

Generating a Search Index with Swish

Swish doesn’t search the HTML pages in the archive directly, probably because it would be too slow. Instead, it searches against its own optimized index file. That’s faster, but periodically the index file itself needs to be updated (regenerated completely, actually, as Swish doesn’t seem to support incrementally adding to index files).

Since I’d have to do the same steps to generate the Swish index file for each list archive, I automated the process with a script called arch_index.py, named somewhat in honor of the Mailman bin program “arch”.

In addition to generating the Swish index file for a list(s), I also had arch_index.py set up the CGI script which uses Swish to search a given list (based on the swish.cgi provided with the Swish package) , and set up a configuration file for the CGI script to work correctly.

If arch_index.py is just given the Mailman archive directory as a parameter, and no other options, it creates Swish search indexes for all lists (except the built-in “mailman” list, though there is a flag to force indexing of that too):

arch_index.py /var/lib/mailman/archives/

To create the index (and search config files, etc.) for just a particular list you can do:

arch_index.py -l some-listname /var/lib/mailman/archives/

arch_index.py assumes the Swish indexer program is located at /usr/bin/swish-e and the default Swish CGI search script was installed at /usr/lib/swish-e/swish.cgi—though you can override both on the command line or by editing arch_index.py itself. To see all options, just type arch_index.py with no arguments.

arch_index.py also enables searching list archives by date. Swish’s search by date feature looks at the modification times of indexed files, but for a list converted from Listserv all the HTML archive pages would be generated at once and have the same modification time. So to support date searching, arch_index.py looks at each posting’s date and then sets the file modification time to that date.

We should probably describe exactly what arch_index.py does though…

What arch_index.py Does

You can look at the source code yourself of course, but here’s a quick summary of what arch_index.py does:

  1. Checks command line options and figures out what, if any, lists to index, and whether the swish-e and swish.cgi files are available
  2. Copies the default swish.cgi file (if it hasn’t been copied already) and customizes it
  3. Creates a config file for the customized swish.cgi script (if it hasn’t been created before)
  4. Updates the modification times of any new HTML archive message files for each list, to match their message posting dates
  5. Creates a swish indexing config file for each list, if necessary
  6. Actually runs swish for each list to create the swish index files

Creating a Custom Swish Search Template for All Lists

Part of the customization in item #2 above is changing the default swish.cgi file to refer to a custom template file, similar to the process described in Integrating Mailman with a Swish-e search engine.

So we set up the custom template file by copying TemplateDefault.pm in /usr/lib/swish-e/perl/SWISH/ to TemplateDefault_MM.pm in the same directory and making four changes:

  1. package SWISH::TemplateDefault was changed to package SWISH::TemplateDefault_MM
  2. my $advanced_link = qq[<small><a href=”$form”>advanced form</a></small>] was changed to my $advanced_link = qq[<small><a href=”$form$ENV{‘PATH_INFO’}“>advanced form</a></small>]
  3. <form method=”get” action=”$form” enctype=”application/x-www-form-urlencoded” class=”form”> was changed to <form method=”get” action=”$form$ENV{‘PATH_INFO’}” enctype=”application/x-www-form-urlencoded” class=”form”>
  4. The following was added after $query_href and $pages:
    $query_href =~ s#search\?#search$ENV{‘PATH_INFO’}\?#g;
    $pages =~ s#search\?#search$ENV{‘PATH_INFO’}\?#g;

(If you prefer patch files you can download TemplateDefault_MM.patch, change into /usr/lib/swish-e/perl/SWISH/ and then run “patch < TemplateDefault_MM.patch”; see The Ten Minute Guide to diff and patch for more info.)

$ENV{‘PATH_INFO’} had to be added in all those places to let one master Swish search template work for all lists, because the listname is passed to the search script as an extra path info parameter (i.e., /cgi-bin/mailman/search/listname).

arch_index.py copies the swish.cgi file in /usr/lib/swish-e/ into /var/lib/mailman/archives/private/ and then 1) changes SWISH::TemplateDefault to SWISH::TemplateDefault_MM and 2) changes $DEFAULT_CONFIG_FILE to point to /var/lib/mailman/archives/private/swish.cgi.conf which arch_index.py also creates.

The /var/lib/mailman/archives/private/swish.cgi.conf file created by arch_index.py uses an $ENV{‘LISTNAME’} environment variable that /cgi-bin/mailman/search (which we haven’t covered yet) sets from its extra path info parameter. In other words, using an environment variable in swish.cgi.conf file allows there to be one master configuration file which can dynamically refer to a different search index file for each list.

Setting Up the Search CGI Script – Background/Explanations

At this point we’d edited the search templates to provide links to the search pages, created Swish search indexes for all the HTML message files, and created a customized swish.cgi script and config file (in /var/lib/mailman/archives/private/) to do the actual searching with Swish and return results.

We could have then tried to hook up the web pages’ search links with our version of swish.cgi to do the search and return the results, as described on the Integrating Mailman with a Swish-e search engine page (though it uses a somewhat different integration method with Server Side Includes).

The problem with that approach is that while the full list messages themselves are protected by the Mailman “private” access control mechanism (for private lists), the search results themselves contain message excerpts so if a mailing list had confidential information it would be exposed to non-subscribers. For our purposes, that wasn’t acceptable.

So we also had to restrict running the search and displaying the results to list subscribers, which proved to be fairly involved.

My initial thought was to try to figure out Mailman’s authentication mechanism and then wrap it around the Swish search CGI script.

I looked at the files in the /cgi-bin/mailman directory (actually /usr/lib/cgi-bin/mailman/ in our installation) and the “file” command said that they were all “setgid ELF 32-bit LSB executable” files, i.e., compiled executables.

This confused me since most of Mailman seems written in Python, but I noticed that the Python files in /usr/lib/mailman/Mailman/Cgi/ had the same names as the compiled programs in /usr/lib/cgi-bin/mailman/. Further, changing one of the interpreted .py files in /usr/lib/mailman/Mailman/Cgi/ confirmed that the compiled files with the same name in /usr/lib/cgi-bin/mailman/ were calling them.

I tried copying the interpreted Python file /usr/lib/mailman/Mailman/Cgi/private.py to /usr/lib/cgi-bin/mailman/search and then editing it to call Swish’s CGI search program (/var/lib/mailman/archives/private/swish.cgi) instead of displaying private archive pages as private.py normally does.

While this worked on our development box, unfortunately on our production host it resulted in permission errors reading list configuration files because the web server wouldn’t use the setgid mechanism to run an interpreted file as the same “mailman” group as the other Mailman programs (which it wouldn’t do for good security reasons).

I’d thought the CGI files in /usr/lib/cgi-bin/mailman/ were compiled for performance reasons, but it turns out they were compiled to allow the web server to run the CGI scripts with the correct permissions via the setgid mechanism.

At this point I downloaded the source files for our version of Mailman because I wanted to confirm this and because I suspected I’d need to compile my own version of a “search” script (a modified “private” script). (I needed to download the source files because our production server admins had installed Mailman with a precompiled binary package.)

After downloading the mailman-2.x.yy.tar.gz file for our version of Mailman and unpacking it, I went into the src/ subdirectory and found a cgi-wrapper.c program that, along with common.c, confirmed my theory about the compiled binary wrapper programs existing just for security reasons. In particular the following comments in common.c were helpful:

/* We want to tightly control how the CGI scripts get executed.
 * For portability and security, the path to the Python executable
 * is hard-coded into this C wrapper, rather than encoded in the #!
 * line of the script that gets executed.  So we invoke those
 * scripts by passing the script name on the command line to the
 * Python executable.
 *
 * We also need to hack on the PYTHONPATH environment variable so
 * that the path to the installed Mailman modules will show up
 * first on sys.path.
 */

While this whole compilation thing was a pain, at least the C wrappers appeared to be as thin as possible and seemed to exist merely to call the corresponding Python scripts in /usr/lib/mailman/Mailman/Cgi/

(Incidentally, the part above about “hard-coding the path to the Python executable” finally explained why the Python scripts in /usr/lib/mailman/Mailman/Cgi/ didn’t have #!/usr/bin/python at the top!)

What I ultimately had to do was: 1) Compile my own “search” binary CGI wrapper which had the setgid bit like the other Mailman CGI programs in /usr/lib/cgi-bin/mailman/, and 2) Create a corresponding search.py file in /usr/lib/mailman/Mailman/Cgi/ as a version of the private.py script which authenticated the user and then called swish.cgi to do the actual search.

Compiling a CGI Wrapper for the Search Script

Even though our list server admins had installed Mailman from a binary package, to compile a custom binary CGI wrapper for the search script I needed to download the source for the same version of Mailman as installed on the server (which I determined by running the Mailman bin program “version”).

I created a directory at ~/src and then ran wget http://ftp.gnu.org/gnu/mailman/mailman-2.x.yy.tgz to download the tarfile for our Mailman version (e.g., 2.1.12) and tar xvfz mailman-2.x.yy.tgz to unpack it, which created a mailman-2.x.yy/ subdirectory.

I also realized I might need a place to “install” the compiled binary wrapper file, so I did mkdir ~/src/mailman and then chmod g+s ~/src/mailman because that’s required by the configure script below.

Then I changed into the newly-created mailman-2.x.yy directory and ran ./configure –prefix=$HOME/src/mailman to generate Makefiles from the corresponding Makefile.in files (if you’re curious how the Makefile system works, see make: Automating Your Recipes).

(Actually, I had to run ./configure –prefix=$HOME/src/mailman –with-username=list –with-groupname=list because our list server is set up to use “list” instead of “mailman” for the Mailman user and group, but chances are your server will use the default mailman account/group which configure looks for, so that won’t be necessary.)

Then I changed into the src/ subdirectory (making the current directory ~/src/mailman-2.x.yy/src/) which had the following files:

  • cgi-wrapper.c
  • common.c
  • common.h
  • mail-wrapper.c
  • Makefile
  • Makefile.in
  • vsnprintf.c

All of those files had come with the source archive except for Makefile, which was created by running the “configure” command above. Then I needed to edit that file (~/src/mailman-2.x.yy/src/Makefile) to make the following changes:

  • Changed prefix= /home/ouruser/src/mailman to prefix= /usr/lib/mailman (because our /home/ouruser/src/mailman was just an empty place for our newly-compiled program to be saved into)
  • Copied and pasted the two $(CGI_PROGS) target lines and then in the duplicated lines changed $(CGI_PROGS) to search; this resulted in adding the following two lines:
    search: $(srcdir)/cgi-wrapper.c $(COMMONOBJS)
    $(CC) -DSCRIPT=”\”$@\”” -I. $(CGI_FLAGS) $(CFLAGS) $(COMMONOBJS) -o $@ $(srcdir)/cgi-wrapper.c

(You can use Makefile.patch to make the changes, with “patch < Makefile.patch”, but you’ll need to edit the patch file first to replace “ouruser” with your own username. If you make the changes by hand instead, keep in mind that the leading space before $(CC) is a TAB and not just spaces, which has been called one of the worst design botches in the history of Unix.)

Having added a new target for our search wrapper program in ~/src/mailman-2.x.yy/src/Makefile, I then ran make search to do the actual compilation.

I’d expected it to get compiled into the ~/src/mailman directory, but in fact it got compiled into ~/src/mailman-2.x.yy/src/ (the same directory as the just-edited Makefile). I typed ./search to run the program and it produced the following output:

Content-type: text/html

<head>
<title>Mailman CGI error!!!</title>
</head><body>
<h1>Mailman CGI error!!!</h1>
The Mailman CGI wrapper encountered a fatal error. This entry
is being stored in your syslog:
<pre>
Group mismatch error... (etc. etc. etc.)

This was a hell of a lot better than a segfault, and confirmed that the custom CGI wrapper for search had compiled successfully.

I copied the new “search” program to the same place as the other Mailman CGI scripts (with cp ./search /usr/lib/cgi-bin/mailman/) and had our server admins change the user and group to match the other CGI programs in that directory.

I also—importantly—had them do chmod g+s /usr/lib/cgi-bin/mailman/search to make it a setgid script, because the whole point of this process was to have a a compiled setgid script so that the web server would run the search program as the same group as the other Mailman programs and be able to read all the Mailman files without permission problems.

The last step was then creating a Python search script that would get called by the compiled search CGI wrapper program we just compiled and installed.

Creating search.py in Mailman/Cgi to Run the Swish Search

Compared to compiling a “search” binary CGI wrapper program, creating the corresponding Python search.py script was relatively simple.

As written above, the CGI wrapper programs in /usr/lib/cgi-bin/mailman/ only seem to exist to let Apache run the real Python scripts with the necessary security permissions as the main Mailman user (via the setgid mechanism).

These Python scripts that actually do the work are in the Mailman/Cgi/ directory, which on our server is /usr/lib/mailman/Mailman/Cgi/, and have the same names as the CGI wrapper programs in /usr/lib/cgi-bin/mailman/, so it’s pretty easy to see the one-to-one correspondence between them.

Basically, I just needed to do cp private.py search.py from within the /usr/lib/mailman/Mailman/Cgi/ directory and then edit search.py to do the following:

  1. Removed everything between
    doc.set_language(lang)
    and
    if __name__ == ‘__main__’:
  2. Added the following two lines in place of what was just removed:
    os.environ[‘LISTNAME’] = Utils.websafe(listname)
    os.execv(‘/var/lib/mailman/archives/private/swish.cgi’, [])

(or you can copy private.py to search.py and then use search.patch to make the changes with “patch < search.patch”)

This ended up replacing the part of private.py that displays HTML archive files after authentication, with a call to the swish.cgi search program to do the search and display the results.

To test this, I went to http://lists.ourhost.org/cgi-bin/mailman/private/abc-listname/ and clicked on the “search the list archives” link. This ran http://lists.ourhost.org/cgi-bin/mailman/search/abc-listname which ended up calling swish.cgi to display the default search form for that list. Putting in a test search term and clicking the Search button searched that list’s index file (which swish.cgi figured out from the listname in the URL) and displayed the results—after doing the same authentication that private.py uses before displaying private list messages.

Next: Appendix – Tips, Tricks, and Notes or Up: Table of Contents

Filed in Tech Tips
  Posted by admin
Tagged access, archive, archives, cgi, index, indexing, login, mailman, private, searching, swish, swish-e

Anthony R. Thompson's Blog is powered by WordPress and proudly hosted on Dreamhost
  • Archives

    • August 2015
    • October 2010
    • July 2010
    • June 2010
    • May 2010
    • February 2010
    • January 2010
  • Categories

    • General
    • Personal Improvement
    • Tech Tips
  • Blogroll

    • Bruce Schneier's Blog Schneier on Security
    • Clay Shirky's Blog Social Software
    • Coding Horror Programming Human Factors
    • Hacker News News for Geeks
    • Jakob Nielsen's Alertbox Usability Newsletter
    • Joel on Software Software Development
    • Paul Graham's Essays Startups, Lisp, Etc.
    • Rhode Island Monuments RI Monuments/Memorials
    • Sam Ruby's Blog Just Data
    • Steve Pavlina's Blog Personal Development
    • Tim Bray – Ongoing An Alpha Geek
    • Tim O'Reilly's Blog Tech Radar