<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Anthony R. Thompson&#039;s Blog &#187; swish-e</title>
	<atom:link href="http://blog.anthonyrthompson.com/tag/swish-e/feed/" rel="self" type="application/rss+xml" />
	<link>http://blog.anthonyrthompson.com</link>
	<description>Helpful Things</description>
	<lastBuildDate>Tue, 26 Oct 2010 17:47:31 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>Listserv to Mailman Part 3.2: Setting Up Archive Search</title>
		<link>http://blog.anthonyrthompson.com/2010/07/listserv-to-mailman-setting-up-archive-search/</link>
		<comments>http://blog.anthonyrthompson.com/2010/07/listserv-to-mailman-setting-up-archive-search/#comments</comments>
		<pubDate>Fri, 09 Jul 2010 09:05:36 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Tech Tips]]></category>
		<category><![CDATA[access]]></category>
		<category><![CDATA[archive]]></category>
		<category><![CDATA[archives]]></category>
		<category><![CDATA[cgi]]></category>
		<category><![CDATA[index]]></category>
		<category><![CDATA[indexing]]></category>
		<category><![CDATA[login]]></category>
		<category><![CDATA[mailman]]></category>
		<category><![CDATA[private]]></category>
		<category><![CDATA[searching]]></category>
		<category><![CDATA[swish]]></category>
		<category><![CDATA[swish-e]]></category>

		<guid isPermaLink="false">http://blog.anthonyrthompson.com/?p=24</guid>
		<description><![CDATA[Modifying Mailman web templates to add search links, using Swish-E to generate archive search indexes, setting up swish's CGI search and template files, and compiling the search CGI wrapper program (required due to web server security/permissions issues).]]></description>
			<content:encoded><![CDATA[<h1>Introduction</h1>
<p>Mailman doesn&#8217;t come with built-in archive searching. Instead, maybe in the Unix tradition of &#8220;do one thing and do it well, leave the rest to other tools&#8221;, you must find, install, and integrate your own search package. In some places it&#8217;s mentioned that people have used the <a href="http://www.swish-e.org/">Swish</a> search package successfully, but usually no further explanation of how to do this is given.</p>
<p>(Note that there are actually <em>two</em> &#8220;swish&#8221; index/search packages out there, <a href="http://www.swish-e.org/">Swish-E</a> and <a href="http://swishplusplus.sourceforge.net/">Swish++</a>. For this guide, and on our server, I used Swish-E because we already used it for search on our main website.)</p>
<p>Mailman seems to add each message to the list archive (mbox file and HTML archive pages) as it&#8217;s processed for delivery/handling. We needed to somehow periodically have Swish generate a search index from all the HTML archive pages.</p>
<p>We also had to tweak the Mailman installation to add a search box to the list template pages, and add other tweaks to limit searching and search results to subscribers only (since all our lists and their archives were private).</p>
<h1>Adding a Search Box to List Templates</h1>
<p>To enable list archive searching, we needed a place where people could enter their search parameters, which required editing a few Mailman list template files.</p>
<p>As written at <a href="http://wpkg.org/Integrating_Mailman_with_a_Swish-e_search_engine#Integrating_the_search_with_Mailman.27s_pages">http://wpkg.org/Integrating_Mailman_with_a_Swish-e_search_engine</a>, we didn&#8217;t want to edit the installed Mailman templates, we wanted to <em>copy</em> the default templates to a special area and then edit the copies to override the default ones. This made sure that our customized templates wouldn&#8217;t be overwritten by a Mailman upgrade.</p>
<p>On our server, the default templates were in /etc/mailman/en/ so we created /etc/mailman/site/en/ and copied the archidxhead.html, archtoc.html, and archtocnombox.html files from the former directory to the latter.</p>
<p>Then I edited archidxhead.html to add the following line:</p>
<pre>&lt;li&gt;&lt;b&gt;&lt;a href="/cgi-bin/mailman/search/%(listname)s"
&gt;Search the archives of this list&lt;/a&gt;&lt;/b&gt;&lt;/li&gt;</pre>
<p>archtoc.html was edited to have:</p>
<pre>&lt;p&gt;You can get &lt;a href="%(listinfo)s"&gt;more information about this list&lt;/a&gt;,
&lt;a href="%(fullarch)s"&gt;download the full raw archive&lt;/a&gt; (%(size)s),
or &lt;a href="/cgi-bin/mailman/search/%(listname)s"
&gt;search all the archives of the list&lt;/a&gt;.&lt;/p&gt;</pre>
<p>Finally, archtocnombox.html was edited to have:</p>
<pre>&lt;p&gt;You can get &lt;a href="%(listinfo)s"&gt;more information about this list&lt;/a&gt;
or &lt;a href="/cgi-bin/mailman/search/%(listname)s"
&gt;search the list archives&lt;/a&gt;.&lt;/p&gt;</pre>
<p>Mailman seems to load templates into memory when starting up, so to get it to recognize the custom template overrides, we had to restart Mailman (which I think was done with <strong>/etc/init.d/mailman restart</strong> but I&#8217;m not 100% sure because our server admins actually did the restart).</p>
<p>Note that if your mailman CGI scripts end with cgi (e.g., if you built from source and used the &#8211;with-cgi-ext=.cgi flag to the configure script), it would be search.cgi in the above snippets.</p>
<p>After putting links to the search script on the Mailman templates, we had to put the search script in place to receive search queries. This was a bit complicated due to our  requirement to keep search (and search results) limited to subscribers only, and it&#8217;s where we had to deviate from the <a href="http://wpkg.org/Integrating_Mailman_with_a_Swish-e_search_engine#Integrating_the_search_with_Mailman.27s_pages">Integrating Mailman with a Swish-e search engine</a> page because that didn&#8217;t cover keeping searches private.</p>
<p>The script I wrote to create the Swish search index for each list (<a href="/listserv-to-mailman/code/arch_index.py">arch_index.py</a>) also does as much setup for Swish searching as possible but there are some one-time setup things we had to do manually first.</p>
<p>Before I describe those one-time setup steps, I want to describe the general process of using arch_index.py to create the Swish index files for Mailman lists.</p>
<h1>Generating a Search Index with Swish</h1>
<p>Swish doesn&#8217;t search the HTML pages in the archive directly, probably because it would be too slow. Instead, it searches against its own optimized index file. That&#8217;s faster, but periodically the index file itself needs to be updated (regenerated completely, actually, as Swish doesn&#8217;t seem to support incrementally adding to index files).</p>
<p>Since I&#8217;d have to do the same steps to generate the Swish index file for each list archive, I automated the process with a script called <a href="/listserv-to-mailman/code/arch_index.py">arch_index.py</a>, named somewhat in honor of the Mailman bin program &#8220;arch&#8221;.</p>
<p>In addition to generating the Swish index file for a list(s), I also had arch_index.py set up the CGI script which uses Swish to search a given list (based on the swish.cgi provided with the Swish package) , and set up a configuration file for the CGI script to work correctly.</p>
<p>If arch_index.py is just given the Mailman archive directory as a parameter, and no other options, it creates Swish search indexes for all lists (except the built-in &#8220;mailman&#8221; list, though there is a flag to force indexing of that too):</p>
<pre>arch_index.py /var/lib/mailman/archives/</pre>
<p>To create the index (and search config files, etc.) for just a particular list you can do:</p>
<pre>arch_index.py -l some-listname /var/lib/mailman/archives/</pre>
<p>arch_index.py assumes the Swish indexer program is located at /usr/bin/swish-e and the default Swish CGI search script was installed at /usr/lib/swish-e/swish.cgi—though you can override both on the command line or by editing arch_index.py itself. To see all options, just type arch_index.py with no arguments.</p>
<p>arch_index.py also enables searching list archives by date. Swish&#8217;s search by date feature looks at the modification times of indexed files, but for a list converted from Listserv all the HTML archive pages would be generated at once and have the same modification time. So to support date searching, arch_index.py looks at each posting&#8217;s date and then sets the file modification time to that date.</p>
<p>We should probably describe exactly what arch_index.py does though&#8230;</p>
<h1>What arch_index.py Does</h1>
<p>You can look at the <a href="/listserv-to-mailman/code/arch_index.py">source code</a> yourself of course, but here&#8217;s a quick summary of what arch_index.py does:</p>
<ol>
<li>Checks command line options and figures out what, if any, lists to index, and whether the swish-e and swish.cgi files are available</li>
<li>Copies the default swish.cgi file (if it hasn&#8217;t been copied already) and customizes it</li>
<li>Creates a config file for the customized swish.cgi script (if it hasn&#8217;t been created before)</li>
<li>Updates the modification times of any new HTML archive message files for each list, to match their message posting dates</li>
<li>Creates a swish indexing config file for each list, if necessary</li>
<li>Actually runs swish for each list to create the swish index files</li>
</ol>
<h1>Creating a Custom Swish Search Template for All Lists</h1>
<p>Part of the customization in item #2 above is changing the default swish.cgi file to refer to a custom template file, similar to the process described in <a href="http://wpkg.org/Integrating_Mailman_with_a_Swish-e_search_engine#Integrating_the_search_with_Mailman.27s_pages">Integrating Mailman with a Swish-e search engine</a>.</p>
<p>So we set up the custom template file by copying TemplateDefault.pm in /usr/lib/swish-e/perl/SWISH/ to TemplateDefault_MM.pm in the same directory and making four changes:</p>
<ol>
<li><strong>package SWISH::TemplateDefault</strong> was changed to <strong>package SWISH::TemplateDefault<span style="color: #ff0000;">_MM</span></strong></li>
<li><strong>my $advanced_link = qq[&lt;small&gt;&lt;a href="$form"&gt;advanced form&lt;/a&gt;&lt;/small&gt;]</strong> was changed to <strong>my $advanced_link = qq[&lt;small&gt;&lt;a href="$form<span style="color: #ff0000;">$ENV{'PATH_INFO'}</span>"&gt;advanced form&lt;/a&gt;&lt;/small&gt;]</strong></li>
<li><strong>&lt;form method=&#8221;get&#8221; action=&#8221;$form&#8221; enctype=&#8221;application/x-www-form-urlencoded&#8221; class=&#8221;form&#8221;&gt;</strong> was changed to <strong>&lt;form method=&#8221;get&#8221; action=&#8221;$form<span style="color: #ff0000;">$ENV{&#8216;PATH_INFO&#8217;}</span>&#8221; enctype=&#8221;application/x-www-form-urlencoded&#8221; class=&#8221;form&#8221;&gt;</strong></li>
<li>The following was added after $query_href and $pages:<br />
<span style="color: #ff0000;"><strong>$query_href =~ s#search\?#search$ENV{&#8216;PATH_INFO&#8217;}\?#g;</strong><br />
<strong>$pages =~ s#search\?#search$ENV{&#8216;PATH_INFO&#8217;}\?#g;</strong></span></li>
</ol>
<p>(If you prefer patch files you can download <a href="/listserv-to-mailman/code/TemplateDefault_MM.patch">TemplateDefault_MM.patch</a>, change into /usr/lib/swish-e/perl/SWISH/ and then run &#8220;patch &lt; TemplateDefault_MM.patch&#8221;; see <a href="http://stephenjungels.com/jungels.net/articles/diff-patch-ten-minutes.html">The Ten Minute Guide to diff and patch</a> for more info.)</p>
<p>$ENV{&#8216;PATH_INFO&#8217;} had to be added in all those places to let one master Swish search template work for all lists, because the listname is passed to the search script as an extra path info parameter (i.e., /cgi-bin/mailman/search/<em>listname</em>).</p>
<p>arch_index.py copies the swish.cgi file in /usr/lib/swish-e/ into /var/lib/mailman/archives/private/ and then 1) changes SWISH::TemplateDefault to SWISH::TemplateDefault_MM and 2) changes $DEFAULT_CONFIG_FILE to point to /var/lib/mailman/archives/private/swish.cgi.conf which arch_index.py also creates.</p>
<p>The /var/lib/mailman/archives/private/swish.cgi.conf file created by arch_index.py uses an $ENV{&#8216;LISTNAME&#8217;} environment variable that /cgi-bin/mailman/search (which we haven&#8217;t covered yet) sets from its extra path info parameter. In other words, using an environment variable in swish.cgi.conf file allows there to be one master configuration file which can dynamically refer to a different search index file for each list.</p>
<h1>Setting Up the Search CGI Script &#8211; Background/Explanations<strong><br />
</strong></h1>
<p>At this point we&#8217;d edited the search templates to provide links to the search pages, created Swish search indexes for all the HTML message files, and created a customized swish.cgi script and config file (in /var/lib/mailman/archives/private/) to do the actual searching with Swish and return results.</p>
<p>We could have then tried to hook up the web pages&#8217; search links with our version of swish.cgi to do the search and return the results, as described on the <a href="http://wpkg.org/Integrating_Mailman_with_a_Swish-e_search_engine#Integrating_the_search_with_Mailman.27s_pages">Integrating Mailman with a Swish-e search engine</a> page (though it uses a somewhat different integration method with Server Side Includes).</p>
<p>The problem with that approach is that while the full list messages themselves are protected by the Mailman &#8220;private&#8221; access control mechanism (for private lists), the search results themselves contain message excerpts so if a mailing list had confidential information it would be exposed to non-subscribers. For our purposes, that wasn&#8217;t acceptable.</p>
<p>So we also had to restrict running the search and displaying the results to list subscribers, which proved to be fairly involved.</p>
<p>My initial thought was to try to figure out Mailman&#8217;s authentication mechanism and then wrap it around the Swish search CGI script.</p>
<p>I looked at the files in the /cgi-bin/mailman directory (actually /usr/lib/cgi-bin/mailman/ in our installation) and the &#8220;file&#8221; command said that they were all &#8220;setgid ELF 32-bit LSB executable&#8221; files, i.e., compiled executables.</p>
<p>This confused me since most of Mailman seems written in Python, but I noticed that the Python files in /usr/lib/mailman/Mailman/Cgi/ had the same names as the compiled programs in /usr/lib/cgi-bin/mailman/. Further, changing one of the interpreted .py files in /usr/lib/mailman/Mailman/Cgi/ confirmed that the compiled files with the same name in /usr/lib/cgi-bin/mailman/ were calling them.</p>
<p>I tried copying the interpreted Python file /usr/lib/mailman/Mailman/Cgi/private.py to /usr/lib/cgi-bin/mailman/search and then editing it to call Swish&#8217;s CGI search program (/var/lib/mailman/archives/private/swish.cgi) instead of displaying private archive pages as private.py normally does.</p>
<p>While this worked on our development box, unfortunately on our production host it resulted in permission errors reading list configuration files because the web server wouldn&#8217;t use the <a href="http://en.wikipedia.org/wiki/Setuid">setgid</a> mechanism to run an interpreted file as the same &#8220;mailman&#8221; group as the other Mailman programs (which it wouldn&#8217;t do for <a href="http://www.faqs.org/faqs/unix-faq/faq/part4/section-7.html">good security reasons</a>).</p>
<p>I&#8217;d thought the CGI files in /usr/lib/cgi-bin/mailman/ were compiled for performance reasons, but it turns out they were compiled to allow the web server to run the CGI scripts with the correct permissions via the setgid mechanism.</p>
<p>At this point I downloaded the <a href="http://www.list.org/download.html">source files</a> for our version of Mailman because I wanted to confirm this and because I suspected I&#8217;d need to compile my own version of a &#8220;search&#8221; script (a modified &#8220;private&#8221; script). (I needed to download the source files because our production server admins had installed Mailman with a precompiled binary package.)</p>
<p>After downloading the mailman-2.x.yy.tar.gz file for our version of Mailman and unpacking it, I went into the src/ subdirectory and found a cgi-wrapper.c program that, along with common.c, confirmed my theory about the compiled binary wrapper programs existing just for security reasons. In particular the following comments in common.c were helpful:</p>
<pre>/* We want to tightly control how the CGI scripts get executed.
 * For portability and security, the path to the Python executable
 * is hard-coded into this C wrapper, rather than encoded in the #!
 * line of the script that gets executed.  So we invoke those
 * scripts by passing the script name on the command line to the
 * Python executable.
 *
 * We also need to hack on the PYTHONPATH environment variable so
 * that the path to the installed Mailman modules will show up
 * first on sys.path.
 */</pre>
<p>While this whole compilation thing was a pain, at least the C wrappers appeared to be as thin as possible and seemed to exist merely to call the corresponding Python scripts in /usr/lib/mailman/Mailman/Cgi/</p>
<p>(Incidentally, the part above about &#8220;hard-coding the path to the Python executable&#8221; finally explained why the Python scripts in /usr/lib/mailman/Mailman/Cgi/ didn&#8217;t have #!/usr/bin/python at the top!)</p>
<p>What I ultimately had to do was: 1) Compile my own &#8220;search&#8221; binary CGI wrapper which had the setgid bit like the other Mailman CGI programs in /usr/lib/cgi-bin/mailman/, and 2) Create a corresponding search.py file in /usr/lib/mailman/Mailman/Cgi/ as a version of the private.py script which authenticated the user and then called swish.cgi to do the actual search.</p>
<h1>Compiling a CGI Wrapper for the Search Script</h1>
<p>Even though our list server admins had installed Mailman from a binary package, to compile a custom binary CGI wrapper for the search script I needed to download the source for the same version of Mailman as installed on the server (which I determined by running the Mailman bin program &#8220;version&#8221;).</p>
<p>I created a directory at ~/src and then ran <strong>wget http://ftp.gnu.org/gnu/mailman/mailman-2.x.yy.tgz</strong> to download the tarfile for our Mailman version (e.g., 2.1.12) and <strong>tar xvfz mailman-2.x.yy.tgz</strong> to unpack it, which created a mailman-2.x.yy/ subdirectory.</p>
<p>I also realized I might need a place to &#8220;install&#8221; the compiled binary wrapper file, so I did <strong>mkdir ~/src/mailman</strong> and then <strong>chmod g+s ~/src/mailman</strong> because that&#8217;s required by the configure script below.</p>
<p>Then I changed into the newly-created mailman-2.x.yy directory and ran <strong>./configure &#8211;prefix=$HOME/src/mailman</strong> to generate Makefiles from the corresponding Makefile.in files (if you&#8217;re curious how the Makefile system works, see <a href="http://www.faqs.org/docs/artu/ch15s04.html">make: Automating Your Recipes</a>).</p>
<p>(<em>Actually</em>, I had to run <strong>./configure &#8211;prefix=$HOME/src/mailman &#8211;with-username=list &#8211;with-groupname=list</strong> because our list server is set up to use &#8220;list&#8221; instead of &#8220;mailman&#8221; for the Mailman user and group, but chances are your server will use the default mailman account/group which configure looks for, so that won&#8217;t be necessary.)</p>
<p>Then I changed into the src/ subdirectory (making the current directory ~/src/mailman-2.x.yy/src/) which had the following files:</p>
<ul>
<li>cgi-wrapper.c</li>
<li>common.c</li>
<li>common.h</li>
<li>mail-wrapper.c</li>
<li>Makefile</li>
<li>Makefile.in</li>
<li>vsnprintf.c</li>
</ul>
<p>All of those files had come with the source archive except for Makefile, which was created by running the &#8220;configure&#8221; command above. Then I needed to edit that file (~/src/mailman-2.x.yy/src/Makefile) to make the following changes:</p>
<ul>
<li>Changed <strong>prefix= /home/ouruser/src/mailman</strong> to <strong>prefix= /usr/lib/mailman</strong> (because our /home/ouruser/src/mailman was just an empty place for our newly-compiled program to be saved into)</li>
<li>Copied and pasted the two $(CGI_PROGS) target lines and then in the duplicated lines changed $(CGI_PROGS) to search; this resulted in adding the following two lines:<br />
<strong>search: $(srcdir)/cgi-wrapper.c $(COMMONOBJS)</strong><br />
<strong>$(CC) -DSCRIPT=&#8221;\&#8221;$@\&#8221;" -I. $(CGI_FLAGS) $(CFLAGS) $(COMMONOBJS) -o $@ $(srcdir)/cgi-wrapper.c</strong></li>
</ul>
<p>(You can use <a href="/listserv-to-mailman/code/Makefile.patch">Makefile.patch</a> to make the changes, with &#8220;patch &lt; Makefile.patch&#8221;, but you&#8217;ll need to edit the patch file first to replace &#8220;ouruser&#8221; with your own username. If you make the changes by hand instead, keep in mind that the leading space before $(CC) is a TAB and not just spaces, which has been called <a href="http://www.faqs.org/docs/artu/ch15s04.html">one of the worst design botches in the history of Unix</a>.)</p>
<p>Having added a new target for our search wrapper program in ~/src/mailman-2.x.yy/src/Makefile, I then ran <strong>make search</strong> to do the actual compilation.</p>
<p>I&#8217;d expected it to get compiled into the ~/src/mailman directory, but in fact it got compiled into ~/src/mailman-2.x.yy/src/ (the same directory as the just-edited Makefile). I typed <strong>./search</strong> to run the program and it produced the following output:</p>
<pre>Content-type: text/html

&lt;head&gt;
&lt;title&gt;Mailman CGI error!!!&lt;/title&gt;
&lt;/head&gt;&lt;body&gt;
&lt;h1&gt;Mailman CGI error!!!&lt;/h1&gt;
The Mailman CGI wrapper encountered a fatal error. This entry
is being stored in your syslog:
&lt;pre&gt;
Group mismatch error... (etc. etc. etc.)</pre>
<p>This was a hell of a lot better than a <a href="http://en.wikipedia.org/wiki/Segmentation_fault">segfault</a>, and confirmed that the custom CGI wrapper for search had compiled successfully.</p>
<p>I copied the new &#8220;search&#8221; program to the same place as the other Mailman CGI scripts (with <strong>cp ./search /usr/lib/cgi-bin/mailman/</strong>) and had our server admins change the user and group to match the other CGI programs in that directory.</p>
<p>I also—importantly—had them do <strong>chmod g+s /usr/lib/cgi-bin/mailman/search</strong> to make it a setgid script, because the whole point of this process was to have a a compiled setgid script so that the web server would run the search program as the same group as the other Mailman programs and be able to read all the Mailman files without permission problems.</p>
<p>The last step was then creating a Python search script that would get called by the compiled search CGI wrapper program we just compiled and installed.</p>
<h1>Creating search.py in Mailman/Cgi to Run the Swish Search</h1>
<p>Compared to compiling a &#8220;search&#8221; binary CGI wrapper program, creating the corresponding Python search.py script was relatively simple.</p>
<p>As written above, the CGI wrapper programs in /usr/lib/cgi-bin/mailman/ only seem to exist to let Apache run the real Python scripts with the necessary security permissions as the main Mailman user (via the setgid mechanism).</p>
<p>These Python scripts that actually do the work are in the Mailman/Cgi/ directory, which on our server is /usr/lib/mailman/Mailman/Cgi/, and have the same names as the CGI wrapper programs in /usr/lib/cgi-bin/mailman/, so it&#8217;s pretty easy to see the one-to-one correspondence between them.</p>
<p>Basically, I just needed to do <strong>cp private.py search.py</strong> from within the /usr/lib/mailman/Mailman/Cgi/ directory and then edit search.py to do the following:</p>
<ol>
<li>Removed everything between<br />
<strong>doc.set_language(lang)</strong><br />
and<br />
<strong>if __name__ == &#8216;__main__&#8217;:</strong></li>
<li>Added the following two lines in place of what was just removed:<br />
<strong>os.environ['LISTNAME'] = Utils.websafe(listname)<br />
os.execv(&#8216;/var/lib/mailman/archives/private/swish.cgi&#8217;, [])</strong></li>
</ol>
<p>(or you can copy private.py to search.py and then use <a href="/listserv-to-mailman/code/search.patch">search.patch</a> to make the changes with &#8220;patch &lt; search.patch&#8221;)</p>
<p>This ended up replacing the part of private.py that displays HTML archive files after authentication, with a call to the swish.cgi search program to do the search and display the results.</p>
<p>To test this, I went to http://lists.ourhost.org/cgi-bin/mailman/private/abc-listname/ and clicked on the &#8220;search the list archives&#8221; link. This ran http://lists.ourhost.org/cgi-bin/mailman/search/abc-listname which ended up calling swish.cgi to display the default search form for that list. Putting in a test search term and clicking the Search button searched that list&#8217;s index file (which swish.cgi figured out from the listname in the URL) and displayed the results—<em>after</em> doing the same authentication that private.py uses before displaying private list messages.</p>
<p><strong>Next</strong>: <a href="/2010/07/listserv-to-mailman-tips-tricks-note/">Appendix &#8211; Tips, Tricks, and Notes</a> or <strong>Up</strong>: <a href="/listserv-to-mailman/">Table of Contents</a></p>
]]></content:encoded>
			<wfw:commentRss>http://blog.anthonyrthompson.com/2010/07/listserv-to-mailman-setting-up-archive-search/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Listserv to Mailman Part 1.2: Installing Swish for Archive Searching</title>
		<link>http://blog.anthonyrthompson.com/2010/02/listserv-to-mailman-installing-swish-for-archive-searching/</link>
		<comments>http://blog.anthonyrthompson.com/2010/02/listserv-to-mailman-installing-swish-for-archive-searching/#comments</comments>
		<pubDate>Mon, 01 Feb 2010 21:39:16 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Tech Tips]]></category>
		<category><![CDATA[archives]]></category>
		<category><![CDATA[indexing]]></category>
		<category><![CDATA[listserv]]></category>
		<category><![CDATA[mailman]]></category>
		<category><![CDATA[searching]]></category>
		<category><![CDATA[swish]]></category>
		<category><![CDATA[swish-e]]></category>

		<guid isPermaLink="false">http://blog.anthonyrthompson.com/?p=11</guid>
		<description><![CDATA[Choosing an archive search package for Mailman, about the Swish-E package for archive indexing/searching, and Swish-E filter/helper add-ons for indexing attachment files like PDFs and Word docs.]]></description>
			<content:encoded><![CDATA[<h1>Background</h1>
<p>Mailman, unlike Listserv, doesn&#8217;t come with built-in list archive searching. (Which is funny since I&#8217;d always thought Listserv&#8217;s archive search was clunky and dated—but at least it <em>had</em> archive search! <img src='http://blog.anthonyrthompson.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> )</p>
<p>I think Mailman may not have archive searching because the developers wanted to keep their focus on the mailing list software itself rather than trying to write and maintain search software too. They probably figured it would be better to leave search to search software developers, and also give people the freedom to install whatever search package they want.</p>
<p>Nonetheless, it would have been a great relief if Mailman had just come with a default search package for list archives that could be uninstalled/overridden if necessary, rather than making everyone who wants to provide list archive searching (which <em>must</em> be a pretty common requirement) re-invent the search wheel by hunting down a package and figuring out how to integrate it into Mailman.</p>
<h1>The Swish-E Package</h1>
<p>After a good deal of searching (ha) I eventually found that many others who also wanted searchable list archives seemed to lean toward using Swish-E (<a href="http://www.swish-e.org/">swish-e.org</a>, <a href="http://en.wikipedia.org/wiki/SWISH-E">Wikipedia</a>).</p>
<p>(Note: Since writing this guide, I&#8217;ve found that others have used <a href="http://en.wikipedia.org/wiki/Htdig">htdig</a> successfully for archive searching too; see the _README file at <a href="http://www.msapiro.net/mm/">msapiro.net/mm/</a> and <a href="http://www.msapiro.net/mm/INSTALL.htdig-mm.html">this documentation</a>.)</p>
<p>Fortunately, I already had experience with Swish because our organization already used Swish for the search feature on its main website.</p>
<p>I wish I could provide step-by-step instructions for setting up Swish , but I forget the steps I used to install it from source on our development box (web host) back in 2003, and the server admins on our production list host installed it for me, probably using a binary package manager like RPM or Apt.</p>
<p>Even though I don&#8217;t have step-by-step instructions for installing Swish (though if you are root, downloading/unpacking the source, running &#8220;configure&#8221; and &#8220;make install&#8221; will probably do the trick), I wanted to include a section about it for a few reasons.</p>
<p>Aside from pointing you to the main swish-e.org site, in particular the <a href="http://www.swish-e.org/download/">Download</a> and <a href="http://www.swish-e.org/docs/">Documentation</a> sections (the latter including a nice INSTALL page) , I also wanted to point you to the <em>extremely</em> helpful <a href="http://wpkg.org/Integrating_Mailman_with_a_Swish-e_search_engine">Integrating Mailman with a Swish-e Search Engine</a> page.</p>
<p>I ended up only following some of that page&#8217;s advice, but it was invaluable for getting started on the Listserv to Mailman archive conversion as far as what issues to consider. So that page is definitely worth checking out just to see what&#8217;s involved. (Though note that I ended up writing <a href="/listserv-to-mailman/code/arch_index.py">code</a> to automate some of the things on that page, which I&#8217;ll cover <a href="/2010/07/listserv-to-mailman-setting-up-archive-search/">later</a>.)</p>
<p>First though, just focus on installing Swish-e itself, which indexes and searches 1) the Mailman HTML archive pages (one page per message) and, optionally, 2) any message attachment files such as PDFs or Word docs by using extra Swish add-on/filter programs.</p>
<h1>Using Swish-E for Searching Message Attachments</h1>
<p>While the basic setup of Swish for HTML archive messages is fairly easy, setting it up to search non-HTML/text files such as Word, Excel, and PDF attachments is a little more tricky. I struggled with this when I set up Swish to search binary files on our website, so I wanted to give some info on it in case you want to support searching message attachments too.</p>
<p>(Mailman <em>does</em> allow attachments to be included in list archives, it just separates each from its associated message and puts a link to the attachment file at the bottom of the archived message.)</p>
<p><strong>Note</strong> though: This entire section (the rest of this page) is optional. To be blunt, if you don&#8217;t <em>have</em> to support searching archive message attachments, make your life easier and don&#8217;t do it; you can always go back and add it later if you need to. We didn&#8217;t even do it for our list archives, I&#8217;m just including this info about using Swish to index binary files based on my experience doing that on our main website.</p>
<p>For a general article about indexing HTML pages and other file types with Swish-E, see <a href="http://www.linuxjournal.com/article/6652">How to Index Anything</a> from the Linux Journal in 2003. At first glance that page is very good, but I haven&#8217;t examined it in detail and it&#8217;s possible that some details have changed since 2003.</p>
<p>Basically, for non-text/HTML files Swish relies on external helper programs to extract text from each file. For example, it uses the pdftotext program in the <a href="http://www.foolabs.com/xpdf/">xpdf package</a> for extracting text from PDF files, the <a href="http://www.wagner.pp.ru/~vitus/software/catdoc/">catdoc program</a> to get text from Word .doc files, etc. See the &#8220;Optional But Recommended Packages&#8221; section of the <a href="http://www.swish-e.org/docs/install.html">Swish-E INSTALL doc</a> for more info on what&#8217;s available.</p>
<p>(I haven&#8217;t yet tackled the issue of extracting text from Word&#8217;s newer .docx file format for Swish; if I do, I&#8217;ll update this page, but if you&#8217;ve already done so please <a href="/contact/">let me know</a>; some possibilities are 1) <a href="http://sourceforge.net/projects/docx2txt/">docx2txt</a>, 2) <a href="http://linux.die.net/man/1/unoconv">unoconv</a> though that might unusable by Swish because it requires a running OpenOffice instance, or 3) maybe even a quick-and-dirty <a href="http://stackoverflow.com/questions/1184747/rtf-doc-docx-text-extraction-in-program-written-in-c-qt">unzip/sed/grep combination</a>.)</p>
<p>For more information about supporting searching of PDF, Word, etc. files, see &#8220;How do I index my PDF, Word, and compressed documents?&#8221; and the sections after it on the <a href="http://www.swish-e.org/docs/swish-faq.html#how_do_i_index_my_pdf_word_and_compressed_documents_">Swish-E FAQ page</a> as well as the example filters in the &#8220;Document Filter Directives&#8221; section of the <a href="http://www.swish-e.org/docs/swish-config.html#document_filter_directives">SWISH-CONFIG man page</a>.</p>
<p>Note: I had to tweak the example given on that last page for PDF and DOC files, which were the only two binary file types I included in our website search. Specifically, the SWISH-CONFIG page gave the example of:</p>
<pre>FileFilter .pdf       pdftotext   "%p -"</pre>
<p>and that produced an error during indexing; every time the Swish indexer encountered a PDF file and tried to run pdftotext,  it printed pdftotext&#8217;s usage info:</p>
<pre>pdftotext version 3.02
Copyright 1996-2007 Glyph &amp; Cog, LLC
Usage: pdftotext [options] &lt;PDF-file&gt; [&lt;text-file&gt;]
  -f &lt;int&gt;          : first page to convert
  -l &lt;int&gt;          : last page to convert
  ... etc.</pre>
<p>To fix this, I had to tweak it to:</p>
<pre>FileFilter .pdf pdftotext "'%p' -"</pre>
<p>Actually I chose to spell out the full /path/to/pdftotext instead of just pdftotext there, but you get the idea—the main difference is in the quoting at the end, to put %p within its own single quotes.</p>
<p>I had to do the same thing with the catdoc example; the SWISH-CONFIG page suggested:</p>
<pre>FileFilter .doc     /usr/local/bin/catdoc "-s8859-1 -d8859-1 %p"</pre>
<p>&#8230; but that failed too so I enclosed the %p in single quotes there as well:</p>
<pre>FileFilter .doc /path/to/our/catdoc "-s8859-1 -d8859-1 '%p'"</pre>
<h1>Character Set Problems While Indexing</h1>
<p>Another issue was that at some point I started getting character set error messages when running the Swish indexer. I wish I&#8217;d documented the problem and my solution when it happened, but I&#8217;ve done my best to reconstruct the issue in case it&#8217;s useful to someone.</p>
<p>I believe the error may have been something about catdoc not being able to find the ascii.replchars and/or ascii.specchars character set files. I think I hunted around to find these files, but since catdoc seems to be a fairly old (and seemingly unmaintained) package, my search was fruitless.</p>
<p>Ultimately I think my solution was that I noticed I <em>did</em> have ascii.rpl and ascii.spc files, so I copied those to ascii.replchars and ascii.specchars and that fixed the problem.</p>
<p>I&#8217;m also still getting the following error (from pdftotext I believe) when running the Swish indexer on our web site:</p>
<pre>Error: Unknown character collection 'Adobe-Korea1'</pre>
<p>I searched online and couldn&#8217;t find anything for that specific character set, but when I searched for &#8220;swish unknown character collection&#8221; I found <a href="http://swish-e.org/archive/2004-12/8679.html">this post</a> which recommended upgrading the xpdf package as a possible solution. I haven&#8217;t tried it yet because it&#8217;s only a few errors and I hate to upgrade things unless I absolutely have to, but I wanted to mention it here in case someone else gets similar errors using Swish to index PDFs.</p>
<p><strong>Next</strong>: <a href="/2010/02/listserv-to-mailman-installing-an-administrative-command-handler/">Installing an Administrative Command Handler</a> or <strong>Up</strong>: <a href="/listserv-to-mailman/">Table of Contents</a></p>
]]></content:encoded>
			<wfw:commentRss>http://blog.anthonyrthompson.com/2010/02/listserv-to-mailman-installing-swish-for-archive-searching/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

