Listserv to Mailman Part 3.1: Converting Listserv Archives to Mailman

Note: You could end up using ten times as much disk space as your uncompressed Listserv archives, after converting them to Mailman’s mbox format and using Swish to index them for searching. If that amount of disk space could be an issue for you, see “A Word About Disk Space” at the end of this page.

Introduction

If you’re converting Listserv lists to Mailman, you probably want to keep your list archives too. The good news is that the Listserv archive format is plain text and is actually pretty well documented. The bad news is that it’s a custom proprietary format, not something common like the Unix “mbox” format, so conversion will be necessary.

Mailman actually wanted the archives to be in mbox format, so that’s what we converted the Listserv archives to. In fact, one of our lists had a lot of spam in its archives so after we did the conversion, but before we imported into Mailman, we opened the mbox file in Thunderbird, deleted the spam messages, and then saved the file back to mbox for importing into Mailman.

After converting the Listserv archives to mbox format, we had to move the mbox files to the right locations for Mailman to generate HTML archive pages, and then use Swish to index the HTML pages and do a little Mailman fiddling to keep private archives private (including search results). But first, more about the archive conversion process.

Converting the Archive Files

My archive conversion journey began with a Perl script called ls2mail.pl that was posted in 1999 and claimed to convert from Listserv’s archive format to mbox for Mailman.

The admin of our old Listserv modified it a bit to fix an unspecified “potentially nasty bug”, and I modified it further to 1) skip messages with dates earlier than the earliest legitimate post and later than the current year (to weed out spam messages with invalid dates), and 2) better match the mboxrd format (by quoting body lines beginning with “From”).

First, our Listserv archives were broken up into weekly archive files named listname.logYYMMw where YY is the year of the archive file, MM is the month, and w is the week (a = first week, b = second, up to e for a month with five weeks). For example, listname.log0901a was the Listserv archive file for the first week of January, 2009.

I decided it would be better to rename the files to use four digit years instead of two (i.e., listname.logYYYYMMw), so I used a small Perl utility called perlren to rename the files according to a Perl regular expression. So for each list I did:

perlren 's#log1#log201#' *.log*
perlren 's#log9#log199#' *.log*
perlren 's#log0#log200#' *.log*

Then I wanted a master file for each list which would have a sorted listing of all archive files (and because of the rename above, an alpha sorting was also a date sorting):

ls -1 *.log* > archive-files.txt

(I used a general name like archive-files.txt since all the log files for each list were grouped into a separate directory for each list.)

Then I used that file as a basis of making a file containing all the Listserv log files combined into one big file, in order:

perl -ne 'print "Processing $_"; chomp; print `cat $_ >> abc-listname.ls`;' \
  archive-files.txt

Finally, I ran the ls2mail.pl conversion script on the master Listserv log file to convert it to mbox format:

perl ~/mailman/ls2mail.pl < abc-listname.ls > abc-listname.mbox

Generating Mailman’s HTML Archive Pages

To generate the viewable HTML pages for Mailman’s web archive, I just had to move this new mbox file to the appropriate Mailman list archive directory (e.g., /var/lib/mailman/archives/private/abc-listname.mbox/) and run the Mailman bin “arch” command to generate the HTML archive pages from the mbox file:

mv abc-listname.mbox /var/lib/mailman/archives/private/abc-listname.mbox/
/usr/lib/mailman/bin/arch --wipe abc-listname \
  /var/lib/mailman/archives/private/abc-listname.mbox/abc-listname.mbox

(the –wipe option tells arch to overwrite any existing HTML pages with newly-generated ones from the mbox file, but since this was a new list that wasn’t a problem)

Once I’d tested this with one list, I repeated the process automatically with the other lists by doing something like the following (using command looping in the bash shell again):

cd /var/lib/mailman/archives/private
for list in abc-list1 abc-list2 abc-list3 etc;
do cd $list;
perlren 's#log1#log201#' *.log*;
perlren 's#log9#log199#' *.log*;
perlren 's#log0#log200#' *.log*;
ls -1 *.log* > archive-files.txt;
perl -ne 'print "Processing $_"; chomp; print `cat $_ >> archive.ls`;' \
  archive-files.txt;
perl ~/mailman/ls2mail.pl < archive.ls > $list.mbox;
mv $list.mbox ../$list.mbox/;
/usr/lib/mailman/bin/arch --wipe $list ../$list.mbox/$list.mbox;
cd ..;
done

By doing this, all the existing list archives were converted from Listserv archives (combined into .ls files) to Mailman .mbox archive files, and HTML web-viewable archives pages were generated by Mailman’s arch command.

A Word About Disk Space

I ran into disk space limits several times during this conversion process.

First, our old list host gave us the Listserv archives as one large compressed .tar.gz file—which expanded to triple the size, requiring 4x the size to accommodate both the gzip file and its uncompressed files.

Furthermore, I found that after concatenating all the individual Listserv .log files into one giant Listserv notebook archive .ls file (so, disk space times two for that step), the conversion to mbox format caused the resulting mbox file to take up about 60% more space than the .ls files.

Then running the Mailman arch command on the mbox files generated HTML pages that took up about twice as much space as the corresponding mbox files.

And then the Swish search index files took up about 55% of the size of the HTML archive pages, so that was another bunch of disk space.

The following table shows, step by step, how a 350MB compressed Listserv archive file ballooned up to almost 10GB:

Item/Action Item Size Total GB
Original Gzipped (compressed) Listserv archive files, as one big .tar.gz file 0.35GB 0.35GB
Uncompressed Listserv archives, indiv .log files (gzip file x 3 due to 1/3 comp ratio) 1.05GB 1.40GB
Concatenate .log files for each list into .ls Listserv archives (same size as .log files) 1.05GB 2.45GB
Convert .ls Listserv archive files to mbox files (about 60% bigger than .ls files) 1.68GB 4.13GB
Run arch on mbox files to generate HTML pages (about 2x size of mbox files) 3.36GB 7.49GB
Create Swish indexes from HTML pages (about 55% of HTML pages) 1.85GB 9.34GB

You can mitigate some of those increases by deleting interim files (e.g., deleting the individual .log files after creating the concatenated .ls Listserv archive files), but the overall disk usage still ends up a lot more than you might think because the mbox files take more space than the Listserv notebooks, the HTML archive files are twice the size of the mbox files, and the Swish search index files are even more space.

We filled our disk up several times during the conversion, and one of the times this happened caused all mail delivery for all our just-converted lists to stop completely until some disk space was freed up and Mailman was restarted (actually it required a server restart, not just Mailman, which they’d tried with “/etc/init.d/mailman restart”).

By the way, after filling the disk a few times I created a script called disk_space_check and installed a daily cron job to send an email if disk usage was too high:

/bin/df -h | $HOME/bin/disk_space_check

In the next section we’ll add searching to Mailman’s archives, and make it so search on private lists (and search results) are limited to just subscribers.

Next: Setting Up Archive Search or Up: Table of Contents