[Teammetrics-discuss] Converter for mboxes (Was: Debian mailing lists archives as mbox)

Andreas Tille andreas at an3as.eu
Thu Aug 18 14:31:01 UTC 2011


Hi,

sorry for the full quote which I leave as a reminder for the mbox
conversion issue.  I would like to add the information that we recently
found some mboxes from alioth lists (which are public) which are lacking
any Message-ID.  All these messages were SPAM.  To reflect this do you
want us to drop messages without Message-ID in the conversion script in
case such messages will be found?

Kind regards

       Andreas.

PS: The GSoC project ends next Monday.  It would be great if you could
    at least give a short notice what you intend to do with the filter
    to enable my student to give some reasonable outlook about things
    to do.

On Tue, Aug 16, 2011 at 04:34:19PM +0200, Andreas Tille wrote:
> Hi,
> 
> as it was requested by listmaster in this longish thread we wrote a
> converter which strips certain tags from mboxes of lists.debian.org.
> The code can be found in the attached tgz.  You can find it as well in 
> 
>   git://git.debian.org/git/teammetrics/teammetrics.git
> 
> in directory mbox-tools.  The actual filter is mboxfilter.py.  It takes
> an (unzipped for the moment - feel free to ask for support of gzipped)
> mbox and outputs a mbox with the extension '.converted' - not very cool
> name but you did no specification.  It's easy to adapt to your needs
> (better name / stdout / whatever).
> 
> For the moment it takes a single file for specifying the Message-IDs
> which should be deleted.  This is called messageid and contains *only*
> the Message-IDs (not the prefix Skip-Spam-Message-Id: as written below).
> It is not clear to us whether this prefix is always the same - this
> sounds not probable because it would be just redundant.  If the
> exclusion files are featuring those prefixes can we safely assume that
> we get the Message-ID with the following regexp:
> 
>     ^Skip-.*-Message-Id: (.*)$
> 
> ?  If not please be more verbose or tell me where I can find those
> exclusion files on master.
> 
> Moreover you were speaking about more than one exclusion file.  Do you
> mean *several* exclusion files per mbox or just one per mbox which has a
> defined naming scheme?
> 
> Regarding the fields which are taken over into the converted mbox: In
> the beginning of mboxfilter.py you find a list HEADERS which specifies
> those headers which are taken over.  I also added a list
> possible_HEADERS which contans fields which might make sense to take
> over for certain reasons.  This is just for documentation currently.
> 
> I tested the filter with random mboxes (from different lists, different
> times, different sizes):
> 
> 	debian-accessibility.200406
> 	debian-announce.200902
> 	debian-devel.199808
> 	debian-devel.200704
> 	debian-devel.201106
> 	debian-jr.200609
> 	debian-med.200609
> 	debian-ocaml-maint.200408
> 
> using the messageid file in the attached tarball and found it working
> for these.  This messageid file was created using the script
> mbox-potential-spam-ids (just to have some input) and I checked the
> result by mbox-diff-check to be able to detect some potential problems.
> My tests did not revealed any unexpected things.
> 
> Please tell us how to proceed from now.
> 
> Kind regards
> 
>          Andreas.
> 
> On Thu, Aug 04, 2011 at 11:32:42AM +0200, Alexander Wirt wrote:
> > Sukhbir Singh schrieb am Thursday, den 04. August 2011:
> > 
> > > Hi Alex,
> > > 
> > > Can we have some prototype/ format of the Message-IDs that you want us
> > > to strip? It would be beneficial for both sides because then we can
> > > show you what we will be handling and you can tell if something else
> > > needs to be taken care of.
> > Sure. We have several files with entries like:
> > Skip-Spam-Message-Id: <4610e762.1f8f12a6.0218.7af1 at mx.google.com>
> > Skip-Spam-Message-Id: <8600e4c3dd4c62fb51f343ac020608e3 at gmail.com>
> > Skip-Spam-Message-Id: <CA287EE3.7684.AC15C2D5 at localhost>
> > 
> > if would be best if the converter accepts a message box and several skip
> > files. I'll write a wrapper that does the dirty details on the filesystem.
> > (Explaining everything in detail would take more time than writing a script).
> > 
> > Alex
> > 
> > 
> > > 
> > > Thanks for the help,
> > > 
> > > -- 
> > > Sukhbir
> > > 
> > 
> > 
> > -- 
> > To UNSUBSCRIBE, email to debian-devel-REQUEST at lists.debian.org
> > with a subject of "unsubscribe". Trouble? Contact listmaster at lists.debian.org
> > Archive: http://lists.debian.org/20110804093242.GM3348@smithers.snow-crash.org
> > 
> > 
> 
> -- 
> http://fam-tille.de



-- 
http://fam-tille.de



More information about the Teammetrics-discuss mailing list