[Teammetrics-discuss] Converter for mboxes (Was: Debian mailing lists archives as mbox)
Andreas Tille
andreas at an3as.eu
Thu Aug 18 14:31:01 UTC 2011
Hi,
sorry for the full quote which I leave as a reminder for the mbox
conversion issue. I would like to add the information that we recently
found some mboxes from alioth lists (which are public) which are lacking
any Message-ID. All these messages were SPAM. To reflect this do you
want us to drop messages without Message-ID in the conversion script in
case such messages will be found?
Kind regards
Andreas.
PS: The GSoC project ends next Monday. It would be great if you could
at least give a short notice what you intend to do with the filter
to enable my student to give some reasonable outlook about things
to do.
On Tue, Aug 16, 2011 at 04:34:19PM +0200, Andreas Tille wrote:
> Hi,
>
> as it was requested by listmaster in this longish thread we wrote a
> converter which strips certain tags from mboxes of lists.debian.org.
> The code can be found in the attached tgz. You can find it as well in
>
> git://git.debian.org/git/teammetrics/teammetrics.git
>
> in directory mbox-tools. The actual filter is mboxfilter.py. It takes
> an (unzipped for the moment - feel free to ask for support of gzipped)
> mbox and outputs a mbox with the extension '.converted' - not very cool
> name but you did no specification. It's easy to adapt to your needs
> (better name / stdout / whatever).
>
> For the moment it takes a single file for specifying the Message-IDs
> which should be deleted. This is called messageid and contains *only*
> the Message-IDs (not the prefix Skip-Spam-Message-Id: as written below).
> It is not clear to us whether this prefix is always the same - this
> sounds not probable because it would be just redundant. If the
> exclusion files are featuring those prefixes can we safely assume that
> we get the Message-ID with the following regexp:
>
> ^Skip-.*-Message-Id: (.*)$
>
> ? If not please be more verbose or tell me where I can find those
> exclusion files on master.
>
> Moreover you were speaking about more than one exclusion file. Do you
> mean *several* exclusion files per mbox or just one per mbox which has a
> defined naming scheme?
>
> Regarding the fields which are taken over into the converted mbox: In
> the beginning of mboxfilter.py you find a list HEADERS which specifies
> those headers which are taken over. I also added a list
> possible_HEADERS which contans fields which might make sense to take
> over for certain reasons. This is just for documentation currently.
>
> I tested the filter with random mboxes (from different lists, different
> times, different sizes):
>
> debian-accessibility.200406
> debian-announce.200902
> debian-devel.199808
> debian-devel.200704
> debian-devel.201106
> debian-jr.200609
> debian-med.200609
> debian-ocaml-maint.200408
>
> using the messageid file in the attached tarball and found it working
> for these. This messageid file was created using the script
> mbox-potential-spam-ids (just to have some input) and I checked the
> result by mbox-diff-check to be able to detect some potential problems.
> My tests did not revealed any unexpected things.
>
> Please tell us how to proceed from now.
>
> Kind regards
>
> Andreas.
>
> On Thu, Aug 04, 2011 at 11:32:42AM +0200, Alexander Wirt wrote:
> > Sukhbir Singh schrieb am Thursday, den 04. August 2011:
> >
> > > Hi Alex,
> > >
> > > Can we have some prototype/ format of the Message-IDs that you want us
> > > to strip? It would be beneficial for both sides because then we can
> > > show you what we will be handling and you can tell if something else
> > > needs to be taken care of.
> > Sure. We have several files with entries like:
> > Skip-Spam-Message-Id: <4610e762.1f8f12a6.0218.7af1 at mx.google.com>
> > Skip-Spam-Message-Id: <8600e4c3dd4c62fb51f343ac020608e3 at gmail.com>
> > Skip-Spam-Message-Id: <CA287EE3.7684.AC15C2D5 at localhost>
> >
> > if would be best if the converter accepts a message box and several skip
> > files. I'll write a wrapper that does the dirty details on the filesystem.
> > (Explaining everything in detail would take more time than writing a script).
> >
> > Alex
> >
> >
> > >
> > > Thanks for the help,
> > >
> > > --
> > > Sukhbir
> > >
> >
> >
> > --
> > To UNSUBSCRIBE, email to debian-devel-REQUEST at lists.debian.org
> > with a subject of "unsubscribe". Trouble? Contact listmaster at lists.debian.org
> > Archive: http://lists.debian.org/20110804093242.GM3348@smithers.snow-crash.org
> >
> >
>
> --
> http://fam-tille.de
--
http://fam-tille.de
More information about the Teammetrics-discuss
mailing list