[Soc-coordination] Debian Teams Activity Metrics - Report II
sukhbir.in at gmail.com
Sun Jun 19 10:56:33 UTC 2011
This is the second report for the Google Summer of Code project
'Debian Team Activity Metrics'.
Continuing from the previous report  in which we mentioned what we
would working on from June 6th - June 17th:
Work which was 'to be done' in the last report:
- pushing this information into the database -- done.
- preventing redundancy by not allowing lists already downloaded to be
download/ parsed again -- done.
- replace the print statements with the logging module -- done.
- implementing a mechanism to remove spam from the lists -- in
progress. We are working on the design and are going to implement what
we have discussed soon. The implementation is pretty easy given the
way our code is arranged, what is difficult is to identify patterns
for detecting spam. We have narrowed down on some and currently
improving them . Note that we are not aiming to implement a spam
'filter'; that would be an overkill for this project and also given
that we just need the top ten active contributors from a list. Please
feel free to suggest other metrics for detecting spam.
+ we designed and implemented new metrics for measuring contribution.
You can have a look at the source code to get an idea.
+ the code has been improved -- it's modular and readable with better
identifiers and comments.
+ creating the database is handled by a script now.
In the coming weeks, we will be:
- completing the implementation of the spam 'filter',
- lists on lists.debian.org
As of now, our script only works for parsing the lists on Alioth.
Next we will be parsing the lists on lists.debian.org . Once we get
the mbox archives, it should be trivial given most of the work has
been done and the function which handles the parsing is separate from
the rest of the code. Thus it's just a matter of just calling that
function with the archives. We did get in touch with the maintainers
at lists.debian.org but they seemed hesitant in providing access to
archives of all the lists but we will be getting in touch with them
again and resolving this.
- parsing commit data from repositories, which is phase III of the
project as mentioned in the GSoC report . We will see whether the
wheel needs to be reinvented again or not :-)
The output from the script is *exactly* what we aimed for. We have
also added some new metrics which we think will be very helpful.
Just to give an idea about our work, here is the statistical data from
the mailing list for this project teammetrics-discuss .
(The most active contributor using 'Name' as the frequency)
liststat=# SELECT name, COUNT(name) FROM listarchives GROUP BY name;
name | count
Andreas Tille | 30
Scott Howard | 1
Sukhbir Singh | 54
But we thought, is the person with the highest number of postings
always the most active contributor? Not so. So here is another metric,
which measures the number of characters in the message body:
(The most active contributor using the length of the message body as
liststat=# SELECT name, SUM(msg_raw_len) FROM listarchives GROUP BY name;
name | sum
Andreas Tille | 47815
Sukhbir Singh | 39968
Scott Howard | 455
This should give you an idea of what we are aiming to do.
And given how active and helpful Andreas has been, here is a
'statistical' thank you to him :-)
Our Git repository is on Alioth .
Thank you for reading this and please get in touch with us here on our
mailing list if you want to discuss anything, as always.
 - http://lists.alioth.debian.org/pipermail/soc-coordination/2011-June/001004.html
 - http://lists.alioth.debian.org/pipermail/teammetrics-discuss/2011-June/000043.html
 - http://lists.alioth.debian.org/pipermail/teammetrics-discuss/2011-June/000084.html
 - http://wiki.debian.org/SummerOfCode2011/TeamFeatures/SukhbirS
 - http://lists.alioth.debian.org/pipermail/teammetrics-discuss/
 - https://alioth.debian.org/scm/browser.php?group_id=100628
More information about the Soc-coordination