[Teammetrics-discuss] Git repositories parsing is ready.

Andreas Tille andreas at an3as.eu
Tue Jul 12 19:52:49 UTC 2011


On Tue, Jul 12, 2011 at 10:37:54PM +0530, Sukhbir Singh wrote:
> 
> 1. Notice lines 24 and 25. We have to give a username / password as by
> default the SSH private key is encrypted. To decrypt that, the
> passphrase is required. We have two approaches here, you can decide
> which one to follow:
> 
> - Either we save the password and let it decrypt the key (current approach)
> 
> OR
> 
> - We save the key in plain text after decrypting it and let the script
> use the decrypted key.
> 
> Both approaches would involve setting the appropriate permissions so
> that the password/ private key is not readable. But this is for you to
> decide :-).

I'm fine with the current approach because it seems equivalent to the
other suggestion (if I did understood you correctly).  However, I'm not
sure that Alioth admin understood us correctly.  This script is intended
to run on a server in the internet which is accessible to a set of
people (who belongs to the blends/debian-med team and have some reason
to use this server).  On this host an unencrypted password / key is
hanging around which enables login to alioth.  It could even happen that
you just play around with the script and commit the result - forgetting
to delete the password again.  Well, it's no simple path of offense to
alioth, but I'm not perfectly happy about this principle.
 
> 2. We need an account also for Alioth that will connect to it via SSH.
> For now, you can test it with your Alioth account but this has to be
> discussed -- which account do we use?

Please rediscuss with alioth admin.  I do not care about the account
name (but definitely not my account because my password will not hang
around inside those scripts ...

As I suggested - a web service providing those data to your script
might do the trick without such password stuff.
 
> 3. Please note the schema is now:
> 
>      Column     |  Type   | Modifiers
> ----------------+---------+-----------
>  project        | text    |
>  package        | text    |
>  name           | text    |
>  changes        | integer |
>  lines_inserted | integer |
>  lines_deleted  | integer |
> 
> project - name of the project, like debian-med
> package - name of the package within a project, like 'ball' in 'debian-med'.
> name - name of the author
> changes - the frequency of commits
> lines_inserted/ deleted - lines inserted/ deleted

For debugging issues it might be interesting from what kind of Vcs the
information is obtained.  This is not relevant for our research but if
you need to check the results it is probably easier if you can tell
things appart.
 
> I hope this is acceptable. Please let me know any changes if they are
> to be made.

I just commited my config file as an example.
 
> So for testing, fill in lines 24, 25 and update the schema for gistat
> (The archives.sql will do it for you).

Sure.
 
> 4. Also note that updatenames.py is called so the multiple names
> problem are taken care of (if any).

Seems that it needs some update, but this are details.
 
> 5. Statistics:
> 
> teammetrics=# SELECT name, SUM(changes) FROM gitstat WHERE
> project='teammetrics' GROUP BY name ORDER BY SUM DESC;
>      name      | sum
> ---------------+-----
>  Sukhbir Singh |  46
>  Andreas Tille |   2

Good. :-)
 
> teammetrics=# SELECT name, SUM(changes) FROM gitstat WHERE
> project='debian-med' GROUP BY name ORDER BY SUM DESC;
> 
>           name           | sum
> -------------------------+------
>  oliver                  | 3771
>  amoll                   | 3724
>  Andreas Hildebrandt     | 1103
>  anker                   |  859
>  Charles Plessy          |  689

I've got the same numbers, but this stat also raises questions.  First
I guess it is git only (did not parsed svn string in the code and I'm
not *that* far behind those leading commiters ;-)).

Then I wonder about names I have never heard about 'anker'.  I have
never heard the names 'Daniel Stoeckel' and 'Anna Dehof' and can not see
them in the list of Debian Med members on Alioth.  Ahh, I'm afraid I
have an explanation:  These are non-Debian related upstream changes
which are poisoning our statistics.  That might be a problem.  We just
want to track only changes in debian/.  Is there any way to separate
those changes?  BTW, I was never happy about this policy that Git allows
ones of the whole upstream archive.  I thought this would be based on
my Git beginner status - but somehow it seems to make us trouble now.

> It would be great if you could test some more repositories just to
> confirm all is well.

Besides this upstream changes influence it seemd to work quite fine
for a moment:

teammetrics=# SELECT project, count(*) from gitstat group by project order by project;
     project     | count 
-----------------+-------
 debian-med      |   127
 debian-science  |     5
 pkg-common-lisp |   363
 pkg-java        |   345
 pkg-multimedia  |   232
 pkg-scicomp     |    44
(6 Zeilen)

However I went into an encoding problem:

$ python zzz_run_gitstat.py 
Traceback (most recent call last):
  File "zzz_run_gitstat.py", line 169, in <module>
    ssh_initialize()
  File "zzz_run_gitstat.py", line 162, in ssh_initialize
    fetch_logs(ssh)
  File "zzz_run_gitstat.py", line 124, in fetch_logs
    (team, each_dir, author, changes, insert, delete)
psycopg2.DataError: FEHLER:  ungültige Byte-Sequenz für Kodierung »UTF8«: 0xf66c6e69


You remember my promise that encoding problems will always beat you?
 
> 6. How do we find out which VCS the team is using? I have a solution
> but I am hoping we can do better. We try with SVN first URL and if
> that fails, we try with Git/ SSH. And if that fails too, we throw up
> an error. Good/ bad?

This was eactly my first idea and it probably works that way.  Please
remind that some teams (debian-med, debian-science) use *both* SVN *and*
Git.

> Or we can perhaps get this from the UDD?

Perhaps indirectly.  You might look at a random package of the team and
parse the Vcs columns.  However this is probably not very safe because
there is no secure way to tell that *all* packages of a team might use
Vcs.  May be you pick a random package which is not in Vcs (for whatever
reason).

So your first idea is probably the best:

   - Check SVN
   - *and* Check Git

for *all* teams.

> PS: If you commit anything, please make sure you do remove your
> password from the script before committing. One tends to forget
> sometimes.

:-)
As I was afraid above.  That's why I use a copy of the original file
which is not tracked by Git ...
 
> Thanks for reading this long post! When you give the go ahead, I will
> replace the print statements and implement logging support and
> finalize this code.

Besides those details I mentioned which might fire back with different
degrees of trickyness I'm really happy about this, because it finally
adds a really new measure to our research.  So I'm quite happy.

Thanks for your effort

        Andreas. 

-- 
http://fam-tille.de



More information about the Teammetrics-discuss mailing list