[Reproducible-builds] Better popcon stats for source packages
infinity0 at debian.org
Mon Jun 13 13:17:18 UTC 2016
At Reproducible Builds we just added popcon stats to our issues page, to help us better understand which issues to prioritise: https://tests.reproducible-builds.org/debian/index_issues.html
However, we work on source packages, but popcon data is based on binary packages. This means that that page is currently very inaccurate for some packages - for example it thinks "linux" has a popcon score of 6.
Popcon does provide stats for source packages at http://popcon.debian.org/source/by_inst
however, these stats are basically useless - the "score" for each source package, is simply the sum total of the scores for the binary packages produced by that source package. This is *not* the correct way to calculate "popularity" for a source package, since it is heavily biased in favour of source packages with many binary packages that must be co-installed.
What we really want is the statistic "number of people that have installed binary-package-1 OR binary-package-2 OR .. OR binary-package-n". It is mathematically impossible to calculate this from the data that popcon is currently providing at http://popcon.debian.org/, however fixing this is easy - we would simply need to change the backend to keep a separate "by-source-package" dump of data, that is based on set-union (logical-disjunction) and not arithmetic-sum.
I had thought about coding up heuristics to estimate this, but it would be better to just have the popcon backend calculate this exactly, for others to consume.
I'd be happy to submit a patch for the popcon backend, but I could only find the client source code here: https://anonscm.debian.org/cgit/popcon/ Could you let me know how I could submit a patch for the backend?
More information about the Reproducible-builds