[Popcon-developers] Accessing popcon data

Wed Mar 23 20:06:51 UTC 2011

Hi Bill,

Thanks for your inputs about this, they are very much appreciated!

2011/3/23 Bill Allombert <Bill.Allombert at math.u-bordeaux1.fr>:
>
> I will come to that later. But for now, I like to define a leak as the
> possibility to guess information about the data that is not provided by
> the current popcon.debian.org.
>
In fact I was only worried about avoiding privacy leaks to ensure
users privacy. I don't get what is the problem of providing more
information than what is currently provided by popcon reports (except
that it should be agreed here first), given that we safeguard users
individual records.

Don't you think it could contribute to make Debian better?

If we manage to develop an accurate enough recommender such a system
would help users to meet potentially desired packages. Assuming that
it would increase package popularity in general, we would be improving
Debian (more users, more bug reports, more contributors etc).

> Let consider D to be the data set you are using, R the
> recommender, q a query and C the recommendation.
>
> We have the equation R(D,q)=C.
>
> Now assume that someone has access to the code source of R and is able to make
> an infinite number of query to R. They can ask themself: what are all the
> possible datasets D such that R(D,q)=C for all query ?
> The less datasets are possible, the more information is leaked.
>
> For example, they might know that user X has packages foo1, foo2, foo3 installed,
> and they want to know whether it is likely that X has installed bar1, bar2 or bar3.
> So they might submit queries to the recommender and see whether it suggests
> bar1, bar2 or bar3.
>
I agree with you, but again, I don't think this kind of association is
a problem. I believe quite the opposite actually. I think it might be
useful to know that some packages are usually together even though
they are not dependent from each other. This kind of info might be
useful for QA, for example.

> Or they have found a security issue in package bar, and are looking at server running
> bar. They can search web servers that run such and such webservice packages (for example they
> can check whether the server run the Debian version of apache, etc) and they might use
> the recommender to find a set of webservices that is correlated to the installation of the
> package bar.
>
> Theses are passive attacks. But of course, they could submit hand-crafted popcon reports
> to affect the dataset and change the behaviour of the recommender to increase their chance
> of success.
>
This is a central issue about data mining. When we help people to
better understand a collection of data we let them free to use this
knowledge as they want. And in fact this could be extrapoled to any
field of science. You never know if what you are
researching/developing will be used for evil things. It's a pitty.

> Now, to the subject of tracking individual records: it depends how the recommender works,
> but some packages are only installed by a single user. Sometime the package is even private
> to that user (e.g. home-built kernels with custom version). This information could be used
> to guess other packages installed by the same user.
>
Regarding this point I'd say that private packages and those with very
few or a single user would be totally low priority for
recommendations, and could even be ignored.

[ ]'s

Tássia.