[Popcon-developers] Accessing popcon data

Bill Allombert Bill.Allombert at math.u-bordeaux1.fr
Wed Mar 23 18:17:18 UTC 2011


On Wed, Mar 23, 2011 at 01:50:44PM -0300, Tássia Camões wrote:
> Hello Bill,
> 
> 2011/3/23 Bill Allombert <Bill.Allombert at math.u-bordeaux1.fr>:
> >
> > Now, it would be great if there was some research on how much information a recommendation
> > system leaks, and how that can be mitigated. A whole new master project I am afraid.
> 
> I plan to bring this questions in my work, but I'll probably not go
> too deep since my main goal is to develop the recommender.

I understand that! 

> Indeed, there are some researches in this field. As an example, there
> is this article from researchers of University of Texas at Austin
> which case study discusses how to break anonymity of the Netflix prize
> dataset [1]. And in fact the contest was canceled after a privacy
> lawsuit.
> 
> However I don't see these privacy issues as a problem of the
> recommendation systems themselves, these leaks are all consequence of
> the disclosure of the transactions database.
> Netflix had to do it for the contest, since the participants could not
> develop a solution without having access to the database. And so they
> got suited.
> 
> In this project, what will be available is the recommender, not the
> database. The recommendation is a result of processing the whole bunch
> of data to give suggestions to a specific user based on commonalities
> of behavior. I can't see how individual records could be tracked if
> you only have access to the recommendation. Do you?

I will come to that later. But for now, I like to define a leak as the
possibility to guess information about the data that is not provided by
the current popcon.debian.org.

Let consider D to be the data set you are using, R the 
recommender, q a query and C the recommendation. 

We have the equation R(D,q)=C.

Now assume that someone has access to the code source of R and is able to make 
an infinite number of query to R. They can ask themself: what are all the 
possible datasets D such that R(D,q)=C for all query ?
The less datasets are possible, the more information is leaked.

More realisticaly, they can make an hypothesis H and check its validity by making
query. Since they have access to the algorithm used by the recommender, they can 
infer some information from the answer.

For example, they might know that user X has packages foo1, foo2, foo3 installed,
and they want to know whether it is likely that X has installed bar1, bar2 or bar3.
So they might submit queries to the recommender and see whether it suggests
bar1, bar2 or bar3.

Or they have found a security issue in package bar, and are looking at server running
bar. They can search web servers that run such and such webservice packages (for example they
can check whether the server run the Debian version of apache, etc) and they might use
the recommender to find a set of webservices that is correlated to the installation of the
package bar.

Theses are passive attacks. But of course, they could submit hand-crafted popcon reports
to affect the dataset and change the behaviour of the recommender to increase their chance
of success.

Now, to the subject of tracking individual records: it depends how the recommender works,
but some packages are only installed by a single user. Sometime the package is even private
to that user (e.g. home-built kernels with custom version). This information could be used
to guess other packages installed by the same user.

Cheers,
-- 
Bill. <ballombe at debian.org>

Imagine a large red swirl here. 



More information about the Popcon-developers mailing list