Bug#876055: Environment variable handling for reproducible builds

Mon Sep 18 22:45:33 UTC 2017

Daniel Kahn Gillmor <dkg at fifthhorseman.net> writes:
> On Sun 2017-09-17 16:26:25 -0700, Russ Allbery wrote:

>> I personally lean towards 2, which is consistent with what's in Policy
>> right now, but I can see definite merits in 3.  I believe the
>> reproducible builds project is currently sort of doing 1, but I have a
>> hard time seeing how to make that viable on the testing side.

> Thanks for raising this question, Russ!

> I'm not sure that we should let lack of exhaustive testing push us away
> from (1).  (1) is in principle the right thing -- it's easy to make a
> build reproducible if we tell people that they have to do exactly one
> specific thing.  But we generally want people to be able to run
> heterogenous systems, and not to force them into one particular
> environment.

Well... I would argue that the amount of time and effort that's gone into
this project shows that it's not that easy to make a build reproducible
even when telling people to do exactly one thing.  :)  But I get your
point.

> Consider someone who wants to see more logging from a build, for
> example.  There could be an environment variable that encourages the
> toolchain to log more, but doesn't affect the binary objects created by
> the build.  By going with choices (2) or (3) we effectively dismiss even
> considering the reproducibility of those builds, which seems like a
> shame.

This is the case for (2), but not for (3).  Indeed, this is exactly the
distinction between (2) and (3).  It does mean that discovery of any new
such environment variable would require a change to our whitelist in
approach (3), so there would be some lag and the whitelist would become
long over time (with a corresponding testing load).  But (3) does try to
achieve that use case without trying to anticipate any possible
environment variable setting.  It lets us be reactive to newly-discovered
environment variables across which we want to stay reproducible.

> Does everything in policy need to be rigorously testable?  or is it ok
> to have Policy state the desired outcome even if we don't know how (or
> don't have the resources) to test it fully today.

I don't think everything has to be rigorously testable, but I do think
it's a useful canary.  If I can't test something, I start wondering
whether that means I have problems with my underlying assumptions.

In particular, for (1), we have no comprehensive list of environment
variables that affect the behavior of tools, and that list would be
difficult to create.  Many pieces of software add their own environment
variables with little coordination, and many of those variables could
possibly affect tool output.

I feel like the work for (1) and for (3) ends up being comparable; for (1)
we have to maintain a blacklist, and for (3) we have to maintain a
whitelist.  But (3) is testable, whereas (1) is inherently aspirational
and will always have to be aspirational.  We're endlessly going to be
discovering some other environment variable that changes tool output.

I'm also unsure that (1) is even what we want to claim.  Do we really want
to say that builds are always reproducible if you don't change this short
list of environment variables, no matter whatever other environment
variables you set?  There's some appeal in this for the end user, but it
feels very frustrating for the package maintainer.  At first glance, as a
package maintainer, I'd think I'd have to maintain a huge blacklist of
environment variables that I've discovered affect my toolchain somewhere,
and explicitly unset them all in debian/rules.  This doesn't feel like a
good use of anyone's time (and may actually *break* other,
non-reproducibility-related things that people want to do with my
package).

-- 
Russ Allbery (rra at debian.org)               <http://www.eyrie.org/~eagle/>