Bug#877418: dh-strip-nondeterminism: kills clojure performance

Daniel Kahn Gillmor dkg at fifthhorseman.net
Thu Oct 5 19:45:06 UTC 2017


On Wed 2017-10-04 19:45:49 +0100, Chris Lamb wrote:
> *Very* quick thoughts here: could some variant of a) be merged
> upstream…? Perhaps upstream could move to a hash-based system instead
> of using timestamps? eg. encoding the SHA1 of the file in the filename.

I'm thinking about this problem more generally than clojure
specifically -- other folks have raised python's .py → .pyc mappings and
i'm sure there are other similar frameworks.  I want to make sure we're
thinking about the various places that these checks happen.

It may also matter whether we're talking about file stored in an archive
vs. one stored in the filesystem.  different archive formats and
different filesystems have different timestamp granularity (iirc, FAT
has 2s granularity, for example).

And there are more questions too: what if multiple source files
contributed to the creation of the compiled artifact (e.g. "include"
directives)?

You can also imagine a compilation regime that detects changes to a file
(e.g. via inotify) and immediately triggers recompilation -- with a fast
compiler and a coarse filesystem/archive timestamp, such a regime would
end up in the same situation (serious performance impact).

And of course, it's always possible to (accidentally or intentionally)
just "touch" the timestamps on a totally different bytecode file of the
appropriate name to trick or confuse this optimization step.

There are also problems with the digest based approach that lamby
suggests: it's significantly more expensive to do a full source
extraction and digest than it is to compare timestamp metadata.

--

So i think we have to ask what the goal of this check is from the upstream
platform's point of view:

 * is it strong assurance that the file was built from the
   exposed source?

 * is it a speedy (if fallible) sanity check?

i think that it can't really be the former (because of all the corner
cases outlined above), so the question is what kind of failure modes and
risks they're willing to tolerate.  Those that want absolute assurance
will be obliged to recompile each time unless they have some sort of
externally-audited mapping/manifest.

It sounds to me like python has made a sensible tradeoff (accepting that
equal timestamps means OK) and clojure has made a decision that tries to
get more of a guarantee than they can actually get, and sacrificed
performance for it.

            --dkg



More information about the Reproducible-builds mailing list