Bug#807270: mk-origtargz: create reproducible tarballs and --mtime option

Diederik de Haas didi.debian at cknow.org
Sun Mar 23 15:35:07 GMT 2025


On Fri Mar 21, 2025 at 10:06 AM CET, Simon Josefsson wrote:
> Holger Levsen <holger at layer-acht.org> writes:
>> On Thu, Mar 20, 2025 at 10:37:15PM +0100, Simon Josefsson wrote:
>>> +1 on reproducible tarballs. 
>>
>> sure, +1, patches welcome! :) \o/
>
> Attached starting point, thoughts?
>
> https://salsa.debian.org/debian/devscripts/-/merge_requests/490
>
> The patch needs review/improvement from those more familiar with
> mk-origtargz and the debian/tests/ framework.

I had made some comments on the MR, but I think it's useful to keep it
all together, so I'll redo that here. At the end of the message.

> My main argument is that solving this is harder than it looks, and I
> fear that solving the general problem here may actually be infeasible.

... having looked into this a bit more, I agree. (more later)

> It can help to realize this, otherwise one may think that solving this
> is just a matter of adding the right parameters (which is what the patch
> attempt to do).
>
> While we could attempt to continue patch things, how about a bigger
> question: why do we re-create tarballs?

I consider that out of scope for this bug, so I won't comment on that.

> For those wanting to understand why solving the --mtime concern is a
> hard problem, here is a partial helper tool to aid with this:
>
> https://lists.gnu.org/archive/html/bug-gnulib/2025-02/msg00166.html
>
> I dislike all that complexity though, so for some upstream projects

I actually like that, not for its 'complexity' but for having clear
rules and (especially if everyone uses that) consistency.
There is one 'problem' though: it only supports git (for now?).

> (libtasn1, libidn2, inetutils, ...) I am using a heavy hammer like this:
>
> TAR_OPTIONS += --mode=go+u,go-w --mtime=$(abs_top_srcdir)/NEWS
> mtime-NEWS-to-git-HEAD:
> 	$(AM_V_GEN)if test -e $(srcdir)/.git \
> 			&& command -v git > /dev/null; then \
> 		touch -m -t "$$(git log -1 --format=%cd --date=format-local:%Y%m%d%H%M.%S)" $(srcdir)/NEWS; \
> 	fi

The ``genorig.py`` script stored the orig_date like this:

  orig_date = time.strftime("%a, %d %b %Y %H:%M:%S +0000",
      time.gmtime(
          os.stat(os.path.join(self.dir, self.orig, 'Makefile'))
          .st_mtime))

And then orig_date is used to set the --mtime parameter to tar.
That ``genorig.py`` script also had a useful comment:

    # exclude_files() will change dir mtimes.  Capture the original
    # release time so we can apply it to the final tarball.

I don't really care which date format is used, but I do care that it's
used consistently. And if the archive is repackaged or not, the mtime
should be the same (which was the whole idea behind storing orig_date).
Similarly it shouldn't matter if the archive is created via ``uscan`` or
via a call to ``mk-origtargz`` directly.

> We could do the same in Debian, replacing NEWS with last timestamp of
> debian/changelog, but it is important to remember that this is an ugly
> workaround rather than a solution. 

It's indeed a(n ugly) workaround but I do think it's useful; having each
package declare which upstream file to use sounds like a very bad idea.

> Solving it like this will lead to other problems.  
> Solving it properly requires going to the root cause of
> the problem, which is what Bruno is chasing in that e-mail thread.

I do like solving things properly, but having stopped myself from going
into too many rabbit holes, I'll settle for a decent one ;-)

And now for a review of the patch/MR itself:

First of all: thanks for a proper commit message :-)

> From a811a58bb007f7f0fe474e0ff1a105c48fedc238 Mon Sep 17 00:00:00 2001
> From: Simon Josefsson <simon at josefsson.org>
> Date: Fri, 21 Mar 2025 09:40:48 +0100
> Subject: [PATCH] MkOrigtargz: Improve tarball reproducibility.
> 
> The --format=ustar is better than the V7 format and is
> a conservative choice if we don't want to switch to PAX
> just yet, see discussion here:
> https://serverfault.com/questions/250511/which-tar-file-format-should-i-use

I do like references/links but what I'm missing is *why* *you* propose
to go with ustar and not go with any of the other archive formats.
I usually put links at the bottom of the commit message which can be
used for background/further reading, but the commit message itself
should contain all the information needed.

https://www.gnu.org/software/tar/manual/html_section/Formats.html

says about ustar: "Archive format defined by POSIX.1-1988 and later."
which I think is a really good argument (I like standards).

I also see 'posix' as archive format:
"The format defined by POSIX.1-2001 and later."
"This archive format will be the default format for future versions of
GNU tar."

POSIX.1-2001 doesn't sound too recent, so why not go with that?
There may be very valid reasons, but please describe why you choose NOT
to go with that.
That can then be used in the future to re-evaluate that choice.

btw: Is this what you mean by 'pax'?
The serverfault page describes it as POSIX.1-2001, but the Formats page
doesn't have the word 'pax'.
The upstream tar git repo does have a 'paxutils' submodule, not to be
confused with the 'pax-utils' Debian package.
Then there's also a 'pax' Debian package "Portable Archive Interchange
(cpio, pax, tar)" which sounds useful (?), but its package description
has this: "This is the MirBSD paxtar implementation supporting the
formats ... old tar, and ustar, but not the format known as pax yet" :-O

[ continuing with 'Formats' ]
"The default format for GNU tar is defined at compilation time. You may
check it by running tar --help, and examining the last lines of its
output. Usually, GNU tar is configured to create archives in ‘gnu’
format, however, a future version will switch to ‘posix’."

```sh
diederik at bagend:~$ tar --version
tar (GNU tar) 1.35
diederik at bagend:~$ tar --help | tail -n3
*This* tar defaults to:
--format=gnu -f- -b20 --quoting-style=escape --rmt-command=/usr/sbin/rmt
--rsh-command=/usr/bin/rsh
```

The ``gnu`` archive format description has:
"Format used by GNU tar versions up to 1.13.25."
I didn't see a format specification in ``debian/rules``, so it seems the
default is still ``gnu``?

It sounds to me that ``ustar`` or ``posix`` are better then ``gnu``, but
is it wise if the Debian tar package uses a different archive format
(by default) then what mk-origtargz does/will do?
The Debian tar maintainer had its last upload 4 years ago and I haven't
found any upload by its official 'uploader' ... ever, so CC-ing them
didn't sound too useful.

And then I stopped myself from going into more rabbit holes ...

> Using --numeric-owner --owner=0 --group=0 avoids relying on the target
> system having a /etc/passwd and /etc/group user/group called 'root'
> and that they both map to uid/gid 0 which is the intent.

Excellent.

> Sorting filenames with --sort=name improve tarball reproducability.

Idem.
 
> Hard code permissions with --mode=go=rX,u+rw,a-s inspired by Guix.

Why? Does this change the permissions of the files in the archive?
If so, then that sounds like a bad idea.
If it is useful, then the reasoning for doing that should be documented
with an optional link to the Guix page (?) that you used for its
justification.

> Using --mtime and --clamp-mtime remains and is the complex part.

Without it, this patch won't close bug 807270, but referencing that bug
in this patch seems *very* useful.
And I want to reiterate that "exclude_files() will change dir mtimes",
which IIUC makes things NOT reproducible.

> ---
>  lib/Devscripts/MkOrigtargz.pm | 18 +++++++++++++-----
>  1 file changed, 13 insertions(+), 5 deletions(-)
> 
> diff --git a/lib/Devscripts/MkOrigtargz.pm b/lib/Devscripts/MkOrigtargz.pm
> index b1a691dc..86993afc 100644
> --- a/lib/Devscripts/MkOrigtargz.pm
> +++ b/lib/Devscripts/MkOrigtargz.pm
> @@ -110,11 +110,19 @@ sub make_orig_targz {
>          # tar it all up
>          spawn(
>              exec => [
> -                'tar',          '--owner=root',
> -                '--group=root', '--mode=a+rX',
> -                '--create',     '--file',
> -                "$destfiletar", '--directory',
> -                $tempdir,       @files
> +                'tar',
> +		'--format=ustar',
> +		'--owner=0',
> +                '--group=0',
> +		'--numeric-owner',
> +		'--sort=name',
> +		'--mode=go=rX,u+rw,a-s',
> +                '--create',
> +		'--file',
> +                "$destfiletar",
> +		'--directory',
> +                $tempdir,
> +		@files

You're using a mix of tabs and spaces above; please use only spaces to
match the rest of the file.

>              ],
>              wait_child => 1
>          );
> --

Cheers,
  Diederik
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 228 bytes
Desc: not available
URL: <http://alioth-lists.debian.net/pipermail/reproducible-builds/attachments/20250323/ec21c09a/attachment.sig>


More information about the Reproducible-builds mailing list