[Reproducible-builds] generating reproducible ISOs with xorriso

Fri Jun 5 16:36:40 UTC 2015

Hi Thomas--

Thanks for all your work looking into this!

On Fri 2015-06-05 10:57:38 -0400, Thomas Schmitt wrote:
> About the --sort-weight-list approach which is possible with
> already released xorriso versions:
>
>> (find . -type f -print0 | xargs -0 md5sum | sort | cut -f2- -d/ ; find .
>> -mindepth 1 \! -type f | sort | cut -f2- -d/ ) | awk '{ N=N+1; print N " "
>> $0 }'
>
> I misunderstood the role of md5sum here. Actually it seems
> surplus. Why not just sort the paths ? That would be enough to
> give awk a reproducible input sequence.

Right, but it would seem to fail for hardlinked files or deduped files,
because it would weight one of the files in different places than the
other.

> Ok. The risk of a random collision is avoided and 2 billion
> files is not a severe limitation. (But the hardlinks ...)
>
> xorriso will not understand the "\n" which md5sum substitutes
> for newline characters in filenames. So trying to process such
> filenames will not be reliably reproducible and throw errors:
>   xorriso : FAILURE : Cannot find path 'a\nb' in loaded ISO image
> One would have to set before -as mkisofs:
>   -abort_on fatal
> in order to avoid a premature end of the program run.
> The attribution of weights would stop in any case.

Yep, i understand this limitation.  For this first-pass hackery, I think
i'm ok with the idea that reproducibility fails if you put a newline in
a filename.  Presumably the same goes for files that have a literal
backslash (\) in their name as well, since md5sum has to escape those
too.  (sane people shouldn't be putting newlines and backslashes in
filenames anyway!)

> There is no need to attribute weight to directories.
> It applies only to the content source objects of regular files.
> ("Regular file" in the ISO, not necessarily on hard disk).

Thanks, that's useful to know, and it makes the command cleaner.

> So how about this:
>
>    if test $(find . -name '*'$'\n''*' | wc -w) -gt 0
>    then
>      echo "FOUND FILENAMES WITH NEWLINES UNDERNEATH $(pwd)" >&2
>      exit 1
>    fi
>
>    find . -type f -print | \
>      sort | cut -f2- -d/ | awk '{ N=N+1; print N " " $0 }'

As i mentioned above, i don't like that sorting just by name seems to
miss out on the dedup/hardlink compression.  I want reproducible images
*and* compact images (and a pony! :) )

Also, the above doesn't do anything for non-directory, non-regular files
(sockets, fifos, device nodes, etc) -- do those even make sense in
ISO-9660?  Do we need to worry about how/where they sort?

> Extent location of regular files:
>
> The question was:
> If i sort the hardlink-merged IsoFileSrc according to
> a ISO 9660 directory tree traversal, will the sequence be
> reproducible for trees with identical file names and
> attributes ?
>
> I now verified that the directories get sorted according
> to their ISO 9660 names. The process of name collision
> resolution (mangling) is complicated but depends only on
> the user defined input names and their sequence. Name sorting
> happens before mangling and afterwards.
> (libisofs/ecma119_tree.c funtions ecma119_tree_create(),
>  sort_tree(), mangle_tree(), qsort(3) in mangle_single_dir())
> So there should be no permutations of identical name lists
> possible.

 This is a triply-nice result, esp. because

 * it includes the hardlink-merged files, and

 * it puts the extents in an order that seems intelligible from a scan
   of the dirtree, and

 * it piggybacks off of sorting work that's already being done, so
   doesn't seem to introduce much extra overhead.

> Extent location of directories:
>
> Looks already reproducible.
> They get stored after volume descriptors but not before block 32.
> (The extent address of the root directory can be read as little
>  endian 32 bit number from byte 32924 to 32927 of the ISO.
>  ECMA-119 8.4.18 and 9.1.3)
> The production of extents traverses the sorted ISO tree.
> (libisofs/ecma119.c function write_dirs())
> The size of a directory extent depends on name lengths and
> attributes of the files inside the directory.
>
> Then there are the Path Tables (nobody reads them):
>
> Looks already reproducible.
> The sequence of entries is determined by an array pathlist[]
> which gets filled by traversal of the sorted ISO tree.
> (libisofs/ecma119.c function write_path_tables())

Nice, these are both good news.

> So i will go for the reproducible array of IsoFileSrc in
> libisofs/filesrc.c function filesrc_writer_pre_compute().
> The red-black tree shall merge hardlinks but not define
> the sequence of data file extents.

One other possible approach occurred to me yesterday:

What if you kept the red/black tree implementation, but keyed it by file
content digest (md5, sha1, sha256, whatever) instead of by dev/inode
tuple?  Using such a red-black tree for extent placement would give you
not only hardlink discovery and reproducibility, but also automatic
deduplication for even more compact images in some situations.

Advantages:

 * even more compact images in some cases

 * images with renamed files or added or removed hardlinks would only
   vary by directory entry, not by file content placement

Downsides:

 * It would certainly be more compute cycles than the existing approach
   (or the dirtree traversal ordering you describe above)

 * placement of the extents in a single image is less
   comprehensible/obvious than the dirtree traversal ordering

 * Maybe there is some context where deduplicated files could be
   dangerous?

So maybe it's not worth doing, i just wanted to describe the possibility
(you probably saw it already).  I leave it in your hands :)

> This can last a few days. I will give a note to the lists
> when the GNU xorriso-1.4.1 development tarball is worth a
> test.

That's great to hear.  Thank you, Thomas!

Regards,

        --dkg