Bug#1069329: diffoscope: XZ: compare metadata before comparing compressed data

Paul Wise pabs at debian.org
Sat Apr 20 03:04:26 BST 2024


Package: diffoscope
Severity: wishlist
X-Debbugs-Cc: Sam James <sam at gentoo.org>

When comparing two XZ compressed files that decompress to identical
data, please compare the metadata before comparing compressed data.

The xz --list option can be used for this, from the manual page:

   Print information about compressed files. No uncompressed output is
   produced, and no files are created or removed. In list mode, the 
   program cannot read the compressed data from standard input or from
   other unseekable sources.
   
   The default listing shows basic information about files, one file
   per line. To get more detailed information, use also the --verbose
   option. For even more information, use --verbose twice, but note
   that this may be slow, because getting all the extra information
   requires many seeks. The width of verbose output exceeds 80
   characters, so piping the output to, for example, less -S may be
   convenient if the terminal isn't wide enough.
   
   The exact output may vary between xz versions and different locales.
   For machine-readable output, --robot --list should be used.

In addition, the filename printed is the one from the command-line,
there does not appear to be filename metadata embedded in XZ files,
so the filename could be stripped before comparison where possible,
since diffoscope already compares filenames at a different layer.

The human-readable output may change between xz versions, so the
filename stripping will be brittle and could break on xz updates,
but comparing human-readable output is more user-friendly than
comparing the robot output, so any stripping would have to be
best-effort, but since any tests for it could break on xz upgrades,
and probably many comparisons will be between the same filenames within
different dirs, it might not be worth adding filename stripping yet,
until xz itself offers an option for hiding the filenames.

In addition, currently the xz --list option only supports the xz file
format and does not support the lzma and lzip file formats.

   $ echo foo > foo
   $ xz -0 < foo > foo.0.xz
   $ xz -9 < foo > foo.9.xz
   
   $ diffoscope foo.0.xz foo.9.xz 
   --- foo.0.xz
   +++ foo.9.xz
   │┄ Format-specific differences are supported for XZ compressed files but no file-specific differences were detected; falling back to a binary diff. file(1) reports: XZ compressed data, checksum CRC64
   @@ -1,4 +1,4 @@
    00000000: fd37 7a58 5a00 0004 e6d6 b446 0200 2101  .7zXZ......F..!.
   -00000010: 0c00 0000 8f98 419c 0100 0366 6f6f 0a00  ......A....foo..
   +00000010: 1c00 0000 10cf 58cc 0100 0366 6f6f 0a00  ......X....foo..
    00000020: ffd7 ac5a 3031 9cf2 0001 1c04 6f2c 9cc1  ...Z01......o,..
    00000030: 1fb6 f37d 0100 0000 0004 595a            ...}......YZ
   
   $ diff -u <(xz --list foo.0.xz) <(xz --list foo.9.xz)
   --- /dev/fd/63	2024-04-20 08:26:27.769377608 +0800
   +++ /dev/fd/62	2024-04-20 08:26:27.769377608 +0800
   @@ -1,2 +1,2 @@
    Strms  Blocks   Compressed Uncompressed  Ratio  Check   Filename
   -    1       1         60 B          4 B    ---  CRC64   foo.0.xz
   +    1       1         60 B          4 B    ---  CRC64   foo.9.xz
   
   $ diff -u <(xz --list --verbose foo.0.xz) <(xz --list --verbose foo.9.xz)
   --- /dev/fd/63	2024-04-20 08:26:43.701196927 +0800
   +++ /dev/fd/62	2024-04-20 08:26:43.705196881 +0800
   @@ -1,4 +1,4 @@
   -foo.0.xz (1/1)
   +foo.9.xz (1/1)
      Streams:           1
      Blocks:            1
      Compressed size:   60 B
   
   $ diff -u <(xz --list --verbose --verbose foo.0.xz) <(xz --list --verbose --verbose foo.9.xz)
   --- /dev/fd/63	2024-04-20 08:26:56.029056126 +0800
   +++ /dev/fd/62	2024-04-20 08:26:56.029056126 +0800
   @@ -1,4 +1,4 @@
   -foo.0.xz (1/1)
   +foo.9.xz (1/1)
      Streams:           1
      Blocks:            1
      Compressed size:   60 B
   @@ -11,7 +11,7 @@
             1         1               0               0              60               4    ---  CRC64            0
      Blocks:
        Stream     Block      CompOffset    UncompOffset       TotalSize      UncompSize  Ratio  Check      CheckVal          Header  Flags        CompSize    MemUsage  Filters
   -         1         1              12               0              28               4  7.000  CRC64      f29c31305aacd7ff      12  --                  8       1 MiB  --lzma2=dict=256KiB
   -  Memory needed:     1 MiB
   +         1         1              12               0              28               4  7.000  CRC64      f29c31305aacd7ff      12  --                  8      65 MiB  --lzma2=dict=64MiB
   +  Memory needed:     65 MiB
      Sizes in headers:  No
      Minimum XZ Utils version: 5.0.0
   
   $ diff -u <(xz --robot --list --verbose --verbose foo.0.xz) <(xz --robot --list --verbose --verbose foo.9.xz)
   --- /dev/fd/63	2024-04-20 08:31:42.445584805 +0800
   +++ /dev/fd/62	2024-04-20 08:31:42.449584755 +0800
   @@ -1,6 +1,6 @@
   -name	foo.0.xz
   +name	foo.9.xz
    file	1	1	60	4	---	CRC64	0
    stream	1	1	0	0	60	4	---	CRC64	0
   -block	1	1	1	12	0	28	4	7.000	CRC64	f29c31305aacd7ff	12	--	8	327736	--lzma2=dict=256KiB
   -summary	327736	no	50000002
   -totals	1	1	60	4	---	CRC64	0	1	327736	no	50000002
   +block	1	1	1	12	0	28	4	7.000	CRC64	f29c31305aacd7ff	12	--	8	67174456	--lzma2=dict=64MiB
   +summary	67174456	no	50000002
   +totals	1	1	60	4	---	CRC64	0	1	67174456	no	50000002

-- 
bye,
pabs

https://wiki.debian.org/PaulWise
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: This is a digitally signed message part
URL: <http://alioth-lists.debian.net/pipermail/reproducible-builds/attachments/20240420/b699d2be/attachment.sig>


More information about the Reproducible-builds mailing list