Bug#1069329: diffoscope: XZ: compare metadata before comparing compressed data
Paul Wise
pabs at debian.org
Sat Apr 20 03:04:26 BST 2024
Package: diffoscope
Severity: wishlist
X-Debbugs-Cc: Sam James <sam at gentoo.org>
When comparing two XZ compressed files that decompress to identical
data, please compare the metadata before comparing compressed data.
The xz --list option can be used for this, from the manual page:
Print information about compressed files. No uncompressed output is
produced, and no files are created or removed. In list mode, the
program cannot read the compressed data from standard input or from
other unseekable sources.
The default listing shows basic information about files, one file
per line. To get more detailed information, use also the --verbose
option. For even more information, use --verbose twice, but note
that this may be slow, because getting all the extra information
requires many seeks. The width of verbose output exceeds 80
characters, so piping the output to, for example, less -S may be
convenient if the terminal isn't wide enough.
The exact output may vary between xz versions and different locales.
For machine-readable output, --robot --list should be used.
In addition, the filename printed is the one from the command-line,
there does not appear to be filename metadata embedded in XZ files,
so the filename could be stripped before comparison where possible,
since diffoscope already compares filenames at a different layer.
The human-readable output may change between xz versions, so the
filename stripping will be brittle and could break on xz updates,
but comparing human-readable output is more user-friendly than
comparing the robot output, so any stripping would have to be
best-effort, but since any tests for it could break on xz upgrades,
and probably many comparisons will be between the same filenames within
different dirs, it might not be worth adding filename stripping yet,
until xz itself offers an option for hiding the filenames.
In addition, currently the xz --list option only supports the xz file
format and does not support the lzma and lzip file formats.
$ echo foo > foo
$ xz -0 < foo > foo.0.xz
$ xz -9 < foo > foo.9.xz
$ diffoscope foo.0.xz foo.9.xz
--- foo.0.xz
+++ foo.9.xz
│┄ Format-specific differences are supported for XZ compressed files but no file-specific differences were detected; falling back to a binary diff. file(1) reports: XZ compressed data, checksum CRC64
@@ -1,4 +1,4 @@
00000000: fd37 7a58 5a00 0004 e6d6 b446 0200 2101 .7zXZ......F..!.
-00000010: 0c00 0000 8f98 419c 0100 0366 6f6f 0a00 ......A....foo..
+00000010: 1c00 0000 10cf 58cc 0100 0366 6f6f 0a00 ......X....foo..
00000020: ffd7 ac5a 3031 9cf2 0001 1c04 6f2c 9cc1 ...Z01......o,..
00000030: 1fb6 f37d 0100 0000 0004 595a ...}......YZ
$ diff -u <(xz --list foo.0.xz) <(xz --list foo.9.xz)
--- /dev/fd/63 2024-04-20 08:26:27.769377608 +0800
+++ /dev/fd/62 2024-04-20 08:26:27.769377608 +0800
@@ -1,2 +1,2 @@
Strms Blocks Compressed Uncompressed Ratio Check Filename
- 1 1 60 B 4 B --- CRC64 foo.0.xz
+ 1 1 60 B 4 B --- CRC64 foo.9.xz
$ diff -u <(xz --list --verbose foo.0.xz) <(xz --list --verbose foo.9.xz)
--- /dev/fd/63 2024-04-20 08:26:43.701196927 +0800
+++ /dev/fd/62 2024-04-20 08:26:43.705196881 +0800
@@ -1,4 +1,4 @@
-foo.0.xz (1/1)
+foo.9.xz (1/1)
Streams: 1
Blocks: 1
Compressed size: 60 B
$ diff -u <(xz --list --verbose --verbose foo.0.xz) <(xz --list --verbose --verbose foo.9.xz)
--- /dev/fd/63 2024-04-20 08:26:56.029056126 +0800
+++ /dev/fd/62 2024-04-20 08:26:56.029056126 +0800
@@ -1,4 +1,4 @@
-foo.0.xz (1/1)
+foo.9.xz (1/1)
Streams: 1
Blocks: 1
Compressed size: 60 B
@@ -11,7 +11,7 @@
1 1 0 0 60 4 --- CRC64 0
Blocks:
Stream Block CompOffset UncompOffset TotalSize UncompSize Ratio Check CheckVal Header Flags CompSize MemUsage Filters
- 1 1 12 0 28 4 7.000 CRC64 f29c31305aacd7ff 12 -- 8 1 MiB --lzma2=dict=256KiB
- Memory needed: 1 MiB
+ 1 1 12 0 28 4 7.000 CRC64 f29c31305aacd7ff 12 -- 8 65 MiB --lzma2=dict=64MiB
+ Memory needed: 65 MiB
Sizes in headers: No
Minimum XZ Utils version: 5.0.0
$ diff -u <(xz --robot --list --verbose --verbose foo.0.xz) <(xz --robot --list --verbose --verbose foo.9.xz)
--- /dev/fd/63 2024-04-20 08:31:42.445584805 +0800
+++ /dev/fd/62 2024-04-20 08:31:42.449584755 +0800
@@ -1,6 +1,6 @@
-name foo.0.xz
+name foo.9.xz
file 1 1 60 4 --- CRC64 0
stream 1 1 0 0 60 4 --- CRC64 0
-block 1 1 1 12 0 28 4 7.000 CRC64 f29c31305aacd7ff 12 -- 8 327736 --lzma2=dict=256KiB
-summary 327736 no 50000002
-totals 1 1 60 4 --- CRC64 0 1 327736 no 50000002
+block 1 1 1 12 0 28 4 7.000 CRC64 f29c31305aacd7ff 12 -- 8 67174456 --lzma2=dict=64MiB
+summary 67174456 no 50000002
+totals 1 1 60 4 --- CRC64 0 1 67174456 no 50000002
--
bye,
pabs
https://wiki.debian.org/PaulWise
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: This is a digitally signed message part
URL: <http://alioth-lists.debian.net/pipermail/reproducible-builds/attachments/20240420/b699d2be/attachment.sig>
More information about the Reproducible-builds
mailing list