Bug#1022209: diffoscope: highlight text-only differences in HTML files

Paul Wise pabs at debian.org
Sat Oct 22 03:36:55 BST 2022


Package: diffoscope
Version: 224
Severity: wishlist

It would be nice if diffoscope could help highlight that HTML files
differ in the text output or if they differ only in the non-text HTML
bytes like the page title, the stylesheet etc.

The proposal would be that by default diffoscope would convert HTML
files to text, diff that and if there were text differences then
display them, with a comment saying these are differences in the text.
In situations where the text does not differ, diffoscope would do a
line diff of the HTML file itself, with a comment saying the text of
the two files was not different.

This is useful in some situations like when comparing old versions of a
document with newer versions of a document or similar. In particular it
would have been useful when preparing this mail to debian-mentors:

https://lists.debian.org/msgid-search/197a4671e7694c24424b91b4d7288867c0c85d9b.camel@debian.org

Since there are many different tools for conversion of HTML to text and
each of them have different bugs and features, probably this feature
should allow the user to choose the tool they want to use for this.

   $ head -vn-0 *.html
   ==> bar.html <==
   <html>
   <head>
   <title>bar</title>
   <style>
   <!--
   BODY {
   BACKGROUND: #FFFFFF;
   COLOR: #000000;
   -->
   </style>
   </head>
   <body>
   <p>
   bar
   </p>
   </body>
   </html>
   
   ==> foo.html <==
   <html>
   <head>
   <title>foo</title>
   <style>
   <!--
   BODY {
   BACKGROUND: #000000;
   COLOR: #FFFFFF;
   -->
   </style>
   </head>
   <body>
   <p>
   foo
   </p>
   </body>
   </html>
   
   $ diffoscope foo.html bar.html 
   --- foo.html
   +++ bar.html
   @@ -1,17 +1,17 @@
    <html>
    <head>
   -<title>foo</title>
   +<title>bar</title>
    <style>
    <!--
    BODY {
   -BACKGROUND: #000000;
   -COLOR: #FFFFFF;
   +BACKGROUND: #FFFFFF;
   +COLOR: #000000;
    -->
    </style>
    </head>
    <body>
    <p>
   -foo
   +bar
    </p>
    </body>
    </html>
   
   $ diff -u <(w3m -dump foo.html) <(w3m -dump bar.html)
   --- /dev/fd/63	2022-10-22 08:52:33.581676470 +0800
   +++ /dev/fd/62	2022-10-22 08:52:33.585676477 +0800
   @@ -1,2 +1,2 @@
   -foo
   +bar
   
   $ diff -u <(html2text foo.html) <(html2text bar.html)
   --- /dev/fd/63	2022-10-22 08:54:43.793859066 +0800
   +++ /dev/fd/62	2022-10-22 08:54:43.781859049 +0800
   @@ -1 +1 @@
   -foo
   +bar

-- System Information:
Debian Release: bookworm/sid
  APT prefers testing-debug
  APT policy: (900, 'testing-debug'), (900, 'testing'), (800, 'unstable-debug'), (800, 'unstable'), (790, 'buildd-unstable'), (700, 'experimental-debug'), (700, 'experimental'), (690, 'buildd-experimental')
merged-usr: no
Architecture: amd64 (x86_64)

Kernel: Linux 6.0.0-1-amd64 (SMP w/8 CPU threads; PREEMPT)
Kernel taint flags: TAINT_OOT_MODULE, TAINT_UNSIGNED_MODULE
Locale: LANG=en_AU.utf8, LC_CTYPE=en_AU.utf8 (charmap=UTF-8), LANGUAGE=en_AU:en
Shell: /bin/sh linked to /bin/dash
Init: systemd (via /run/systemd/system)
LSM: AppArmor: enabled

Versions of packages diffoscope depends on:
ii  diffoscope-minimal  224

Versions of packages diffoscope recommends:
ii  abootimg                         0.6-1+b2
ii  acl                              2.3.1-1
ii  androguard                       3.4.0~a1-5
ii  apksigner                        31.0.2-1
ii  apktool                          2.6.1+dfsg.1-2
ii  binutils-multiarch               2.39-8
ii  bzip2                            1.0.8-5+b1
ii  caca-utils                       0.99.beta20-3
ii  colord                           1.4.6-1
ii  coreboot-utils                   4.15~dfsg-2
ii  db-util                          5.3.1+nmu1
ii  default-jdk [java-sdk]           2:1.11-72
ii  default-jdk-headless             2:1.11-72
pn  device-tree-compiler             <none>
pn  docx2txt                         <none>
ii  e2fsprogs                        1.46.6~rc1-1+b1
ii  enjarify                         1:1.0.3-5
ii  ffmpeg                           7:5.1.2-1
ii  fontforge-extras                 1:20220308~dfsg-1
pn  fp-utils                         <none>
ii  genisoimage                      9:1.1.11-3.4
ii  gettext                          0.21-9
ii  ghc                              9.0.2-4
ii  ghostscript                      9.56.1~dfsg-1
ii  giflib-tools                     5.2.1-2.5
ii  gnumeric                         1.12.52-1
ii  gnupg                            2.2.39-1
ii  gnupg-utils                      2.2.39-1+b1
pn  hdf5-tools                       <none>
ii  imagemagick                      8:6.9.11.60+dfsg-1.3+b3
ii  imagemagick-6.q16 [imagemagick]  8:6.9.11.60+dfsg-1.3+b3
ii  jsbeautifier                     1.14.4-1
ii  libarchive-tools                 3.6.0-1
pn  libxmlb-dev                      <none>
ii  llvm                             1:14.0-55.2+b1
ii  lz4 [liblz4-tool]                1.9.4-1
pn  mono-utils                       <none>
ii  ocaml-nox                        4.13.1-3
pn  odt2txt                          <none>
pn  oggvideotools                    <none>
ii  openjdk-11-jdk [java-sdk]        11.0.17+8-2
ii  openssh-client                   1:9.0p1-1+b2
ii  openssl                          3.0.5-4
ii  pgpdump                          0.34-1
ii  poppler-utils                    22.08.0-2.1
pn  procyon-decompiler               <none>
ii  python3-argcomplete              2.0.0-1
ii  python3-binwalk                  2.3.3+dfsg1-2
ii  python3-debian                   0.1.48
ii  python3-defusedxml               0.7.1-2
ii  python3-guestfs                  1:1.48.4-2+b1
hi  python3-jsondiff                 1.1.1-4
ii  python3-pdfminer                 20220319+dfsg-1
ii  python3-progressbar              2.5-3
ii  python3-pypdf2                   2.11.0-1
ii  python3-pyxattr                  0.7.2-2+b1
ii  python3-rpm                      4.17.1.1+dfsg-1
ii  python3-tlsh                     3.4.4+20151206-1.4+b2
pn  r-base-core                      <none>
pn  radare2                          <none>
ii  rpm2cpio                         4.17.1.1+dfsg-1
ii  sng                              1.1.0-4
ii  sqlite3                          3.39.4-1
ii  squashfs-tools                   1:4.5.1-1
ii  tcpdump                          4.99.1-4+b1
ii  u-boot-tools                     2022.10+dfsg-1
ii  unzip                            6.0-27
pn  wabt                             <none>
pn  xmlbeans                         <none>
ii  xxd                              2:9.0.0626-1
ii  xz-utils                         5.2.5-2.1
ii  zip                              3.0-12
ii  zstd                             1.5.2+dfsg-1

Versions of packages diffoscope suggests:
ii  libjs-jquery  3.6.1+dfsg+~3.5.14-1

-- no debconf information

-- 
bye,
pabs

https://wiki.debian.org/PaulWise
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: This is a digitally signed message part
URL: <http://alioth-lists.debian.net/pipermail/reproducible-builds/attachments/20221022/827d7dd8/attachment.sig>


More information about the Reproducible-builds mailing list