Bug#909122: diffoscope: MemoryError when comparing big ISO images

Marek Marczykowski-Górecki marmarek at invisiblethingslab.com
Tue Sep 18 19:17:03 BST 2018


Package: diffoscope
Version: 101
Severity: normal

Dear Maintainer,

When comparing two 4.5GB ISO images, diffoscope tries to load them into
memory, which fails with MemoryError in json comparator:

    Traceback (most recent call last):
      File "/usr/lib/python3/dist-packages/diffoscope/main.py", line 470, in main
	sys.exit(run_diffoscope(parsed_args))
      File "/usr/lib/python3/dist-packages/diffoscope/main.py", line 442, in run_diffoscope
	difference = compare_root_paths(path1, path2)
      File "/usr/lib/python3/dist-packages/diffoscope/comparators/utils/compare.py", line 65, in compare_root_paths
	file1 = specialize(FilesystemFile(path1, container=container1))
      File "/usr/lib/python3/dist-packages/diffoscope/comparators/utils/specialize.py", line 49, in specialize
	if try_recognize(file, cls, cls.recognizes):
      File "/usr/lib/python3/dist-packages/diffoscope/comparators/utils/specialize.py", line 36, in try_recognize
	if not recognizes(file):
      File "/usr/lib/python3/dist-packages/diffoscope/comparators/json.py", line 52, in recognizes
	f.read().decode('utf-8', errors='ignore'),
    MemoryError

Obviously ISO file is not JSON.
The whole thing could be avoided if earlier check (if initial 10 chars
contains '[' or '{') would be executed not only on "text" files.
Any reasons for that "is_text" there? Alternatively, if is_text=False,
maybe the function should return False early?

I can provide a patch for either option, but I'd like to know which one
of them you prefer.

The JSONFile.recognizes function, for context:

    @classmethod
    def recognizes(cls, file):
        with open(file.path, 'rb') as f:
            # Try fuzzy matching for JSON files
            is_text = any(
                file.magic_file_type.startswith(x)
                for x in ('ASCII text', 'UTF-8 Unicode text'),
            )
            if is_text and not file.name.endswith('.json'):
                buf = f.read(10)
                if not any(x in buf for x in b'{['):
                    return False
                f.seek(0)

            try:
                file.parsed = json.loads(
                    f.read().decode('utf-8', errors='ignore'),
                    object_pairs_hook=collections.OrderedDict,
                )
            except ValueError:
                return False

        return True


-- System Information:
Debian Release: buster/sid
  APT prefers testing
  APT policy: (500, 'testing')
Architecture: amd64 (x86_64)

Kernel: Linux 4.14.67-1.pvops.qubes.x86_64 (SMP w/8 CPU cores)
Locale: LANG=C, LC_CTYPE=C (charmap=ANSI_X3.4-1968), LANGUAGE=C (charmap=ANSI_X3.4-1968)
Shell: /bin/sh linked to /usr/bin/dash
Init: unable to detect

Versions of packages diffoscope depends on:
ii  libpython3.6-stdlib    3.6.6-1
ii  python3                3.6.5-3
ii  python3-distro         1.3.0-1
ii  python3-distutils      3.6.6-1
ii  python3-libarchive-c   2.1-3.1
ii  python3-magic          2:0.4.15-2
ii  python3-pkg-resources  40.2.0-1

Versions of packages diffoscope recommends:
ii  abootimg                         0.6-1+b2
ii  acl                              2.2.52-3+b1
pn  apktool                          <none>
ii  binutils-multiarch               2.31.1-5
ii  bzip2                            1.0.6-9
ii  caca-utils                       0.99.beta19-2+b3
ii  colord                           1.3.3-2
ii  db-util                          5.3.1
ii  default-jdk-headless             2:1.10-68
ii  device-tree-compiler             1.4.7-3
ii  docx2txt                         1.4-1
ii  e2fsprogs                        1.44.4-2
ii  enjarify                         1:1.0.3-4
ii  fontforge-extras                 0.3-4
ii  fp-utils                         3.0.4+dfsg-20
ii  fp-utils-3.0.4 [fp-utils]        3.0.4+dfsg-20
ii  genisoimage                      9:1.1.11-3+b2
ii  gettext                          0.19.8.1-7
ii  ghc                              8.2.2-4
ii  ghostscript                      9.25~dfsg-2
ii  giflib-tools                     5.1.4-3
ii  gnumeric                         1.12.41-1
ii  gnupg                            2.2.10-1
ii  imagemagick                      8:6.9.10.8+dfsg-1
ii  imagemagick-6.q16 [imagemagick]  8:6.9.10.8+dfsg-1
ii  jsbeautifier                     1.6.4-7
ii  libarchive-tools                 3.2.2-5
ii  llvm                             1:6.0-43
ii  lz4                              1.8.2-1
ii  mono-utils                       4.6.2.7+dfsg-1
ii  odt2txt                          0.5-1+b2
pn  oggvideotools                    <none>
ii  openssh-client                   1:7.8p1-1
ii  pgpdump                          0.33-1
ii  poppler-utils                    0.63.0-2
ii  procyon-decompiler               0.5.32-4
ii  python3-argcomplete              1.8.1-1
ii  python3-binwalk                  2.1.2~git20180830+dfsg1-1
ii  python3-debian                   0.1.33
ii  python3-defusedxml               0.5.0-1
ii  python3-guestfs                  1:1.38.4-1
ii  python3-jsondiff                 1.1.1-2
ii  python3-progressbar              2.3-4
ii  python3-pyxattr                  0.6.0-2+b2
ii  python3-tlsh                     3.4.4+20151206-1+b4
ii  r-base-core                      3.5.1-1+b1
ii  rpm2cpio                         4.14.1+dfsg1-4
ii  sng                              1.1.0-1+b1
ii  sqlite3                          3.24.0-1
ii  squashfs-tools                   1:4.3-6
ii  tcpdump                          4.9.2-3
ii  unzip                            6.0-21
ii  vim-common                       2:8.1.0320-1
ii  xmlbeans                         2.6.0+dfsg-4
ii  xxd                              2:8.1.0320-1
ii  xz-utils                         5.2.2-1.3

Versions of packages diffoscope suggests:
ii  libjs-jquery  3.2.1-1

-- no debconf information

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 488 bytes
Desc: not available
URL: <http://alioth-lists.debian.net/pipermail/reproducible-builds/attachments/20180918/1a5531b3/attachment.sig>


More information about the Reproducible-builds mailing list