Bug#909122: diffoscope: MemoryError when comparing big ISO images

Tue Sep 18 19:39:28 BST 2018

Marek Marczykowski-Górecki wrote on Tue, Sep 18, 2018 at 20:17:03 +0200:
>       File "/usr/lib/python3/dist-packages/diffoscope/comparators/json.py", line 52, in recognizes
> 	f.read().decode('utf-8', errors='ignore'),
>     MemoryError
> 
> The JSONFile.recognizes function, for context:
> 
>     @classmethod
>     def recognizes(cls, file):
>         with open(file.path, 'rb') as f:
>             # Try fuzzy matching for JSON files
>             is_text = any(
>                 file.magic_file_type.startswith(x)
>                 for x in ('ASCII text', 'UTF-8 Unicode text'),
>             )
>             if is_text and not file.name.endswith('.json'):
>                 buf = f.read(10)
>                 if not any(x in buf for x in b'{['):
>                     return False
>                 f.seek(0)
> 
>             try:
>                 file.parsed = json.loads(
>                     f.read().decode('utf-8', errors='ignore'),
>                     object_pairs_hook=collections.OrderedDict,
>                 )

Slurping the file to a string object is an antipattern.  Instead of
using f.read() to create a 4.5GB string, it would be better to use
json.load(f) to read the file incrementally.  That should raise an
exception rather quickly.

>             except ValueError:
>                 return False
> 
>         return True

> Obviously ISO file is not JSON.
> The whole thing could be avoided if earlier check (if initial 10 chars
> contains '[' or '{') would be executed not only on "text" files.
> Any reasons for that "is_text" there? Alternatively, if is_text=False,
> maybe the function should return False early?
> 
> I can provide a patch for either option, but I'd like to know which one
> of them you prefer.

No opinion on these.

Thanks for including the function in the report.

Cheers,

Daniel