Bug#909122: diffoscope: MemoryError when comparing big ISO images
Daniel Shahaf
danielsh at apache.org
Tue Sep 18 19:39:28 BST 2018
Marek Marczykowski-Górecki wrote on Tue, Sep 18, 2018 at 20:17:03 +0200:
> File "/usr/lib/python3/dist-packages/diffoscope/comparators/json.py", line 52, in recognizes
> f.read().decode('utf-8', errors='ignore'),
> MemoryError
>
> The JSONFile.recognizes function, for context:
>
> @classmethod
> def recognizes(cls, file):
> with open(file.path, 'rb') as f:
> # Try fuzzy matching for JSON files
> is_text = any(
> file.magic_file_type.startswith(x)
> for x in ('ASCII text', 'UTF-8 Unicode text'),
> )
> if is_text and not file.name.endswith('.json'):
> buf = f.read(10)
> if not any(x in buf for x in b'{['):
> return False
> f.seek(0)
>
> try:
> file.parsed = json.loads(
> f.read().decode('utf-8', errors='ignore'),
> object_pairs_hook=collections.OrderedDict,
> )
Slurping the file to a string object is an antipattern. Instead of
using f.read() to create a 4.5GB string, it would be better to use
json.load(f) to read the file incrementally. That should raise an
exception rather quickly.
> except ValueError:
> return False
>
> return True
> Obviously ISO file is not JSON.
> The whole thing could be avoided if earlier check (if initial 10 chars
> contains '[' or '{') would be executed not only on "text" files.
> Any reasons for that "is_text" there? Alternatively, if is_text=False,
> maybe the function should return False early?
>
> I can provide a patch for either option, but I'd like to know which one
> of them you prefer.
No opinion on these.
Thanks for including the function in the report.
Cheers,
Daniel
More information about the Reproducible-builds
mailing list