[Python-apps-team] Bug#918756: Broken by .txt.gz

Wed Jan 9 03:18:32 GMT 2019

Package: dodgy
Version: 0.1.9-3
Severity: important
File: /usr/lib/python3/dist-packages/dodgy/run.py

dodgy basically does this:

  • recursively find all regular files under ./
  • for each file,
      • if its MIME type appears to be text/*,
          • assume it is UTF-8
          • assume it is uncompressed
          • search it for "bad" regular expressions, and report any matches

This misdetects compressed text files:

    bash4$ with-temp-dir
    with-temp-dir: entering directory `/tmp/with-temp-dir.stBmJY'
    This directory will be deleted when you exit.
    bash4$ wget -nv https://secure.eicar.org/eicar.com.txt
    2019-01-09 13:54:31 URL:https://secure.eicar.org/eicar.com.txt [68/68] -> "eicar.com.txt" [1]
    bash4$ gzip eicar.com.txt
    bash4$ file --mime eicar.com.txt.gz
    eicar.com.txt.gz: application/gzip; charset=binary
    bash4$ python3 -m mimetypes eicar.com.txt.gz
    type: text/plain encoding: gzip
    bash4$ dodgy
    Traceback (most recent call last):
      File "/usr/bin/dodgy", line 4, in <module>
        dodgy.run.run()
      File "/usr/lib/python3/dist-packages/dodgy/run.py", line 56, in run
        warnings = run_checks(os.getcwd())
      File "/usr/lib/python3/dist-packages/dodgy/run.py", line 44, in run_checks
        for msg_parts in check_file(filepath):
      File "/usr/lib/python3/dist-packages/dodgy/checks.py", line 72, in check_file
        return check_file_contents(to_check.read())
      File "/usr/lib/python3.7/codecs.py", line 701, in read
        return self.reader.read(size)
      File "/usr/lib/python3.7/codecs.py", line 504, in read
        newchars, decodedbytes = self.decode(data, self.errors)
    UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

This happens because dodgy expects compressed files to be application/*, but mimetypes reports them as text/*.

To fix this, dodgy can be extended to check the encoding property and
then use different open/gzopen/lzmaopen calls depending on what kind
of compression is used.

I think dodgy's assumption of UTF-8 will also produce crashes and false negatives if UTF-16 is used for Windows/Java compatibility reasons.
However I was not able to produce a trivial example.

You may also want to switch from mimetypes to python3-magic, which uses libmagic1 (i.e. file(1)).
This will make guess the MIME type based on the file's *CONTENTS*, rather than just the file's *NAME*.

-- System Information:
Debian Release: buster/sid
  APT prefers testing
  APT policy: (990, 'testing'), (500, 'unstable'), (1, 'experimental')
Architecture: amd64 (x86_64)

Kernel: Linux 4.19.0-1-amd64 (SMP w/2 CPU cores)
Locale: LANG=en_AU.utf8, LC_CTYPE=en_AU.utf8 (charmap=UTF-8), LANGUAGE=en_AU:en (charmap=UTF-8)
Shell: /bin/sh linked to /bin/dash
Init: systemd (via /run/systemd/system)
LSM: AppArmor: enabled

Versions of packages dodgy depends on:
ii  python3  3.7.1-3

dodgy recommends no packages.

dodgy suggests no packages.

-- no debconf information