[Python-apps-team] Bug#918756: Broken by .txt.gz
Trent W. Buck
trentbuck at gmail.com
Wed Jan 9 03:18:32 GMT 2019
Package: dodgy
Version: 0.1.9-3
Severity: important
File: /usr/lib/python3/dist-packages/dodgy/run.py
dodgy basically does this:
• recursively find all regular files under ./
• for each file,
• if its MIME type appears to be text/*,
• assume it is UTF-8
• assume it is uncompressed
• search it for "bad" regular expressions, and report any matches
This misdetects compressed text files:
bash4$ with-temp-dir
with-temp-dir: entering directory `/tmp/with-temp-dir.stBmJY'
This directory will be deleted when you exit.
bash4$ wget -nv https://secure.eicar.org/eicar.com.txt
2019-01-09 13:54:31 URL:https://secure.eicar.org/eicar.com.txt [68/68] -> "eicar.com.txt" [1]
bash4$ gzip eicar.com.txt
bash4$ file --mime eicar.com.txt.gz
eicar.com.txt.gz: application/gzip; charset=binary
bash4$ python3 -m mimetypes eicar.com.txt.gz
type: text/plain encoding: gzip
bash4$ dodgy
Traceback (most recent call last):
File "/usr/bin/dodgy", line 4, in <module>
dodgy.run.run()
File "/usr/lib/python3/dist-packages/dodgy/run.py", line 56, in run
warnings = run_checks(os.getcwd())
File "/usr/lib/python3/dist-packages/dodgy/run.py", line 44, in run_checks
for msg_parts in check_file(filepath):
File "/usr/lib/python3/dist-packages/dodgy/checks.py", line 72, in check_file
return check_file_contents(to_check.read())
File "/usr/lib/python3.7/codecs.py", line 701, in read
return self.reader.read(size)
File "/usr/lib/python3.7/codecs.py", line 504, in read
newchars, decodedbytes = self.decode(data, self.errors)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte
This happens because dodgy expects compressed files to be application/*, but mimetypes reports them as text/*.
To fix this, dodgy can be extended to check the encoding property and
then use different open/gzopen/lzmaopen calls depending on what kind
of compression is used.
I think dodgy's assumption of UTF-8 will also produce crashes and false negatives if UTF-16 is used for Windows/Java compatibility reasons.
However I was not able to produce a trivial example.
You may also want to switch from mimetypes to python3-magic, which uses libmagic1 (i.e. file(1)).
This will make guess the MIME type based on the file's *CONTENTS*, rather than just the file's *NAME*.
-- System Information:
Debian Release: buster/sid
APT prefers testing
APT policy: (990, 'testing'), (500, 'unstable'), (1, 'experimental')
Architecture: amd64 (x86_64)
Kernel: Linux 4.19.0-1-amd64 (SMP w/2 CPU cores)
Locale: LANG=en_AU.utf8, LC_CTYPE=en_AU.utf8 (charmap=UTF-8), LANGUAGE=en_AU:en (charmap=UTF-8)
Shell: /bin/sh linked to /bin/dash
Init: systemd (via /run/systemd/system)
LSM: AppArmor: enabled
Versions of packages dodgy depends on:
ii python3 3.7.1-3
dodgy recommends no packages.
dodgy suggests no packages.
-- no debconf information
More information about the Python-apps-team
mailing list