[Pkg-privacy-commits] [Git][pkg-privacy-team/mat2][upstream] New upstream version 0.9.0
Georg Faerber
georg at debian.org
Thu Jul 11 16:52:37 BST 2019
Georg Faerber pushed to branch upstream at Privacy Maintainers / mat2
Commits:
2b198534 by Georg Faerber at 2019-07-10T15:32:16Z
New upstream version 0.9.0
- - - - -
23 changed files:
- .gitlab-ci.yml
- CHANGELOG.md
- CONTRIBUTING.md
- README.md
- doc/mat2.1
- libmat2/__init__.py
- libmat2/abstract.py
- libmat2/archive.py
- libmat2/audio.py
- libmat2/epub.py
- libmat2/exiftool.py
- libmat2/images.py
- libmat2/office.py
- libmat2/parser_factory.py
- libmat2/torrent.py
- libmat2/video.py
- libmat2/web.py
- mat2
- nautilus/mat2.py
- setup.py
- tests/test_climat2.py
- tests/test_corrupted_files.py
- tests/test_libmat2.py
Changes:
=====================================
.gitlab-ci.yml
=====================================
@@ -1,76 +1,69 @@
-image: debian
+variables:
+ CONTAINER_REGISTRY: $CI_REGISTRY/georg/mat2-ci-images
stages:
- linting
- test
-bandit:
+linting:bandit:
+ image: $CONTAINER_REGISTRY:linting
stage: linting
script: # TODO: remove B405 and B314
- - apt-get -qqy update
- - apt-get -qqy install --no-install-recommends python3-bandit
- - bandit ./mat2 --format txt --skip B101
- - bandit -r ./nautilus/ --format txt --skip B101
- - bandit -r ./libmat2 --format txt --skip B101,B404,B603,B405,B314
+ - bandit ./mat2 --format txt --skip B101
+ - bandit -r ./nautilus/ --format txt --skip B101
+ - bandit -r ./libmat2 --format txt --skip B101,B404,B603,B405,B314
-pylint:
+linting:pylint:
+ image: $CONTAINER_REGISTRY:linting
stage: linting
script:
- - apt-get -qqy update
- - apt-get -qqy install --no-install-recommends pylint3 python3-mutagen python3-gi-cairo gir1.2-poppler-0.18 gir1.2-gdkpixbuf-2.0
- - pylint3 --extension-pkg-whitelist=cairo,gi ./libmat2 ./mat2
- # Once nautilus-python is in Debian, decomment it form the line below
- - pylint3 --extension-pkg-whitelist=Nautilus,GObject,Gtk,Gio,GLib,gi ./nautilus/mat2.py
+ - pylint3 --disable=no-else-return --extension-pkg-whitelist=cairo,gi ./libmat2 ./mat2
+ # Once nautilus-python is in Debian, decomment it form the line below
+ - pylint3 --disable=no-else-return --extension-pkg-whitelist=Nautilus,GObject,Gtk,Gio,GLib,gi ./nautilus/mat2.py
-pyflakes:
+linting:pyflakes:
+ image: $CONTAINER_REGISTRY:linting
stage: linting
script:
- - apt-get -qqy update
- - apt-get -qqy install --no-install-recommends pyflakes3
- - pyflakes3 ./libmat2 ./mat2 ./tests/ ./nautilus
+ - pyflakes3 ./libmat2 ./mat2 ./tests/ ./nautilus
-mypy:
+linting:mypy:
+ image: $CONTAINER_REGISTRY:linting
stage: linting
script:
- - apt-get -qqy update
- - apt-get -qqy install --no-install-recommends python3-pip
- - pip3 install mypy
- - mypy --ignore-missing-imports mat2 libmat2/*.py ./nautilus/mat2.py
+ - mypy --ignore-missing-imports mat2 libmat2/*.py ./nautilus/mat2.py
+tests:archlinux:
+ image: $CONTAINER_REGISTRY:archlinux
+ stage: test
+ script:
+ - python3 setup.py test
+
tests:debian:
+ image: $CONTAINER_REGISTRY:debian
stage: test
script:
- - apt-get -qqy update
- - apt-get -qqy install --no-install-recommends python3-mutagen python3-gi-cairo gir1.2-poppler-0.18 gir1.2-gdkpixbuf-2.0 libimage-exiftool-perl python3-coverage ffmpeg
- - apt-get -qqy purge bubblewrap
- - python3-coverage run --branch -m unittest discover -s tests/
- - python3-coverage report --fail-under=90 -m --include 'libmat2/*'
+ - apt-get -qqy purge bubblewrap
+ - python3-coverage run --branch -m unittest discover -s tests/
+ - python3-coverage report --fail-under=90 -m --include 'libmat2/*'
tests:debian_with_bubblewrap:
+ image: $CONTAINER_REGISTRY:debian
stage: test
- tags:
- - whitewhale
+ allow_failure: true
script:
- - apt-get -qqy update
- - apt-get -qqy install --no-install-recommends python3-mutagen python3-gi-cairo gir1.2-poppler-0.18 gir1.2-gdkpixbuf-2.0 libimage-exiftool-perl python3-coverage ffmpeg bubblewrap
- - python3-coverage run --branch -m unittest discover -s tests/
- - python3-coverage report --fail-under=100 -m --include 'libmat2/*'
+ - python3-coverage run --branch -m unittest discover -s tests/
+ - python3-coverage report --fail-under=100 -m --include 'libmat2/*'
tests:fedora:
- image: fedora
+ image: $CONTAINER_REGISTRY:fedora
stage: test
- tags:
- - whitewhale
script:
- - dnf install -y python3 python3-mutagen python3-gobject gdk-pixbuf2 poppler-glib gdk-pixbuf2 gdk-pixbuf2-modules cairo-gobject cairo python3-cairo perl-Image-ExifTool mailcap
- - gdk-pixbuf-query-loaders-64 > /usr/lib64/gdk-pixbuf-2.0/2.10.0/loaders.cache
- - python3 setup.py test
+ - python3 setup.py test
-tests:archlinux:
- image: archlinux/base
+tests:gentoo:
+ image: $CONTAINER_REGISTRY:gentoo
stage: test
- tags:
- - whitewhale
+ allow_failure: true
script:
- - pacman -Sy --noconfirm python-mutagen python-gobject gdk-pixbuf2 poppler-glib gdk-pixbuf2 python-cairo perl-image-exiftool python-setuptools mailcap ffmpeg
- - python3 setup.py test
+ - python3 -m unittest discover -v
=====================================
CHANGELOG.md
=====================================
@@ -1,3 +1,13 @@
+# 0.9.0 - 2019-05-10
+
+- Add tar/tar.gz/tar.bz2/tar.zx archives support
+- Add support for xhtml files
+- Improve handling of read-only files
+- Improve a bit the command line's documentation
+- Fix a confusing error message
+- Add even more tests
+- Usuals internal cleanups/refactorings
+
# 0.8.0 - 2019-02-28
- Add support for epub files
=====================================
CONTRIBUTING.md
=====================================
@@ -1,7 +1,7 @@
# Contributing to MAT2
The main repository for MAT2 is on [0xacab]( https://0xacab.org/jvoisin/mat2 ),
-with a mirror on [gitlab.com]( https://gitlab.com/jvoisin/mat2 ).
+but you can send patches to jvoisin by [email](https://dustri.org/) if you prefer.
Do feel free to pick up [an issue]( https://0xacab.org/jvoisin/mat2/issues )
and to send a pull-request. Please do check that everything is fine by running the
@@ -29,9 +29,10 @@ Since MAT2 is written in Python3, please conform as much as possible to the
6. Create a tag with `git tag -s $VERSION`
7. Push the commit with `git push origin master`
8. Push the tag with `git push --tags`
-9. Create the signed tarball with `git archive --format=tar.xz --prefix=mat-$VERSION/ $VERSION > mat-$VERSION.tar.xz`
-10. Sign the tarball with `gpg --armor --detach-sign mat-$VERSION.tar.xz`
-11. Upload the result on Gitlab's [tag page](https://0xacab.org/jvoisin/mat2/tags) and add the changelog there
-12. Announce the release on the [mailing list](https://mailman.boum.org/listinfo/mat-dev)
-13. Upload the new version on pypi with `python3 setup.py sdist bdist_wheel` then `twine upload -s dist/*`
-14. Do the secret release dance
+9. Download the gitlab archive of the release
+10. Diff it against the local copy
+11. If there is no difference, sign the archive with `gpg --armor --detach-sign mat2-$VERSION.tar.xz`
+12. Upload the signature on Gitlab's [tag page](https://0xacab.org/jvoisin/mat2/tags) and add the changelog there
+13. Announce the release on the [mailing list](https://mailman.boum.org/listinfo/mat-dev)
+14. Sign'n'upload the new version on pypi with `python3 setup.py sdist bdist_wheel` then `twine upload -s dist/*`
+15. Do the secret release dance
=====================================
README.md
=====================================
@@ -30,7 +30,7 @@ metadata.
- `python3-mutagen` for audio support
- `python3-gi-cairo` and `gir1.2-poppler-0.18` for PDF support
- `gir1.2-gdkpixbuf-2.0` for images support
-- `FFmpeg`, optionally, for video support
+- `FFmpeg`, optionally, for video support
- `libimage-exiftool-perl` for everything else
- `bubblewrap`, optionally, for sandboxing
@@ -70,7 +70,8 @@ optional arguments:
-V, --verbose show more verbose status information
--unknown-members policy
how to handle unknown members of archive-style files
- (policy should be one of: abort, omit, keep)
+ (policy should be one of: abort, omit, keep) [Default:
+ abort]
-s, --show list harmful metadata detectable by MAT2 without
removing them
-L, --lightweight remove SOME metadata
=====================================
doc/mat2.1
=====================================
@@ -1,4 +1,4 @@
-.TH MAT2 "1" "February 2019" "MAT2 0.8.0" "User Commands"
+.TH MAT2 "1" "May 2019" "MAT2 0.9.0" "User Commands"
.SH NAME
mat2 \- the metadata anonymisation toolkit 2
@@ -8,7 +8,7 @@ mat2 \- the metadata anonymisation toolkit 2
.SH DESCRIPTION
.B mat2
-removes metadata from various fileformats. It supports a wide variety of file
+removes metadata from various fileformats. It supports a wide variety of file
formats, audio, office, images, …
Careful, mat2 does not clean files in-place, instead, it will produce a file with the word
=====================================
libmat2/__init__.py
=====================================
@@ -30,27 +30,35 @@ UNSUPPORTED_EXTENSIONS = {
}
DEPENDENCIES = {
- 'cairo': 'Cairo',
- 'gi': 'PyGobject',
- 'gi.repository.GdkPixbuf': 'GdkPixbuf from PyGobject',
- 'gi.repository.Poppler': 'Poppler from PyGobject',
- 'gi.repository.GLib': 'GLib from PyGobject',
- 'mutagen': 'Mutagen',
+ 'Cairo': 'cairo',
+ 'PyGobject': 'gi',
+ 'GdkPixbuf from PyGobject': 'gi.repository.GdkPixbuf',
+ 'Poppler from PyGobject': 'gi.repository.Poppler',
+ 'GLib from PyGobject': 'gi.repository.GLib',
+ 'Mutagen': 'mutagen',
}
+CMD_DEPENDENCIES = {
+ 'Exiftool': exiftool._get_exiftool_path,
+ 'Ffmpeg': video._get_ffmpeg_path,
+ }
def check_dependencies() -> Dict[str, bool]:
ret = collections.defaultdict(bool) # type: Dict[str, bool]
- ret['Exiftool'] = bool(exiftool._get_exiftool_path())
- ret['Ffmpeg'] = bool(video._get_ffmpeg_path())
-
for key, value in DEPENDENCIES.items():
- ret[value] = True
+ ret[key] = True
try:
- importlib.import_module(key)
+ importlib.import_module(value)
except ImportError: # pragma: no cover
- ret[value] = False # pragma: no cover
+ ret[key] = False # pragma: no cover
+
+ for k, v in CMD_DEPENDENCIES.items():
+ ret[k] = True
+ try:
+ v()
+ except RuntimeError: # pragma: no cover
+ ret[k] = False
return ret
=====================================
libmat2/abstract.py
=====================================
@@ -25,17 +25,22 @@ class AbstractParser(abc.ABC):
self.filename = filename
fname, extension = os.path.splitext(filename)
+
+ # Special case for tar.gz, tar.bz2, … files
+ if fname.endswith('.tar') and len(fname) > 4:
+ fname, extension = fname[:-4], '.tar' + extension
+
self.output_filename = fname + '.cleaned' + extension
self.lightweight_cleaning = False
@abc.abstractmethod
def get_meta(self) -> Dict[str, Union[str, dict]]:
- pass # pragma: no cover
+ """Return all the metadata of the current file"""
@abc.abstractmethod
def remove_all(self) -> bool:
"""
+ Remove all the metadata of the current file
+
:raises RuntimeError: Raised if the cleaning process went wrong.
"""
- # pylint: disable=unnecessary-pass
- pass # pragma: no cover
=====================================
libmat2/archive.py
=====================================
@@ -1,5 +1,8 @@
+import abc
+import stat
import zipfile
import datetime
+import tarfile
import tempfile
import os
import logging
@@ -11,14 +14,41 @@ from . import abstract, UnknownMemberPolicy, parser_factory
# Make pyflakes happy
assert Set
assert Pattern
-assert List
-assert Union
+
+# pylint: disable=not-callable,assignment-from-no-return,too-many-branches
+
+# An ArchiveClass is a class representing an archive,
+# while an ArchiveMember is a class representing an element
+# (usually a file) of an archive.
+ArchiveClass = Union[zipfile.ZipFile, tarfile.TarFile]
+ArchiveMember = Union[zipfile.ZipInfo, tarfile.TarInfo]
class ArchiveBasedAbstractParser(abstract.AbstractParser):
- """ Office files (.docx, .odt, …) are zipped files. """
+ """Base class for all archive-based formats.
+
+ Welcome to a world of frustrating complexity and tediouness:
+ - A lot of file formats (docx, odt, epubs, …) are archive-based,
+ so we need to add callbacks erverywhere to allow their respective
+ parsers to apply specific cleanup to the required files.
+ - Python has two different modules to deal with .tar and .zip files,
+ with similar-but-yet-o-so-different API, so we need to write
+ a ghetto-wrapper to avoid duplicating everything
+ - The combination of @staticmethod and @abstractstaticmethod is
+ required because for now, mypy doesn't know that
+ @abstractstaticmethod is, indeed, a static method.
+ - Mypy is too dumb (yet) to realise that a type A is valid under
+ the Union[A, B] constrain, hence the weird `# type: ignore`
+ annotations.
+ """
+ # Tarfiles can optionally support compression
+ # https://docs.python.org/3/library/tarfile.html#tarfile.open
+ compression = ''
+
def __init__(self, filename):
super().__init__(filename)
+ self.archive_class = None # type: Optional[ArchiveClass]
+ self.member_class = None # type: Optional[ArchiveMember]
# Those are the files that have a format that _isn't_
# supported by MAT2, but that we want to keep anyway.
@@ -32,10 +62,10 @@ class ArchiveBasedAbstractParser(abstract.AbstractParser):
# the archive?
self.unknown_member_policy = UnknownMemberPolicy.ABORT # type: UnknownMemberPolicy
- try: # better fail here than later
- zipfile.ZipFile(self.filename)
- except zipfile.BadZipFile:
- raise ValueError
+ self.is_archive_valid()
+
+ def is_archive_valid(self):
+ """Raise a ValueError is the current archive isn't a valid one."""
def _specific_cleanup(self, full_path: str) -> bool:
""" This method can be used to apply specific treatment
@@ -50,59 +80,64 @@ class ArchiveBasedAbstractParser(abstract.AbstractParser):
return {} # pragma: no cover
@staticmethod
- def _clean_zipinfo(zipinfo: zipfile.ZipInfo) -> zipfile.ZipInfo:
- zipinfo.create_system = 3 # Linux
- zipinfo.comment = b''
- zipinfo.date_time = (1980, 1, 1, 0, 0, 0) # this is as early as a zipfile can be
- return zipinfo
+ @abc.abstractstaticmethod
+ def _get_all_members(archive: ArchiveClass) -> List[ArchiveMember]:
+ """Return all the members of the archive."""
@staticmethod
- def _get_zipinfo_meta(zipinfo: zipfile.ZipInfo) -> Dict[str, str]:
- metadata = {}
- if zipinfo.create_system == 3: # this is Linux
- pass
- elif zipinfo.create_system == 2:
- metadata['create_system'] = 'Windows'
- else:
- metadata['create_system'] = 'Weird'
+ @abc.abstractstaticmethod
+ def _clean_member(member: ArchiveMember) -> ArchiveMember:
+ """Remove all the metadata for a given member."""
+
+ @staticmethod
+ @abc.abstractstaticmethod
+ def _get_member_meta(member: ArchiveMember) -> Dict[str, str]:
+ """Return all the metadata of a given member."""
- if zipinfo.comment:
- metadata['comment'] = zipinfo.comment # type: ignore
+ @staticmethod
+ @abc.abstractstaticmethod
+ def _get_member_name(member: ArchiveMember) -> str:
+ """Return the name of the given member."""
- if zipinfo.date_time != (1980, 1, 1, 0, 0, 0):
- metadata['date_time'] = str(datetime.datetime(*zipinfo.date_time))
+ @staticmethod
+ @abc.abstractstaticmethod
+ def _add_file_to_archive(archive: ArchiveClass, member: ArchiveMember,
+ full_path: str):
+ """Add the file at full_path to the archive, via the given member."""
- return metadata
+ @staticmethod
+ def _set_member_permissions(member: ArchiveMember, permissions: int) -> ArchiveMember:
+ """Set the permission of the archive member."""
+ # pylint: disable=unused-argument
+ return member
def get_meta(self) -> Dict[str, Union[str, dict]]:
meta = dict() # type: Dict[str, Union[str, dict]]
- with zipfile.ZipFile(self.filename) as zin:
+ with self.archive_class(self.filename) as zin:
temp_folder = tempfile.mkdtemp()
- for item in zin.infolist():
- local_meta = dict() # type: Dict[str, Union[str, Dict]]
- for k, v in self._get_zipinfo_meta(item).items():
- local_meta[k] = v
+ for item in self._get_all_members(zin):
+ local_meta = self._get_member_meta(item)
+ member_name = self._get_member_name(item)
- if item.filename[-1] == '/': # pragma: no cover
+ if member_name[-1] == '/': # pragma: no cover
# `is_dir` is added in Python3.6
continue # don't keep empty folders
zin.extract(member=item, path=temp_folder)
- full_path = os.path.join(temp_folder, item.filename)
+ full_path = os.path.join(temp_folder, member_name)
+ os.chmod(full_path, stat.S_IRUSR)
- specific_meta = self._specific_get_meta(full_path, item.filename)
- for (k, v) in specific_meta.items():
- local_meta[k] = v
+ specific_meta = self._specific_get_meta(full_path, member_name)
+ local_meta = {**local_meta, **specific_meta}
- tmp_parser, _ = parser_factory.get_parser(full_path) # type: ignore
- if tmp_parser:
- for k, v in tmp_parser.get_meta().items():
- local_meta[k] = v
+ member_parser, _ = parser_factory.get_parser(full_path) # type: ignore
+ if member_parser:
+ local_meta = {**local_meta, **member_parser.get_meta()}
if local_meta:
- meta[item.filename] = local_meta
+ meta[member_name] = local_meta
shutil.rmtree(temp_folder)
return meta
@@ -110,17 +145,19 @@ class ArchiveBasedAbstractParser(abstract.AbstractParser):
def remove_all(self) -> bool:
# pylint: disable=too-many-branches
- with zipfile.ZipFile(self.filename) as zin,\
- zipfile.ZipFile(self.output_filename, 'w') as zout:
+ with self.archive_class(self.filename) as zin,\
+ self.archive_class(self.output_filename, 'w' + self.compression) as zout:
temp_folder = tempfile.mkdtemp()
abort = False
- items = list() # type: List[zipfile.ZipInfo]
- for item in sorted(zin.infolist(), key=lambda z: z.filename):
+ # Sort the items to process, to reduce fingerprinting,
+ # and keep them in the `items` variable.
+ items = list() # type: List[ArchiveMember]
+ for item in sorted(self._get_all_members(zin), key=self._get_member_name):
# Some fileformats do require to have the `mimetype` file
# as the first file in the archive.
- if item.filename == 'mimetype':
+ if self._get_member_name(item) == 'mimetype':
items = [item] + items
else:
items.append(item)
@@ -128,53 +165,57 @@ class ArchiveBasedAbstractParser(abstract.AbstractParser):
# Since files order is a fingerprint factor,
# we're iterating (and thus inserting) them in lexicographic order.
for item in items:
- if item.filename[-1] == '/': # `is_dir` is added in Python3.6
+ member_name = self._get_member_name(item)
+ if member_name[-1] == '/': # `is_dir` is added in Python3.6
continue # don't keep empty folders
zin.extract(member=item, path=temp_folder)
- full_path = os.path.join(temp_folder, item.filename)
+ full_path = os.path.join(temp_folder, member_name)
+
+ original_permissions = os.stat(full_path).st_mode
+ os.chmod(full_path, original_permissions | stat.S_IWUSR | stat.S_IRUSR)
if self._specific_cleanup(full_path) is False:
logging.warning("Something went wrong during deep cleaning of %s",
- item.filename)
+ member_name)
abort = True
continue
- if any(map(lambda r: r.search(item.filename), self.files_to_keep)):
+ if any(map(lambda r: r.search(member_name), self.files_to_keep)):
# those files aren't supported, but we want to add them anyway
pass
- elif any(map(lambda r: r.search(item.filename), self.files_to_omit)):
+ elif any(map(lambda r: r.search(member_name), self.files_to_omit)):
continue
else: # supported files that we want to first clean, then add
- tmp_parser, mtype = parser_factory.get_parser(full_path) # type: ignore
- if not tmp_parser:
+ member_parser, mtype = parser_factory.get_parser(full_path) # type: ignore
+ if not member_parser:
if self.unknown_member_policy == UnknownMemberPolicy.OMIT:
logging.warning("In file %s, omitting unknown element %s (format: %s)",
- self.filename, item.filename, mtype)
+ self.filename, member_name, mtype)
continue
elif self.unknown_member_policy == UnknownMemberPolicy.KEEP:
logging.warning("In file %s, keeping unknown element %s (format: %s)",
- self.filename, item.filename, mtype)
+ self.filename, member_name, mtype)
else:
logging.error("In file %s, element %s's format (%s) " \
"isn't supported",
- self.filename, item.filename, mtype)
+ self.filename, member_name, mtype)
abort = True
continue
- if tmp_parser:
- if tmp_parser.remove_all() is False:
+ else:
+ if member_parser.remove_all() is False:
logging.warning("In file %s, something went wrong \
with the cleaning of %s \
(format: %s)",
- self.filename, item.filename, mtype)
+ self.filename, member_name, mtype)
abort = True
continue
- os.rename(tmp_parser.output_filename, full_path)
+ os.rename(member_parser.output_filename, full_path)
- zinfo = zipfile.ZipInfo(item.filename) # type: ignore
- clean_zinfo = self._clean_zipinfo(zinfo)
- with open(full_path, 'rb') as f:
- zout.writestr(clean_zinfo, f.read())
+ zinfo = self.member_class(member_name) # type: ignore
+ zinfo = self._set_member_permissions(zinfo, original_permissions)
+ clean_zinfo = self._clean_member(zinfo)
+ self._add_file_to_archive(zout, clean_zinfo, full_path)
shutil.rmtree(temp_folder)
if abort:
@@ -183,6 +224,188 @@ class ArchiveBasedAbstractParser(abstract.AbstractParser):
return True
+class TarParser(ArchiveBasedAbstractParser):
+ mimetypes = {'application/x-tar'}
+ def __init__(self, filename):
+ super().__init__(filename)
+ # yes, it's tarfile.open and not tarfile.TarFile,
+ # as stated in the documentation:
+ # https://docs.python.org/3/library/tarfile.html#tarfile.TarFile
+ # This is required to support compressed archives.
+ self.archive_class = tarfile.open
+ self.member_class = tarfile.TarInfo
+
+ def is_archive_valid(self):
+ if tarfile.is_tarfile(self.filename) is False:
+ raise ValueError
+ self.__check_tarfile_safety()
+
+ def __check_tarfile_safety(self):
+ """Checks if the tarfile doesn't have any "suspicious" members.
+
+ This is a rewrite of this patch: https://bugs.python.org/file47826/safetarfile-4.diff
+ inspired by this bug from 2014: https://bugs.python.org/issue21109
+ because Python's stdlib doesn't provide a way to "safely" extract
+ things from a tar file.
+ """
+ names = set()
+ with tarfile.open(self.filename) as f:
+ members = f.getmembers()
+ for member in members:
+ name = member.name
+ if os.path.isabs(name):
+ raise ValueError("The archive %s contains a file with an " \
+ "absolute path: %s" % (self.filename, name))
+ elif os.path.normpath(name).startswith('../') or '/../' in name:
+ raise ValueError("The archive %s contains a file with an " \
+ "path traversal attack: %s" % (self.filename, name))
+
+ if name in names:
+ raise ValueError("The archive %s contains two times the same " \
+ "file: %s" % (self.filename, name))
+ else:
+ names.add(name)
+
+ if member.isfile():
+ if member.mode & stat.S_ISUID:
+ raise ValueError("The archive %s contains a setuid file: %s" % \
+ (self.filename, name))
+ elif member.mode & stat.S_ISGID:
+ raise ValueError("The archive %s contains a setgid file: %s" % \
+ (self.filename, name))
+ elif member.issym():
+ linkname = member.linkname
+ if os.path.normpath(linkname).startswith('..'):
+ raise ValueError('The archive %s contains a symlink pointing' \
+ 'outside of the archive via a path traversal: %s -> %s' % \
+ (self.filename, name, linkname))
+ if os.path.isabs(linkname):
+ raise ValueError('The archive %s contains a symlink pointing' \
+ 'outside of the archive: %s -> %s' % \
+ (self.filename, name, linkname))
+ elif member.isdev():
+ raise ValueError("The archive %s contains a non-regular " \
+ "file: %s" % (self.filename, name))
+ elif member.islnk():
+ raise ValueError("The archive %s contains a hardlink: %s" \
+ % (self.filename, name))
+
+ @staticmethod
+ def _clean_member(member: ArchiveMember) -> ArchiveMember:
+ assert isinstance(member, tarfile.TarInfo) # please mypy
+ member.mtime = member.uid = member.gid = 0
+ member.uname = member.gname = ''
+ return member
+
+ @staticmethod
+ def _get_member_meta(member: ArchiveMember) -> Dict[str, str]:
+ assert isinstance(member, tarfile.TarInfo) # please mypy
+ metadata = {}
+ if member.mtime != 0:
+ metadata['mtime'] = str(datetime.datetime.fromtimestamp(member.mtime))
+ if member.uid != 0:
+ metadata['uid'] = str(member.uid)
+ if member.gid != 0:
+ metadata['gid'] = str(member.gid)
+ if member.uname != '':
+ metadata['uname'] = member.uname
+ if member.gname != '':
+ metadata['gname'] = member.gname
+ return metadata
+
+ @staticmethod
+ def _add_file_to_archive(archive: ArchiveClass, member: ArchiveMember,
+ full_path: str):
+ assert isinstance(member, tarfile.TarInfo) # please mypy
+ assert isinstance(archive, tarfile.TarFile) # please mypy
+ archive.add(full_path, member.name, filter=TarParser._clean_member) # type: ignore
+
+ @staticmethod
+ def _get_all_members(archive: ArchiveClass) -> List[ArchiveMember]:
+ assert isinstance(archive, tarfile.TarFile) # please mypy
+ return archive.getmembers() # type: ignore
+
+ @staticmethod
+ def _get_member_name(member: ArchiveMember) -> str:
+ assert isinstance(member, tarfile.TarInfo) # please mypy
+ return member.name
+
+ @staticmethod
+ def _set_member_permissions(member: ArchiveMember, permissions: int) -> ArchiveMember:
+ assert isinstance(member, tarfile.TarInfo) # please mypy
+ member.mode = permissions
+ return member
+
+
+class TarGzParser(TarParser):
+ compression = ':gz'
+ mimetypes = {'application/x-tar+gz'}
+
+
+class TarBz2Parser(TarParser):
+ compression = ':bz2'
+ mimetypes = {'application/x-tar+bz2'}
+
+
+class TarXzParser(TarParser):
+ compression = ':xz'
+ mimetypes = {'application/x-tar+xz'}
+
class ZipParser(ArchiveBasedAbstractParser):
mimetypes = {'application/zip'}
+ def __init__(self, filename):
+ super().__init__(filename)
+ self.archive_class = zipfile.ZipFile
+ self.member_class = zipfile.ZipInfo
+
+ def is_archive_valid(self):
+ try:
+ zipfile.ZipFile(self.filename)
+ except zipfile.BadZipFile:
+ raise ValueError
+
+ @staticmethod
+ def _clean_member(member: ArchiveMember) -> ArchiveMember:
+ assert isinstance(member, zipfile.ZipInfo) # please mypy
+ member.create_system = 3 # Linux
+ member.comment = b''
+ member.date_time = (1980, 1, 1, 0, 0, 0) # this is as early as a zipfile can be
+ return member
+
+ @staticmethod
+ def _get_member_meta(member: ArchiveMember) -> Dict[str, str]:
+ assert isinstance(member, zipfile.ZipInfo) # please mypy
+ metadata = {}
+ if member.create_system == 3: # this is Linux
+ pass
+ elif member.create_system == 2:
+ metadata['create_system'] = 'Windows'
+ else:
+ metadata['create_system'] = 'Weird'
+
+ if member.comment:
+ metadata['comment'] = member.comment # type: ignore
+
+ if member.date_time != (1980, 1, 1, 0, 0, 0):
+ metadata['date_time'] = str(datetime.datetime(*member.date_time))
+
+ return metadata
+
+ @staticmethod
+ def _add_file_to_archive(archive: ArchiveClass, member: ArchiveMember,
+ full_path: str):
+ assert isinstance(archive, zipfile.ZipFile) # please mypy
+ assert isinstance(member, zipfile.ZipInfo) # please mypy
+ with open(full_path, 'rb') as f:
+ archive.writestr(member, f.read())
+
+ @staticmethod
+ def _get_all_members(archive: ArchiveClass) -> List[ArchiveMember]:
+ assert isinstance(archive, zipfile.ZipFile) # please mypy
+ return archive.infolist() # type: ignore
+
+ @staticmethod
+ def _get_member_name(member: ArchiveMember) -> str:
+ assert isinstance(member, zipfile.ZipInfo) # please mypy
+ return member.filename
=====================================
libmat2/audio.py
=====================================
@@ -38,6 +38,8 @@ class MP3Parser(MutagenParser):
metadata = {} # type: Dict[str, Union[str, dict]]
meta = mutagen.File(self.filename).tags
for key in meta:
+ if not hasattr(meta[key], 'text'): # pragma: no cover
+ continue
metadata[key.rstrip(' \t\r\n\0')] = ', '.join(map(str, meta[key].text))
return metadata
=====================================
libmat2/epub.py
=====================================
@@ -5,7 +5,7 @@ import xml.etree.ElementTree as ET # type: ignore
from . import archive, office
-class EPUBParser(archive.ArchiveBasedAbstractParser):
+class EPUBParser(archive.ZipParser):
mimetypes = {'application/epub+zip', }
metadata_namespace = '{http://purl.org/dc/elements/1.1/}'
=====================================
libmat2/exiftool.py
=====================================
@@ -15,14 +15,14 @@ class ExiftoolParser(abstract.AbstractParser):
from a import file, hence why several parsers are re-using its `get_meta`
method.
"""
- meta_whitelist = set() # type: Set[str]
+ meta_allowlist = set() # type: Set[str]
def get_meta(self) -> Dict[str, Union[str, dict]]:
out = subprocess.run([_get_exiftool_path(), '-json', self.filename],
input_filename=self.filename,
check=True, stdout=subprocess.PIPE).stdout
meta = json.loads(out.decode('utf-8'))[0]
- for key in self.meta_whitelist:
+ for key in self.meta_allowlist:
meta.pop(key, None)
return meta
=====================================
libmat2/images.py
=====================================
@@ -15,7 +15,7 @@ assert Set
class PNGParser(exiftool.ExiftoolParser):
mimetypes = {'image/png', }
- meta_whitelist = {'SourceFile', 'ExifToolVersion', 'FileName',
+ meta_allowlist = {'SourceFile', 'ExifToolVersion', 'FileName',
'Directory', 'FileSize', 'FileModifyDate',
'FileAccessDate', 'FileInodeChangeDate',
'FilePermissions', 'FileType', 'FileTypeExtension',
@@ -44,7 +44,7 @@ class PNGParser(exiftool.ExiftoolParser):
class GIFParser(exiftool.ExiftoolParser):
mimetypes = {'image/gif'}
- meta_whitelist = {'AnimationIterations', 'BackgroundColor', 'BitsPerPixel',
+ meta_allowlist = {'AnimationIterations', 'BackgroundColor', 'BitsPerPixel',
'ColorResolutionDepth', 'Directory', 'Duration',
'ExifToolVersion', 'FileAccessDate',
'FileInodeChangeDate', 'FileModifyDate', 'FileName',
@@ -86,7 +86,7 @@ class GdkPixbufAbstractParser(exiftool.ExiftoolParser):
class JPGParser(GdkPixbufAbstractParser):
_type = 'jpeg'
mimetypes = {'image/jpeg'}
- meta_whitelist = {'SourceFile', 'ExifToolVersion', 'FileName',
+ meta_allowlist = {'SourceFile', 'ExifToolVersion', 'FileName',
'Directory', 'FileSize', 'FileModifyDate',
'FileAccessDate', "FileInodeChangeDate",
'FilePermissions', 'FileType', 'FileTypeExtension',
@@ -99,7 +99,7 @@ class JPGParser(GdkPixbufAbstractParser):
class TiffParser(GdkPixbufAbstractParser):
_type = 'tiff'
mimetypes = {'image/tiff'}
- meta_whitelist = {'Compression', 'ExifByteOrder', 'ExtraSamples',
+ meta_allowlist = {'Compression', 'ExifByteOrder', 'ExtraSamples',
'FillOrder', 'PhotometricInterpretation',
'PlanarConfiguration', 'RowsPerStrip', 'SamplesPerPixel',
'StripByteCounts', 'StripOffsets', 'BitsPerSample',
=====================================
libmat2/office.py
=====================================
@@ -6,7 +6,7 @@ from typing import Dict, Set, Pattern, Tuple, Any
import xml.etree.ElementTree as ET # type: ignore
-from .archive import ArchiveBasedAbstractParser
+from .archive import ZipParser
# pylint: disable=line-too-long
@@ -43,7 +43,7 @@ def _sort_xml_attributes(full_path: str) -> bool:
return True
-class MSOfficeParser(ArchiveBasedAbstractParser):
+class MSOfficeParser(ZipParser):
mimetypes = {
'application/vnd.openxmlformats-officedocument.wordprocessingml.document',
'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet',
@@ -89,7 +89,7 @@ class MSOfficeParser(ArchiveBasedAbstractParser):
r'^word/theme',
r'^word/people\.xml$',
- # we have a whitelist in self.files_to_keep,
+ # we have an allowlist in self.files_to_keep,
# so we can trash everything else
r'^word/_rels/',
}))
@@ -100,7 +100,7 @@ class MSOfficeParser(ArchiveBasedAbstractParser):
def __fill_files_to_keep_via_content_types(self) -> bool:
""" There is a suer-handy `[Content_Types].xml` file
in MS Office archives, describing what each other file contains.
- The self.content_types_to_keep member contains a type whitelist,
+ The self.content_types_to_keep member contains a type allowlist,
so we're using it to fill the self.files_to_keep one.
"""
with zipfile.ZipFile(self.filename) as zin:
@@ -220,7 +220,7 @@ class MSOfficeParser(ArchiveBasedAbstractParser):
for file_to_omit in self.files_to_omit:
if file_to_omit.search(fname):
matches = map(lambda r: r.search(fname), self.files_to_keep)
- if any(matches): # the file is whitelisted
+ if any(matches): # the file is in the allowlist
continue
removed_fnames.add(fname)
break
@@ -312,7 +312,7 @@ class MSOfficeParser(ArchiveBasedAbstractParser):
return {file_path: 'harmful content', }
-class LibreOfficeParser(ArchiveBasedAbstractParser):
+class LibreOfficeParser(ZipParser):
mimetypes = {
'application/vnd.oasis.opendocument.text',
'application/vnd.oasis.opendocument.spreadsheet',
=====================================
libmat2/parser_factory.py
=====================================
@@ -7,13 +7,10 @@ from typing import TypeVar, List, Tuple, Optional
from . import abstract, UNSUPPORTED_EXTENSIONS
-assert Tuple # make pyflakes happy
-
T = TypeVar('T', bound='abstract.AbstractParser')
mimetypes.add_type('application/epub+zip', '.epub')
-# EPUB Navigation Control XML File
-mimetypes.add_type('application/x-dtbncx+xml', '.ncx')
+mimetypes.add_type('application/x-dtbncx+xml', '.ncx') # EPUB Navigation Control XML File
def __load_all_parsers():
@@ -43,13 +40,17 @@ def _get_parsers() -> List[T]:
def get_parser(filename: str) -> Tuple[Optional[T], Optional[str]]:
- """ Return the appropriate parser for a giver filename. """
+ """ Return the appropriate parser for a given filename. """
mtype, _ = mimetypes.guess_type(filename)
_, extension = os.path.splitext(filename)
if extension.lower() in UNSUPPORTED_EXTENSIONS:
return None, mtype
+ if mtype == 'application/x-tar':
+ if extension[1:] in ('bz2', 'gz', 'xz'):
+ mtype = mtype + '+' + extension[1:]
+
for parser_class in _get_parsers(): # type: ignore
if mtype in parser_class.mimetypes:
try:
=====================================
libmat2/torrent.py
=====================================
@@ -6,7 +6,7 @@ from . import abstract
class TorrentParser(abstract.AbstractParser):
mimetypes = {'application/x-bittorrent', }
- whitelist = {b'announce', b'announce-list', b'info'}
+ allowlist = {b'announce', b'announce-list', b'info'}
def __init__(self, filename):
super().__init__(filename)
@@ -18,14 +18,14 @@ class TorrentParser(abstract.AbstractParser):
def get_meta(self) -> Dict[str, Union[str, dict]]:
metadata = {}
for key, value in self.dict_repr.items():
- if key not in self.whitelist:
+ if key not in self.allowlist:
metadata[key.decode('utf-8')] = value
return metadata
def remove_all(self) -> bool:
cleaned = dict()
for key, value in self.dict_repr.items():
- if key in self.whitelist:
+ if key in self.allowlist:
cleaned[key] = value
with open(self.output_filename, 'wb') as f:
f.write(_BencodeHandler().bencode(cleaned))
=====================================
libmat2/video.py
=====================================
@@ -10,10 +10,10 @@ from . import subprocess
class AbstractFFmpegParser(exiftool.ExiftoolParser):
""" Abstract parser for all FFmpeg-based ones, mainly for video. """
# Some fileformats have mandatory metadata fields
- meta_key_value_whitelist = {} # type: Dict[str, Union[str, int]]
+ meta_key_value_allowlist = {} # type: Dict[str, Union[str, int]]
def remove_all(self) -> bool:
- if self.meta_key_value_whitelist:
+ if self.meta_key_value_allowlist:
logging.warning('The format of "%s" (%s) has some mandatory '
'metadata fields; mat2 filled them with standard '
'data.', self.filename, ', '.join(self.mimetypes))
@@ -45,8 +45,8 @@ class AbstractFFmpegParser(exiftool.ExiftoolParser):
ret = dict() # type: Dict[str, Union[str, dict]]
for key, value in meta.items():
- if key in self.meta_key_value_whitelist.keys():
- if value == self.meta_key_value_whitelist[key]:
+ if key in self.meta_key_value_allowlist.keys():
+ if value == self.meta_key_value_allowlist[key]:
continue
ret[key] = value
return ret
@@ -54,7 +54,7 @@ class AbstractFFmpegParser(exiftool.ExiftoolParser):
class WMVParser(AbstractFFmpegParser):
mimetypes = {'video/x-ms-wmv', }
- meta_whitelist = {'AudioChannels', 'AudioCodecID', 'AudioCodecName',
+ meta_allowlist = {'AudioChannels', 'AudioCodecID', 'AudioCodecName',
'ErrorCorrectionType', 'AudioSampleRate', 'DataPackets',
'Directory', 'Duration', 'ExifToolVersion',
'FileAccessDate', 'FileInodeChangeDate', 'FileLength',
@@ -64,7 +64,7 @@ class WMVParser(AbstractFFmpegParser):
'ImageWidth', 'MIMEType', 'MaxBitrate', 'MaxPacketSize',
'Megapixels', 'MinPacketSize', 'Preroll', 'SendDuration',
'SourceFile', 'StreamNumber', 'VideoCodecName', }
- meta_key_value_whitelist = { # some metadata are mandatory :/
+ meta_key_value_allowlist = { # some metadata are mandatory :/
'AudioCodecDescription': '',
'CreationDate': '0000:00:00 00:00:00Z',
'FileID': '00000000-0000-0000-0000-000000000000',
@@ -78,7 +78,7 @@ class WMVParser(AbstractFFmpegParser):
class AVIParser(AbstractFFmpegParser):
mimetypes = {'video/x-msvideo', }
- meta_whitelist = {'SourceFile', 'ExifToolVersion', 'FileName', 'Directory',
+ meta_allowlist = {'SourceFile', 'ExifToolVersion', 'FileName', 'Directory',
'FileSize', 'FileModifyDate', 'FileAccessDate',
'FileInodeChangeDate', 'FilePermissions', 'FileType',
'FileTypeExtension', 'MIMEType', 'FrameRate', 'MaxDataRate',
@@ -98,7 +98,7 @@ class AVIParser(AbstractFFmpegParser):
class MP4Parser(AbstractFFmpegParser):
mimetypes = {'video/mp4', }
- meta_whitelist = {'AudioFormat', 'AvgBitrate', 'Balance', 'TrackDuration',
+ meta_allowlist = {'AudioFormat', 'AvgBitrate', 'Balance', 'TrackDuration',
'XResolution', 'YResolution', 'ExifToolVersion',
'FileAccessDate', 'FileInodeChangeDate', 'FileModifyDate',
'FileName', 'FilePermissions', 'MIMEType', 'FileType',
@@ -109,7 +109,7 @@ class MP4Parser(AbstractFFmpegParser):
'MovieDataSize', 'VideoFrameRate', 'MediaTimeScale',
'SourceImageHeight', 'SourceImageWidth',
'MatrixStructure', 'MediaDuration'}
- meta_key_value_whitelist = { # some metadata are mandatory :/
+ meta_key_value_allowlist = { # some metadata are mandatory :/
'CreateDate': '0000:00:00 00:00:00',
'CurrentTime': '0 s',
'MediaCreateDate': '0000:00:00 00:00:00',
=====================================
libmat2/web.py
=====================================
@@ -37,15 +37,15 @@ class CSSParser(abstract.AbstractParser):
class AbstractHTMLParser(abstract.AbstractParser):
- tags_blacklist = set() # type: Set[str]
+ tags_blocklist = set() # type: Set[str]
# In some html/xml-based formats some tags are mandatory,
- # so we're keeping them, but are discaring their content
- tags_required_blacklist = set() # type: Set[str]
+ # so we're keeping them, but are discarding their content
+ tags_required_blocklist = set() # type: Set[str]
def __init__(self, filename):
super().__init__(filename)
- self.__parser = _HTMLParser(self.filename, self.tags_blacklist,
- self.tags_required_blacklist)
+ self.__parser = _HTMLParser(self.filename, self.tags_blocklist,
+ self.tags_required_blocklist)
with open(filename, encoding='utf-8') as f:
self.__parser.feed(f.read())
self.__parser.close()
@@ -58,14 +58,14 @@ class AbstractHTMLParser(abstract.AbstractParser):
class HTMLParser(AbstractHTMLParser):
- mimetypes = {'text/html', }
- tags_blacklist = {'meta', }
- tags_required_blacklist = {'title', }
+ mimetypes = {'text/html', 'application/xhtml+xml'}
+ tags_blocklist = {'meta', }
+ tags_required_blocklist = {'title', }
class DTBNCXParser(AbstractHTMLParser):
mimetypes = {'application/x-dtbncx+xml', }
- tags_required_blacklist = {'title', 'doctitle', 'meta'}
+ tags_required_blocklist = {'title', 'doctitle', 'meta'}
class _HTMLParser(parser.HTMLParser):
@@ -79,7 +79,7 @@ class _HTMLParser(parser.HTMLParser):
Also, gotcha: the `tag` parameters are always in lowercase.
"""
- def __init__(self, filename, blacklisted_tags, required_blacklisted_tags):
+ def __init__(self, filename, blocklisted_tags, required_blocklisted_tags):
super().__init__()
self.filename = filename
self.__textrepr = ''
@@ -90,24 +90,24 @@ class _HTMLParser(parser.HTMLParser):
self.__in_dangerous_but_required_tag = 0
self.__in_dangerous_tag = 0
- if required_blacklisted_tags & blacklisted_tags: # pragma: nocover
+ if required_blocklisted_tags & blocklisted_tags: # pragma: nocover
raise ValueError("There is an overlap between %s and %s" % (
- required_blacklisted_tags, blacklisted_tags))
- self.tag_required_blacklist = required_blacklisted_tags
- self.tag_blacklist = blacklisted_tags
+ required_blocklisted_tags, blocklisted_tags))
+ self.tag_required_blocklist = required_blocklisted_tags
+ self.tag_blocklist = blocklisted_tags
def handle_starttag(self, tag: str, attrs: List[Tuple[str, str]]):
original_tag = self.get_starttag_text()
self.__validation_queue.append(original_tag)
- if tag in self.tag_blacklist:
+ if tag in self.tag_blocklist:
self.__in_dangerous_tag += 1
if self.__in_dangerous_tag == 0:
if self.__in_dangerous_but_required_tag == 0:
self.__textrepr += original_tag
- if tag in self.tag_required_blacklist:
+ if tag in self.tag_required_blocklist:
self.__in_dangerous_but_required_tag += 1
def handle_endtag(self, tag: str):
@@ -123,7 +123,7 @@ class _HTMLParser(parser.HTMLParser):
"tag %s in %s" %
(tag, previous_tag, self.filename))
- if tag in self.tag_required_blacklist:
+ if tag in self.tag_required_blocklist:
self.__in_dangerous_but_required_tag -= 1
if self.__in_dangerous_tag == 0:
@@ -131,7 +131,7 @@ class _HTMLParser(parser.HTMLParser):
# There is no `get_endtag_text()` method :/
self.__textrepr += '</' + previous_tag + '>'
- if tag in self.tag_blacklist:
+ if tag in self.tag_blocklist:
self.__in_dangerous_tag -= 1
def handle_data(self, data: str):
@@ -141,14 +141,14 @@ class _HTMLParser(parser.HTMLParser):
self.__textrepr += escape(data)
def handle_startendtag(self, tag: str, attrs: List[Tuple[str, str]]):
- if tag in self.tag_required_blacklist | self.tag_blacklist:
+ if tag in self.tag_required_blocklist | self.tag_blocklist:
meta = {k:v for k, v in attrs}
name = meta.get('name', 'harmful metadata')
content = meta.get('content', 'harmful data')
self.__meta[name] = content
if self.__in_dangerous_tag == 0:
- if tag in self.tag_required_blacklist:
+ if tag in self.tag_required_blocklist:
self.__textrepr += '<' + tag + ' />'
return
=====================================
mat2
=====================================
@@ -15,7 +15,7 @@ except ValueError as e:
print(e)
sys.exit(1)
-__version__ = '0.8.0'
+__version__ = '0.9.0'
# Make pyflakes happy
assert Tuple
@@ -24,15 +24,20 @@ assert Union
logging.basicConfig(format='%(levelname)s: %(message)s', level=logging.WARNING)
-def __check_file(filename: str, mode: int=os.R_OK) -> bool:
+def __check_file(filename: str, mode: int = os.R_OK) -> bool:
if not os.path.exists(filename):
- print("[-] %s is doesn't exist." % filename)
+ print("[-] %s doesn't exist." % filename)
return False
elif not os.path.isfile(filename):
print("[-] %s is not a regular file." % filename)
return False
elif not os.access(filename, mode):
- print("[-] %s is not readable and writeable." % filename)
+ mode_str = [] # type: List[str]
+ if mode & os.R_OK:
+ mode_str += 'readable'
+ if mode & os.W_OK:
+ mode_str += 'writeable'
+ print("[-] %s is not %s." % (filename, 'nor '.join(mode_str)))
return False
return True
@@ -49,8 +54,9 @@ def create_arg_parser() -> argparse.ArgumentParser:
parser.add_argument('-V', '--verbose', action='store_true',
help='show more verbose status information')
parser.add_argument('--unknown-members', metavar='policy', default='abort',
- help='how to handle unknown members of archive-style files (policy should' +
- ' be one of: %s)' % ', '.join(p.value for p in UnknownMemberPolicy))
+ help='how to handle unknown members of archive-style '
+ 'files (policy should be one of: %s) [Default: abort]' %
+ ', '.join(p.value for p in UnknownMemberPolicy))
info = parser.add_mutually_exclusive_group()
@@ -72,7 +78,7 @@ def show_meta(filename: str):
__print_meta(filename, p.get_meta())
-def __print_meta(filename: str, metadata: dict, depth: int=1):
+def __print_meta(filename: str, metadata: dict, depth: int = 1):
padding = " " * depth*2
if not metadata:
print(padding + "No metadata found")
@@ -100,7 +106,7 @@ def __print_meta(filename: str, metadata: dict, depth: int=1):
def clean_meta(filename: str, is_lightweight: bool, policy: UnknownMemberPolicy) -> bool:
- if not __check_file(filename, os.R_OK|os.W_OK):
+ if not __check_file(filename, os.R_OK):
return False
p, mtype = parser_factory.get_parser(filename) # type: ignore
=====================================
nautilus/mat2.py
=====================================
@@ -173,7 +173,7 @@ class Mat2Extension(GObject.GObject, Nautilus.MenuProvider, Nautilus.LocationWid
if fname is None:
self.infobar_hbox.destroy()
self.infobar.hide()
- if len(self.failed_items):
+ if self.failed_items:
self.__infobar_failure()
if not processing_queue.empty():
print("Something went wrong, the queue isn't empty :/")
=====================================
setup.py
=====================================
@@ -5,7 +5,7 @@ with open("README.md", encoding='utf-8') as fh:
setuptools.setup(
name="mat2",
- version='0.8.0',
+ version='0.9.0',
author="Julien (jvoisin) Voisin",
author_email="julien.voisin+mat2 at dustri.org",
description="A handy tool to trash your metadata",
=====================================
tests/test_climat2.py
=====================================
@@ -1,7 +1,11 @@
+import random
import os
import shutil
import subprocess
import unittest
+import glob
+
+from libmat2 import images, parser_factory
mat2_binary = ['./mat2']
@@ -181,3 +185,57 @@ class TestControlCharInjection(unittest.TestCase):
stdout=subprocess.PIPE)
stdout, _ = proc.communicate()
self.assertIn(b'Comment: GQ\n', stdout)
+
+
+class TestCommandLineParallel(unittest.TestCase):
+ iterations = 24
+
+ def test_same(self):
+ for i in range(self.iterations):
+ shutil.copy('./tests/data/dirty.jpg', './tests/data/dirty_%d.jpg' % i)
+
+ proc = subprocess.Popen(mat2_binary + ['./tests/data/dirty_%d.jpg' % i for i in range(self.iterations)],
+ stdout=subprocess.PIPE)
+ stdout, _ = proc.communicate()
+
+ for i in range(self.iterations):
+ path = './tests/data/dirty_%d.jpg' % i
+ p = images.JPGParser('./tests/data/dirty_%d.cleaned.jpg' % i)
+ self.assertEqual(p.get_meta(), {})
+ os.remove('./tests/data/dirty_%d.cleaned.jpg' % i)
+ os.remove(path)
+
+ def test_different(self):
+ shutil.copytree('./tests/data/', './tests/data/parallel')
+
+ proc = subprocess.Popen(mat2_binary + glob.glob('./tests/data/parallel/dirty.*'),
+ stdout=subprocess.PIPE)
+ stdout, _ = proc.communicate()
+
+ for i in glob.glob('./test/data/parallel/dirty.cleaned.*'):
+ p, mime = parser_factory.get_parser(i)
+ self.assertIsNotNone(mime)
+ self.assertIsNotNone(p)
+ p = parser_factory.get_parser(p.output_filename)
+ self.assertEqual(p.get_meta(), {})
+ shutil.rmtree('./tests/data/parallel')
+
+ def test_faulty(self):
+ for i in range(self.iterations):
+ shutil.copy('./tests/data/dirty.jpg', './tests/data/dirty_%d.jpg' % i)
+ shutil.copy('./tests/data/dirty.torrent', './tests/data/dirty_%d.docx' % i)
+
+ to_process = ['./tests/data/dirty_%d.jpg' % i for i in range(self.iterations)]
+ to_process.extend(['./tests/data/dirty_%d.docx' % i for i in range(self.iterations)])
+ random.shuffle(to_process)
+ proc = subprocess.Popen(mat2_binary + to_process,
+ stdout=subprocess.PIPE)
+ stdout, _ = proc.communicate()
+
+ for i in range(self.iterations):
+ path = './tests/data/dirty_%d.jpg' % i
+ p = images.JPGParser('./tests/data/dirty_%d.cleaned.jpg' % i)
+ self.assertEqual(p.get_meta(), {})
+ os.remove('./tests/data/dirty_%d.cleaned.jpg' % i)
+ os.remove(path)
+ os.remove('./tests/data/dirty_%d.docx' % i)
=====================================
tests/test_corrupted_files.py
=====================================
@@ -1,13 +1,16 @@
#!/usr/bin/env python3
import unittest
+import stat
+import time
import shutil
import os
import logging
import zipfile
+import tarfile
from libmat2 import pdf, images, audio, office, parser_factory, torrent
-from libmat2 import harmless, video, web
+from libmat2 import harmless, video, web, archive
# No need to logging messages, should something go wrong,
# the testsuite _will_ fail.
@@ -278,7 +281,6 @@ class TestCorruptedFiles(unittest.TestCase):
p.remove_all()
os.remove('./tests/data/clean.html')
-
def test_epub(self):
with zipfile.ZipFile('./tests/data/clean.epub', 'w') as zout:
zout.write('./tests/data/dirty.jpg', 'OEBPS/content.opf')
@@ -291,3 +293,170 @@ class TestCorruptedFiles(unittest.TestCase):
self.assertFalse(p.remove_all())
os.remove('./tests/data/clean.epub')
+ def test_tar(self):
+ with tarfile.TarFile.open('./tests/data/clean.tar', 'w') as zout:
+ zout.add('./tests/data/dirty.flac')
+ zout.add('./tests/data/dirty.docx')
+ zout.add('./tests/data/dirty.jpg')
+ zout.add('./tests/data/embedded_corrupted.docx')
+ tarinfo = tarfile.TarInfo(name='./tests/data/dirty.png')
+ tarinfo.mtime = time.time()
+ tarinfo.uid = 1337
+ tarinfo.gid = 1338
+ tarinfo.size = os.stat('./tests/data/dirty.png').st_size
+ with open('./tests/data/dirty.png', 'rb') as f:
+ zout.addfile(tarinfo, f)
+ p, mimetype = parser_factory.get_parser('./tests/data/clean.tar')
+ self.assertEqual(mimetype, 'application/x-tar')
+ meta = p.get_meta()
+ self.assertEqual(meta['./tests/data/dirty.flac']['comments'], 'Thank you for using MAT !')
+ self.assertEqual(meta['./tests/data/dirty.docx']['word/media/image1.png']['Comment'], 'This is a comment, be careful!')
+ self.assertFalse(p.remove_all())
+ os.remove('./tests/data/clean.tar')
+
+ shutil.copy('./tests/data/dirty.png', './tests/data/clean.tar')
+ with self.assertRaises(ValueError):
+ archive.TarParser('./tests/data/clean.tar')
+ os.remove('./tests/data/clean.tar')
+
+class TestReadOnlyArchiveMembers(unittest.TestCase):
+ def test_onlymember_tar(self):
+ with tarfile.open('./tests/data/clean.tar', 'w') as zout:
+ zout.add('./tests/data/dirty.png')
+ tarinfo = tarfile.TarInfo('./tests/data/dirty.jpg')
+ tarinfo.mtime = time.time()
+ tarinfo.uid = 1337
+ tarinfo.gid = 0
+ tarinfo.mode = 0o000
+ tarinfo.size = os.stat('./tests/data/dirty.jpg').st_size
+ with open('./tests/data/dirty.jpg', 'rb') as f:
+ zout.addfile(tarinfo=tarinfo, fileobj=f)
+ p, mimetype = parser_factory.get_parser('./tests/data/clean.tar')
+ self.assertEqual(mimetype, 'application/x-tar')
+ meta = p.get_meta()
+ self.assertEqual(meta['./tests/data/dirty.jpg']['uid'], '1337')
+ self.assertTrue(p.remove_all())
+
+ p = archive.TarParser('./tests/data/clean.cleaned.tar')
+ self.assertEqual(p.get_meta(), {})
+ os.remove('./tests/data/clean.tar')
+ os.remove('./tests/data/clean.cleaned.tar')
+
+
+class TestPathTraversalArchiveMembers(unittest.TestCase):
+ def test_tar_traversal(self):
+ with tarfile.open('./tests/data/clean.tar', 'w') as zout:
+ zout.add('./tests/data/dirty.png')
+ tarinfo = tarfile.TarInfo('./tests/data/dirty.jpg')
+ tarinfo.name = '../../../../../../../../../../tmp/mat2_test.png'
+ with open('./tests/data/dirty.jpg', 'rb') as f:
+ zout.addfile(tarinfo=tarinfo, fileobj=f)
+ with self.assertRaises(ValueError):
+ archive.TarParser('./tests/data/clean.tar')
+ os.remove('./tests/data/clean.tar')
+
+ def test_tar_absolute_path(self):
+ with tarfile.open('./tests/data/clean.tar', 'w') as zout:
+ zout.add('./tests/data/dirty.png')
+ tarinfo = tarfile.TarInfo('./tests/data/dirty.jpg')
+ tarinfo.name = '/etc/passwd'
+ with open('./tests/data/dirty.jpg', 'rb') as f:
+ zout.addfile(tarinfo=tarinfo, fileobj=f)
+ with self.assertRaises(ValueError):
+ archive.TarParser('./tests/data/clean.tar')
+ os.remove('./tests/data/clean.tar')
+
+ def test_tar_duplicate_file(self):
+ with tarfile.open('./tests/data/clean.tar', 'w') as zout:
+ for _ in range(3):
+ zout.add('./tests/data/dirty.png')
+ tarinfo = tarfile.TarInfo('./tests/data/dirty.jpg')
+ with open('./tests/data/dirty.jpg', 'rb') as f:
+ zout.addfile(tarinfo=tarinfo, fileobj=f)
+ with self.assertRaises(ValueError):
+ archive.TarParser('./tests/data/clean.tar')
+ os.remove('./tests/data/clean.tar')
+
+ def test_tar_setuid(self):
+ with tarfile.open('./tests/data/clean.tar', 'w') as zout:
+ zout.add('./tests/data/dirty.png')
+ tarinfo = tarfile.TarInfo('./tests/data/dirty.jpg')
+ tarinfo.mode |= stat.S_ISUID
+ with open('./tests/data/dirty.jpg', 'rb') as f:
+ zout.addfile(tarinfo=tarinfo, fileobj=f)
+ with self.assertRaises(ValueError):
+ archive.TarParser('./tests/data/clean.tar')
+ os.remove('./tests/data/clean.tar')
+
+ def test_tar_setgid(self):
+ with tarfile.open('./tests/data/clean.tar', 'w') as zout:
+ zout.add('./tests/data/dirty.png')
+ tarinfo = tarfile.TarInfo('./tests/data/dirty.jpg')
+ tarinfo.mode |= stat.S_ISGID
+ with open('./tests/data/dirty.jpg', 'rb') as f:
+ zout.addfile(tarinfo=tarinfo, fileobj=f)
+ with self.assertRaises(ValueError):
+ archive.TarParser('./tests/data/clean.tar')
+ os.remove('./tests/data/clean.tar')
+
+ def test_tar_symlink_absolute(self):
+ os.symlink('/etc/passwd', './tests/data/symlink')
+ with tarfile.open('./tests/data/clean.tar', 'w') as zout:
+ zout.add('./tests/data/symlink')
+ tarinfo = tarfile.TarInfo('./tests/data/symlink')
+ tarinfo.linkname = '/etc/passwd'
+ tarinfo.type = tarfile.SYMTYPE
+ with open('./tests/data/dirty.jpg', 'rb') as f:
+ zout.addfile(tarinfo=tarinfo, fileobj=f)
+ with self.assertRaises(ValueError):
+ archive.TarParser('./tests/data/clean.tar')
+ os.remove('./tests/data/clean.tar')
+ os.remove('./tests/data/symlink')
+
+ def test_tar_symlink_ok(self):
+ shutil.copy('./tests/data/dirty.png', './tests/data/clean.png')
+ with tarfile.open('./tests/data/clean.tar', 'w') as zout:
+ zout.add('./tests/data/dirty.png')
+ t = tarfile.TarInfo('mydir')
+ t.type = tarfile.DIRTYPE
+ zout.addfile(t)
+ zout.add('./tests/data/clean.png')
+ t = tarfile.TarInfo('mylink')
+ t.type = tarfile.SYMTYPE
+ t.linkname = './tests/data/clean.png'
+ zout.addfile(t)
+ zout.add('./tests/data/dirty.jpg')
+ archive.TarParser('./tests/data/clean.tar')
+ os.remove('./tests/data/clean.tar')
+ os.remove('./tests/data/clean.png')
+
+ def test_tar_symlink_relative(self):
+ os.symlink('../../../etc/passwd', './tests/data/symlink')
+ with tarfile.open('./tests/data/clean.tar', 'w') as zout:
+ zout.add('./tests/data/symlink')
+ tarinfo = tarfile.TarInfo('./tests/data/symlink')
+ with open('./tests/data/dirty.jpg', 'rb') as f:
+ zout.addfile(tarinfo=tarinfo, fileobj=f)
+ with self.assertRaises(ValueError):
+ archive.TarParser('./tests/data/clean.tar')
+ os.remove('./tests/data/clean.tar')
+ os.remove('./tests/data/symlink')
+
+ def test_tar_device_file(self):
+ with tarfile.open('./tests/data/clean.tar', 'w') as zout:
+ zout.add('/dev/null')
+ with self.assertRaises(ValueError):
+ archive.TarParser('./tests/data/clean.tar')
+ os.remove('./tests/data/clean.tar')
+
+ def test_tar_hardlink(self):
+ shutil.copy('./tests/data/dirty.png', './tests/data/clean.png')
+ os.link('./tests/data/clean.png', './tests/data/hardlink.png')
+ with tarfile.open('./tests/data/cleaner.tar', 'w') as zout:
+ zout.add('tests/data/clean.png')
+ zout.add('tests/data/hardlink.png')
+ with self.assertRaises(ValueError):
+ archive.TarParser('./tests/data/cleaner.tar')
+ os.remove('./tests/data/cleaner.tar')
+ os.remove('./tests/data/clean.png')
+ os.remove('./tests/data/hardlink.png')
=====================================
tests/test_libmat2.py
=====================================
@@ -4,6 +4,8 @@ import unittest
import shutil
import os
import re
+import tarfile
+import tempfile
import zipfile
from libmat2 import pdf, images, audio, office, parser_factory, torrent, harmless
@@ -17,8 +19,8 @@ class TestCheckDependencies(unittest.TestCase):
except RuntimeError:
return # this happens if not every dependency is installed
- for value in ret.values():
- self.assertTrue(value)
+ for key, value in ret.items():
+ self.assertTrue(value, "The value for %s is False" % key)
class TestParserFactory(unittest.TestCase):
@@ -28,6 +30,14 @@ class TestParserFactory(unittest.TestCase):
self.assertEqual(mimetype, 'audio/mpeg')
self.assertEqual(parser.__class__, audio.MP3Parser)
+ def test_tarfile_double_extension_handling(self):
+ """ Test that our module auto-detection is handling sub-sub-classes """
+ with tarfile.TarFile.open('./tests/data/dirty.tar.bz2', 'w:bz2') as zout:
+ zout.add('./tests/data/dirty.jpg')
+ parser, mimetype = parser_factory.get_parser('./tests/data/dirty.tar.bz2')
+ self.assertEqual(mimetype, 'application/x-tar+bz2')
+ os.remove('./tests/data/dirty.tar.bz2')
+
class TestParameterInjection(unittest.TestCase):
def test_ver_injection(self):
@@ -195,6 +205,19 @@ class TestGetMeta(unittest.TestCase):
self.assertEqual(meta['version'], '1.0')
self.assertEqual(meta['harmful data'], 'underline is cool')
+ def test_tar(self):
+ with tarfile.TarFile('./tests/data/dirty.tar', 'w') as tout:
+ tout.add('./tests/data/dirty.flac')
+ tout.add('./tests/data/dirty.docx')
+ tout.add('./tests/data/dirty.jpg')
+ p, mimetype = parser_factory.get_parser('./tests/data/dirty.tar')
+ self.assertEqual(mimetype, 'application/x-tar')
+ meta = p.get_meta()
+ self.assertEqual(meta['./tests/data/dirty.flac']['comments'], 'Thank you for using MAT !')
+ self.assertEqual(meta['./tests/data/dirty.docx']['word/media/image1.png']['Comment'], 'This is a comment, be careful!')
+ os.remove('./tests/data/dirty.tar')
+
+
class TestRemovingThumbnails(unittest.TestCase):
def test_odt(self):
shutil.copy('./tests/data/revision.odt', './tests/data/clean.odt')
@@ -702,3 +725,143 @@ class TestCleaning(unittest.TestCase):
os.remove('./tests/data/clean.css')
os.remove('./tests/data/clean.cleaned.css')
os.remove('./tests/data/clean.cleaned.cleaned.css')
+
+ def test_tar(self):
+ with tarfile.TarFile.open('./tests/data/dirty.tar', 'w') as zout:
+ zout.add('./tests/data/dirty.flac')
+ zout.add('./tests/data/dirty.docx')
+ zout.add('./tests/data/dirty.jpg')
+ p = archive.TarParser('./tests/data/dirty.tar')
+ meta = p.get_meta()
+ self.assertEqual(meta['./tests/data/dirty.docx']['word/media/image1.png']['Comment'], 'This is a comment, be careful!')
+
+ ret = p.remove_all()
+ self.assertTrue(ret)
+
+ p = archive.TarParser('./tests/data/dirty.cleaned.tar')
+ self.assertEqual(p.get_meta(), {})
+ self.assertTrue(p.remove_all())
+
+ tmp_dir = tempfile.mkdtemp()
+ with tarfile.open('./tests/data/dirty.cleaned.tar') as zout:
+ zout.extractall(path=tmp_dir)
+ zout.close()
+
+ number_of_files = 0
+ for root, _, fnames in os.walk(tmp_dir):
+ for f in fnames:
+ complete_path = os.path.join(root, f)
+ p, _ = parser_factory.get_parser(complete_path)
+ self.assertIsNotNone(p)
+ self.assertEqual(p.get_meta(), {})
+ number_of_files += 1
+ self.assertEqual(number_of_files, 3)
+
+ os.remove('./tests/data/dirty.tar')
+ os.remove('./tests/data/dirty.cleaned.tar')
+ os.remove('./tests/data/dirty.cleaned.cleaned.tar')
+
+ def test_targz(self):
+ with tarfile.TarFile.open('./tests/data/dirty.tar.gz', 'w:gz') as zout:
+ zout.add('./tests/data/dirty.flac')
+ zout.add('./tests/data/dirty.docx')
+ zout.add('./tests/data/dirty.jpg')
+ p = archive.TarParser('./tests/data/dirty.tar.gz')
+ meta = p.get_meta()
+ self.assertEqual(meta['./tests/data/dirty.docx']['word/media/image1.png']['Comment'], 'This is a comment, be careful!')
+
+ ret = p.remove_all()
+ self.assertTrue(ret)
+
+ p = archive.TarParser('./tests/data/dirty.cleaned.tar.gz')
+ self.assertEqual(p.get_meta(), {})
+ self.assertTrue(p.remove_all())
+
+ tmp_dir = tempfile.mkdtemp()
+ with tarfile.open('./tests/data/dirty.cleaned.tar.gz') as zout:
+ zout.extractall(path=tmp_dir)
+ zout.close()
+
+ number_of_files = 0
+ for root, _, fnames in os.walk(tmp_dir):
+ for f in fnames:
+ complete_path = os.path.join(root, f)
+ p, _ = parser_factory.get_parser(complete_path)
+ self.assertIsNotNone(p)
+ self.assertEqual(p.get_meta(), {})
+ number_of_files += 1
+ self.assertEqual(number_of_files, 3)
+
+ os.remove('./tests/data/dirty.tar.gz')
+ os.remove('./tests/data/dirty.cleaned.tar.gz')
+ os.remove('./tests/data/dirty.cleaned.cleaned.tar.gz')
+
+ def test_tarbz2(self):
+ with tarfile.TarFile.open('./tests/data/dirty.tar.bz2', 'w:bz2') as zout:
+ zout.add('./tests/data/dirty.flac')
+ zout.add('./tests/data/dirty.docx')
+ zout.add('./tests/data/dirty.jpg')
+ p = archive.TarParser('./tests/data/dirty.tar.bz2')
+ meta = p.get_meta()
+ self.assertEqual(meta['./tests/data/dirty.docx']['word/media/image1.png']['Comment'], 'This is a comment, be careful!')
+
+ ret = p.remove_all()
+ self.assertTrue(ret)
+
+ p = archive.TarParser('./tests/data/dirty.cleaned.tar.bz2')
+ self.assertEqual(p.get_meta(), {})
+ self.assertTrue(p.remove_all())
+
+ tmp_dir = tempfile.mkdtemp()
+ with tarfile.open('./tests/data/dirty.cleaned.tar.bz2') as zout:
+ zout.extractall(path=tmp_dir)
+ zout.close()
+
+ number_of_files = 0
+ for root, _, fnames in os.walk(tmp_dir):
+ for f in fnames:
+ complete_path = os.path.join(root, f)
+ p, _ = parser_factory.get_parser(complete_path)
+ self.assertIsNotNone(p)
+ self.assertEqual(p.get_meta(), {})
+ number_of_files += 1
+ self.assertEqual(number_of_files, 3)
+
+ os.remove('./tests/data/dirty.tar.bz2')
+ os.remove('./tests/data/dirty.cleaned.tar.bz2')
+ os.remove('./tests/data/dirty.cleaned.cleaned.tar.bz2')
+
+ def test_tarxz(self):
+ with tarfile.TarFile.open('./tests/data/dirty.tar.xz', 'w:xz') as zout:
+ zout.add('./tests/data/dirty.flac')
+ zout.add('./tests/data/dirty.docx')
+ zout.add('./tests/data/dirty.jpg')
+ p = archive.TarParser('./tests/data/dirty.tar.xz')
+ meta = p.get_meta()
+ self.assertEqual(meta['./tests/data/dirty.docx']['word/media/image1.png']['Comment'], 'This is a comment, be careful!')
+
+ ret = p.remove_all()
+ self.assertTrue(ret)
+
+ p = archive.TarParser('./tests/data/dirty.cleaned.tar.xz')
+ self.assertEqual(p.get_meta(), {})
+ self.assertTrue(p.remove_all())
+
+ tmp_dir = tempfile.mkdtemp()
+ with tarfile.open('./tests/data/dirty.cleaned.tar.xz') as zout:
+ zout.extractall(path=tmp_dir)
+ zout.close()
+
+ number_of_files = 0
+ for root, _, fnames in os.walk(tmp_dir):
+ for f in fnames:
+ complete_path = os.path.join(root, f)
+ p, _ = parser_factory.get_parser(complete_path)
+ self.assertIsNotNone(p)
+ self.assertEqual(p.get_meta(), {})
+ number_of_files += 1
+ self.assertEqual(number_of_files, 3)
+
+ os.remove('./tests/data/dirty.tar.xz')
+ os.remove('./tests/data/dirty.cleaned.tar.xz')
+ os.remove('./tests/data/dirty.cleaned.cleaned.tar.xz')
View it on GitLab: https://salsa.debian.org/pkg-privacy-team/mat2/commit/2b1985347bcda8acb960a497160a223ab48f7479
--
View it on GitLab: https://salsa.debian.org/pkg-privacy-team/mat2/commit/2b1985347bcda8acb960a497160a223ab48f7479
You're receiving this email because of your account on salsa.debian.org.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://alioth-lists.debian.net/pipermail/pkg-privacy-commits/attachments/20190711/563df6ab/attachment-0001.html>
More information about the Pkg-privacy-commits
mailing list