[Pkg-privacy-commits] [Git][pkg-privacy-team/mat2][upstream] New upstream version 0.7.0
Georg Faerber
gitlab at salsa.debian.org
Sun Feb 17 16:38:25 GMT 2019
Georg Faerber pushed to branch upstream at Privacy Maintainers / mat2
Commits:
8e51bae5 by Georg Faerber at 2019-02-17T16:36:36Z
New upstream version 0.7.0
- - - - -
26 changed files:
- .gitlab-ci.yml
- CHANGELOG.md
- CONTRIBUTING.md
- INSTALL.md
- README.md
- doc/mat2.1
- libmat2/__init__.py
- libmat2/abstract.py
- libmat2/archive.py
- libmat2/exiftool.py
- + libmat2/html.py
- libmat2/images.py
- libmat2/office.py
- libmat2/parser_factory.py
- + libmat2/subprocess.py
- libmat2/torrent.py
- libmat2/video.py
- mat2
- + nautilus/README.md
- nautilus/mat2.py
- setup.py
- + tests/data/dirty.gif
- + tests/data/dirty.html
- + tests/data/dirty.wmv
- tests/test_corrupted_files.py
- tests/test_libmat2.py
Changes:
=====================================
.gitlab-ci.yml
=====================================
@@ -42,6 +42,17 @@ tests:debian:
script:
- apt-get -qqy update
- apt-get -qqy install --no-install-recommends python3-mutagen python3-gi-cairo gir1.2-poppler-0.18 gir1.2-gdkpixbuf-2.0 libimage-exiftool-perl python3-coverage ffmpeg
+ - apt-get -qqy purge bubblewrap
+ - python3-coverage run --branch -m unittest discover -s tests/
+ - python3-coverage report --fail-under=90 -m --include 'libmat2/*'
+
+tests:debian_with_bubblewrap:
+ stage: test
+ tags:
+ - whitewhale
+ script:
+ - apt-get -qqy update
+ - apt-get -qqy install --no-install-recommends python3-mutagen python3-gi-cairo gir1.2-poppler-0.18 gir1.2-gdkpixbuf-2.0 libimage-exiftool-perl python3-coverage ffmpeg bubblewrap
- python3-coverage run --branch -m unittest discover -s tests/
- python3-coverage report --fail-under=100 -m --include 'libmat2/*'
=====================================
CHANGELOG.md
=====================================
@@ -1,3 +1,12 @@
+# 0.7.0 - 2019-02-17
+
+- Add support for wmv files
+- Add support for gif files
+- Add support for html files
+- Sandbox external processes via bubblewrap
+- Simplify archive-based formats processing
+- The Nautilus extension now plays nicer with other extensions
+
# 0.6.0 - 2018-11-10
- Add lightweight cleaning for jpeg
=====================================
CONTRIBUTING.md
=====================================
@@ -33,5 +33,5 @@ Since MAT2 is written in Python3, please conform as much as possible to the
10. Sign the tarball with `gpg --armor --detach-sign mat-$VERSION.tar.xz`
11. Upload the result on Gitlab's [tag page](https://0xacab.org/jvoisin/mat2/tags) and add the changelog there
12. Announce the release on the [mailing list](https://mailman.boum.org/listinfo/mat-dev)
-13. Tell the [downstreams](https://0xacab.org/jvoisin/mat2/blob/master/INSTALL.md) about it
+13. Upload the new version on pypi with `python3 setup.py sdist bdist_wheel` then `twine upload -s dist/*`
14. Do the secret release dance
=====================================
INSTALL.md
=====================================
@@ -1,5 +1,21 @@
+# Python ecosystem
+
+If you feel like running arbitrary code downloaded over the
+internet (pypi doesn't support gpg signatures [anymore](https://github.com/pypa/python-packaging-user-guide/pull/466)),
+mat2 is [available on pypi](https://pypi.org/project/mat2/), and can be
+installed like this:
+
+```
+pip3 install mat2
+```
+
# GNU/Linux
+## Optional dependencies
+
+When [bubblewrap](https://github.com/projectatomic/bubblewrap) is
+installed, MAT2 uses it to sandbox any external processes it invokes.
+
## Fedora
Thanks to [atenart](https://ack.tf/), there is a package available on
@@ -23,13 +39,14 @@ dnf -y install mat2 mat2-nautilus
## Debian
-There is currently no package for Debian. If you want to help to make this
-happen, there is an [issue](https://0xacab.org/jvoisin/mat2/issues/16) open.
+There a package available in Debian *buster/sid*. The package [doesn't include
+the Nautilus extension yet](https://bugs.debian.org/910491).
-But fear not, there is a way to install it *manually*:
+For Debian 9 *stretch*, there is a way to install it *manually*:
```
-# apt install python3-mutagen python3-gi-cairo gir1.2-gdkpixbuf-2.0 libimage-exiftool-perl gir1.2-glib-2.0 gir1.2-poppler-0.18
+# apt install python3-mutagen python3-gi-cairo gir1.2-gdkpixbuf-2.0 libimage-exiftool-perl gir1.2-glib-2.0 gir1.2-poppler-0.18 ffmpeg
+# apt install bubblewrap # if you want sandboxing
$ git clone https://0xacab.org/jvoisin/mat2.git
$ cd mat2
$ ./mat2
@@ -38,7 +55,7 @@ $ ./mat2
and if you want to install the über-fancy Nautilus extension:
```
-# apt install gnome-common gtk-doc-tools libnautilus-extension-dev python-gi-dev
+# apt install gnome-common gtk-doc-tools libnautilus-extension-dev python-gi-dev python3-dev build-essential
$ git clone https://github.com/GNOME/nautilus-python
$ cd nautilus-python
$ PYTHON=/usr/bin/python3 ./autogen.sh
=====================================
README.md
=====================================
@@ -42,6 +42,13 @@ doesn't run on [Debian Jessie](https://packages.debian.org/jessie/python3).
$ python3 -m unittest discover -v
```
+And if you want to see the coverage:
+
+```bash
+$ python3-coverage run --branch -m unittest discover -s tests/
+$ python3-coverage report --include -m --include /libmat2/*'
+```
+
# How to use MAT2
```bash
@@ -82,6 +89,15 @@ complex file formats.
This is why you shouldn't rely on metadata's presence to decide if your file must
be cleaned or not.
+# Notes about the lightweight mode
+
+By default, mat2 might alter a bit the data of your files, in order to remove
+as much metadata as possible. For example, texts in PDF might not be selectable anymore,
+compressed images might get compressed again, …
+Since some users might be willing to trade some metadata's presence in exchange
+of the guarantee that mat2 won't modify the data of their files, there is the
+`-L` flag that precisely does that.
+
# Related software
- The first iteration of [MAT](https://mat.boum.org)
=====================================
doc/mat2.1
=====================================
@@ -1,4 +1,4 @@
-.TH MAT2 "1" "November 2018" "MAT2 0.6.0" "User Commands"
+.TH MAT2 "1" "February 2019" "MAT2 0.7.0" "User Commands"
.SH NAME
mat2 \- the metadata anonymisation toolkit 2
=====================================
libmat2/__init__.py
=====================================
@@ -39,12 +39,11 @@ DEPENDENCIES = {
}
-
def check_dependencies() -> Dict[str, bool]:
ret = collections.defaultdict(bool) # type: Dict[str, bool]
- ret['Exiftool'] = True if exiftool._get_exiftool_path() else False
- ret['Ffmpeg'] = True if video._get_ffmpeg_path() else False
+ ret['Exiftool'] = bool(exiftool._get_exiftool_path())
+ ret['Ffmpeg'] = bool(video._get_ffmpeg_path())
for key, value in DEPENDENCIES.items():
ret[value] = True
@@ -55,6 +54,7 @@ def check_dependencies() -> Dict[str, bool]:
return ret
+
@enum.unique
class UnknownMemberPolicy(enum.Enum):
ABORT = 'abort'
=====================================
libmat2/abstract.py
=====================================
@@ -37,4 +37,5 @@ class AbstractParser(abc.ABC):
"""
:raises RuntimeError: Raised if the cleaning process went wrong.
"""
+ # pylint: disable=unnecessary-pass
pass # pragma: no cover
=====================================
libmat2/archive.py
=====================================
@@ -4,7 +4,7 @@ import tempfile
import os
import logging
import shutil
-from typing import Dict, Set, Pattern, Union
+from typing import Dict, Set, Pattern, Union, Any
from . import abstract, UnknownMemberPolicy, parser_factory
@@ -42,6 +42,12 @@ class ArchiveBasedAbstractParser(abstract.AbstractParser):
# pylint: disable=unused-argument,no-self-use
return True # pragma: no cover
+ def _specific_get_meta(self, full_path: str, file_path: str) -> Dict[str, Any]:
+ """ This method can be used to extract specific metadata
+ from files present in the archive."""
+ # pylint: disable=unused-argument,no-self-use
+ return {} # pragma: no cover
+
@staticmethod
def _clean_zipinfo(zipinfo: zipfile.ZipInfo) -> zipfile.ZipInfo:
zipinfo.create_system = 3 # Linux
@@ -74,6 +80,10 @@ class ArchiveBasedAbstractParser(abstract.AbstractParser):
temp_folder = tempfile.mkdtemp()
for item in zin.infolist():
+ local_meta = dict() # type: Dict[str, Union[str, Dict]]
+ for k, v in self._get_zipinfo_meta(item).items():
+ local_meta[k] = v
+
if item.filename[-1] == '/': # pragma: no cover
# `is_dir` is added in Python3.6
continue # don't keep empty folders
@@ -81,11 +91,15 @@ class ArchiveBasedAbstractParser(abstract.AbstractParser):
zin.extract(member=item, path=temp_folder)
full_path = os.path.join(temp_folder, item.filename)
+ specific_meta = self._specific_get_meta(full_path, item.filename)
+ for (k, v) in specific_meta.items():
+ local_meta[k] = v
+
tmp_parser, _ = parser_factory.get_parser(full_path) # type: ignore
- if not tmp_parser:
- continue
+ if tmp_parser:
+ for k, v in tmp_parser.get_meta().items():
+ local_meta[k] = v
- local_meta = tmp_parser.get_meta()
if local_meta:
meta[item.filename] = local_meta
@@ -132,7 +146,7 @@ class ArchiveBasedAbstractParser(abstract.AbstractParser):
logging.warning("In file %s, keeping unknown element %s (format: %s)",
self.filename, item.filename, mtype)
else:
- logging.error("In file %s, element %s's format (%s) " +
+ logging.error("In file %s, element %s's format (%s) " \
"isn't supported",
self.filename, item.filename, mtype)
abort = True
=====================================
libmat2/exiftool.py
=====================================
@@ -1,10 +1,10 @@
import json
import logging
import os
-import subprocess
from typing import Dict, Union, Set
from . import abstract
+from . import subprocess
# Make pyflakes happy
assert Set
@@ -18,7 +18,9 @@ class ExiftoolParser(abstract.AbstractParser):
meta_whitelist = set() # type: Set[str]
def get_meta(self) -> Dict[str, Union[str, dict]]:
- out = subprocess.check_output([_get_exiftool_path(), '-json', self.filename])
+ out = subprocess.run([_get_exiftool_path(), '-json', self.filename],
+ input_filename=self.filename,
+ check=True, stdout=subprocess.PIPE).stdout
meta = json.loads(out.decode('utf-8'))[0]
for key in self.meta_whitelist:
meta.pop(key, None)
@@ -46,7 +48,9 @@ class ExiftoolParser(abstract.AbstractParser):
'-o', self.output_filename,
self.filename]
try:
- subprocess.check_call(cmd)
+ subprocess.run(cmd, check=True,
+ input_filename=self.filename,
+ output_filename=self.output_filename)
except subprocess.CalledProcessError as e: # pragma: no cover
logging.error("Something went wrong during the processing of %s: %s", self.filename, e)
return False
=====================================
libmat2/html.py
=====================================
@@ -0,0 +1,69 @@
+from html import parser
+from typing import Dict, Any, List, Tuple
+
+from . import abstract
+
+
+class HTMLParser(abstract.AbstractParser):
+ mimetypes = {'text/html', }
+ def __init__(self, filename):
+ super().__init__(filename)
+ self.__parser = _HTMLParser()
+ with open(filename) as f:
+ self.__parser.feed(f.read())
+ self.__parser.close()
+
+ def get_meta(self) -> Dict[str, Any]:
+ return self.__parser.get_meta()
+
+ def remove_all(self) -> bool:
+ return self.__parser.remove_all(self.output_filename)
+
+
+class _HTMLParser(parser.HTMLParser):
+ """Python doesn't have a validating html parser in its stdlib, so
+ we're using an internal queue to track all the opening/closing tags,
+ and hoping for the best.
+ """
+ def __init__(self):
+ super().__init__()
+ self.__textrepr = ''
+ self.__meta = {}
+ self.__validation_queue = []
+
+ def handle_starttag(self, tag: str, attrs: List[Tuple[str, str]]):
+ self.__textrepr += self.get_starttag_text()
+ self.__validation_queue.append(tag)
+
+ def handle_endtag(self, tag: str):
+ if not self.__validation_queue:
+ raise ValueError
+ elif tag != self.__validation_queue.pop():
+ raise ValueError
+ # There is no `get_endtag_text()` method :/
+ self.__textrepr += '</' + tag + '>\n'
+
+ def handle_data(self, data: str):
+ if data.strip():
+ self.__textrepr += data
+
+ def handle_startendtag(self, tag: str, attrs: List[Tuple[str, str]]):
+ if tag == 'meta':
+ meta = {k:v for k, v in attrs}
+ name = meta.get('name', 'harmful metadata')
+ content = meta.get('content', 'harmful data')
+ self.__meta[name] = content
+ else:
+ self.__textrepr += self.get_starttag_text()
+
+ def remove_all(self, output_filename: str) -> bool:
+ if self.__validation_queue:
+ raise ValueError
+ with open(output_filename, 'w') as f:
+ f.write(self.__textrepr)
+ return True
+
+ def get_meta(self) -> Dict[str, Any]:
+ if self.__validation_queue:
+ raise ValueError
+ return self.__meta
=====================================
libmat2/images.py
=====================================
@@ -42,6 +42,21 @@ class PNGParser(exiftool.ExiftoolParser):
return True
+class GIFParser(exiftool.ExiftoolParser):
+ mimetypes = {'image/gif'}
+ meta_whitelist = {'AnimationIterations', 'BackgroundColor', 'BitsPerPixel',
+ 'ColorResolutionDepth', 'Directory', 'Duration',
+ 'ExifToolVersion', 'FileAccessDate',
+ 'FileInodeChangeDate', 'FileModifyDate', 'FileName',
+ 'FilePermissions', 'FileSize', 'FileType',
+ 'FileTypeExtension', 'FrameCount', 'GIFVersion',
+ 'HasColorMap', 'ImageHeight', 'ImageSize', 'ImageWidth',
+ 'MIMEType', 'Megapixels', 'SourceFile',}
+
+ def remove_all(self) -> bool:
+ return self._lightweight_cleanup()
+
+
class GdkPixbufAbstractParser(exiftool.ExiftoolParser):
""" GdkPixbuf can handle a lot of surfaces, so we're rending images on it,
this has the side-effect of completely removing metadata.
=====================================
libmat2/office.py
=====================================
@@ -2,7 +2,7 @@ import logging
import os
import re
import zipfile
-from typing import Dict, Set, Pattern, Tuple, Union
+from typing import Dict, Set, Pattern, Tuple, Any
import xml.etree.ElementTree as ET # type: ignore
@@ -266,7 +266,6 @@ class MSOfficeParser(ArchiveBasedAbstractParser):
f.write(b'<cp:coreProperties xmlns:cp="http://schemas.openxmlformats.org/package/2006/metadata/core-properties">')
f.write(b'</cp:coreProperties>')
-
if self.__remove_rsid(full_path) is False:
return False
@@ -296,26 +295,21 @@ class MSOfficeParser(ArchiveBasedAbstractParser):
return True
- def get_meta(self) -> Dict[str, Union[str, dict]]:
+ def _specific_get_meta(self, full_path: str, file_path: str) -> Dict[str, Any]:
"""
Yes, I know that parsing xml with regexp ain't pretty,
be my guest and fix it if you want.
"""
- metadata = super().get_meta()
- zipin = zipfile.ZipFile(self.filename)
- for item in zipin.infolist():
- if item.filename.startswith('docProps/') and item.filename.endswith('.xml'):
- try:
- content = zipin.read(item).decode('utf-8')
- results = re.findall(r"<(.+)>(.+)</\1>", content, re.I|re.M)
- for (key, value) in results:
- metadata[key] = value
- except (TypeError, UnicodeDecodeError): # We didn't manage to parse the xml file
- metadata[item.filename] = 'harmful content'
- for key, value in self._get_zipinfo_meta(item).items():
- metadata[key] = value
- zipin.close()
- return metadata
+ if not file_path.startswith('docProps/') or not file_path.endswith('.xml'):
+ return {}
+
+ with open(full_path, encoding='utf-8') as f:
+ try:
+ results = re.findall(r"<(.+)>(.+)</\1>", f.read(), re.I|re.M)
+ return {k:v for (k, v) in results}
+ except (TypeError, UnicodeDecodeError):
+ # We didn't manage to parse the xml file
+ return {file_path: 'harmful content', }
class LibreOfficeParser(ArchiveBasedAbstractParser):
@@ -381,23 +375,17 @@ class LibreOfficeParser(ArchiveBasedAbstractParser):
return False
return True
- def get_meta(self) -> Dict[str, Union[str, dict]]:
+ def _specific_get_meta(self, full_path: str, file_path: str) -> Dict[str, Any]:
"""
Yes, I know that parsing xml with regexp ain't pretty,
be my guest and fix it if you want.
"""
- metadata = {}
- zipin = zipfile.ZipFile(self.filename)
- for item in zipin.infolist():
- if item.filename == 'meta.xml':
- try:
- content = zipin.read(item).decode('utf-8')
- results = re.findall(r"<((?:meta|dc|cp).+?)>(.+)</\1>", content, re.I|re.M)
- for (key, value) in results:
- metadata[key] = value
- except (TypeError, UnicodeDecodeError): # We didn't manage to parse the xml file
- metadata[item.filename] = 'harmful content'
- for key, value in self._get_zipinfo_meta(item).items():
- metadata[key] = value
- zipin.close()
- return metadata
+ if file_path != 'meta.xml':
+ return {}
+ with open(full_path, encoding='utf-8') as f:
+ try:
+ results = re.findall(r"<((?:meta|dc|cp).+?)[^>]*>(.+)</\1>", f.read(), re.I|re.M)
+ return {k:v for (k, v) in results}
+ except (TypeError, UnicodeDecodeError): # We didn't manage to parse the xml file
+ # We didn't manage to parse the xml file
+ return {file_path: 'harmful content', }
=====================================
libmat2/parser_factory.py
=====================================
@@ -10,6 +10,7 @@ assert Tuple # make pyflakes happy
T = TypeVar('T', bound='abstract.AbstractParser')
+
def __load_all_parsers():
""" Loads every parser in a dynamic way """
current_dir = os.path.dirname(__file__)
@@ -24,8 +25,10 @@ def __load_all_parsers():
name, _ = os.path.splitext(basename)
importlib.import_module('.' + name, package='libmat2')
+
__load_all_parsers()
+
def _get_parsers() -> List[T]:
""" Get all our parsers!"""
def __get_parsers(cls):
=====================================
libmat2/subprocess.py
=====================================
@@ -0,0 +1,105 @@
+"""
+Wrapper around a subset of the subprocess module,
+that uses bwrap (bubblewrap) when it is available.
+
+Instead of importing subprocess, other modules should use this as follows:
+
+ from . import subprocess
+"""
+
+import os
+import shutil
+import subprocess
+import tempfile
+from typing import List, Optional
+
+
+__all__ = ['PIPE', 'run', 'CalledProcessError']
+PIPE = subprocess.PIPE
+CalledProcessError = subprocess.CalledProcessError
+
+
+def _get_bwrap_path() -> str:
+ bwrap_path = '/usr/bin/bwrap'
+ if os.path.isfile(bwrap_path):
+ if os.access(bwrap_path, os.X_OK):
+ return bwrap_path
+
+ raise RuntimeError("Unable to find bwrap") # pragma: no cover
+
+
+# pylint: disable=bad-whitespace
+def _get_bwrap_args(tempdir: str,
+ input_filename: str,
+ output_filename: Optional[str] = None) -> List[str]:
+ ro_bind_args = []
+ cwd = os.getcwd()
+
+ # XXX: use --ro-bind-try once all supported platforms
+ # have a bubblewrap recent enough to support it.
+ ro_bind_dirs = ['/usr', '/lib', '/lib64', '/bin', '/sbin', cwd]
+ for bind_dir in ro_bind_dirs:
+ if os.path.isdir(bind_dir): # pragma: no cover
+ ro_bind_args.extend(['--ro-bind', bind_dir, bind_dir])
+
+ ro_bind_files = ['/etc/ld.so.cache']
+ for bind_file in ro_bind_files:
+ if os.path.isfile(bind_file): # pragma: no cover
+ ro_bind_args.extend(['--ro-bind', bind_file, bind_file])
+
+ args = ro_bind_args + \
+ ['--dev', '/dev',
+ '--chdir', cwd,
+ '--unshare-all',
+ '--new-session',
+ # XXX: enable --die-with-parent once all supported platforms have
+ # a bubblewrap recent enough to support it.
+ # '--die-with-parent',
+ ]
+
+ if output_filename:
+ # Mount an empty temporary directory where the sandboxed
+ # process will create its output file
+ output_dirname = os.path.dirname(os.path.abspath(output_filename))
+ args.extend(['--bind', tempdir, output_dirname])
+
+ absolute_input_filename = os.path.abspath(input_filename)
+ args.extend(['--ro-bind', absolute_input_filename, absolute_input_filename])
+
+ return args
+
+
+# pylint: disable=bad-whitespace
+def run(args: List[str],
+ input_filename: str,
+ output_filename: Optional[str] = None,
+ **kwargs) -> subprocess.CompletedProcess:
+ """Wrapper around `subprocess.run`, that uses bwrap (bubblewrap) if it
+ is available.
+
+ Extra supported keyword arguments:
+
+ - `input_filename`, made available read-only in the sandbox
+ - `output_filename`, where the file created by the sandboxed process
+ is copied upon successful completion; an empty temporary directory
+ is made visible as the parent directory of this file in the sandbox.
+ Optional: one valid use case is to invoke an external process
+ to inspect metadata present in a file.
+ """
+ try:
+ bwrap_path = _get_bwrap_path()
+ except RuntimeError: # pragma: no cover
+ # bubblewrap is not installed ⇒ short-circuit
+ return subprocess.run(args, **kwargs)
+
+ with tempfile.TemporaryDirectory() as tempdir:
+ prefix_args = [bwrap_path] + \
+ _get_bwrap_args(input_filename=input_filename,
+ output_filename=output_filename,
+ tempdir=tempdir)
+ completed_process = subprocess.run(prefix_args + args, **kwargs)
+ if output_filename and completed_process.returncode == 0:
+ shutil.copy(os.path.join(tempdir, os.path.basename(output_filename)),
+ output_filename)
+
+ return completed_process
=====================================
libmat2/torrent.py
=====================================
@@ -3,6 +3,7 @@ from typing import Union, Tuple, Dict
from . import abstract
+
class TorrentParser(abstract.AbstractParser):
mimetypes = {'application/x-bittorrent', }
whitelist = {b'announce', b'announce-list', b'info'}
@@ -32,7 +33,7 @@ class TorrentParser(abstract.AbstractParser):
return True
-class _BencodeHandler(object):
+class _BencodeHandler():
"""
Since bencode isn't that hard to parse,
MAT2 comes with its own parser, based on the spec
=====================================
libmat2/video.py
=====================================
@@ -1,15 +1,22 @@
import os
-import subprocess
import logging
from typing import Dict, Union
from . import exiftool
+from . import subprocess
class AbstractFFmpegParser(exiftool.ExiftoolParser):
""" Abstract parser for all FFmpeg-based ones, mainly for video. """
+ # Some fileformats have mandatory metadata fields
+ meta_key_value_whitelist = {} # type: Dict[str, Union[str, int]]
+
def remove_all(self) -> bool:
+ if self.meta_key_value_whitelist:
+ logging.warning('The format of "%s" (%s) has some mandatory '
+ 'metadata fields; mat2 filled them with standard '
+ 'data.', self.filename, ', '.join(self.mimetypes))
cmd = [_get_ffmpeg_path(),
'-i', self.filename, # input file
'-y', # overwrite existing output file
@@ -25,12 +32,49 @@ class AbstractFFmpegParser(exiftool.ExiftoolParser):
'-flags:a', '+bitexact', # don't add any metadata
self.output_filename]
try:
- subprocess.check_call(cmd)
+ subprocess.run(cmd, check=True,
+ input_filename=self.filename,
+ output_filename=self.output_filename)
except subprocess.CalledProcessError as e:
logging.error("Something went wrong during the processing of %s: %s", self.filename, e)
return False
return True
+ def get_meta(self) -> Dict[str, Union[str, dict]]:
+ meta = super().get_meta()
+
+ ret = dict() # type: Dict[str, Union[str, dict]]
+ for key, value in meta.items():
+ if key in self.meta_key_value_whitelist.keys():
+ if value == self.meta_key_value_whitelist[key]:
+ continue
+ ret[key] = value
+ return ret
+
+
+class WMVParser(AbstractFFmpegParser):
+ mimetypes = {'video/x-ms-wmv', }
+ meta_whitelist = {'AudioChannels', 'AudioCodecID', 'AudioCodecName',
+ 'ErrorCorrectionType', 'AudioSampleRate', 'DataPackets',
+ 'Directory', 'Duration', 'ExifToolVersion',
+ 'FileAccessDate', 'FileInodeChangeDate', 'FileLength',
+ 'FileModifyDate', 'FileName', 'FilePermissions',
+ 'FileSize', 'FileType', 'FileTypeExtension',
+ 'FrameCount', 'FrameRate', 'ImageHeight', 'ImageSize',
+ 'ImageWidth', 'MIMEType', 'MaxBitrate', 'MaxPacketSize',
+ 'Megapixels', 'MinPacketSize', 'Preroll', 'SendDuration',
+ 'SourceFile', 'StreamNumber', 'VideoCodecName', }
+ meta_key_value_whitelist = { # some metadata are mandatory :/
+ 'AudioCodecDescription': '',
+ 'CreationDate': '0000:00:00 00:00:00Z',
+ 'FileID': '00000000-0000-0000-0000-000000000000',
+ 'Flags': 2, # FIXME: What is this? Why 2?
+ 'ModifyDate': '0000:00:00 00:00:00',
+ 'TimeOffset': '0 s',
+ 'VideoCodecDescription': '',
+ 'StreamType': 'Audio',
+ }
+
class AVIParser(AbstractFFmpegParser):
mimetypes = {'video/x-msvideo', }
@@ -51,6 +95,7 @@ class AVIParser(AbstractFFmpegParser):
'SampleRate', 'AvgBytesPerSec', 'BitsPerSample',
'Duration', 'ImageSize', 'Megapixels'}
+
class MP4Parser(AbstractFFmpegParser):
mimetypes = {'video/mp4', }
meta_whitelist = {'AudioFormat', 'AvgBitrate', 'Balance', 'TrackDuration',
@@ -84,23 +129,6 @@ class MP4Parser(AbstractFFmpegParser):
'TrackVolume': '0.00%',
}
- def remove_all(self) -> bool:
- logging.warning('The format of "%s" (video/mp4) has some mandatory '
- 'metadata fields; mat2 filled them with standard data.',
- self.filename)
- return super().remove_all()
-
- def get_meta(self) -> Dict[str, Union[str, dict]]:
- meta = super().get_meta()
-
- ret = dict() # type: Dict[str, Union[str, dict]]
- for key, value in meta.items():
- if key in self.meta_key_value_whitelist.keys():
- if value == self.meta_key_value_whitelist[key]:
- continue
- ret[key] = value
- return ret
-
def _get_ffmpeg_path() -> str: # pragma: no cover
ffmpeg_path = '/usr/bin/ffmpeg'
=====================================
mat2
=====================================
@@ -15,7 +15,7 @@ except ValueError as e:
print(e)
sys.exit(1)
-__version__ = '0.6.0'
+__version__ = '0.7.0'
# Make pyflakes happy
assert Tuple
@@ -118,7 +118,7 @@ def clean_meta(filename: str, is_lightweight: bool, policy: UnknownMemberPolicy)
-def show_parsers() -> bool:
+def show_parsers():
print('[+] Supported formats:')
formats = set() # Set[str]
for parser in parser_factory._get_parsers(): # type: ignore
@@ -133,7 +133,6 @@ def show_parsers() -> bool:
continue
formats.add(' - %s (%s)' % (mtype, ', '.join(extensions)))
print('\n'.join(sorted(formats)))
- return True
def __get_files_recursively(files: List[str]) -> Generator[str, None, None]:
@@ -156,7 +155,8 @@ def main() -> int:
if not args.files:
if args.list:
- return show_parsers()
+ show_parsers()
+ return 0
elif args.check_dependencies:
print("Dependencies required for MAT2 %s:" % __version__)
for key, value in sorted(check_dependencies().items()):
=====================================
nautilus/README.md
=====================================
@@ -0,0 +1,12 @@
+# mat2's Nautilus extension
+
+# Dependencies
+
+- Nautilus (now known as [Files](https://wiki.gnome.org/action/show/Apps/Files))
+- [nautilus-python](https://gitlab.gnome.org/GNOME/nautilus-python) >= 2.10
+
+# Installation
+
+Simply copy the `mat2.py` file to `~/.local/share/nautilus-python/extensions`,
+and launch Nautilus; you should now have a "Remove metadata" item in the
+right-clic menu on supported files.
=====================================
nautilus/mat2.py
=====================================
@@ -35,7 +35,7 @@ def _remove_metadata(fpath) -> Tuple[bool, Optional[str]]:
return False, mtype
return parser.remove_all(), mtype
-class ColumnExtension(GObject.GObject, Nautilus.MenuProvider, Nautilus.LocationWidgetProvider):
+class Mat2Extension(GObject.GObject, Nautilus.MenuProvider, Nautilus.LocationWidgetProvider):
""" This class adds an item to the right-clic menu in Nautilus. """
def __init__(self):
=====================================
setup.py
=====================================
@@ -5,7 +5,7 @@ with open("README.md", "r") as fh:
setuptools.setup(
name="mat2",
- version='0.6.0',
+ version='0.7.0',
author="Julien (jvoisin) Voisin",
author_email="julien.voisin+mat2 at dustri.org",
description="A handy tool to trash your metadata",
=====================================
tests/data/dirty.gif
=====================================
Binary files /dev/null and b/tests/data/dirty.gif differ
=====================================
tests/data/dirty.html
=====================================
@@ -0,0 +1,14 @@
+<html>
+ <head>
+ <meta content="vim" name="generator"/>
+ <meta content="jvoisin" name="author"/>
+</head>
+<body>
+ <p>
+ <h1>Hello</h1>
+ I am a web page.
+ Please <b>love</b> me.
+ Here, have a pretty picture: <img src='dirty.jpg' alt='a pretty picture'/>
+ </p>
+</body>
+</html>
=====================================
tests/data/dirty.wmv
=====================================
Binary files /dev/null and b/tests/data/dirty.wmv differ
=====================================
tests/test_corrupted_files.py
=====================================
@@ -7,7 +7,7 @@ import logging
import zipfile
from libmat2 import pdf, images, audio, office, parser_factory, torrent
-from libmat2 import harmless, video
+from libmat2 import harmless, video, html
# No need to logging messages, should something go wrong,
# the testsuite _will_ fail.
@@ -67,15 +67,10 @@ class TestCorruptedEmbedded(unittest.TestCase):
os.remove('./tests/data/clean.docx')
def test_odt(self):
- expected = {
- 'create_system': 'Weird',
- 'date_time': '2018-06-10 17:18:18',
- 'meta.xml': 'harmful content'
- }
shutil.copy('./tests/data/embedded_corrupted.odt', './tests/data/clean.odt')
parser, _ = parser_factory.get_parser('./tests/data/clean.odt')
self.assertFalse(parser.remove_all())
- self.assertEqual(parser.get_meta(), expected)
+ self.assertTrue(parser.get_meta())
os.remove('./tests/data/clean.odt')
@@ -237,3 +232,40 @@ class TestCorruptedFiles(unittest.TestCase):
self.assertEqual(meta['tests/data/dirty.docx']['word/media/image1.png']['Comment'], 'This is a comment, be careful!')
self.assertFalse(p.remove_all())
os.remove('./tests/data/dirty.zip')
+
+ def test_html(self):
+ shutil.copy('./tests/data/dirty.html', './tests/data/clean.html')
+ with open('./tests/data/clean.html', 'a') as f:
+ f.write('<open>but not</closed>')
+ with self.assertRaises(ValueError):
+ html.HTMLParser('./tests/data/clean.html')
+ os.remove('./tests/data/clean.html')
+
+ # Yes, we're able to deal with malformed html :/
+ shutil.copy('./tests/data/dirty.html', './tests/data/clean.html')
+ with open('./tests/data/clean.html', 'a') as f:
+ f.write('<meta name=\'this" is="weird"/>')
+ p = html.HTMLParser('./tests/data/clean.html')
+ self.assertTrue(p.remove_all())
+ p = html.HTMLParser('./tests/data/clean.cleaned.html')
+ self.assertEqual(p.get_meta(), {})
+ os.remove('./tests/data/clean.html')
+ os.remove('./tests/data/clean.cleaned.html')
+
+ with open('./tests/data/clean.html', 'w') as f:
+ f.write('</close>')
+ with self.assertRaises(ValueError):
+ html.HTMLParser('./tests/data/clean.html')
+ os.remove('./tests/data/clean.html')
+
+ with open('./tests/data/clean.html', 'w') as f:
+ f.write('<notclosed>')
+ p = html.HTMLParser('./tests/data/clean.html')
+ with self.assertRaises(ValueError):
+ p.get_meta()
+ p = html.HTMLParser('./tests/data/clean.html')
+ with self.assertRaises(ValueError):
+ p.remove_all()
+ os.remove('./tests/data/clean.html')
+
+
=====================================
tests/test_libmat2.py
=====================================
@@ -6,7 +6,7 @@ import os
import zipfile
from libmat2 import pdf, images, audio, office, parser_factory, torrent, harmless
-from libmat2 import check_dependencies, video, archive
+from libmat2 import check_dependencies, video, archive, html
class TestCheckDependencies(unittest.TestCase):
@@ -131,21 +131,21 @@ class TestGetMeta(unittest.TestCase):
def test_docx(self):
p = office.MSOfficeParser('./tests/data/dirty.docx')
meta = p.get_meta()
- self.assertEqual(meta['cp:lastModifiedBy'], 'Julien Voisin')
- self.assertEqual(meta['dc:creator'], 'julien voisin')
- self.assertEqual(meta['Application'], 'LibreOffice/5.4.5.1$Linux_X86_64 LibreOffice_project/40m0$Build-1')
+ self.assertEqual(meta['docProps/core.xml']['cp:lastModifiedBy'], 'Julien Voisin')
+ self.assertEqual(meta['docProps/core.xml']['dc:creator'], 'julien voisin')
+ self.assertEqual(meta['docProps/app.xml']['Application'], 'LibreOffice/5.4.5.1$Linux_X86_64 LibreOffice_project/40m0$Build-1')
def test_libreoffice(self):
p = office.LibreOfficeParser('./tests/data/dirty.odt')
meta = p.get_meta()
- self.assertEqual(meta['meta:initial-creator'], 'jvoisin ')
- self.assertEqual(meta['meta:creation-date'], '2011-07-26T03:27:48')
- self.assertEqual(meta['meta:generator'], 'LibreOffice/3.3$Unix LibreOffice_project/330m19$Build-202')
+ self.assertEqual(meta['meta.xml']['meta:initial-creator'], 'jvoisin ')
+ self.assertEqual(meta['meta.xml']['meta:creation-date'], '2011-07-26T03:27:48')
+ self.assertEqual(meta['meta.xml']['meta:generator'], 'LibreOffice/3.3$Unix LibreOffice_project/330m19$Build-202')
p = office.LibreOfficeParser('./tests/data/weird_producer.odt')
meta = p.get_meta()
- self.assertEqual(meta['create_system'], 'Windows')
- self.assertEqual(meta['comment'], b'YAY FOR COMMENTS')
+ self.assertEqual(meta['mimetype']['create_system'], 'Windows')
+ self.assertEqual(meta['mimetype']['comment'], b'YAY FOR COMMENTS')
def test_txt(self):
p, mimetype = parser_factory.get_parser('./tests/data/dirty.txt')
@@ -165,6 +165,17 @@ class TestGetMeta(unittest.TestCase):
self.assertEqual(meta['tests/data/dirty.docx']['word/media/image1.png']['Comment'], 'This is a comment, be careful!')
os.remove('./tests/data/dirty.zip')
+ def test_wmv(self):
+ p, mimetype = parser_factory.get_parser('./tests/data/dirty.wmv')
+ self.assertEqual(mimetype, 'video/x-ms-wmv')
+ meta = p.get_meta()
+ self.assertEqual(meta['EncodingSettings'], 'Lavf52.103.0')
+
+ def test_gif(self):
+ p, mimetype = parser_factory.get_parser('./tests/data/dirty.gif')
+ self.assertEqual(mimetype, 'image/gif')
+ meta = p.get_meta()
+ self.assertEqual(meta['Comment'], 'this is a test comment')
class TestRemovingThumbnails(unittest.TestCase):
def test_odt(self):
@@ -429,7 +440,7 @@ class TestCleaning(unittest.TestCase):
p = office.LibreOfficeParser('./tests/data/clean.odf')
meta = p.get_meta()
- self.assertEqual(meta['meta:creation-date'], '2018-04-23T00:18:59.438231281')
+ self.assertEqual(meta['meta.xml']['meta:creation-date'], '2018-04-23T00:18:59.438231281')
ret = p.remove_all()
self.assertTrue(ret)
@@ -447,7 +458,7 @@ class TestCleaning(unittest.TestCase):
p = office.LibreOfficeParser('./tests/data/clean.odg')
meta = p.get_meta()
- self.assertEqual(meta['dc:date'], '2018-04-23T00:26:59.385838550')
+ self.assertEqual(meta['meta.xml']['dc:date'], '2018-04-23T00:26:59.385838550')
ret = p.remove_all()
self.assertTrue(ret)
@@ -544,3 +555,62 @@ class TestCleaning(unittest.TestCase):
os.remove('./tests/data/clean.mp4')
os.remove('./tests/data/clean.cleaned.mp4')
os.remove('./tests/data/clean.cleaned.cleaned.mp4')
+
+ def test_wmv(self):
+ try:
+ video._get_ffmpeg_path()
+ except RuntimeError:
+ raise unittest.SkipTest
+
+ shutil.copy('./tests/data/dirty.wmv', './tests/data/clean.wmv')
+ p = video.WMVParser('./tests/data/clean.wmv')
+
+ meta = p.get_meta()
+ self.assertEqual(meta['EncodingSettings'], 'Lavf52.103.0')
+
+ ret = p.remove_all()
+ self.assertTrue(ret)
+
+ p = video.WMVParser('./tests/data/clean.cleaned.wmv')
+ self.assertNotIn('EncodingSettings', p.get_meta())
+ self.assertTrue(p.remove_all())
+
+ os.remove('./tests/data/clean.wmv')
+ os.remove('./tests/data/clean.cleaned.wmv')
+ os.remove('./tests/data/clean.cleaned.cleaned.wmv')
+
+ def test_gif(self):
+ shutil.copy('./tests/data/dirty.gif', './tests/data/clean.gif')
+ p = images.GIFParser('./tests/data/clean.gif')
+
+ meta = p.get_meta()
+ self.assertEqual(meta['Comment'], 'this is a test comment')
+
+ ret = p.remove_all()
+ self.assertTrue(ret)
+
+ p = images.GIFParser('./tests/data/clean.cleaned.gif')
+ self.assertNotIn('EncodingSettings', p.get_meta())
+ self.assertTrue(p.remove_all())
+
+ os.remove('./tests/data/clean.gif')
+ os.remove('./tests/data/clean.cleaned.gif')
+ os.remove('./tests/data/clean.cleaned.cleaned.gif')
+
+ def test_html(self):
+ shutil.copy('./tests/data/dirty.html', './tests/data/clean.html')
+ p = html.HTMLParser('./tests/data/clean.html')
+
+ meta = p.get_meta()
+ self.assertEqual(meta['author'], 'jvoisin')
+
+ ret = p.remove_all()
+ self.assertTrue(ret)
+
+ p = html.HTMLParser('./tests/data/clean.cleaned.html')
+ self.assertEqual(p.get_meta(), {})
+ self.assertTrue(p.remove_all())
+
+ os.remove('./tests/data/clean.html')
+ os.remove('./tests/data/clean.cleaned.html')
+ os.remove('./tests/data/clean.cleaned.cleaned.html')
View it on GitLab: https://salsa.debian.org/pkg-privacy-team/mat2/commit/8e51bae522d272ff2d70ea58a3cef48d04af9ad2
--
View it on GitLab: https://salsa.debian.org/pkg-privacy-team/mat2/commit/8e51bae522d272ff2d70ea58a3cef48d04af9ad2
You're receiving this email because of your account on salsa.debian.org.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://alioth-lists.debian.net/pipermail/pkg-privacy-commits/attachments/20190217/7ce0c41d/attachment-0001.html>
More information about the Pkg-privacy-commits
mailing list