[Pkg-privacy-commits] [Git][pkg-privacy-team/mat2][upstream] New upstream version 0.7.0

Sun Feb 17 16:38:25 GMT 2019

Georg Faerber pushed to branch upstream at Privacy Maintainers / mat2


Commits:
8e51bae5 by Georg Faerber at 2019-02-17T16:36:36Z
New upstream version 0.7.0
- - - - -


26 changed files:

- .gitlab-ci.yml
- CHANGELOG.md
- CONTRIBUTING.md
- INSTALL.md
- README.md
- doc/mat2.1
- libmat2/__init__.py
- libmat2/abstract.py
- libmat2/archive.py
- libmat2/exiftool.py
- + libmat2/html.py
- libmat2/images.py
- libmat2/office.py
- libmat2/parser_factory.py
- + libmat2/subprocess.py
- libmat2/torrent.py
- libmat2/video.py
- mat2
- + nautilus/README.md
- nautilus/mat2.py
- setup.py
- + tests/data/dirty.gif
- + tests/data/dirty.html
- + tests/data/dirty.wmv
- tests/test_corrupted_files.py
- tests/test_libmat2.py


Changes:

=====================================
.gitlab-ci.yml
=====================================
@@ -42,6 +42,17 @@ tests:debian:
   script:
   - apt-get -qqy update
   - apt-get -qqy install --no-install-recommends python3-mutagen python3-gi-cairo gir1.2-poppler-0.18 gir1.2-gdkpixbuf-2.0 libimage-exiftool-perl python3-coverage ffmpeg
+  - apt-get -qqy purge bubblewrap
+  - python3-coverage run --branch -m unittest discover -s tests/
+  - python3-coverage report --fail-under=90 -m --include 'libmat2/*'
+
+tests:debian_with_bubblewrap:
+  stage: test
+  tags:
+    - whitewhale
+  script:
+  - apt-get -qqy update
+  - apt-get -qqy install --no-install-recommends python3-mutagen python3-gi-cairo gir1.2-poppler-0.18 gir1.2-gdkpixbuf-2.0 libimage-exiftool-perl python3-coverage ffmpeg bubblewrap
   - python3-coverage run --branch -m unittest discover -s tests/
   - python3-coverage report --fail-under=100 -m --include 'libmat2/*'
 


=====================================
CHANGELOG.md
=====================================
@@ -1,3 +1,12 @@
+# 0.7.0 - 2019-02-17
+
+- Add support for wmv files
+- Add support for gif files
+- Add support for html files
+- Sandbox external processes via bubblewrap
+- Simplify archive-based formats processing
+- The Nautilus extension now plays nicer with other extensions
+
 # 0.6.0 - 2018-11-10
 
 - Add lightweight cleaning for jpeg


=====================================
CONTRIBUTING.md
=====================================
@@ -33,5 +33,5 @@ Since MAT2 is written in Python3, please conform as much as possible to the
 10. Sign the tarball with `gpg --armor --detach-sign mat-$VERSION.tar.xz`
 11. Upload the result on Gitlab's [tag page](https://0xacab.org/jvoisin/mat2/tags) and add the changelog there
 12. Announce the release on the [mailing list](https://mailman.boum.org/listinfo/mat-dev)
-13. Tell the [downstreams](https://0xacab.org/jvoisin/mat2/blob/master/INSTALL.md) about it
+13. Upload the new version on pypi with `python3 setup.py sdist bdist_wheel` then `twine upload -s dist/*`
 14. Do the secret release dance


=====================================
INSTALL.md
=====================================
@@ -1,5 +1,21 @@
+# Python ecosystem
+
+If you feel like running arbitrary code downloaded over the
+internet (pypi doesn't support gpg signatures [anymore](https://github.com/pypa/python-packaging-user-guide/pull/466)),
+mat2 is [available on pypi](https://pypi.org/project/mat2/), and can be
+installed like this:
+
+```
+pip3 install mat2
+```
+
 # GNU/Linux
 
+## Optional dependencies
+
+When [bubblewrap](https://github.com/projectatomic/bubblewrap) is
+installed, MAT2 uses it to sandbox any external processes it invokes.
+
 ## Fedora
 
 Thanks to [atenart](https://ack.tf/), there is a package available on
@@ -23,13 +39,14 @@ dnf -y install mat2 mat2-nautilus
 
 ## Debian
 
-There is currently no package for Debian. If you want to help to make this
-happen, there is an [issue](https://0xacab.org/jvoisin/mat2/issues/16) open.
+There a package available in Debian *buster/sid*. The package [doesn't include
+the Nautilus extension yet](https://bugs.debian.org/910491).
 
-But fear not, there is a way to install it *manually*:
+For Debian 9 *stretch*, there is a way to install it *manually*:
 
 ```
-# apt install python3-mutagen python3-gi-cairo gir1.2-gdkpixbuf-2.0 libimage-exiftool-perl gir1.2-glib-2.0 gir1.2-poppler-0.18
+# apt install python3-mutagen python3-gi-cairo gir1.2-gdkpixbuf-2.0 libimage-exiftool-perl gir1.2-glib-2.0 gir1.2-poppler-0.18 ffmpeg
+# apt install bubblewrap  # if you want sandboxing
 $ git clone https://0xacab.org/jvoisin/mat2.git
 $ cd mat2
 $ ./mat2
@@ -38,7 +55,7 @@ $ ./mat2
 and if you want to install the über-fancy Nautilus extension:
 
 ```
-# apt install gnome-common gtk-doc-tools libnautilus-extension-dev python-gi-dev
+# apt install gnome-common gtk-doc-tools libnautilus-extension-dev python-gi-dev python3-dev build-essential
 $ git clone https://github.com/GNOME/nautilus-python
 $ cd nautilus-python
 $ PYTHON=/usr/bin/python3 ./autogen.sh


=====================================
README.md
=====================================
@@ -42,6 +42,13 @@ doesn't run on [Debian Jessie](https://packages.debian.org/jessie/python3).
 $ python3 -m unittest discover -v
 ```
 
+And if you want to see the coverage:
+
+```bash
+$ python3-coverage run --branch -m unittest discover -s tests/
+$ python3-coverage report --include -m --include /libmat2/*'
+```
+
 # How to use MAT2
 
 ```bash
@@ -82,6 +89,15 @@ complex file formats.
 This is why you shouldn't rely on metadata's presence to decide if your file must
 be cleaned or not.
 
+# Notes about the lightweight mode
+
+By default, mat2 might alter a bit the data of your files, in order to remove
+as much metadata as possible. For example, texts in PDF might not be selectable anymore,
+compressed images might get compressed again, …
+Since some users might be willing to trade some metadata's presence in exchange
+of the guarantee that mat2 won't modify the data of their files, there is the
+`-L` flag that precisely does that.
+
 # Related software
 
 - The first iteration of [MAT](https://mat.boum.org)


=====================================
doc/mat2.1
=====================================
@@ -1,4 +1,4 @@
-.TH MAT2 "1" "November 2018" "MAT2 0.6.0" "User Commands"
+.TH MAT2 "1" "February 2019" "MAT2 0.7.0" "User Commands"
 
 .SH NAME
 mat2 \- the metadata anonymisation toolkit 2


=====================================
libmat2/__init__.py
=====================================
@@ -39,12 +39,11 @@ DEPENDENCIES = {
     }
 
 
-
 def check_dependencies() -> Dict[str, bool]:
     ret = collections.defaultdict(bool)  # type: Dict[str, bool]
 
-    ret['Exiftool'] = True if exiftool._get_exiftool_path() else False
-    ret['Ffmpeg'] = True if video._get_ffmpeg_path() else False
+    ret['Exiftool'] = bool(exiftool._get_exiftool_path())
+    ret['Ffmpeg'] = bool(video._get_ffmpeg_path())
 
     for key, value in DEPENDENCIES.items():
         ret[value] = True
@@ -55,6 +54,7 @@ def check_dependencies() -> Dict[str, bool]:
 
     return ret
 
+
 @enum.unique
 class UnknownMemberPolicy(enum.Enum):
     ABORT = 'abort'


=====================================
libmat2/abstract.py
=====================================
@@ -37,4 +37,5 @@ class AbstractParser(abc.ABC):
         """
         :raises RuntimeError: Raised if the cleaning process went wrong.
         """
+        # pylint: disable=unnecessary-pass
         pass  # pragma: no cover


=====================================
libmat2/archive.py
=====================================
@@ -4,7 +4,7 @@ import tempfile
 import os
 import logging
 import shutil
-from typing import Dict, Set, Pattern, Union
+from typing import Dict, Set, Pattern, Union, Any
 
 from . import abstract, UnknownMemberPolicy, parser_factory
 
@@ -42,6 +42,12 @@ class ArchiveBasedAbstractParser(abstract.AbstractParser):
         # pylint: disable=unused-argument,no-self-use
         return True  # pragma: no cover
 
+    def _specific_get_meta(self, full_path: str, file_path: str) -> Dict[str, Any]:
+        """ This method can be used to extract specific metadata
+        from files present in the archive."""
+        # pylint: disable=unused-argument,no-self-use
+        return {}  # pragma: no cover
+
     @staticmethod
     def _clean_zipinfo(zipinfo: zipfile.ZipInfo) -> zipfile.ZipInfo:
         zipinfo.create_system = 3  # Linux
@@ -74,6 +80,10 @@ class ArchiveBasedAbstractParser(abstract.AbstractParser):
             temp_folder = tempfile.mkdtemp()
 
             for item in zin.infolist():
+                local_meta = dict()  # type: Dict[str, Union[str, Dict]]
+                for k, v in self._get_zipinfo_meta(item).items():
+                    local_meta[k] = v
+
                 if item.filename[-1] == '/':  # pragma: no cover
                     # `is_dir` is added in Python3.6
                     continue  # don't keep empty folders
@@ -81,11 +91,15 @@ class ArchiveBasedAbstractParser(abstract.AbstractParser):
                 zin.extract(member=item, path=temp_folder)
                 full_path = os.path.join(temp_folder, item.filename)
 
+                specific_meta = self._specific_get_meta(full_path, item.filename)
+                for (k, v) in specific_meta.items():
+                    local_meta[k] = v
+
                 tmp_parser, _ = parser_factory.get_parser(full_path)  # type: ignore
-                if not tmp_parser:
-                    continue
+                if tmp_parser:
+                    for k, v in tmp_parser.get_meta().items():
+                        local_meta[k] = v
 
-                local_meta = tmp_parser.get_meta()
                 if local_meta:
                     meta[item.filename] = local_meta
 
@@ -132,7 +146,7 @@ class ArchiveBasedAbstractParser(abstract.AbstractParser):
                             logging.warning("In file %s, keeping unknown element %s (format: %s)",
                                             self.filename, item.filename, mtype)
                         else:
-                            logging.error("In file %s, element %s's format (%s) " +
+                            logging.error("In file %s, element %s's format (%s) " \
                                           "isn't supported",
                                           self.filename, item.filename, mtype)
                             abort = True


=====================================
libmat2/exiftool.py
=====================================
@@ -1,10 +1,10 @@
 import json
 import logging
 import os
-import subprocess
 from typing import Dict, Union, Set
 
 from . import abstract
+from . import subprocess
 
 # Make pyflakes happy
 assert Set
@@ -18,7 +18,9 @@ class ExiftoolParser(abstract.AbstractParser):
     meta_whitelist = set()  # type: Set[str]
 
     def get_meta(self) -> Dict[str, Union[str, dict]]:
-        out = subprocess.check_output([_get_exiftool_path(), '-json', self.filename])
+        out = subprocess.run([_get_exiftool_path(), '-json', self.filename],
+                             input_filename=self.filename,
+                             check=True, stdout=subprocess.PIPE).stdout
         meta = json.loads(out.decode('utf-8'))[0]
         for key in self.meta_whitelist:
             meta.pop(key, None)
@@ -46,7 +48,9 @@ class ExiftoolParser(abstract.AbstractParser):
                '-o', self.output_filename,
                self.filename]
         try:
-            subprocess.check_call(cmd)
+            subprocess.run(cmd, check=True,
+                           input_filename=self.filename,
+                           output_filename=self.output_filename)
         except subprocess.CalledProcessError as e:  # pragma: no cover
             logging.error("Something went wrong during the processing of %s: %s", self.filename, e)
             return False


=====================================
libmat2/html.py
=====================================
@@ -0,0 +1,69 @@
+from html import parser
+from typing import Dict, Any, List, Tuple
+
+from . import abstract
+
+
+class HTMLParser(abstract.AbstractParser):
+    mimetypes = {'text/html', }
+    def __init__(self, filename):
+        super().__init__(filename)
+        self.__parser = _HTMLParser()
+        with open(filename) as f:
+            self.__parser.feed(f.read())
+        self.__parser.close()
+
+    def get_meta(self) -> Dict[str, Any]:
+        return self.__parser.get_meta()
+
+    def remove_all(self) -> bool:
+        return self.__parser.remove_all(self.output_filename)
+
+
+class _HTMLParser(parser.HTMLParser):
+    """Python doesn't have a validating html parser in its stdlib, so
+    we're using an internal queue to track all the opening/closing tags,
+    and hoping for the best.
+    """
+    def __init__(self):
+        super().__init__()
+        self.__textrepr = ''
+        self.__meta = {}
+        self.__validation_queue = []
+
+    def handle_starttag(self, tag: str, attrs: List[Tuple[str, str]]):
+        self.__textrepr += self.get_starttag_text()
+        self.__validation_queue.append(tag)
+
+    def handle_endtag(self, tag: str):
+        if not self.__validation_queue:
+            raise ValueError
+        elif tag != self.__validation_queue.pop():
+            raise ValueError
+        # There is no `get_endtag_text()` method :/
+        self.__textrepr += '</' + tag + '>\n'
+
+    def handle_data(self, data: str):
+        if data.strip():
+            self.__textrepr += data
+
+    def handle_startendtag(self, tag: str, attrs: List[Tuple[str, str]]):
+        if tag == 'meta':
+            meta = {k:v for k, v in attrs}
+            name = meta.get('name', 'harmful metadata')
+            content = meta.get('content', 'harmful data')
+            self.__meta[name] = content
+        else:
+            self.__textrepr += self.get_starttag_text()
+
+    def remove_all(self, output_filename: str) -> bool:
+        if self.__validation_queue:
+            raise ValueError
+        with open(output_filename, 'w') as f:
+            f.write(self.__textrepr)
+        return True
+
+    def get_meta(self) -> Dict[str, Any]:
+        if self.__validation_queue:
+            raise ValueError
+        return self.__meta


=====================================
libmat2/images.py
=====================================
@@ -42,6 +42,21 @@ class PNGParser(exiftool.ExiftoolParser):
         return True
 
 
+class GIFParser(exiftool.ExiftoolParser):
+    mimetypes = {'image/gif'}
+    meta_whitelist = {'AnimationIterations', 'BackgroundColor', 'BitsPerPixel',
+                      'ColorResolutionDepth', 'Directory', 'Duration',
+                      'ExifToolVersion', 'FileAccessDate',
+                      'FileInodeChangeDate', 'FileModifyDate', 'FileName',
+                      'FilePermissions', 'FileSize', 'FileType',
+                      'FileTypeExtension', 'FrameCount', 'GIFVersion',
+                      'HasColorMap', 'ImageHeight', 'ImageSize', 'ImageWidth',
+                      'MIMEType', 'Megapixels', 'SourceFile',}
+
+    def remove_all(self) -> bool:
+        return self._lightweight_cleanup()
+
+
 class GdkPixbufAbstractParser(exiftool.ExiftoolParser):
     """ GdkPixbuf can handle a lot of surfaces, so we're rending images on it,
         this has the side-effect of completely removing metadata.


=====================================
libmat2/office.py
=====================================
@@ -2,7 +2,7 @@ import logging
 import os
 import re
 import zipfile
-from typing import Dict, Set, Pattern, Tuple, Union
+from typing import Dict, Set, Pattern, Tuple, Any
 
 import xml.etree.ElementTree as ET  # type: ignore
 
@@ -266,7 +266,6 @@ class MSOfficeParser(ArchiveBasedAbstractParser):
                 f.write(b'<cp:coreProperties xmlns:cp="http://schemas.openxmlformats.org/package/2006/metadata/core-properties">')
                 f.write(b'</cp:coreProperties>')
 
-
         if self.__remove_rsid(full_path) is False:
             return False
 
@@ -296,26 +295,21 @@ class MSOfficeParser(ArchiveBasedAbstractParser):
 
         return True
 
-    def get_meta(self) -> Dict[str, Union[str, dict]]:
+    def _specific_get_meta(self, full_path: str, file_path: str) -> Dict[str, Any]:
         """
         Yes, I know that parsing xml with regexp ain't pretty,
         be my guest and fix it if you want.
         """
-        metadata = super().get_meta()
-        zipin = zipfile.ZipFile(self.filename)
-        for item in zipin.infolist():
-            if item.filename.startswith('docProps/') and item.filename.endswith('.xml'):
-                try:
-                    content = zipin.read(item).decode('utf-8')
-                    results = re.findall(r"<(.+)>(.+)</\1>", content, re.I|re.M)
-                    for (key, value) in results:
-                        metadata[key] = value
-                except (TypeError, UnicodeDecodeError):  # We didn't manage to parse the xml file
-                    metadata[item.filename] = 'harmful content'
-            for key, value in self._get_zipinfo_meta(item).items():
-                metadata[key] = value
-        zipin.close()
-        return metadata
+        if not file_path.startswith('docProps/') or not file_path.endswith('.xml'):
+            return {}
+
+        with open(full_path, encoding='utf-8') as f:
+            try:
+                results = re.findall(r"<(.+)>(.+)</\1>", f.read(), re.I|re.M)
+                return {k:v for (k, v) in results}
+            except (TypeError, UnicodeDecodeError):
+                # We didn't manage to parse the xml file
+                return {file_path: 'harmful content', }
 
 
 class LibreOfficeParser(ArchiveBasedAbstractParser):
@@ -381,23 +375,17 @@ class LibreOfficeParser(ArchiveBasedAbstractParser):
                 return False
         return True
 
-    def get_meta(self) -> Dict[str, Union[str, dict]]:
+    def _specific_get_meta(self, full_path: str, file_path: str) -> Dict[str, Any]:
         """
         Yes, I know that parsing xml with regexp ain't pretty,
         be my guest and fix it if you want.
         """
-        metadata = {}
-        zipin = zipfile.ZipFile(self.filename)
-        for item in zipin.infolist():
-            if item.filename == 'meta.xml':
-                try:
-                    content = zipin.read(item).decode('utf-8')
-                    results = re.findall(r"<((?:meta|dc|cp).+?)>(.+)</\1>", content, re.I|re.M)
-                    for (key, value) in results:
-                        metadata[key] = value
-                except (TypeError, UnicodeDecodeError):  # We didn't manage to parse the xml file
-                    metadata[item.filename] = 'harmful content'
-            for key, value in self._get_zipinfo_meta(item).items():
-                metadata[key] = value
-        zipin.close()
-        return metadata
+        if file_path != 'meta.xml':
+            return {}
+        with open(full_path, encoding='utf-8') as f:
+            try:
+                results = re.findall(r"<((?:meta|dc|cp).+?)[^>]*>(.+)</\1>", f.read(), re.I|re.M)
+                return {k:v for (k, v) in results}
+            except (TypeError, UnicodeDecodeError):  # We didn't manage to parse the xml file
+                # We didn't manage to parse the xml file
+                return {file_path: 'harmful content', }


=====================================
libmat2/parser_factory.py
=====================================
@@ -10,6 +10,7 @@ assert Tuple  # make pyflakes happy
 
 T = TypeVar('T', bound='abstract.AbstractParser')
 
+
 def __load_all_parsers():
     """ Loads every parser in a dynamic way """
     current_dir = os.path.dirname(__file__)
@@ -24,8 +25,10 @@ def __load_all_parsers():
         name, _ = os.path.splitext(basename)
         importlib.import_module('.' + name, package='libmat2')
 
+
 __load_all_parsers()
 
+
 def _get_parsers() -> List[T]:
     """ Get all our parsers!"""
     def __get_parsers(cls):


=====================================
libmat2/subprocess.py
=====================================
@@ -0,0 +1,105 @@
+"""
+Wrapper around a subset of the subprocess module,
+that uses bwrap (bubblewrap) when it is available.
+
+Instead of importing subprocess, other modules should use this as follows:
+
+  from . import subprocess
+"""
+
+import os
+import shutil
+import subprocess
+import tempfile
+from typing import List, Optional
+
+
+__all__ = ['PIPE', 'run', 'CalledProcessError']
+PIPE = subprocess.PIPE
+CalledProcessError = subprocess.CalledProcessError
+
+
+def _get_bwrap_path() -> str:
+    bwrap_path = '/usr/bin/bwrap'
+    if os.path.isfile(bwrap_path):
+        if os.access(bwrap_path, os.X_OK):
+            return bwrap_path
+
+    raise RuntimeError("Unable to find bwrap")  # pragma: no cover
+
+
+# pylint: disable=bad-whitespace
+def _get_bwrap_args(tempdir: str,
+                    input_filename: str,
+                    output_filename: Optional[str] = None) -> List[str]:
+    ro_bind_args = []
+    cwd = os.getcwd()
+
+    # XXX: use --ro-bind-try once all supported platforms
+    # have a bubblewrap recent enough to support it.
+    ro_bind_dirs = ['/usr', '/lib', '/lib64', '/bin', '/sbin', cwd]
+    for bind_dir in ro_bind_dirs:
+        if os.path.isdir(bind_dir):  # pragma: no cover
+            ro_bind_args.extend(['--ro-bind', bind_dir, bind_dir])
+
+    ro_bind_files = ['/etc/ld.so.cache']
+    for bind_file in ro_bind_files:
+        if os.path.isfile(bind_file):  # pragma: no cover
+            ro_bind_args.extend(['--ro-bind', bind_file, bind_file])
+
+    args = ro_bind_args + \
+        ['--dev', '/dev',
+         '--chdir', cwd,
+         '--unshare-all',
+         '--new-session',
+         # XXX: enable --die-with-parent once all supported platforms have
+         # a bubblewrap recent enough to support it.
+         # '--die-with-parent',
+        ]
+
+    if output_filename:
+        # Mount an empty temporary directory where the sandboxed
+        # process will create its output file
+        output_dirname = os.path.dirname(os.path.abspath(output_filename))
+        args.extend(['--bind', tempdir, output_dirname])
+
+    absolute_input_filename = os.path.abspath(input_filename)
+    args.extend(['--ro-bind', absolute_input_filename, absolute_input_filename])
+
+    return args
+
+
+# pylint: disable=bad-whitespace
+def run(args: List[str],
+        input_filename: str,
+        output_filename: Optional[str] = None,
+        **kwargs) -> subprocess.CompletedProcess:
+    """Wrapper around `subprocess.run`, that uses bwrap (bubblewrap) if it
+    is available.
+
+    Extra supported keyword arguments:
+
+     - `input_filename`, made available read-only in the sandbox
+     - `output_filename`, where the file created by the sandboxed process
+       is copied upon successful completion; an empty temporary directory
+       is made visible as the parent directory of this file in the sandbox.
+       Optional: one valid use case is to invoke an external process
+       to inspect metadata present in a file.
+    """
+    try:
+        bwrap_path = _get_bwrap_path()
+    except RuntimeError:  # pragma: no cover
+        # bubblewrap is not installed ⇒ short-circuit
+        return subprocess.run(args, **kwargs)
+
+    with tempfile.TemporaryDirectory() as tempdir:
+        prefix_args = [bwrap_path] + \
+            _get_bwrap_args(input_filename=input_filename,
+                            output_filename=output_filename,
+                            tempdir=tempdir)
+        completed_process = subprocess.run(prefix_args + args, **kwargs)
+        if output_filename and completed_process.returncode == 0:
+            shutil.copy(os.path.join(tempdir, os.path.basename(output_filename)),
+                        output_filename)
+
+        return completed_process


=====================================
libmat2/torrent.py
=====================================
@@ -3,6 +3,7 @@ from typing import Union, Tuple, Dict
 
 from . import abstract
 
+
 class TorrentParser(abstract.AbstractParser):
     mimetypes = {'application/x-bittorrent', }
     whitelist = {b'announce', b'announce-list', b'info'}
@@ -32,7 +33,7 @@ class TorrentParser(abstract.AbstractParser):
         return True
 
 
-class _BencodeHandler(object):
+class _BencodeHandler():
     """
     Since bencode isn't that hard to parse,
     MAT2 comes with its own parser, based on the spec


=====================================
libmat2/video.py
=====================================
@@ -1,15 +1,22 @@
 import os
-import subprocess
 import logging
 
 from typing import Dict, Union
 
 from . import exiftool
+from . import subprocess
 
 
 class AbstractFFmpegParser(exiftool.ExiftoolParser):
     """ Abstract parser for all FFmpeg-based ones, mainly for video. """
+    # Some fileformats have mandatory metadata fields
+    meta_key_value_whitelist = {}  # type: Dict[str, Union[str, int]]
+
     def remove_all(self) -> bool:
+        if self.meta_key_value_whitelist:
+            logging.warning('The format of "%s" (%s) has some mandatory '
+                            'metadata fields; mat2 filled them with standard '
+                            'data.', self.filename, ', '.join(self.mimetypes))
         cmd = [_get_ffmpeg_path(),
                '-i', self.filename,      # input file
                '-y',                     # overwrite existing output file
@@ -25,12 +32,49 @@ class AbstractFFmpegParser(exiftool.ExiftoolParser):
                '-flags:a', '+bitexact',  # don't add any metadata
                self.output_filename]
         try:
-            subprocess.check_call(cmd)
+            subprocess.run(cmd, check=True,
+                           input_filename=self.filename,
+                           output_filename=self.output_filename)
         except subprocess.CalledProcessError as e:
             logging.error("Something went wrong during the processing of %s: %s", self.filename, e)
             return False
         return True
 
+    def get_meta(self) -> Dict[str, Union[str, dict]]:
+        meta = super().get_meta()
+
+        ret = dict()  # type: Dict[str, Union[str, dict]]
+        for key, value in meta.items():
+            if key in self.meta_key_value_whitelist.keys():
+                if value == self.meta_key_value_whitelist[key]:
+                    continue
+            ret[key] = value
+        return ret
+
+
+class WMVParser(AbstractFFmpegParser):
+    mimetypes = {'video/x-ms-wmv', }
+    meta_whitelist = {'AudioChannels', 'AudioCodecID', 'AudioCodecName',
+                      'ErrorCorrectionType', 'AudioSampleRate', 'DataPackets',
+                      'Directory', 'Duration', 'ExifToolVersion',
+                      'FileAccessDate', 'FileInodeChangeDate', 'FileLength',
+                      'FileModifyDate', 'FileName', 'FilePermissions',
+                      'FileSize', 'FileType', 'FileTypeExtension',
+                      'FrameCount', 'FrameRate', 'ImageHeight', 'ImageSize',
+                      'ImageWidth', 'MIMEType', 'MaxBitrate', 'MaxPacketSize',
+                      'Megapixels', 'MinPacketSize', 'Preroll', 'SendDuration',
+                      'SourceFile', 'StreamNumber', 'VideoCodecName', }
+    meta_key_value_whitelist = {  # some metadata are mandatory :/
+        'AudioCodecDescription': '',
+        'CreationDate': '0000:00:00 00:00:00Z',
+        'FileID': '00000000-0000-0000-0000-000000000000',
+        'Flags': 2,  # FIXME: What is this? Why 2?
+        'ModifyDate': '0000:00:00 00:00:00',
+        'TimeOffset': '0 s',
+        'VideoCodecDescription': '',
+        'StreamType': 'Audio',
+        }
+
 
 class AVIParser(AbstractFFmpegParser):
     mimetypes = {'video/x-msvideo', }
@@ -51,6 +95,7 @@ class AVIParser(AbstractFFmpegParser):
                       'SampleRate', 'AvgBytesPerSec', 'BitsPerSample',
                       'Duration', 'ImageSize', 'Megapixels'}
 
+
 class MP4Parser(AbstractFFmpegParser):
     mimetypes = {'video/mp4', }
     meta_whitelist = {'AudioFormat', 'AvgBitrate', 'Balance', 'TrackDuration',
@@ -84,23 +129,6 @@ class MP4Parser(AbstractFFmpegParser):
         'TrackVolume': '0.00%',
     }
 
-    def remove_all(self) -> bool:
-        logging.warning('The format of "%s" (video/mp4) has some mandatory '
-                        'metadata fields; mat2 filled them with standard data.',
-                        self.filename)
-        return super().remove_all()
-
-    def get_meta(self) -> Dict[str, Union[str, dict]]:
-        meta = super().get_meta()
-
-        ret = dict()  # type: Dict[str, Union[str, dict]]
-        for key, value in meta.items():
-            if key in self.meta_key_value_whitelist.keys():
-                if value == self.meta_key_value_whitelist[key]:
-                    continue
-            ret[key] = value
-        return ret
-
 
 def _get_ffmpeg_path() -> str:  # pragma: no cover
     ffmpeg_path = '/usr/bin/ffmpeg'


=====================================
mat2
=====================================
@@ -15,7 +15,7 @@ except ValueError as e:
     print(e)
     sys.exit(1)
 
-__version__ = '0.6.0'
+__version__ = '0.7.0'
 
 # Make pyflakes happy
 assert Tuple
@@ -118,7 +118,7 @@ def clean_meta(filename: str, is_lightweight: bool, policy: UnknownMemberPolicy)
 
 
 
-def show_parsers() -> bool:
+def show_parsers():
     print('[+] Supported formats:')
     formats = set()  # Set[str]
     for parser in parser_factory._get_parsers():  # type: ignore
@@ -133,7 +133,6 @@ def show_parsers() -> bool:
                 continue
             formats.add('  - %s (%s)' % (mtype, ', '.join(extensions)))
     print('\n'.join(sorted(formats)))
-    return True
 
 
 def __get_files_recursively(files: List[str]) -> Generator[str, None, None]:
@@ -156,7 +155,8 @@ def main() -> int:
 
     if not args.files:
         if args.list:
-            return show_parsers()
+            show_parsers()
+            return 0
         elif args.check_dependencies:
             print("Dependencies required for MAT2 %s:" % __version__)
             for key, value in sorted(check_dependencies().items()):


=====================================
nautilus/README.md
=====================================
@@ -0,0 +1,12 @@
+# mat2's Nautilus extension
+
+# Dependencies
+
+- Nautilus (now known as [Files](https://wiki.gnome.org/action/show/Apps/Files))
+- [nautilus-python](https://gitlab.gnome.org/GNOME/nautilus-python) >= 2.10
+
+# Installation
+
+Simply copy the `mat2.py` file to `~/.local/share/nautilus-python/extensions`,
+and launch Nautilus; you should now have a "Remove metadata" item in the
+right-clic menu on supported files.


=====================================
nautilus/mat2.py
=====================================
@@ -35,7 +35,7 @@ def _remove_metadata(fpath) -> Tuple[bool, Optional[str]]:
         return False, mtype
     return parser.remove_all(), mtype
 
-class ColumnExtension(GObject.GObject, Nautilus.MenuProvider, Nautilus.LocationWidgetProvider):
+class Mat2Extension(GObject.GObject, Nautilus.MenuProvider, Nautilus.LocationWidgetProvider):
     """ This class adds an item to the right-clic menu in Nautilus. """
 
     def __init__(self):


=====================================
setup.py
=====================================
@@ -5,7 +5,7 @@ with open("README.md", "r") as fh:
 
 setuptools.setup(
     name="mat2",
-    version='0.6.0',
+    version='0.7.0',
     author="Julien (jvoisin) Voisin",
     author_email="julien.voisin+mat2 at dustri.org",
     description="A handy tool to trash your metadata",


=====================================
tests/data/dirty.gif
=====================================
Binary files /dev/null and b/tests/data/dirty.gif differ


=====================================
tests/data/dirty.html
=====================================
@@ -0,0 +1,14 @@
+<html>
+	<head>
+		<meta content="vim" name="generator"/>
+		<meta content="jvoisin" name="author"/>
+</head>
+<body>
+	<p>
+		<h1>Hello</h1>
+		I am a web page.
+		Please <b>love</b> me.
+		Here, have a pretty picture: <img src='dirty.jpg' alt='a pretty picture'/>
+	</p>
+</body>
+</html>


=====================================
tests/data/dirty.wmv
=====================================
Binary files /dev/null and b/tests/data/dirty.wmv differ


=====================================
tests/test_corrupted_files.py
=====================================
@@ -7,7 +7,7 @@ import logging
 import zipfile
 
 from libmat2 import pdf, images, audio, office, parser_factory, torrent
-from libmat2 import harmless, video
+from libmat2 import harmless, video, html
 
 # No need to logging messages, should something go wrong,
 # the testsuite _will_ fail.
@@ -67,15 +67,10 @@ class TestCorruptedEmbedded(unittest.TestCase):
         os.remove('./tests/data/clean.docx')
 
     def test_odt(self):
-        expected = {
-                'create_system': 'Weird',
-                'date_time': '2018-06-10 17:18:18',
-                'meta.xml': 'harmful content'
-                }
         shutil.copy('./tests/data/embedded_corrupted.odt', './tests/data/clean.odt')
         parser, _ = parser_factory.get_parser('./tests/data/clean.odt')
         self.assertFalse(parser.remove_all())
-        self.assertEqual(parser.get_meta(), expected)
+        self.assertTrue(parser.get_meta())
         os.remove('./tests/data/clean.odt')
 
 
@@ -237,3 +232,40 @@ class TestCorruptedFiles(unittest.TestCase):
         self.assertEqual(meta['tests/data/dirty.docx']['word/media/image1.png']['Comment'], 'This is a comment, be careful!')
         self.assertFalse(p.remove_all())
         os.remove('./tests/data/dirty.zip')
+
+    def test_html(self):
+        shutil.copy('./tests/data/dirty.html', './tests/data/clean.html')
+        with open('./tests/data/clean.html', 'a') as f:
+            f.write('<open>but not</closed>')
+        with self.assertRaises(ValueError):
+            html.HTMLParser('./tests/data/clean.html')
+        os.remove('./tests/data/clean.html')
+
+        # Yes, we're able to deal with malformed html :/
+        shutil.copy('./tests/data/dirty.html', './tests/data/clean.html')
+        with open('./tests/data/clean.html', 'a') as f:
+            f.write('<meta name=\'this" is="weird"/>')
+        p = html.HTMLParser('./tests/data/clean.html')
+        self.assertTrue(p.remove_all())
+        p = html.HTMLParser('./tests/data/clean.cleaned.html')
+        self.assertEqual(p.get_meta(), {})
+        os.remove('./tests/data/clean.html')
+        os.remove('./tests/data/clean.cleaned.html')
+
+        with open('./tests/data/clean.html', 'w') as f:
+            f.write('</close>')
+        with self.assertRaises(ValueError):
+            html.HTMLParser('./tests/data/clean.html')
+        os.remove('./tests/data/clean.html')
+
+        with open('./tests/data/clean.html', 'w') as f:
+            f.write('<notclosed>')
+        p = html.HTMLParser('./tests/data/clean.html')
+        with self.assertRaises(ValueError):
+            p.get_meta()
+        p = html.HTMLParser('./tests/data/clean.html')
+        with self.assertRaises(ValueError):
+            p.remove_all()
+        os.remove('./tests/data/clean.html')
+
+


=====================================
tests/test_libmat2.py
=====================================
@@ -6,7 +6,7 @@ import os
 import zipfile
 
 from libmat2 import pdf, images, audio, office, parser_factory, torrent, harmless
-from libmat2 import check_dependencies, video, archive
+from libmat2 import check_dependencies, video, archive, html
 
 
 class TestCheckDependencies(unittest.TestCase):
@@ -131,21 +131,21 @@ class TestGetMeta(unittest.TestCase):
     def test_docx(self):
         p = office.MSOfficeParser('./tests/data/dirty.docx')
         meta = p.get_meta()
-        self.assertEqual(meta['cp:lastModifiedBy'], 'Julien Voisin')
-        self.assertEqual(meta['dc:creator'], 'julien voisin')
-        self.assertEqual(meta['Application'], 'LibreOffice/5.4.5.1$Linux_X86_64 LibreOffice_project/40m0$Build-1')
+        self.assertEqual(meta['docProps/core.xml']['cp:lastModifiedBy'], 'Julien Voisin')
+        self.assertEqual(meta['docProps/core.xml']['dc:creator'], 'julien voisin')
+        self.assertEqual(meta['docProps/app.xml']['Application'], 'LibreOffice/5.4.5.1$Linux_X86_64 LibreOffice_project/40m0$Build-1')
 
     def test_libreoffice(self):
         p = office.LibreOfficeParser('./tests/data/dirty.odt')
         meta = p.get_meta()
-        self.assertEqual(meta['meta:initial-creator'], 'jvoisin ')
-        self.assertEqual(meta['meta:creation-date'], '2011-07-26T03:27:48')
-        self.assertEqual(meta['meta:generator'], 'LibreOffice/3.3$Unix LibreOffice_project/330m19$Build-202')
+        self.assertEqual(meta['meta.xml']['meta:initial-creator'], 'jvoisin ')
+        self.assertEqual(meta['meta.xml']['meta:creation-date'], '2011-07-26T03:27:48')
+        self.assertEqual(meta['meta.xml']['meta:generator'], 'LibreOffice/3.3$Unix LibreOffice_project/330m19$Build-202')
 
         p = office.LibreOfficeParser('./tests/data/weird_producer.odt')
         meta = p.get_meta()
-        self.assertEqual(meta['create_system'], 'Windows')
-        self.assertEqual(meta['comment'], b'YAY FOR COMMENTS')
+        self.assertEqual(meta['mimetype']['create_system'], 'Windows')
+        self.assertEqual(meta['mimetype']['comment'], b'YAY FOR COMMENTS')
 
     def test_txt(self):
         p, mimetype = parser_factory.get_parser('./tests/data/dirty.txt')
@@ -165,6 +165,17 @@ class TestGetMeta(unittest.TestCase):
         self.assertEqual(meta['tests/data/dirty.docx']['word/media/image1.png']['Comment'], 'This is a comment, be careful!')
         os.remove('./tests/data/dirty.zip')
 
+    def test_wmv(self):
+        p, mimetype = parser_factory.get_parser('./tests/data/dirty.wmv')
+        self.assertEqual(mimetype, 'video/x-ms-wmv')
+        meta = p.get_meta()
+        self.assertEqual(meta['EncodingSettings'], 'Lavf52.103.0')
+
+    def test_gif(self):
+        p, mimetype = parser_factory.get_parser('./tests/data/dirty.gif')
+        self.assertEqual(mimetype, 'image/gif')
+        meta = p.get_meta()
+        self.assertEqual(meta['Comment'], 'this is a test comment')
 
 class TestRemovingThumbnails(unittest.TestCase):
     def test_odt(self):
@@ -429,7 +440,7 @@ class TestCleaning(unittest.TestCase):
         p = office.LibreOfficeParser('./tests/data/clean.odf')
 
         meta = p.get_meta()
-        self.assertEqual(meta['meta:creation-date'], '2018-04-23T00:18:59.438231281')
+        self.assertEqual(meta['meta.xml']['meta:creation-date'], '2018-04-23T00:18:59.438231281')
 
         ret = p.remove_all()
         self.assertTrue(ret)
@@ -447,7 +458,7 @@ class TestCleaning(unittest.TestCase):
         p = office.LibreOfficeParser('./tests/data/clean.odg')
 
         meta = p.get_meta()
-        self.assertEqual(meta['dc:date'], '2018-04-23T00:26:59.385838550')
+        self.assertEqual(meta['meta.xml']['dc:date'], '2018-04-23T00:26:59.385838550')
 
         ret = p.remove_all()
         self.assertTrue(ret)
@@ -544,3 +555,62 @@ class TestCleaning(unittest.TestCase):
         os.remove('./tests/data/clean.mp4')
         os.remove('./tests/data/clean.cleaned.mp4')
         os.remove('./tests/data/clean.cleaned.cleaned.mp4')
+
+    def test_wmv(self):
+        try:
+            video._get_ffmpeg_path()
+        except RuntimeError:
+            raise unittest.SkipTest
+
+        shutil.copy('./tests/data/dirty.wmv', './tests/data/clean.wmv')
+        p = video.WMVParser('./tests/data/clean.wmv')
+
+        meta = p.get_meta()
+        self.assertEqual(meta['EncodingSettings'], 'Lavf52.103.0')
+
+        ret = p.remove_all()
+        self.assertTrue(ret)
+
+        p = video.WMVParser('./tests/data/clean.cleaned.wmv')
+        self.assertNotIn('EncodingSettings', p.get_meta())
+        self.assertTrue(p.remove_all())
+
+        os.remove('./tests/data/clean.wmv')
+        os.remove('./tests/data/clean.cleaned.wmv')
+        os.remove('./tests/data/clean.cleaned.cleaned.wmv')
+
+    def test_gif(self):
+        shutil.copy('./tests/data/dirty.gif', './tests/data/clean.gif')
+        p = images.GIFParser('./tests/data/clean.gif')
+
+        meta = p.get_meta()
+        self.assertEqual(meta['Comment'], 'this is a test comment')
+
+        ret = p.remove_all()
+        self.assertTrue(ret)
+
+        p = images.GIFParser('./tests/data/clean.cleaned.gif')
+        self.assertNotIn('EncodingSettings', p.get_meta())
+        self.assertTrue(p.remove_all())
+
+        os.remove('./tests/data/clean.gif')
+        os.remove('./tests/data/clean.cleaned.gif')
+        os.remove('./tests/data/clean.cleaned.cleaned.gif')
+
+    def test_html(self):
+        shutil.copy('./tests/data/dirty.html', './tests/data/clean.html')
+        p = html.HTMLParser('./tests/data/clean.html')
+
+        meta = p.get_meta()
+        self.assertEqual(meta['author'], 'jvoisin')
+
+        ret = p.remove_all()
+        self.assertTrue(ret)
+
+        p = html.HTMLParser('./tests/data/clean.cleaned.html')
+        self.assertEqual(p.get_meta(), {})
+        self.assertTrue(p.remove_all())
+
+        os.remove('./tests/data/clean.html')
+        os.remove('./tests/data/clean.cleaned.html')
+        os.remove('./tests/data/clean.cleaned.cleaned.html')



View it on GitLab: https://salsa.debian.org/pkg-privacy-team/mat2/commit/8e51bae522d272ff2d70ea58a3cef48d04af9ad2

-- 
View it on GitLab: https://salsa.debian.org/pkg-privacy-team/mat2/commit/8e51bae522d272ff2d70ea58a3cef48d04af9ad2
You're receiving this email because of your account on salsa.debian.org.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://alioth-lists.debian.net/pipermail/pkg-privacy-commits/attachments/20190217/7ce0c41d/attachment-0001.html>