[Pkg-privacy-commits] [Git][pkg-privacy-team/mat2][upstream] New upstream version 0.4.0

Georg Faerber gitlab at salsa.debian.org
Wed Oct 3 20:04:52 BST 2018


Georg Faerber pushed to branch upstream at Privacy Maintainers / mat2


Commits:
2ebbdb39 by Georg Faerber at 2018-10-03T18:11:33Z
New upstream version 0.4.0
- - - - -


26 changed files:

- .gitlab-ci.yml
- + .mailmap
- CHANGELOG.md
- CONTRIBUTING.md
- INSTALL.md
- README.md
- doc/implementation_notes.md
- doc/mat.1 → doc/mat2.1
- libmat2/__init__.py
- + libmat2/archive.py
- libmat2/images.py
- libmat2/office.py
- libmat2/pdf.py
- libmat2/torrent.py
- mat2
- nautilus/mat2.py
- setup.py
- + tests/data/broken_xml_content_types.docx
- + tests/data/malformed_content_types.docx
- + tests/data/no_content_types.docx
- + tests/data/office_revision_session_ids.docx
- tests/test_climat2.py
- tests/test_corrupted_files.py
- + tests/test_deep_cleaning.py
- tests/test_libmat2.py
- + tests/test_policy.py


Changes:

=====================================
.gitlab-ci.yml
=====================================
@@ -49,6 +49,8 @@ tests:debian:
 tests:fedora:
   image: fedora
   stage: test
+  tags:
+    - whitewhale
   script:
   - dnf install -y python3 python3-mutagen python3-gobject gdk-pixbuf2 poppler-glib gdk-pixbuf2 gdk-pixbuf2-modules cairo-gobject cairo python3-cairo perl-Image-ExifTool mailcap
   - gdk-pixbuf-query-loaders-64 > /usr/lib64/gdk-pixbuf-2.0/2.10.0/loaders.cache
@@ -57,6 +59,8 @@ tests:fedora:
 tests:archlinux:
   image: archlinux/base
   stage: test
+  tags:
+    - whitewhale
   script:
   - pacman -Sy --noconfirm python-mutagen python-gobject gdk-pixbuf2 poppler-glib gdk-pixbuf2 python-cairo perl-image-exiftool python-setuptools mailcap
   - python3 setup.py test


=====================================
.mailmap
=====================================
@@ -0,0 +1,5 @@
+Julien (jvoisin) Voisin <julien.voisin+mat2 at dustri.org> totallylegit <totallylegit at dustri.org>
+Julien (jvoisin) Voisin <julien.voisin+mat2 at dustri.org> jvoisin <julien.voisin at dustri.org>
+Julien (jvoisin) Voisin <julien.voisin+mat2 at dustri.org> jvoisin <jvoisin at riseup.net>
+
+Daniel Kahn Gillmor <dkg at fifthhorseman.net> dkg <dkg at fifthhorseman.net>


=====================================
CHANGELOG.md
=====================================
@@ -1,3 +1,20 @@
+# 0.4.0 - 2018-10-03
+
+- There is now a policy, for advanced users, to deal with unknown embedded fileformats
+- Improve the documentation
+- Various minor refactoring
+- Improve how corrupted PNG are handled
+- Dangerous/advanced cli's options no longer have short versions
+- Significant improvements to office files anonymisation
+	- Archive members are sorted lexicographically
+	- XML attributes are sorted lexicographically too
+	- RSID are now stripped
+	- Dangling references in [Content_types].xml are now removed
+- Significant improvements to office files support
+- Anonimysed office files can now be opened by MS Office without warnings
+- The CLI isn't threaded anymore, for it was causing issues
+- Various misc typo fix
+
 # 0.3.1 - 2018-09-01
 
 - Document how to install MAT2 for various distributions


=====================================
CONTRIBUTING.md
=====================================
@@ -24,10 +24,13 @@ Since MAT2 is written in Python3, please conform as much as possible to the
 1. Update the [changelog](https://0xacab.org/jvoisin/mat2/blob/master/CHANGELOG.md)
 2. Update the version in the [mat2](https://0xacab.org/jvoisin/mat2/blob/master/mat2) file
 3. Update the version in the [setup.py](https://0xacab.org/jvoisin/mat2/blob/master/setup.py) file
-4. Update the version and date in the [man page](https://0xacab.org/jvoisin/mat2/blob/master/doc/mat.1)
+4. Update the version and date in the [man page](https://0xacab.org/jvoisin/mat2/blob/master/doc/mat2.1)
 5. Commit the changelog, man page, mat2 and setup.py files
 6. Create a tag with `git tag -s $VERSION`
 7. Push the commit with `git push origin master`
 8. Push the tag with `git push --tags`
-9. Tell the [downstreams](https://0xacab.org/jvoisin/mat2/blob/master/INSTALL.md) about it
-10. Do the secret release dance
+9. Create the signed tarball with `git archive --format=tar.xz --prefix=mat-$VERSION/ $VERSION > mat-$VERSION.tar.xz`
+10. Sign the tarball with `gpg --armor --detach-sign mat-$VERSION.tar.xz`
+11. Upload the result on Gitlab's [tag page](https://0xacab.org/jvoisin/mat2/tags) and add the changelog there
+12. Tell the [downstreams](https://0xacab.org/jvoisin/mat2/blob/master/INSTALL.md) about it
+13. Do the secret release dance


=====================================
INSTALL.md
=====================================
@@ -38,13 +38,14 @@ $ ./mat2
 and if you want to install the über-fancy Nautilus extension:
 
 ```
-# apt install python-gi-dev
+# apt install gnome-common gtk-doc-tools libnautilus-extension-dev python-gi-dev
 $ git clone https://github.com/GNOME/nautilus-python
 $ cd nautilus-python
 $ PYTHON=/usr/bin/python3 ./autogen.sh
 $ make
 # make install
-$ cp ./nautilus/mat2.py ~/.local/share/nautilus-python/extensions/
+$ mkdir -p ~/.local/share/nautilus-python/extensions/
+$ cp ../nautilus/mat2.py ~/.local/share/nautilus-python/extensions/
 $ PYTHONPATH=/home/$USER/mat2 PYTHON=/usr/bin/python3 nautilus
 ```
 
@@ -52,3 +53,7 @@ $ PYTHONPATH=/home/$USER/mat2 PYTHON=/usr/bin/python3 nautilus
 
 Thanks to [Francois_B](https://www.sciunto.org/), there is an package available on
 [Arch linux's AUR](https://aur.archlinux.org/packages/mat2/).
+
+## Gentoo
+
+MAT2 is available in the [torbrowser overlay](https://github.com/MeisterP/torbrowser-overlay).


=====================================
README.md
=====================================
@@ -44,22 +44,33 @@ $ python3 -m unittest discover -v
 # How to use MAT2
 
 ```bash
-usage: mat2 [-h] [-v] [-l] [-s | -L] [files [files ...]]
+usage: mat2 [-h] [-v] [-l] [--check-dependencies] [-V]
+            [--unknown-members policy] [-s | -L]
+            [files [files ...]]
 
 Metadata anonymisation toolkit 2
 
 positional arguments:
-  files
+  files                 the files to process
 
 optional arguments:
-  -h, --help         show this help message and exit
-  -v, --version      show program's version number and exit
-  -l, --list         list all supported fileformats
-  -s, --show         list all the harmful metadata of a file without removing
-                     them
-  -L, --lightweight  remove SOME metadata
+  -h, --help            show this help message and exit
+  -v, --version         show program's version number and exit
+  -l, --list            list all supported fileformats
+  --check-dependencies  check if MAT2 has all the dependencies it needs
+  -V, --verbose         show more verbose status information
+  --unknown-members policy
+                        how to handle unknown members of archive-style files
+                        (policy should be one of: abort, omit, keep)
+  -s, --show            list harmful metadata detectable by MAT2 without
+                        removing them
+  -L, --lightweight     remove SOME metadata
 ```
 
+Note that MAT2 **will not** clean files in-place, but will produce, for
+example, with a file named "myfile.png" a cleaned version named
+"myfile.cleaned.png".
+
 # Notes about detecting metadata
 
 While MAT2 is doing its very best to display metadata when the `--show` flag is
@@ -78,12 +89,15 @@ be cleaned or not.
 	tries to deal with *printer dots* too.
 - [pdfparanoia](https://github.com/kanzure/pdfparanoia), that removes
 	watermarks from PDF.
+- [Scrambled Exif](https://f-droid.org/packages/com.jarsilio.android.scrambledeggsif/),
+	an open-source Android application to remove metadata from pictures.
 
 # Contact
 
-If possible, use the [issues system](https://0xacab.org/jvoisin/mat2/issues).
-If you think that a more private contact is needed (eg. for reporting security issues),
-you can email Julien (jvoisin) Voisin at `julien.voisin+mat at dustri.org`,
+If possible, use the [issues system](https://0xacab.org/jvoisin/mat2/issues)
+or the [mailing list](https://mailman.boum.org/listinfo/mat-dev)
+Should a more private contact be needed (eg. for reporting security issues),
+you can email Julien (jvoisin) Voisin at `julien.voisin+mat2 at dustri.org`,
 using the gpg key `9FCDEE9E1A381F311EA62A7404D041E8171901CC`.
 
 # License


=====================================
doc/implementation_notes.md
=====================================
@@ -61,3 +61,11 @@ Images handling
 When possible, images are handled like PDF: rendered on a surface, then saved
 to the filesystem. This ensures that every metadata is removed.
 
+XML attacks
+-----------
+
+Since our threat model conveniently excludes files crafted to specifically
+bypass MAT2, fileformats containing harmful XML are out of our scope.
+But since MAT2 is using [etree](https://docs.python.org/3/library/xml.html#xml-vulnerabilities)
+to process XML, it's "only" vulnerable to DoS, and not memory corruption:
+odds are that the user will notice that the cleaning didn't succeed.


=====================================
doc/mat.1 → doc/mat2.1
=====================================
@@ -1,16 +1,20 @@
-.TH MAT2 "1" "September 2018" "MAT2 0.3.1" "User Commands"
+.TH MAT2 "1" "October 2018" "MAT2 0.4.0" "User Commands"
 
 .SH NAME
 mat2 \- the metadata anonymisation toolkit 2
 
 .SH SYNOPSIS
-mat2 [\-h] [\-v] [\-l] [\-c] [\-s | \-L]\fR [files [files ...]]
+\fBmat2\fR [\-h] [\-v] [\-l] [\-V] [-s | -L] [\fIfiles\fR [\fIfiles ...\fR]]
 
 .SH DESCRIPTION
 .B mat2
 removes metadata from various fileformats. It supports a wide variety of file 
 formats, audio, office, images, …
 
+Careful, mat2 does not clean files in-place, instead, it will produce a file with the word
+"cleaned" between the filename and its extension, for example "filename.cleaned.png"
+for a file named "filename.png".
+
 .SH OPTIONS
 .SS "positional arguments:"
 .TP
@@ -27,9 +31,15 @@ show program's version number and exit
 \fB\-l\fR, \fB\-\-list\fR
 list all supported fileformats
 .TP
-\fB\-c\fR, \fB\-\-check\-dependencies\fR
+\fB\-\-check\-dependencies\fR
 check if MAT2 has all the dependencies it needs
 .TP
+\fB\-V\fR, \fB\-\-verbose\fR
+show more verbose status information
+.TP
+\fB\-\-unknown-members\fR \fIpolicy\fR
+how to handle unknown members of archive-style files (policy should be one of: abort, omit, keep)
+.TP
 \fB\-s\fR, \fB\-\-show\fR
 list harmful metadata detectable by MAT2 without
 removing them


=====================================
libmat2/__init__.py
=====================================
@@ -2,6 +2,7 @@
 
 import os
 import collections
+import enum
 import importlib
 from typing import Dict, Optional
 
@@ -35,16 +36,16 @@ DEPENDENCIES = {
     'mutagen': 'Mutagen',
     }
 
-def _get_exiftool_path() -> Optional[str]:
+def _get_exiftool_path() -> Optional[str]:  # pragma: no cover
     exiftool_path = '/usr/bin/exiftool'
     if os.path.isfile(exiftool_path):
-        if os.access(exiftool_path, os.X_OK):  # pragma: no cover
+        if os.access(exiftool_path, os.X_OK):
             return exiftool_path
 
     # ArchLinux
     exiftool_path = '/usr/bin/vendor_perl/exiftool'
     if os.path.isfile(exiftool_path):
-        if os.access(exiftool_path, os.X_OK):  # pragma: no cover
+        if os.access(exiftool_path, os.X_OK):
             return exiftool_path
 
     return None
@@ -62,3 +63,9 @@ def check_dependencies() -> dict:
             ret[value] = False  # pragma: no cover
 
     return ret
+
+ at enum.unique
+class UnknownMemberPolicy(enum.Enum):
+    ABORT = 'abort'
+    OMIT = 'omit'
+    KEEP = 'keep'


=====================================
libmat2/archive.py
=====================================
@@ -0,0 +1,127 @@
+import zipfile
+import datetime
+import tempfile
+import os
+import logging
+import shutil
+from typing import Dict, Set, Pattern
+
+from . import abstract, UnknownMemberPolicy, parser_factory
+
+# Make pyflakes happy
+assert Set
+assert Pattern
+
+
+class ArchiveBasedAbstractParser(abstract.AbstractParser):
+    """ Office files (.docx, .odt, …) are zipped files. """
+    def __init__(self, filename):
+        super().__init__(filename)
+
+        # Those are the files that have a format that _isn't_
+        # supported by MAT2, but that we want to keep anyway.
+        self.files_to_keep = set()  # type: Set[Pattern]
+
+        # Those are the files that we _do not_ want to keep,
+        # no matter if they are supported or not.
+        self.files_to_omit = set()  # type: Set[Pattern]
+
+        # what should the parser do if it encounters an unknown file in
+        # the archive?
+        self.unknown_member_policy = UnknownMemberPolicy.ABORT  # type: UnknownMemberPolicy
+
+        try:  # better fail here than later
+            zipfile.ZipFile(self.filename)
+        except zipfile.BadZipFile:
+            raise ValueError
+
+    def _specific_cleanup(self, full_path: str) -> bool:
+        """ This method can be used to apply specific treatment
+        to files present in the archive."""
+        # pylint: disable=unused-argument,no-self-use
+        return True  # pragma: no cover
+
+    @staticmethod
+    def _clean_zipinfo(zipinfo: zipfile.ZipInfo) -> zipfile.ZipInfo:
+        zipinfo.create_system = 3  # Linux
+        zipinfo.comment = b''
+        zipinfo.date_time = (1980, 1, 1, 0, 0, 0)  # this is as early as a zipfile can be
+        return zipinfo
+
+    @staticmethod
+    def _get_zipinfo_meta(zipinfo: zipfile.ZipInfo) -> Dict[str, str]:
+        metadata = {}
+        if zipinfo.create_system == 3:  # this is Linux
+            pass
+        elif zipinfo.create_system == 2:
+            metadata['create_system'] = 'Windows'
+        else:
+            metadata['create_system'] = 'Weird'
+
+        if zipinfo.comment:
+            metadata['comment'] = zipinfo.comment  # type: ignore
+
+        if zipinfo.date_time != (1980, 1, 1, 0, 0, 0):
+            metadata['date_time'] = str(datetime.datetime(*zipinfo.date_time))
+
+        return metadata
+
+    def remove_all(self) -> bool:
+        # pylint: disable=too-many-branches
+
+        with zipfile.ZipFile(self.filename) as zin,\
+             zipfile.ZipFile(self.output_filename, 'w') as zout:
+
+            temp_folder = tempfile.mkdtemp()
+            abort = False
+
+            # Since files order is a fingerprint factor,
+            # we're iterating (and thus inserting) them in lexicographic order.
+            for item in sorted(zin.infolist(), key=lambda z: z.filename):
+                if item.filename[-1] == '/':  # `is_dir` is added in Python3.6
+                    continue  # don't keep empty folders
+
+                zin.extract(member=item, path=temp_folder)
+                full_path = os.path.join(temp_folder, item.filename)
+
+                if self._specific_cleanup(full_path) is False:
+                    logging.warning("Something went wrong during deep cleaning of %s",
+                                    item.filename)
+                    abort = True
+                    continue
+
+                if any(map(lambda r: r.search(item.filename), self.files_to_keep)):
+                    # those files aren't supported, but we want to add them anyway
+                    pass
+                elif any(map(lambda r: r.search(item.filename), self.files_to_omit)):
+                    continue
+                else:  # supported files that we want to first clean, then add
+                    tmp_parser, mtype = parser_factory.get_parser(full_path)  # type: ignore
+                    if not tmp_parser:
+                        if self.unknown_member_policy == UnknownMemberPolicy.OMIT:
+                            logging.warning("In file %s, omitting unknown element %s (format: %s)",
+                                            self.filename, item.filename, mtype)
+                            continue
+                        elif self.unknown_member_policy == UnknownMemberPolicy.KEEP:
+                            logging.warning("In file %s, keeping unknown element %s (format: %s)",
+                                            self.filename, item.filename, mtype)
+                        else:
+                            logging.error("In file %s, element %s's format (%s) " +
+                                          "isn't supported",
+                                          self.filename, item.filename, mtype)
+                            abort = True
+                            continue
+                    if tmp_parser:
+                        tmp_parser.remove_all()
+                        os.rename(tmp_parser.output_filename, full_path)
+
+                zinfo = zipfile.ZipInfo(item.filename)  # type: ignore
+                clean_zinfo = self._clean_zipinfo(zinfo)
+                with open(full_path, 'rb') as f:
+                    zout.writestr(clean_zinfo, f.read())
+
+        shutil.rmtree(temp_folder)
+        if abort:
+            os.remove(self.output_filename)
+            return False
+        return True


=====================================
libmat2/images.py
=====================================
@@ -62,9 +62,13 @@ class PNGParser(_ImageParser):
 
     def __init__(self, filename):
         super().__init__(filename)
+
+        if imghdr.what(filename) != 'png':
+            raise ValueError
+
         try:  # better fail here than later
             cairo.ImageSurface.create_from_png(self.filename)
-        except MemoryError:
+        except MemoryError:  # pragma: no cover
             raise ValueError
 
     def remove_all(self):


=====================================
libmat2/office.py
=====================================
@@ -1,149 +1,166 @@
+import logging
 import os
 import re
-import shutil
-import tempfile
-import datetime
 import zipfile
-import logging
 from typing import Dict, Set, Pattern
 
-try:  # protect against DoS
-    from defusedxml import ElementTree as ET  # type: ignore
-except ImportError:
-    import xml.etree.ElementTree as ET  # type: ignore
+import xml.etree.ElementTree as ET  # type: ignore
 
+from .archive import ArchiveBasedAbstractParser
 
-from . import abstract, parser_factory
+# pylint: disable=line-too-long
 
 # Make pyflakes happy
 assert Set
 assert Pattern
 
 def _parse_xml(full_path: str):
-    """ This function parse XML, with namespace support. """
+    """ This function parses XML, with namespace support. """
 
     namespace_map = dict()
     for _, (key, value) in ET.iterparse(full_path, ("start-ns", )):
+        # The ns[0-9]+ namespaces are reserved for internal usage, so
+        # we have to use an other nomenclature.
+        if re.match('^ns[0-9]+$', key, re.I):  # pragma: no cover
+            key = 'mat' + key[2:]
+
         namespace_map[key] = value
         ET.register_namespace(key, value)
 
     return ET.parse(full_path), namespace_map
 
 
-class ArchiveBasedAbstractParser(abstract.AbstractParser):
-    """ Office files (.docx, .odt, …) are zipped files. """
-    # Those are the files that have a format that _isn't_
-    # supported by MAT2, but that we want to keep anyway.
-    files_to_keep = set()  # type: Set[str]
+def _sort_xml_attributes(full_path: str) -> bool:
+    """ Sort xml attributes lexicographically,
+    because it's possible to fingerprint producers (MS Office, Libreoffice, …)
+    since they are all using different orders.
+    """
+    tree = ET.parse(full_path)
+
+    for c in tree.getroot():
+        c[:] = sorted(c, key=lambda child: (child.tag, child.get('desc')))
+
+    tree.write(full_path, xml_declaration=True)
+    return True
+
+
+class MSOfficeParser(ArchiveBasedAbstractParser):
+    mimetypes = {
+        'application/vnd.openxmlformats-officedocument.wordprocessingml.document',
+        'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet',
+        'application/vnd.openxmlformats-officedocument.presentationml.presentation'
+    }
+    content_types_to_keep = {
+        'application/vnd.openxmlformats-officedocument.wordprocessingml.endnotes+xml',  # /word/endnotes.xml
+        'application/vnd.openxmlformats-officedocument.wordprocessingml.footnotes+xml',  # /word/footnotes.xml
+        'application/vnd.openxmlformats-officedocument.extended-properties+xml',  # /docProps/app.xml
+        'application/vnd.openxmlformats-officedocument.wordprocessingml.document.main+xml',  # /word/document.xml
+        'application/vnd.openxmlformats-officedocument.wordprocessingml.fontTable+xml',  # /word/fontTable.xml
+        'application/vnd.openxmlformats-officedocument.wordprocessingml.footer+xml',  # /word/footer.xml
+        'application/vnd.openxmlformats-officedocument.wordprocessingml.header+xml',  # /word/header.xml
+        'application/vnd.openxmlformats-officedocument.wordprocessingml.styles+xml',  # /word/styles.xml
+        'application/vnd.openxmlformats-package.core-properties+xml',  # /docProps/core.xml
+
+        # Do we want to keep the following ones?
+        'application/vnd.openxmlformats-officedocument.wordprocessingml.settings+xml',
+
+        # See https://0xacab.org/jvoisin/mat2/issues/71
+        'application/vnd.openxmlformats-officedocument.wordprocessingml.numbering+xml',  # /word/numbering.xml
+    }
 
-    # Those are the files that we _do not_ want to keep,
-    # no matter if they are supported or not.
-    files_to_omit = set() # type: Set[Pattern]
 
     def __init__(self, filename):
         super().__init__(filename)
-        try:  # better fail here than later
-            zipfile.ZipFile(self.filename)
-        except zipfile.BadZipFile:
+
+        self.files_to_keep = set(map(re.compile, {  # type: ignore
+            r'^\[Content_Types\]\.xml$',
+            r'^_rels/\.rels$',
+            r'^word/_rels/document\.xml\.rels$',
+            r'^word/_rels/footer[0-9]*\.xml\.rels$',
+            r'^word/_rels/header[0-9]*\.xml\.rels$',
+
+            # https://msdn.microsoft.com/en-us/library/dd908153(v=office.12).aspx
+            r'^word/stylesWithEffects\.xml$',
+        }))
+        self.files_to_omit = set(map(re.compile, {  # type: ignore
+            r'^customXml/',
+            r'webSettings\.xml$',
+            r'^docProps/custom\.xml$',
+            r'^word/printerSettings/',
+            r'^word/theme',
+
+            # we have a whitelist in self.files_to_keep,
+            # so we can trash everything else
+            r'^word/_rels/',
+        }))
+
+        if self.__fill_files_to_keep_via_content_types() is False:
             raise ValueError
 
-    def _specific_cleanup(self, full_path: str) -> bool:
-        """ This method can be used to apply specific treatment
-        to files present in the archive."""
-        # pylint: disable=unused-argument,no-self-use
-        return True  # pragma: no cover
+    def __fill_files_to_keep_via_content_types(self) -> bool:
+        """ There is a suer-handy `[Content_Types].xml` file
+        in MS Office archives, describing what each other file contains.
+        The self.content_types_to_keep member contains a type whitelist,
+        so we're using it to fill the self.files_to_keep one.
+        """
+        with zipfile.ZipFile(self.filename) as zin:
+            if '[Content_Types].xml' not in zin.namelist():
+                return False
+            xml_data = zin.read('[Content_Types].xml')
 
-    @staticmethod
-    def _clean_zipinfo(zipinfo: zipfile.ZipInfo) -> zipfile.ZipInfo:
-        zipinfo.create_system = 3  # Linux
-        zipinfo.comment = b''
-        zipinfo.date_time = (1980, 1, 1, 0, 0, 0)  # this is as early as a zipfile can be
-        return zipinfo
+        self.content_types = dict()  # type: Dict[str, str]
+        try:
+            tree = ET.fromstring(xml_data)
+        except ET.ParseError:
+            return False
+        for c in tree:
+            if 'PartName' not in c.attrib or 'ContentType' not in c.attrib:
+                continue
+            elif c.attrib['ContentType'] in self.content_types_to_keep:
+                fname = c.attrib['PartName'][1:]  # remove leading `/`
+                re_fname = re.compile('^' + re.escape(fname) + '$')
+                self.files_to_keep.add(re_fname)  # type: ignore
+        return True
 
     @staticmethod
-    def _get_zipinfo_meta(zipinfo: zipfile.ZipInfo) -> Dict[str, str]:
-        metadata = {}
-        if zipinfo.create_system == 3:  # this is Linux
-            pass
-        elif zipinfo.create_system == 2:
-            metadata['create_system'] = 'Windows'
-        else:
-            metadata['create_system'] = 'Weird'
-
-        if zipinfo.comment:
-            metadata['comment'] = zipinfo.comment  # type: ignore
+    def __remove_rsid(full_path: str) -> bool:
+        """ The method will remove "revision session ID".  We're '}rsid'
+        instead of proper parsing, since rsid can have multiple forms, like
+        `rsidRDefault`, `rsidR`, `rsids`, …
 
-        if zipinfo.date_time != (1980, 1, 1, 0, 0, 0):
-            metadata['date_time'] = str(datetime.datetime(*zipinfo.date_time))
+        We're removing rsid tags in two times, because we can't modify
+        the xml while we're iterating on it.
 
-        return metadata
-
-    def remove_all(self) -> bool:
-        with zipfile.ZipFile(self.filename) as zin,\
-             zipfile.ZipFile(self.output_filename, 'w') as zout:
+        For more details, see
+        - https://msdn.microsoft.com/en-us/library/office/documentformat.openxml.wordprocessing.previoussectionproperties.rsidrpr.aspx
+        - https://blogs.msdn.microsoft.com/brian_jones/2006/12/11/whats-up-with-all-those-rsids/
+        """
+        try:
+            tree, namespace = _parse_xml(full_path)
+        except ET.ParseError:
+            return False
 
-            temp_folder = tempfile.mkdtemp()
+        # rsid, tags or attributes, are always under the `w` namespace
+        if 'w' not in namespace.keys():
+            return True
 
-            for item in zin.infolist():
-                if item.filename[-1] == '/':  # `is_dir` is added in Python3.6
-                    continue  # don't keep empty folders
+        parent_map = {c:p for p in tree.iter() for c in p}
 
-                zin.extract(member=item, path=temp_folder)
-                full_path = os.path.join(temp_folder, item.filename)
+        elements_to_remove = list()
+        for item in tree.iterfind('.//', namespace):
+            if '}rsid' in item.tag.strip().lower():  # rsid as tag
+                elements_to_remove.append(item)
+                continue
+            for key in list(item.attrib.keys()):  # rsid as attribute
+                if '}rsid' in key.lower():
+                    del item.attrib[key]
 
-                if self._specific_cleanup(full_path) is False:
-                    shutil.rmtree(temp_folder)
-                    os.remove(self.output_filename)
-                    logging.warning("Something went wrong during deep cleaning of %s",
-                                    item.filename)
-                    return False
+        for element in elements_to_remove:
+            parent_map[element].remove(element)
 
-                if item.filename in self.files_to_keep:
-                    # those files aren't supported, but we want to add them anyway
-                    pass
-                elif any(map(lambda r: r.search(item.filename), self.files_to_omit)):
-                    continue
-                else:
-                    # supported files that we want to clean then add
-                    tmp_parser, mtype = parser_factory.get_parser(full_path)  # type: ignore
-                    if not tmp_parser:
-                        shutil.rmtree(temp_folder)
-                        os.remove(self.output_filename)
-                        logging.error("In file %s, element %s's format (%s) " +
-                                      "isn't supported",
-                                      self.filename, item.filename, mtype)
-                        return False
-                    tmp_parser.remove_all()
-                    os.rename(tmp_parser.output_filename, full_path)
-
-                zinfo = zipfile.ZipInfo(item.filename)  # type: ignore
-                clean_zinfo = self._clean_zipinfo(zinfo)
-                with open(full_path, 'rb') as f:
-                    zout.writestr(clean_zinfo, f.read())
-
-        shutil.rmtree(temp_folder)
+        tree.write(full_path, xml_declaration=True)
         return True
 
-
-class MSOfficeParser(ArchiveBasedAbstractParser):
-    mimetypes = {
-        'application/vnd.openxmlformats-officedocument.wordprocessingml.document',
-        'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet',
-        'application/vnd.openxmlformats-officedocument.presentationml.presentation'
-    }
-    files_to_keep = {
-        '[Content_Types].xml',
-        '_rels/.rels',
-        'word/_rels/document.xml.rels',
-        'word/document.xml',
-        'word/fontTable.xml',
-        'word/settings.xml',
-        'word/styles.xml',
-    }
-    files_to_omit = set(map(re.compile, {  # type: ignore
-        '^docProps/',
-    }))
-
     @staticmethod
     def __remove_revisions(full_path: str) -> bool:
         """ In this function, we're changing the XML document in several
@@ -152,7 +169,8 @@ class MSOfficeParser(ArchiveBasedAbstractParser):
         """
         try:
             tree, namespace = _parse_xml(full_path)
-        except ET.ParseError:
+        except ET.ParseError as e:
+            logging.error("Unable to parse %s: %s", full_path, e)
             return False
 
         # Revisions are either deletions (`w:del`) or
@@ -172,7 +190,7 @@ class MSOfficeParser(ArchiveBasedAbstractParser):
 
         elements = list()
         for element in tree.iterfind('.//w:ins', namespace):
-            for position, item in enumerate(tree.iter()):  #pragma: no cover
+            for position, item in enumerate(tree.iter()):  # pragma: no cover
                 if item == element:
                     for children in element.iterfind('./*'):
                         elements.append((element, position, children))
@@ -182,13 +200,100 @@ class MSOfficeParser(ArchiveBasedAbstractParser):
             parent_map[element].remove(element)
 
         tree.write(full_path, xml_declaration=True)
+        return True
 
+    def __remove_content_type_members(self, full_path: str) -> bool:
+        """ The method will remove the dangling references
+        form the [Content_Types].xml file, since MS office doesn't like them
+        """
+        try:
+            tree, namespace = _parse_xml(full_path)
+        except ET.ParseError:  # pragma: no cover
+            return False
+
+        if len(namespace.items()) != 1:
+            return False  # there should be only one namespace for Types
+
+        removed_fnames = set()
+        with zipfile.ZipFile(self.filename) as zin:
+            for fname in [item.filename for item in zin.infolist()]:
+                for file_to_omit in self.files_to_omit:
+                    if file_to_omit.search(fname):
+                        matches = map(lambda r: r.search(fname), self.files_to_keep)
+                        if any(matches):  # the file is whitelisted
+                            continue
+                        removed_fnames.add(fname)
+                        break
+
+        root = tree.getroot()
+        for item in root.findall('{%s}Override' % namespace['']):
+            name = item.attrib['PartName'][1:]  # remove the leading '/'
+            if name in removed_fnames:
+                root.remove(item)
+
+        tree.write(full_path, xml_declaration=True)
         return True
 
     def _specific_cleanup(self, full_path: str) -> bool:
-        if full_path.endswith('/word/document.xml'):
+        # pylint: disable=too-many-return-statements
+        if os.stat(full_path).st_size == 0:  # Don't process empty files
+            return True
+
+        if not full_path.endswith('.xml'):
+            return True
+
+        if full_path.endswith('/[Content_Types].xml'):
+            # this file contains references to files that we might
+            # remove, and MS Office doesn't like dangling references
+            if self.__remove_content_type_members(full_path) is False:
+                return False
+        elif full_path.endswith('/word/document.xml'):
             # this file contains the revisions
-            return self.__remove_revisions(full_path)
+            if self.__remove_revisions(full_path) is False:
+                return False
+        elif full_path.endswith('/docProps/app.xml'):
+            # This file must be present and valid,
+            # so we're removing as much as we can.
+            with open(full_path, 'wb') as f:
+                f.write(b'<?xml version="1.0" encoding="UTF-8" standalone="yes"?>')
+                f.write(b'<Properties xmlns="http://schemas.openxmlformats.org/officeDocument/2006/extended-properties">')
+                f.write(b'</Properties>')
+        elif full_path.endswith('/docProps/core.xml'):
+            # This file must be present and valid,
+            # so we're removing as much as we can.
+            with open(full_path, 'wb') as f:
+                f.write(b'<?xml version="1.0" encoding="UTF-8" standalone="yes"?>')
+                f.write(b'<cp:coreProperties xmlns:cp="http://schemas.openxmlformats.org/package/2006/metadata/core-properties">')
+                f.write(b'</cp:coreProperties>')
+
+
+        if self.__remove_rsid(full_path) is False:
+            return False
+
+        try:
+            _sort_xml_attributes(full_path)
+        except ET.ParseError as e:  # pragma: no cover
+            logging.error("Unable to parse %s: %s", full_path, e)
+            return False
+
+        # This is awful, I'm sorry.
+        #
+        # Microsoft Office isn't happy when we have the `mc:Ignorable`
+        # tag containing namespaces that aren't present in the xml file,
+        # so instead of trying to remove this specific tag with etree,
+        # we're removing it, with a regexp.
+        #
+        # Since we're the ones producing this file, via the call to
+        # _sort_xml_attributes, there won't be any "funny tricks".
+        # Worst case, the tag isn't present, and everything is fine.
+        #
+        # see: https://docs.microsoft.com/en-us/dotnet/framework/wpf/advanced/mc-ignorable-attribute
+        with open(full_path, 'rb') as f:
+            text = f.read()
+            out = re.sub(b'mc:Ignorable="[^"]*"', b'', text, 1)
+        with open(full_path, 'wb') as f:
+            f.write(out)
+
         return True
 
     def get_meta(self) -> Dict[str, str]:
@@ -223,26 +328,31 @@ class LibreOfficeParser(ArchiveBasedAbstractParser):
         'application/vnd.oasis.opendocument.formula',
         'application/vnd.oasis.opendocument.image',
     }
-    files_to_keep = {
-        'META-INF/manifest.xml',
-        'content.xml',
-        'manifest.rdf',
-        'mimetype',
-        'settings.xml',
-        'styles.xml',
-    }
-    files_to_omit = set(map(re.compile, {  # type: ignore
-        r'^meta\.xml$',
-        '^Configurations2/',
-        '^Thumbnails/',
-    }))
 
 
+    def __init__(self, filename):
+        super().__init__(filename)
+
+        self.files_to_keep = set(map(re.compile, {  # type: ignore
+            r'^META-INF/manifest\.xml$',
+            r'^content\.xml$',
+            r'^manifest\.rdf$',
+            r'^mimetype$',
+            r'^settings\.xml$',
+            r'^styles\.xml$',
+        }))
+        self.files_to_omit = set(map(re.compile, {  # type: ignore
+            r'^meta\.xml$',
+            r'^Configurations2/',
+            r'^Thumbnails/',
+        }))
+
     @staticmethod
     def __remove_revisions(full_path: str) -> bool:
         try:
             tree, namespace = _parse_xml(full_path)
-        except ET.ParseError:
+        except ET.ParseError as e:
+            logging.error("Unable to parse %s: %s", full_path, e)
             return False
 
         if 'office' not in namespace.keys():  # no revisions in the current file
@@ -253,12 +363,22 @@ class LibreOfficeParser(ArchiveBasedAbstractParser):
                 text.remove(changes)
 
         tree.write(full_path, xml_declaration=True)
-
         return True
 
     def _specific_cleanup(self, full_path: str) -> bool:
-        if os.path.basename(full_path) == 'content.xml':
-            return self.__remove_revisions(full_path)
+        if os.stat(full_path).st_size == 0:  # Don't process empty files
+            return True
+
+        if os.path.basename(full_path).endswith('.xml'):
+            if os.path.basename(full_path) == 'content.xml':
+                if self.__remove_revisions(full_path) is False:
+                    return False
+
+            try:
+                _sort_xml_attributes(full_path)
+            except ET.ParseError as e:
+                logging.error("Unable to parse %s: %s", full_path, e)
+                return False
         return True
 
     def get_meta(self) -> Dict[str, str]:


=====================================
libmat2/pdf.py
=====================================
@@ -17,7 +17,7 @@ from gi.repository import Poppler, GLib
 from . import abstract
 
 poppler_version = Poppler.get_version()
-if LooseVersion(poppler_version) < LooseVersion('0.46'): # pragma: no cover
+if LooseVersion(poppler_version) < LooseVersion('0.46'):  # pragma: no cover
     raise ValueError("MAT2 needs at least Poppler version 0.46 to work. \
 The installed version is %s." % poppler_version)  # pragma: no cover
 
@@ -118,7 +118,6 @@ class PDFParser(abstract.AbstractParser):
         document.save('file://' + os.path.abspath(out_file))
         return True
 
-
     @staticmethod
     def __parse_metadata_field(data: str) -> dict:
         metadata = {}


=====================================
libmat2/torrent.py
=====================================
@@ -21,7 +21,6 @@ class TorrentParser(abstract.AbstractParser):
                 metadata[key.decode('utf-8')] = value
         return metadata
 
-
     def remove_all(self) -> bool:
         cleaned = dict()
         for key, value in self.dict_repr.items():


=====================================
mat2
=====================================
@@ -1,21 +1,20 @@
-#!/usr/bin/python3
+#!/usr/bin/env python3
 
 import os
 from typing import Tuple
 import sys
-import itertools
 import mimetypes
 import argparse
-import multiprocessing
 import logging
 
 try:
-    from libmat2 import parser_factory, UNSUPPORTED_EXTENSIONS, check_dependencies
+    from libmat2 import parser_factory, UNSUPPORTED_EXTENSIONS
+    from libmat2 import check_dependencies, UnknownMemberPolicy
 except ValueError as e:
     print(e)
     sys.exit(1)
 
-__version__ = '0.3.1'
+__version__ = '0.4.0'
 
 def __check_file(filename: str, mode: int=os.R_OK) -> bool:
     if not os.path.exists(filename):
@@ -37,10 +36,13 @@ def create_arg_parser():
                         version='MAT2 %s' % __version__)
     parser.add_argument('-l', '--list', action='store_true',
                         help='list all supported fileformats')
-    parser.add_argument('-c', '--check-dependencies', action='store_true',
+    parser.add_argument('--check-dependencies', action='store_true',
                         help='check if MAT2 has all the dependencies it needs')
     parser.add_argument('-V', '--verbose', action='store_true',
                         help='show more verbose status information')
+    parser.add_argument('--unknown-members', metavar='policy', default='abort',
+                        help='how to handle unknown members of archive-style files (policy should' +
+                        ' be one of: %s)' % ', '.join(p.value for p in UnknownMemberPolicy))
 
 
     info = parser.add_mutually_exclusive_group()
@@ -67,8 +69,8 @@ def show_meta(filename: str):
         except UnicodeEncodeError:
             print("  %s: harmful content" % k)
 
-def clean_meta(params: Tuple[str, bool]) -> bool:
-    filename, is_lightweight = params
+def clean_meta(params: Tuple[str, bool, UnknownMemberPolicy]) -> bool:
+    filename, is_lightweight, unknown_member_policy = params
     if not __check_file(filename, os.R_OK|os.W_OK):
         return False
 
@@ -76,6 +78,7 @@ def clean_meta(params: Tuple[str, bool]) -> bool:
     if p is None:
         print("[-] %s's format (%s) is not supported" % (filename, mtype))
         return False
+    p.unknown_member_policy = unknown_member_policy
     if is_lightweight:
         return p.remove_all_lightweight()
     return p.remove_all()
@@ -133,12 +136,16 @@ def main():
         return 0
 
     else:
-        p = multiprocessing.Pool()
-        mode = (args.lightweight is True)
-        l = zip(__get_files_recursively(args.files), itertools.repeat(mode))
+        unknown_member_policy = UnknownMemberPolicy(args.unknown_members)
+        if unknown_member_policy == UnknownMemberPolicy.KEEP:
+            logging.warning('Keeping unknown member files may leak metadata in the resulting file!')
+
+        no_failure = True
+        for f in __get_files_recursively(args.files):
+            if clean_meta([f, args.lightweight, unknown_member_policy]) is False:
+                no_failure = False
+        return 0 if no_failure is True else -1
 
-        ret = list(p.imap_unordered(clean_meta, list(l)))
-        return 0 if all(ret) else -1
 
 if __name__ == '__main__':
     sys.exit(main())


=====================================
nautilus/mat2.py
=====================================
@@ -104,7 +104,6 @@ class ColumnExtension(GObject.GObject, Nautilus.MenuProvider, Nautilus.LocationW
         box.add(self.__create_treeview())
         window.show_all()
 
-
     @staticmethod
     def __validate(fileinfo) -> Tuple[bool, str]:
         """ Validate if a given file FileInfo `fileinfo` can be processed.
@@ -115,7 +114,6 @@ class ColumnExtension(GObject.GObject, Nautilus.MenuProvider, Nautilus.LocationW
             return False, "Not writeable"
         return True, ""
 
-
     def __create_treeview(self) -> Gtk.TreeView:
         liststore = Gtk.ListStore(GdkPixbuf.Pixbuf, str, str)
         treeview = Gtk.TreeView(model=liststore)
@@ -148,7 +146,6 @@ class ColumnExtension(GObject.GObject, Nautilus.MenuProvider, Nautilus.LocationW
         treeview.show_all()
         return treeview
 
-
     def __create_progressbar(self) -> Gtk.ProgressBar:
         """ Create the progressbar used to notify that files are currently
         being processed.
@@ -211,7 +208,6 @@ class ColumnExtension(GObject.GObject, Nautilus.MenuProvider, Nautilus.LocationW
         processing_queue.put(None)  # signal that we processed all the files
         return True
 
-
     def __cb_menu_activate(self, menu, files):
         """ This method is called when the user clicked the "clean metadata"
         menu item.
@@ -228,7 +224,6 @@ class ColumnExtension(GObject.GObject, Nautilus.MenuProvider, Nautilus.LocationW
         thread.daemon = True
         thread.start()
 
-
     def get_background_items(self, window, file):
         """ https://bugzilla.gnome.org/show_bug.cgi?id=784278 """
         return None


=====================================
setup.py
=====================================
@@ -5,7 +5,7 @@ with open("README.md", "r") as fh:
 
 setuptools.setup(
     name="mat2",
-    version='0.3.1',
+    version='0.4.0',
     author="Julien (jvoisin) Voisin",
     author_email="julien.voisin+mat2 at dustri.org",
     description="A handy tool to trash your metadata",
@@ -20,7 +20,7 @@ setuptools.setup(
         'pycairo',
     ],
     packages=setuptools.find_packages(exclude=('tests', )),
-    classifiers=(
+    classifiers=[
         "Development Status :: 3 - Alpha",
         "Environment :: Console",
         "License :: OSI Approved :: GNU Lesser General Public License v3 or later (LGPLv3+)",
@@ -28,7 +28,7 @@ setuptools.setup(
         "Programming Language :: Python :: 3 :: Only",
         "Topic :: Security",
         "Intended Audience :: End Users/Desktop",
-    ),
+    ],
     project_urls={
         'bugtacker': 'https://0xacab.org/jvoisin/mat2/issues',
     },


=====================================
tests/data/broken_xml_content_types.docx
=====================================
Binary files /dev/null and b/tests/data/broken_xml_content_types.docx differ


=====================================
tests/data/malformed_content_types.docx
=====================================
Binary files /dev/null and b/tests/data/malformed_content_types.docx differ


=====================================
tests/data/no_content_types.docx
=====================================
Binary files /dev/null and b/tests/data/no_content_types.docx differ


=====================================
tests/data/office_revision_session_ids.docx
=====================================
Binary files /dev/null and b/tests/data/office_revision_session_ids.docx differ


=====================================
tests/test_climat2.py
=====================================
@@ -8,12 +8,16 @@ class TestHelp(unittest.TestCase):
     def test_help(self):
         proc = subprocess.Popen(['./mat2', '--help'], stdout=subprocess.PIPE)
         stdout, _ = proc.communicate()
-        self.assertIn(b'usage: mat2 [-h] [-v] [-l] [-c] [-V] [-s | -L] [files [files ...]]', stdout)
+        self.assertIn(b'usage: mat2 [-h] [-v] [-l] [--check-dependencies] [-V]',
+                      stdout)
+        self.assertIn(b'[--unknown-members policy] [-s | -L]', stdout)
 
     def test_no_arg(self):
         proc = subprocess.Popen(['./mat2'], stdout=subprocess.PIPE)
         stdout, _ = proc.communicate()
-        self.assertIn(b'usage: mat2 [-h] [-v] [-l] [-c] [-V] [-s | -L] [files [files ...]]', stdout)
+        self.assertIn(b'usage: mat2 [-h] [-v] [-l] [--check-dependencies] [-V]',
+                      stdout)
+        self.assertIn(b'[--unknown-members policy] [-s | -L]', stdout)
 
 
 class TestVersion(unittest.TestCase):
@@ -46,7 +50,10 @@ class TestReturnValue(unittest.TestCase):
 
 class TestCleanFolder(unittest.TestCase):
     def test_jpg(self):
-        os.mkdir('./tests/data/folder/')
+        try:
+            os.mkdir('./tests/data/folder/')
+        except FileExistsError:
+            pass
         shutil.copy('./tests/data/dirty.jpg', './tests/data/folder/clean1.jpg')
         shutil.copy('./tests/data/dirty.jpg', './tests/data/folder/clean2.jpg')
 
@@ -70,7 +77,6 @@ class TestCleanFolder(unittest.TestCase):
         shutil.rmtree('./tests/data/folder/')
 
 
-
 class TestCleanMeta(unittest.TestCase):
     def test_jpg(self):
         shutil.copy('./tests/data/dirty.jpg', './tests/data/clean.jpg')


=====================================
tests/test_corrupted_files.py
=====================================
@@ -1,11 +1,17 @@
-#!/usr/bin/python3
+#!/usr/bin/env python3
 
 import unittest
 import shutil
 import os
+import logging
 
 from libmat2 import pdf, images, audio, office, parser_factory, torrent, harmless
 
+# No need to logging messages, should something go wrong,
+# the testsuite _will_ fail.
+logger = logging.getLogger()
+logger.setLevel(logging.FATAL)
+
 
 class TestInexistentFiles(unittest.TestCase):
     def test_ro(self):
@@ -53,16 +59,21 @@ class TestUnsupportedFiles(unittest.TestCase):
 class TestCorruptedEmbedded(unittest.TestCase):
     def test_docx(self):
         shutil.copy('./tests/data/embedded_corrupted.docx', './tests/data/clean.docx')
-        parser, mimetype = parser_factory.get_parser('./tests/data/clean.docx')
+        parser, _ = parser_factory.get_parser('./tests/data/clean.docx')
         self.assertFalse(parser.remove_all())
         self.assertIsNotNone(parser.get_meta())
         os.remove('./tests/data/clean.docx')
 
     def test_odt(self):
+        expected = {
+                'create_system': 'Weird',
+                'date_time': '2018-06-10 17:18:18',
+                'meta.xml': 'harmful content'
+                }
         shutil.copy('./tests/data/embedded_corrupted.odt', './tests/data/clean.odt')
-        parser, mimetype = parser_factory.get_parser('./tests/data/clean.odt')
+        parser, _ = parser_factory.get_parser('./tests/data/clean.odt')
         self.assertFalse(parser.remove_all())
-        self.assertEqual(parser.get_meta(), {'create_system': 'Weird', 'date_time': '2018-06-10 17:18:18', 'meta.xml': 'harmful content'})
+        self.assertEqual(parser.get_meta(), expected)
         os.remove('./tests/data/clean.odt')
 
 
@@ -75,6 +86,26 @@ class TestExplicitelyUnsupportedFiles(unittest.TestCase):
         os.remove('./tests/data/clean.py')
 
 
+class TestWrongContentTypesFileOffice(unittest.TestCase):
+    def test_office_incomplete(self):
+        shutil.copy('./tests/data/malformed_content_types.docx', './tests/data/clean.docx')
+        p = office.MSOfficeParser('./tests/data/clean.docx')
+        self.assertIsNotNone(p)
+        self.assertFalse(p.remove_all())
+        os.remove('./tests/data/clean.docx')
+
+    def test_office_broken(self):
+        shutil.copy('./tests/data/broken_xml_content_types.docx', './tests/data/clean.docx')
+        with self.assertRaises(ValueError):
+            office.MSOfficeParser('./tests/data/clean.docx')
+        os.remove('./tests/data/clean.docx')
+
+    def test_office_absent(self):
+        shutil.copy('./tests/data/no_content_types.docx', './tests/data/clean.docx')
+        with self.assertRaises(ValueError):
+            office.MSOfficeParser('./tests/data/clean.docx')
+        os.remove('./tests/data/clean.docx')
+
 class TestCorruptedFiles(unittest.TestCase):
     def test_pdf(self):
         shutil.copy('./tests/data/dirty.png', './tests/data/clean.png')
@@ -90,7 +121,7 @@ class TestCorruptedFiles(unittest.TestCase):
 
     def test_png2(self):
         shutil.copy('./tests/test_libmat2.py', './tests/clean.png')
-        parser, mimetype = parser_factory.get_parser('./tests/clean.png')
+        parser, _ = parser_factory.get_parser('./tests/clean.png')
         self.assertIsNone(parser)
         os.remove('./tests/clean.png')
 
@@ -134,25 +165,26 @@ class TestCorruptedFiles(unittest.TestCase):
 
     def test_bmp(self):
         shutil.copy('./tests/data/dirty.png', './tests/data/clean.bmp')
-        harmless.HarmlessParser('./tests/data/clean.bmp')
+        ret = harmless.HarmlessParser('./tests/data/clean.bmp')
+        self.assertIsNotNone(ret)
         os.remove('./tests/data/clean.bmp')
 
     def test_docx(self):
         shutil.copy('./tests/data/dirty.png', './tests/data/clean.docx')
         with self.assertRaises(ValueError):
-             office.MSOfficeParser('./tests/data/clean.docx')
+            office.MSOfficeParser('./tests/data/clean.docx')
         os.remove('./tests/data/clean.docx')
 
     def test_flac(self):
         shutil.copy('./tests/data/dirty.png', './tests/data/clean.flac')
         with self.assertRaises(ValueError):
-             audio.FLACParser('./tests/data/clean.flac')
+            audio.FLACParser('./tests/data/clean.flac')
         os.remove('./tests/data/clean.flac')
 
     def test_mp3(self):
         shutil.copy('./tests/data/dirty.png', './tests/data/clean.mp3')
         with self.assertRaises(ValueError):
-             audio.MP3Parser('./tests/data/clean.mp3')
+            audio.MP3Parser('./tests/data/clean.mp3')
         os.remove('./tests/data/clean.mp3')
 
     def test_jpg(self):


=====================================
tests/test_deep_cleaning.py
=====================================
@@ -0,0 +1,134 @@
+#!/usr/bin/env python3
+
+import unittest
+import shutil
+import os
+import zipfile
+import tempfile
+
+from libmat2 import office, parser_factory
+
+class TestZipMetadata(unittest.TestCase):
+    def __check_deep_meta(self, p):
+        tempdir = tempfile.mkdtemp()
+        zipin = zipfile.ZipFile(p.filename)
+        zipin.extractall(tempdir)
+
+        for subdir, dirs, files in os.walk(tempdir):
+            for f in files:
+                complete_path = os.path.join(subdir, f)
+                inside_p, _ = parser_factory.get_parser(complete_path)
+                if inside_p is None:
+                    continue
+                self.assertEqual(inside_p.get_meta(), {})
+        shutil.rmtree(tempdir)
+
+    def __check_zip_meta(self, p):
+        zipin = zipfile.ZipFile(p.filename)
+        for item in zipin.infolist():
+            self.assertEqual(item.comment, b'')
+            self.assertEqual(item.date_time, (1980, 1, 1, 0, 0, 0))
+            self.assertEqual(item.create_system, 3)  # 3 is UNIX
+
+    def test_office(self):
+        shutil.copy('./tests/data/dirty.docx', './tests/data/clean.docx')
+        p = office.MSOfficeParser('./tests/data/clean.docx')
+
+        meta = p.get_meta()
+        self.assertIsNotNone(meta)
+
+        ret = p.remove_all()
+        self.assertTrue(ret)
+
+        p = office.MSOfficeParser('./tests/data/clean.cleaned.docx')
+        self.assertEqual(p.get_meta(), {})
+
+        self.__check_zip_meta(p)
+        self.__check_deep_meta(p)
+
+        os.remove('./tests/data/clean.docx')
+        os.remove('./tests/data/clean.cleaned.docx')
+
+    def test_libreoffice(self):
+        shutil.copy('./tests/data/dirty.odt', './tests/data/clean.odt')
+        p = office.LibreOfficeParser('./tests/data/clean.odt')
+
+        meta = p.get_meta()
+        self.assertIsNotNone(meta)
+
+        ret = p.remove_all()
+        self.assertTrue(ret)
+
+        p = office.LibreOfficeParser('./tests/data/clean.cleaned.odt')
+        self.assertEqual(p.get_meta(), {})
+
+        self.__check_zip_meta(p)
+        self.__check_deep_meta(p)
+
+        os.remove('./tests/data/clean.odt')
+        os.remove('./tests/data/clean.cleaned.odt')
+
+
+class TestZipOrder(unittest.TestCase):
+    def test_libreoffice(self):
+        shutil.copy('./tests/data/dirty.odt', './tests/data/clean.odt')
+        p = office.LibreOfficeParser('./tests/data/clean.odt')
+
+        meta = p.get_meta()
+        self.assertIsNotNone(meta)
+
+        is_unordered = False
+        with zipfile.ZipFile('./tests/data/clean.odt') as zin:
+            previous_name = ''
+            for item in zin.infolist():
+                if previous_name == '':
+                    previous_name = item.filename
+                    continue
+                elif item.filename < previous_name:
+                    is_unordered = True
+                    break
+        self.assertTrue(is_unordered)
+
+        ret = p.remove_all()
+        self.assertTrue(ret)
+
+        with zipfile.ZipFile('./tests/data/clean.cleaned.odt') as zin:
+            previous_name = ''
+            for item in zin.infolist():
+                if previous_name == '':
+                    previous_name = item.filename
+                    continue
+                self.assertGreaterEqual(item.filename, previous_name)
+
+        os.remove('./tests/data/clean.odt')
+        os.remove('./tests/data/clean.cleaned.odt')
+
+class TestRsidRemoval(unittest.TestCase):
+    def test_office(self):
+        shutil.copy('./tests/data/office_revision_session_ids.docx', './tests/data/clean.docx')
+        p = office.MSOfficeParser('./tests/data/clean.docx')
+
+        meta = p.get_meta()
+        self.assertIsNotNone(meta)
+
+        how_many_rsid = False
+        with zipfile.ZipFile('./tests/data/clean.docx') as zin:
+            for item in zin.infolist():
+                if not item.filename.endswith('.xml'):
+                    continue
+                num = zin.read(item).decode('utf-8').lower().count('w:rsid')
+                how_many_rsid += num
+        self.assertEqual(how_many_rsid, 11)
+
+        ret = p.remove_all()
+        self.assertTrue(ret)
+
+        with zipfile.ZipFile('./tests/data/clean.cleaned.docx') as zin:
+            for item in zin.infolist():
+                if not item.filename.endswith('.xml'):
+                    continue
+                num = zin.read(item).decode('utf-8').lower().count('w:rsid')
+                self.assertEqual(num, 0)
+
+        os.remove('./tests/data/clean.docx')
+        os.remove('./tests/data/clean.cleaned.docx')


=====================================
tests/test_libmat2.py
=====================================
@@ -1,10 +1,9 @@
-#!/usr/bin/python3
+#!/usr/bin/env python3
 
 import unittest
 import shutil
 import os
 import zipfile
-import tempfile
 
 from libmat2 import pdf, images, audio, office, parser_factory, torrent, harmless
 from libmat2 import check_dependencies
@@ -13,7 +12,7 @@ from libmat2 import check_dependencies
 class TestCheckDependencies(unittest.TestCase):
     def test_deps(self):
         ret = check_dependencies()
-        for key, value in ret.items():
+        for value in ret.values():
             self.assertTrue(value)
 
 
@@ -56,8 +55,8 @@ class TestGetMeta(unittest.TestCase):
         self.assertEqual(meta['producer'], 'pdfTeX-1.40.14')
         self.assertEqual(meta['creator'], "'Certified by IEEE PDFeXpress at 03/19/2016 2:56:07 AM'")
         self.assertEqual(meta['DocumentID'], "uuid:4a1a79c8-404e-4d38-9580-5bc081036e61")
-        self.assertEqual(meta['PTEX.Fullbanner'], "This is pdfTeX, Version " \
-                "3.1415926-2.5-1.40.14 (TeX Live 2013/Debian) kpathsea " \
+        self.assertEqual(meta['PTEX.Fullbanner'], "This is pdfTeX, Version "
+                "3.1415926-2.5-1.40.14 (TeX Live 2013/Debian) kpathsea "
                 "version 6.1.1")
 
     def test_torrent(self):
@@ -182,70 +181,6 @@ class TestRevisionsCleaning(unittest.TestCase):
         os.remove('./tests/data/revision_clean.docx')
         os.remove('./tests/data/revision_clean.cleaned.docx')
 
-
-class TestDeepCleaning(unittest.TestCase):
-    def __check_deep_meta(self, p):
-        tempdir = tempfile.mkdtemp()
-        zipin = zipfile.ZipFile(p.filename)
-        zipin.extractall(tempdir)
-
-        for subdir, dirs, files in os.walk(tempdir):
-            for f in files:
-                complete_path = os.path.join(subdir, f)
-                inside_p, _ = parser_factory.get_parser(complete_path)
-                if inside_p is None:
-                    continue
-                self.assertEqual(inside_p.get_meta(), {})
-        shutil.rmtree(tempdir)
-
-
-    def __check_zip_meta(self, p):
-        zipin = zipfile.ZipFile(p.filename)
-        for item in zipin.infolist():
-            self.assertEqual(item.comment, b'')
-            self.assertEqual(item.date_time, (1980, 1, 1, 0, 0, 0))
-            self.assertEqual(item.create_system, 3)  # 3 is UNIX
-
-
-    def test_office(self):
-        shutil.copy('./tests/data/dirty.docx', './tests/data/clean.docx')
-        p = office.MSOfficeParser('./tests/data/clean.docx')
-
-        meta = p.get_meta()
-        self.assertIsNotNone(meta)
-
-        ret = p.remove_all()
-        self.assertTrue(ret)
-
-        p = office.MSOfficeParser('./tests/data/clean.cleaned.docx')
-        self.assertEqual(p.get_meta(), {})
-
-        self.__check_zip_meta(p)
-        self.__check_deep_meta(p)
-
-        os.remove('./tests/data/clean.docx')
-        os.remove('./tests/data/clean.cleaned.docx')
-
-
-    def test_libreoffice(self):
-        shutil.copy('./tests/data/dirty.odt', './tests/data/clean.odt')
-        p = office.LibreOfficeParser('./tests/data/clean.odt')
-
-        meta = p.get_meta()
-        self.assertIsNotNone(meta)
-
-        ret = p.remove_all()
-        self.assertTrue(ret)
-
-        p = office.LibreOfficeParser('./tests/data/clean.cleaned.odt')
-        self.assertEqual(p.get_meta(), {})
-
-        self.__check_zip_meta(p)
-        self.__check_deep_meta(p)
-
-        os.remove('./tests/data/clean.odt')
-        os.remove('./tests/data/clean.cleaned.odt')
-
 class TestLightWeightCleaning(unittest.TestCase):
     def test_pdf(self):
         shutil.copy('./tests/data/dirty.pdf', './tests/data/clean.pdf')
@@ -294,9 +229,11 @@ class TestCleaning(unittest.TestCase):
         p = pdf.PDFParser('./tests/data/clean.cleaned.pdf')
         expected_meta = {'creation-date': -1, 'format': 'PDF-1.5', 'mod-date': -1}
         self.assertEqual(p.get_meta(), expected_meta)
+        self.assertTrue(p.remove_all())
 
         os.remove('./tests/data/clean.pdf')
         os.remove('./tests/data/clean.cleaned.pdf')
+        os.remove('./tests/data/clean.cleaned.cleaned.pdf')
 
     def test_png(self):
         shutil.copy('./tests/data/dirty.png', './tests/data/clean.png')
@@ -310,9 +247,11 @@ class TestCleaning(unittest.TestCase):
 
         p = images.PNGParser('./tests/data/clean.cleaned.png')
         self.assertEqual(p.get_meta(), {})
+        self.assertTrue(p.remove_all())
 
         os.remove('./tests/data/clean.png')
         os.remove('./tests/data/clean.cleaned.png')
+        os.remove('./tests/data/clean.cleaned.cleaned.png')
 
     def test_jpg(self):
         shutil.copy('./tests/data/dirty.jpg', './tests/data/clean.jpg')
@@ -326,9 +265,11 @@ class TestCleaning(unittest.TestCase):
 
         p = images.JPGParser('./tests/data/clean.cleaned.jpg')
         self.assertEqual(p.get_meta(), {})
+        self.assertTrue(p.remove_all())
 
         os.remove('./tests/data/clean.jpg')
         os.remove('./tests/data/clean.cleaned.jpg')
+        os.remove('./tests/data/clean.cleaned.cleaned.jpg')
 
     def test_mp3(self):
         shutil.copy('./tests/data/dirty.mp3', './tests/data/clean.mp3')
@@ -342,9 +283,11 @@ class TestCleaning(unittest.TestCase):
 
         p = audio.MP3Parser('./tests/data/clean.cleaned.mp3')
         self.assertEqual(p.get_meta(), {})
+        self.assertTrue(p.remove_all())
 
         os.remove('./tests/data/clean.mp3')
         os.remove('./tests/data/clean.cleaned.mp3')
+        os.remove('./tests/data/clean.cleaned.cleaned.mp3')
 
     def test_ogg(self):
         shutil.copy('./tests/data/dirty.ogg', './tests/data/clean.ogg')
@@ -358,9 +301,11 @@ class TestCleaning(unittest.TestCase):
 
         p = audio.OGGParser('./tests/data/clean.cleaned.ogg')
         self.assertEqual(p.get_meta(), {})
+        self.assertTrue(p.remove_all())
 
         os.remove('./tests/data/clean.ogg')
         os.remove('./tests/data/clean.cleaned.ogg')
+        os.remove('./tests/data/clean.cleaned.cleaned.ogg')
 
     def test_flac(self):
         shutil.copy('./tests/data/dirty.flac', './tests/data/clean.flac')
@@ -374,9 +319,11 @@ class TestCleaning(unittest.TestCase):
 
         p = audio.FLACParser('./tests/data/clean.cleaned.flac')
         self.assertEqual(p.get_meta(), {})
+        self.assertTrue(p.remove_all())
 
         os.remove('./tests/data/clean.flac')
         os.remove('./tests/data/clean.cleaned.flac')
+        os.remove('./tests/data/clean.cleaned.cleaned.flac')
 
     def test_office(self):
         shutil.copy('./tests/data/dirty.docx', './tests/data/clean.docx')
@@ -390,10 +337,11 @@ class TestCleaning(unittest.TestCase):
 
         p = office.MSOfficeParser('./tests/data/clean.cleaned.docx')
         self.assertEqual(p.get_meta(), {})
+        self.assertTrue(p.remove_all())
 
         os.remove('./tests/data/clean.docx')
         os.remove('./tests/data/clean.cleaned.docx')
-
+        os.remove('./tests/data/clean.cleaned.cleaned.docx')
 
     def test_libreoffice(self):
         shutil.copy('./tests/data/dirty.odt', './tests/data/clean.odt')
@@ -407,9 +355,11 @@ class TestCleaning(unittest.TestCase):
 
         p = office.LibreOfficeParser('./tests/data/clean.cleaned.odt')
         self.assertEqual(p.get_meta(), {})
+        self.assertTrue(p.remove_all())
 
         os.remove('./tests/data/clean.odt')
         os.remove('./tests/data/clean.cleaned.odt')
+        os.remove('./tests/data/clean.cleaned.cleaned.odt')
 
     def test_tiff(self):
         shutil.copy('./tests/data/dirty.tiff', './tests/data/clean.tiff')
@@ -423,9 +373,11 @@ class TestCleaning(unittest.TestCase):
 
         p = images.TiffParser('./tests/data/clean.cleaned.tiff')
         self.assertEqual(p.get_meta(), {})
+        self.assertTrue(p.remove_all())
 
         os.remove('./tests/data/clean.tiff')
         os.remove('./tests/data/clean.cleaned.tiff')
+        os.remove('./tests/data/clean.cleaned.cleaned.tiff')
 
     def test_bmp(self):
         shutil.copy('./tests/data/dirty.bmp', './tests/data/clean.bmp')
@@ -439,9 +391,11 @@ class TestCleaning(unittest.TestCase):
 
         p = harmless.HarmlessParser('./tests/data/clean.cleaned.bmp')
         self.assertEqual(p.get_meta(), {})
+        self.assertTrue(p.remove_all())
 
         os.remove('./tests/data/clean.bmp')
         os.remove('./tests/data/clean.cleaned.bmp')
+        os.remove('./tests/data/clean.cleaned.cleaned.bmp')
 
     def test_torrent(self):
         shutil.copy('./tests/data/dirty.torrent', './tests/data/clean.torrent')
@@ -455,9 +409,11 @@ class TestCleaning(unittest.TestCase):
 
         p = torrent.TorrentParser('./tests/data/clean.cleaned.torrent')
         self.assertEqual(p.get_meta(), {})
+        self.assertTrue(p.remove_all())
 
         os.remove('./tests/data/clean.torrent')
         os.remove('./tests/data/clean.cleaned.torrent')
+        os.remove('./tests/data/clean.cleaned.cleaned.torrent')
 
     def test_odf(self):
         shutil.copy('./tests/data/dirty.odf', './tests/data/clean.odf')
@@ -471,10 +427,11 @@ class TestCleaning(unittest.TestCase):
 
         p = office.LibreOfficeParser('./tests/data/clean.cleaned.odf')
         self.assertEqual(p.get_meta(), {})
+        self.assertTrue(p.remove_all())
 
         os.remove('./tests/data/clean.odf')
         os.remove('./tests/data/clean.cleaned.odf')
-
+        os.remove('./tests/data/clean.cleaned.cleaned.odf')
 
     def test_odg(self):
         shutil.copy('./tests/data/dirty.odg', './tests/data/clean.odg')
@@ -488,9 +445,11 @@ class TestCleaning(unittest.TestCase):
 
         p = office.LibreOfficeParser('./tests/data/clean.cleaned.odg')
         self.assertEqual(p.get_meta(), {})
+        self.assertTrue(p.remove_all())
 
         os.remove('./tests/data/clean.odg')
         os.remove('./tests/data/clean.cleaned.odg')
+        os.remove('./tests/data/clean.cleaned.cleaned.odg')
 
     def test_txt(self):
         shutil.copy('./tests/data/dirty.txt', './tests/data/clean.txt')
@@ -504,6 +463,8 @@ class TestCleaning(unittest.TestCase):
 
         p = harmless.HarmlessParser('./tests/data/clean.cleaned.txt')
         self.assertEqual(p.get_meta(), {})
+        self.assertTrue(p.remove_all())
 
         os.remove('./tests/data/clean.txt')
         os.remove('./tests/data/clean.cleaned.txt')
+        os.remove('./tests/data/clean.cleaned.cleaned.txt')


=====================================
tests/test_policy.py
=====================================
@@ -0,0 +1,31 @@
+#!/usr/bin/env python3
+
+import unittest
+import shutil
+import os
+
+from libmat2 import office, UnknownMemberPolicy
+
+class TestPolicy(unittest.TestCase):
+    def test_policy_omit(self):
+        shutil.copy('./tests/data/embedded.docx', './tests/data/clean.docx')
+        p = office.MSOfficeParser('./tests/data/clean.docx')
+        p.unknown_member_policy = UnknownMemberPolicy.OMIT
+        self.assertTrue(p.remove_all())
+        os.remove('./tests/data/clean.docx')
+        os.remove('./tests/data/clean.cleaned.docx')
+
+    def test_policy_keep(self):
+        shutil.copy('./tests/data/embedded.docx', './tests/data/clean.docx')
+        p = office.MSOfficeParser('./tests/data/clean.docx')
+        p.unknown_member_policy = UnknownMemberPolicy.KEEP
+        self.assertTrue(p.remove_all())
+        os.remove('./tests/data/clean.docx')
+        os.remove('./tests/data/clean.cleaned.docx')
+
+    def test_policy_unknown(self):
+        shutil.copy('./tests/data/embedded.docx', './tests/data/clean.docx')
+        p = office.MSOfficeParser('./tests/data/clean.docx')
+        with self.assertRaises(ValueError):
+            p.unknown_member_policy = UnknownMemberPolicy('unknown_policy_name_totally_invalid')
+        os.remove('./tests/data/clean.docx')



View it on GitLab: https://salsa.debian.org/pkg-privacy-team/mat2/commit/2ebbdb392594d41156aaac6cd3b5eecaa0d556a7

-- 
View it on GitLab: https://salsa.debian.org/pkg-privacy-team/mat2/commit/2ebbdb392594d41156aaac6cd3b5eecaa0d556a7
You're receiving this email because of your account on salsa.debian.org.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://alioth-lists.debian.net/pipermail/pkg-privacy-commits/attachments/20181003/082fb161/attachment-0001.html>


More information about the Pkg-privacy-commits mailing list