[Pkg-privacy-commits] [Git][pkg-privacy-team/mat2][master] 3 commits: New upstream version 0.11.0

Sun Mar 29 12:51:59 BST 2020


Georg Faerber pushed to branch master at Privacy Maintainers / mat2


Commits:
93f06385 by Georg Faerber at 2020-03-29T11:35:31+00:00
New upstream version 0.11.0
- - - - -
e110b697 by Georg Faerber at 2020-03-29T11:35:47+00:00
Update upstream source from tag 'upstream/0.11.0'

Update to upstream version '0.11.0'
with Debian dir a17914655c9f6b08e626eebc725875cba5926ff7
- - - - -
c326f552 by Georg Faerber at 2020-03-29T11:38:54+00:00
debian/changelog: Debian release 0.11.0-1

- - - - -


14 changed files:

- .gitlab-ci.yml
- CHANGELOG.md
- README.md
- debian/changelog
- doc/mat2.1
- libmat2/archive.py
- libmat2/bubblewrap.py
- libmat2/exiftool.py
- libmat2/office.py
- libmat2/video.py
- mat2
- setup.py
- + tests/data/narrated_powerpoint_presentation.pptx
- tests/test_libmat2.py


Changes:

=====================================
.gitlab-ci.yml
=====================================
@@ -16,7 +16,7 @@ linting:bandit:
   script:  # TODO: remove B405 and B314
     - bandit ./mat2 --format txt --skip B101
     - bandit -r ./nautilus/ --format txt --skip B101
-    - bandit -r ./libmat2 --format txt --skip B101,B404,B603,B405,B314,B108
+    - bandit -r ./libmat2 --format txt --skip B101,B404,B603,B405,B314,B108,B311
 
 linting:codespell:
   image: $CONTAINER_REGISTRY:linting


=====================================
CHANGELOG.md
=====================================
@@ -1,3 +1,8 @@
+# 0.11.0 - 2020-03-29
+
+- Improve significantly MS Office formats support
+- Refactor how mat2 looks for executables
+
 # 0.10.1 - 2020-02-09
 
 - Improve the documentation and the manpage


=====================================
README.md
=====================================
@@ -41,6 +41,12 @@ Nautilus, the default file manager of GNOME.
 
 Please note that mat2 requires at least Python3.5.
 
+# Requirements setup on macOS (OS X) using [Homebrew](https://brew.sh/)
+
+```bash
+brew install exiftool cairo pygobject3 poppler gdk-pixbuf librsvg ffmpeg
+```
+
 # Running the test suite
 
 ```bash
@@ -74,7 +80,7 @@ optional arguments:
                         (policy should be one of: abort, omit, keep) [Default:
                         abort]
   --inplace             clean in place, without backup
-  --no-sandbox          Disable bubblewrap's sandboxing.
+  --no-sandbox          Disable bubblewrap's sandboxing
   -v, --version         show program's version number and exit
   -l, --list            list all supported fileformats
   --check-dependencies  check if mat2 has all the dependencies it needs
@@ -146,6 +152,8 @@ Copyright 2016 Marie-Rose for mat2's logo
 The `tests/data/dirty_with_nsid.docx` file is licensed under GPLv3,
 and was borrowed from the Calibre project: https://calibre-ebook.com/downloads/demos/demo.docx
 
+The `narrated_powerpoint_presentation.pptx` file is in the public domain.
+
 # Thanks
 
 mat2 wouldn't exist without:


=====================================
debian/changelog
=====================================
@@ -1,3 +1,9 @@
+mat2 (0.11.0-1) unstable; urgency=medium
+
+  * New upstream version 0.11.0.
+
+ -- Georg Faerber <georg at debian.org>  Sun, 29 Mar 2020 11:38:44 +0000
+
 mat2 (0.10.1-1) unstable; urgency=medium
 
   * New upstream version 0.10.1:


=====================================
doc/mat2.1
=====================================
@@ -1,4 +1,4 @@
-.TH mat2 "1" "February 2020" "mat2 0.10.1" "User Commands"
+.TH mat2 "1" "March 2020" "mat2 0.11.0" "User Commands"
 
 .SH NAME
 mat2 \- the metadata anonymisation toolkit 2


=====================================
libmat2/archive.py
=====================================
@@ -82,6 +82,13 @@ class ArchiveBasedAbstractParser(abstract.AbstractParser):
         # pylint: disable=unused-argument,no-self-use
         return {}  # pragma: no cover
 
+    def _final_checks(self) -> bool:
+        """ This method is invoked after the file has been cleaned,
+        allowing to run final verifications.
+        """
+        # pylint: disable=unused-argument,no-self-use
+        return True
+
     @staticmethod
     @abc.abstractmethod
     def _get_all_members(archive: ArchiveClass) -> List[ArchiveMember]:
@@ -223,6 +230,8 @@ class ArchiveBasedAbstractParser(abstract.AbstractParser):
         if abort:
             os.remove(self.output_filename)
             return False
+        if not self._final_checks():
+            return False  # pragma: no cover
         return True
 
 


=====================================
libmat2/bubblewrap.py
=====================================
@@ -22,10 +22,9 @@ CalledProcessError = subprocess.CalledProcessError
 
 
 def _get_bwrap_path() -> str:
-    bwrap_path = '/usr/bin/bwrap'
-    if os.path.isfile(bwrap_path):
-        if os.access(bwrap_path, os.X_OK):
-            return bwrap_path
+    which_path = shutil.which('bwrap')
+    if which_path:
+        return which_path
 
     raise RuntimeError("Unable to find bwrap")  # pragma: no cover
 


=====================================
libmat2/exiftool.py
=====================================
@@ -2,6 +2,7 @@ import functools
 import json
 import logging
 import os
+import shutil
 import subprocess
 from typing import Dict, Union, Set
 
@@ -71,14 +72,12 @@ class ExiftoolParser(abstract.AbstractParser):
 
 @functools.lru_cache()
 def _get_exiftool_path() -> str:  # pragma: no cover
-    possible_pathes = {
-        '/usr/bin/exiftool',              # debian/fedora
-        '/usr/bin/vendor_perl/exiftool',  # archlinux
-    }
+    which_path = shutil.which('exiftool')
+    if which_path:
+        return which_path
 
-    for possible_path in possible_pathes:
-        if os.path.isfile(possible_path):
-            if os.access(possible_path, os.X_OK):
-                return possible_path
+    # Exiftool on Arch Linux has a weird path
+    if os.access('/usr/bin/vendor_perl/exiftool', os.X_OK):
+        return '/usr/bin/vendor_perl/exiftool'
 
     raise RuntimeError("Unable to find exiftool")


=====================================
libmat2/office.py
=====================================
@@ -1,3 +1,5 @@
+import random
+import uuid
 import logging
 import os
 import re
@@ -74,6 +76,14 @@ class MSOfficeParser(ZipParser):
     def __init__(self, filename):
         super().__init__(filename)
 
+        # MSOffice documents are using various counters for cross-references,
+        # we collect them all, to make sure that they're effectively counters,
+        # and not unique id used for fingerprinting.
+        self.__counters = {
+            'cNvPr': set(),
+            'rid': set(),
+            }
+
         self.files_to_keep = set(map(re.compile, {  # type: ignore
             r'^\[Content_Types\]\.xml$',
             r'^_rels/\.rels$',
@@ -81,9 +91,16 @@ class MSOfficeParser(ZipParser):
             r'^(?:word|ppt)/_rels/footer[0-9]*\.xml\.rels$',
             r'^(?:word|ppt)/_rels/header[0-9]*\.xml\.rels$',
             r'^ppt/slideLayouts/_rels/slideLayout[0-9]+\.xml\.rels$',
-
+            r'^ppt/slideLayouts/slideLayout[0-9]+\.xml$',
+            r'^(?:word|ppt)/tableStyles\.xml$',
+            r'^ppt/slides/_rels/slide[0-9]*\.xml\.rels$',
+            r'^ppt/slides/slide[0-9]*\.xml$',
             # https://msdn.microsoft.com/en-us/library/dd908153(v=office.12).aspx
             r'^(?:word|ppt)/stylesWithEffects\.xml$',
+            r'^ppt/presentation\.xml$',
+            # TODO: check if p:bgRef can be randomized
+            r'^ppt/slideMasters/slideMaster[0-9]+\.xml',
+            r'^ppt/slideMasters/_rels/slideMaster[0-9]+\.xml\.rels',
         }))
         self.files_to_omit = set(map(re.compile, {  # type: ignore
             r'^customXml/',
@@ -93,6 +110,7 @@ class MSOfficeParser(ZipParser):
             r'^(?:word|ppt)/theme',
             r'^(?:word|ppt)/people\.xml$',
             r'^(?:word|ppt)/numbering\.xml$',
+            r'^(?:word|ppt)/tags/',
             # View properties like view mode, last viewed slide etc
             r'^(?:word|ppt)/viewProps\.xml$',
             # Additional presentation-wide properties like printing properties,
@@ -144,7 +162,7 @@ class MSOfficeParser(ZipParser):
         """
         try:
             tree, namespace = _parse_xml(full_path)
-        except ET.ParseError as e:
+        except ET.ParseError as e:  # pragma: no cover
             logging.error("Unable to parse %s: %s", full_path, e)
             return False
 
@@ -204,7 +222,7 @@ class MSOfficeParser(ZipParser):
     def __remove_revisions(full_path: str) -> bool:
         try:
             tree, namespace = _parse_xml(full_path)
-        except ET.ParseError as e:
+        except ET.ParseError as e:  # pragma: no cover
             logging.error("Unable to parse %s: %s", full_path, e)
             return False
 
@@ -270,14 +288,71 @@ class MSOfficeParser(ZipParser):
         tree.write(full_path, xml_declaration=True)
         return True
 
+    def _final_checks(self) -> bool:
+        for k, v in self.__counters.items():
+            if v and len(v) != max(v):
+                # TODO: make this an error and return False
+                # once the ability to correct the counters is implemented
+                logging.warning("%s contains invalid %s: %s", self.filename, k, v)
+                return True
+        return True
+
+    def __collect_counters(self, full_path: str):
+        with open(full_path, encoding='utf-8') as f:
+            content = f.read()
+            # "relationship Id"
+            for i in re.findall(r'(?:\s|r:)[iI][dD]="rId([0-9]+)"(?:\s|/)', content):
+                self.__counters['rid'].add(int(i))
+            # "connector for Non-visual property"
+            for i in re.findall(r'<p:cNvPr id="([0-9]+)"', content):
+                self.__counters['cNvPr'].add(int(i))
+
+
+    @staticmethod
+    def __randomize_creationId(full_path: str) -> bool:
+        try:
+            tree, namespace = _parse_xml(full_path)
+        except ET.ParseError as e:  # pragma: no cover
+            logging.error("Unable to parse %s: %s", full_path, e)
+            return False
+
+        if 'p14' not in namespace.keys():
+            return True  # pragma: no cover
+
+        for item in tree.iterfind('.//p14:creationId', namespace):
+            item.set('val', '%s' % random.randint(0, 2**32))
+        tree.write(full_path, xml_declaration=True)
+        return True
+
+    @staticmethod
+    def __randomize_sldMasterId(full_path: str) -> bool:
+        try:
+            tree, namespace = _parse_xml(full_path)
+        except ET.ParseError as e:  # pragma: no cover
+            logging.error("Unable to parse %s: %s", full_path, e)
+            return False
+
+        if 'p' not in namespace.keys():
+            return True  # pragma: no cover
+
+        for item in tree.iterfind('.//p:sldMasterId', namespace):
+            item.set('id', '%s' % random.randint(0, 2**32))
+        tree.write(full_path, xml_declaration=True)
+        return True
+
     def _specific_cleanup(self, full_path: str) -> bool:
-        # pylint: disable=too-many-return-statements
+        # pylint: disable=too-many-return-statements,too-many-branches
         if os.stat(full_path).st_size == 0:  # Don't process empty files
             return True
 
         if not full_path.endswith('.xml'):
             return True
 
+        if self.__randomize_creationId(full_path) is False:
+            return False
+
+        self.__collect_counters(full_path)
+
         if full_path.endswith('/[Content_Types].xml'):
             # this file contains references to files that we might
             # remove, and MS Office doesn't like dangling references
@@ -286,7 +361,7 @@ class MSOfficeParser(ZipParser):
         elif full_path.endswith('/word/document.xml'):
             # this file contains the revisions
             if self.__remove_revisions(full_path) is False:
-                return False
+                return False  # pragma: no cover
         elif full_path.endswith('/docProps/app.xml'):
             # This file must be present and valid,
             # so we're removing as much as we can.
@@ -301,9 +376,19 @@ class MSOfficeParser(ZipParser):
                 f.write(b'<?xml version="1.0" encoding="UTF-8" standalone="yes"?>')
                 f.write(b'<cp:coreProperties xmlns:cp="http://schemas.openxmlformats.org/package/2006/metadata/core-properties">')
                 f.write(b'</cp:coreProperties>')
+        elif full_path.endswith('/ppt/tableStyles.xml'):  # pragma: no cover
+            # This file must be present and valid,
+            # so we're removing as much as we can.
+            with open(full_path, 'wb') as f:
+                f.write(b'<?xml version="1.0" encoding="UTF-8" standalone="yes"?>')
+                uid = str(uuid.uuid4()).encode('utf-8')
+                f.write(b'<a:tblStyleLst def="{%s}" xmlns:a="http://schemas.openxmlformats.org/drawingml/2006/main"/>' % uid)
+        elif full_path.endswith('ppt/presentation.xml'):
+            if self.__randomize_sldMasterId(full_path) is False:
+                return False  # pragma: no cover
 
         if self.__remove_rsid(full_path) is False:
-            return False
+            return False  # pragma: no cover
 
         if self.__remove_nsid(full_path) is False:
             return False  # pragma: no cover


=====================================
libmat2/video.py
=====================================
@@ -1,6 +1,6 @@
 import subprocess
 import functools
-import os
+import shutil
 import logging
 
 from typing import Dict, Union
@@ -137,9 +137,8 @@ class MP4Parser(AbstractFFmpegParser):
 
 @functools.lru_cache()
 def _get_ffmpeg_path() -> str:  # pragma: no cover
-    ffmpeg_path = '/usr/bin/ffmpeg'
-    if os.path.isfile(ffmpeg_path):
-        if os.access(ffmpeg_path, os.X_OK):
-            return ffmpeg_path
+    which_path = shutil.which('ffmpeg')
+    if which_path:
+        return which_path
 
     raise RuntimeError("Unable to find ffmpeg")


=====================================
mat2
=====================================
@@ -17,7 +17,7 @@ except ValueError as e:
     print(e)
     sys.exit(1)
 
-__version__ = '0.10.1'
+__version__ = '0.11.0'
 
 # Make pyflakes happy
 assert Set
@@ -57,7 +57,7 @@ def create_arg_parser() -> argparse.ArgumentParser:
     parser.add_argument('--inplace', action='store_true',
                         help='clean in place, without backup')
     parser.add_argument('--no-sandbox', dest='sandbox', action='store_true',
-                        default=False, help='Disable bubblewrap\'s sandboxing.')
+                        default=False, help='Disable bubblewrap\'s sandboxing')
 
     excl_group = parser.add_mutually_exclusive_group()
     excl_group.add_argument('files', nargs='*', help='the files to process',


=====================================
setup.py
=====================================
@@ -5,7 +5,7 @@ with open("README.md", encoding='utf-8') as fh:
 
 setuptools.setup(
     name="mat2",
-    version='0.10.1',
+    version='0.11.0',
     author="Julien (jvoisin) Voisin",
     author_email="julien.voisin+mat2 at dustri.org",
     description="A handy tool to trash your metadata",


=====================================
tests/data/narrated_powerpoint_presentation.pptx
=====================================
Binary files /dev/null and b/tests/data/narrated_powerpoint_presentation.pptx differ


=====================================
tests/test_libmat2.py
=====================================
@@ -777,3 +777,13 @@ class TestNoSandbox(unittest.TestCase):
         os.remove('./tests/data/clean.png')
         os.remove('./tests/data/clean.cleaned.png')
         os.remove('./tests/data/clean.cleaned.cleaned.png')
+
+class TestComplexOfficeFiles(unittest.TestCase):
+    def test_complex_pptx(self):
+        target = './tests/data/clean.pptx'
+        shutil.copy('./tests/data/narrated_powerpoint_presentation.pptx', target)
+        p = office.MSOfficeParser(target)
+        self.assertTrue(p.remove_all())
+
+        os.remove(target)
+        os.remove(p.output_filename)



View it on GitLab: https://salsa.debian.org/pkg-privacy-team/mat2/-/compare/6d22c612d728ad207f4b6ed11e27c7cf2effe2d8...c326f552b1b05fa955c2e03d9ecaf6c65ef60981

-- 
View it on GitLab: https://salsa.debian.org/pkg-privacy-team/mat2/-/compare/6d22c612d728ad207f4b6ed11e27c7cf2effe2d8...c326f552b1b05fa955c2e03d9ecaf6c65ef60981
You're receiving this email because of your account on salsa.debian.org.


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://alioth-lists.debian.net/pipermail/pkg-privacy-commits/attachments/20200329/8424c178/attachment-0001.html>