[Pkg-privacy-commits] [Git][pkg-privacy-team/mat2][upstream] New upstream version 0.4.0
Georg Faerber
gitlab at salsa.debian.org
Wed Oct 3 20:04:52 BST 2018
Georg Faerber pushed to branch upstream at Privacy Maintainers / mat2
Commits:
2ebbdb39 by Georg Faerber at 2018-10-03T18:11:33Z
New upstream version 0.4.0
- - - - -
26 changed files:
- .gitlab-ci.yml
- + .mailmap
- CHANGELOG.md
- CONTRIBUTING.md
- INSTALL.md
- README.md
- doc/implementation_notes.md
- doc/mat.1 → doc/mat2.1
- libmat2/__init__.py
- + libmat2/archive.py
- libmat2/images.py
- libmat2/office.py
- libmat2/pdf.py
- libmat2/torrent.py
- mat2
- nautilus/mat2.py
- setup.py
- + tests/data/broken_xml_content_types.docx
- + tests/data/malformed_content_types.docx
- + tests/data/no_content_types.docx
- + tests/data/office_revision_session_ids.docx
- tests/test_climat2.py
- tests/test_corrupted_files.py
- + tests/test_deep_cleaning.py
- tests/test_libmat2.py
- + tests/test_policy.py
Changes:
=====================================
.gitlab-ci.yml
=====================================
@@ -49,6 +49,8 @@ tests:debian:
tests:fedora:
image: fedora
stage: test
+ tags:
+ - whitewhale
script:
- dnf install -y python3 python3-mutagen python3-gobject gdk-pixbuf2 poppler-glib gdk-pixbuf2 gdk-pixbuf2-modules cairo-gobject cairo python3-cairo perl-Image-ExifTool mailcap
- gdk-pixbuf-query-loaders-64 > /usr/lib64/gdk-pixbuf-2.0/2.10.0/loaders.cache
@@ -57,6 +59,8 @@ tests:fedora:
tests:archlinux:
image: archlinux/base
stage: test
+ tags:
+ - whitewhale
script:
- pacman -Sy --noconfirm python-mutagen python-gobject gdk-pixbuf2 poppler-glib gdk-pixbuf2 python-cairo perl-image-exiftool python-setuptools mailcap
- python3 setup.py test
=====================================
.mailmap
=====================================
@@ -0,0 +1,5 @@
+Julien (jvoisin) Voisin <julien.voisin+mat2 at dustri.org> totallylegit <totallylegit at dustri.org>
+Julien (jvoisin) Voisin <julien.voisin+mat2 at dustri.org> jvoisin <julien.voisin at dustri.org>
+Julien (jvoisin) Voisin <julien.voisin+mat2 at dustri.org> jvoisin <jvoisin at riseup.net>
+
+Daniel Kahn Gillmor <dkg at fifthhorseman.net> dkg <dkg at fifthhorseman.net>
=====================================
CHANGELOG.md
=====================================
@@ -1,3 +1,20 @@
+# 0.4.0 - 2018-10-03
+
+- There is now a policy, for advanced users, to deal with unknown embedded fileformats
+- Improve the documentation
+- Various minor refactoring
+- Improve how corrupted PNG are handled
+- Dangerous/advanced cli's options no longer have short versions
+- Significant improvements to office files anonymisation
+ - Archive members are sorted lexicographically
+ - XML attributes are sorted lexicographically too
+ - RSID are now stripped
+ - Dangling references in [Content_types].xml are now removed
+- Significant improvements to office files support
+- Anonimysed office files can now be opened by MS Office without warnings
+- The CLI isn't threaded anymore, for it was causing issues
+- Various misc typo fix
+
# 0.3.1 - 2018-09-01
- Document how to install MAT2 for various distributions
=====================================
CONTRIBUTING.md
=====================================
@@ -24,10 +24,13 @@ Since MAT2 is written in Python3, please conform as much as possible to the
1. Update the [changelog](https://0xacab.org/jvoisin/mat2/blob/master/CHANGELOG.md)
2. Update the version in the [mat2](https://0xacab.org/jvoisin/mat2/blob/master/mat2) file
3. Update the version in the [setup.py](https://0xacab.org/jvoisin/mat2/blob/master/setup.py) file
-4. Update the version and date in the [man page](https://0xacab.org/jvoisin/mat2/blob/master/doc/mat.1)
+4. Update the version and date in the [man page](https://0xacab.org/jvoisin/mat2/blob/master/doc/mat2.1)
5. Commit the changelog, man page, mat2 and setup.py files
6. Create a tag with `git tag -s $VERSION`
7. Push the commit with `git push origin master`
8. Push the tag with `git push --tags`
-9. Tell the [downstreams](https://0xacab.org/jvoisin/mat2/blob/master/INSTALL.md) about it
-10. Do the secret release dance
+9. Create the signed tarball with `git archive --format=tar.xz --prefix=mat-$VERSION/ $VERSION > mat-$VERSION.tar.xz`
+10. Sign the tarball with `gpg --armor --detach-sign mat-$VERSION.tar.xz`
+11. Upload the result on Gitlab's [tag page](https://0xacab.org/jvoisin/mat2/tags) and add the changelog there
+12. Tell the [downstreams](https://0xacab.org/jvoisin/mat2/blob/master/INSTALL.md) about it
+13. Do the secret release dance
=====================================
INSTALL.md
=====================================
@@ -38,13 +38,14 @@ $ ./mat2
and if you want to install the über-fancy Nautilus extension:
```
-# apt install python-gi-dev
+# apt install gnome-common gtk-doc-tools libnautilus-extension-dev python-gi-dev
$ git clone https://github.com/GNOME/nautilus-python
$ cd nautilus-python
$ PYTHON=/usr/bin/python3 ./autogen.sh
$ make
# make install
-$ cp ./nautilus/mat2.py ~/.local/share/nautilus-python/extensions/
+$ mkdir -p ~/.local/share/nautilus-python/extensions/
+$ cp ../nautilus/mat2.py ~/.local/share/nautilus-python/extensions/
$ PYTHONPATH=/home/$USER/mat2 PYTHON=/usr/bin/python3 nautilus
```
@@ -52,3 +53,7 @@ $ PYTHONPATH=/home/$USER/mat2 PYTHON=/usr/bin/python3 nautilus
Thanks to [Francois_B](https://www.sciunto.org/), there is an package available on
[Arch linux's AUR](https://aur.archlinux.org/packages/mat2/).
+
+## Gentoo
+
+MAT2 is available in the [torbrowser overlay](https://github.com/MeisterP/torbrowser-overlay).
=====================================
README.md
=====================================
@@ -44,22 +44,33 @@ $ python3 -m unittest discover -v
# How to use MAT2
```bash
-usage: mat2 [-h] [-v] [-l] [-s | -L] [files [files ...]]
+usage: mat2 [-h] [-v] [-l] [--check-dependencies] [-V]
+ [--unknown-members policy] [-s | -L]
+ [files [files ...]]
Metadata anonymisation toolkit 2
positional arguments:
- files
+ files the files to process
optional arguments:
- -h, --help show this help message and exit
- -v, --version show program's version number and exit
- -l, --list list all supported fileformats
- -s, --show list all the harmful metadata of a file without removing
- them
- -L, --lightweight remove SOME metadata
+ -h, --help show this help message and exit
+ -v, --version show program's version number and exit
+ -l, --list list all supported fileformats
+ --check-dependencies check if MAT2 has all the dependencies it needs
+ -V, --verbose show more verbose status information
+ --unknown-members policy
+ how to handle unknown members of archive-style files
+ (policy should be one of: abort, omit, keep)
+ -s, --show list harmful metadata detectable by MAT2 without
+ removing them
+ -L, --lightweight remove SOME metadata
```
+Note that MAT2 **will not** clean files in-place, but will produce, for
+example, with a file named "myfile.png" a cleaned version named
+"myfile.cleaned.png".
+
# Notes about detecting metadata
While MAT2 is doing its very best to display metadata when the `--show` flag is
@@ -78,12 +89,15 @@ be cleaned or not.
tries to deal with *printer dots* too.
- [pdfparanoia](https://github.com/kanzure/pdfparanoia), that removes
watermarks from PDF.
+- [Scrambled Exif](https://f-droid.org/packages/com.jarsilio.android.scrambledeggsif/),
+ an open-source Android application to remove metadata from pictures.
# Contact
-If possible, use the [issues system](https://0xacab.org/jvoisin/mat2/issues).
-If you think that a more private contact is needed (eg. for reporting security issues),
-you can email Julien (jvoisin) Voisin at `julien.voisin+mat at dustri.org`,
+If possible, use the [issues system](https://0xacab.org/jvoisin/mat2/issues)
+or the [mailing list](https://mailman.boum.org/listinfo/mat-dev)
+Should a more private contact be needed (eg. for reporting security issues),
+you can email Julien (jvoisin) Voisin at `julien.voisin+mat2 at dustri.org`,
using the gpg key `9FCDEE9E1A381F311EA62A7404D041E8171901CC`.
# License
=====================================
doc/implementation_notes.md
=====================================
@@ -61,3 +61,11 @@ Images handling
When possible, images are handled like PDF: rendered on a surface, then saved
to the filesystem. This ensures that every metadata is removed.
+XML attacks
+-----------
+
+Since our threat model conveniently excludes files crafted to specifically
+bypass MAT2, fileformats containing harmful XML are out of our scope.
+But since MAT2 is using [etree](https://docs.python.org/3/library/xml.html#xml-vulnerabilities)
+to process XML, it's "only" vulnerable to DoS, and not memory corruption:
+odds are that the user will notice that the cleaning didn't succeed.
=====================================
doc/mat.1 → doc/mat2.1
=====================================
@@ -1,16 +1,20 @@
-.TH MAT2 "1" "September 2018" "MAT2 0.3.1" "User Commands"
+.TH MAT2 "1" "October 2018" "MAT2 0.4.0" "User Commands"
.SH NAME
mat2 \- the metadata anonymisation toolkit 2
.SH SYNOPSIS
-mat2 [\-h] [\-v] [\-l] [\-c] [\-s | \-L]\fR [files [files ...]]
+\fBmat2\fR [\-h] [\-v] [\-l] [\-V] [-s | -L] [\fIfiles\fR [\fIfiles ...\fR]]
.SH DESCRIPTION
.B mat2
removes metadata from various fileformats. It supports a wide variety of file
formats, audio, office, images, …
+Careful, mat2 does not clean files in-place, instead, it will produce a file with the word
+"cleaned" between the filename and its extension, for example "filename.cleaned.png"
+for a file named "filename.png".
+
.SH OPTIONS
.SS "positional arguments:"
.TP
@@ -27,9 +31,15 @@ show program's version number and exit
\fB\-l\fR, \fB\-\-list\fR
list all supported fileformats
.TP
-\fB\-c\fR, \fB\-\-check\-dependencies\fR
+\fB\-\-check\-dependencies\fR
check if MAT2 has all the dependencies it needs
.TP
+\fB\-V\fR, \fB\-\-verbose\fR
+show more verbose status information
+.TP
+\fB\-\-unknown-members\fR \fIpolicy\fR
+how to handle unknown members of archive-style files (policy should be one of: abort, omit, keep)
+.TP
\fB\-s\fR, \fB\-\-show\fR
list harmful metadata detectable by MAT2 without
removing them
=====================================
libmat2/__init__.py
=====================================
@@ -2,6 +2,7 @@
import os
import collections
+import enum
import importlib
from typing import Dict, Optional
@@ -35,16 +36,16 @@ DEPENDENCIES = {
'mutagen': 'Mutagen',
}
-def _get_exiftool_path() -> Optional[str]:
+def _get_exiftool_path() -> Optional[str]: # pragma: no cover
exiftool_path = '/usr/bin/exiftool'
if os.path.isfile(exiftool_path):
- if os.access(exiftool_path, os.X_OK): # pragma: no cover
+ if os.access(exiftool_path, os.X_OK):
return exiftool_path
# ArchLinux
exiftool_path = '/usr/bin/vendor_perl/exiftool'
if os.path.isfile(exiftool_path):
- if os.access(exiftool_path, os.X_OK): # pragma: no cover
+ if os.access(exiftool_path, os.X_OK):
return exiftool_path
return None
@@ -62,3 +63,9 @@ def check_dependencies() -> dict:
ret[value] = False # pragma: no cover
return ret
+
+ at enum.unique
+class UnknownMemberPolicy(enum.Enum):
+ ABORT = 'abort'
+ OMIT = 'omit'
+ KEEP = 'keep'
=====================================
libmat2/archive.py
=====================================
@@ -0,0 +1,127 @@
+import zipfile
+import datetime
+import tempfile
+import os
+import logging
+import shutil
+from typing import Dict, Set, Pattern
+
+from . import abstract, UnknownMemberPolicy, parser_factory
+
+# Make pyflakes happy
+assert Set
+assert Pattern
+
+
+class ArchiveBasedAbstractParser(abstract.AbstractParser):
+ """ Office files (.docx, .odt, …) are zipped files. """
+ def __init__(self, filename):
+ super().__init__(filename)
+
+ # Those are the files that have a format that _isn't_
+ # supported by MAT2, but that we want to keep anyway.
+ self.files_to_keep = set() # type: Set[Pattern]
+
+ # Those are the files that we _do not_ want to keep,
+ # no matter if they are supported or not.
+ self.files_to_omit = set() # type: Set[Pattern]
+
+ # what should the parser do if it encounters an unknown file in
+ # the archive?
+ self.unknown_member_policy = UnknownMemberPolicy.ABORT # type: UnknownMemberPolicy
+
+ try: # better fail here than later
+ zipfile.ZipFile(self.filename)
+ except zipfile.BadZipFile:
+ raise ValueError
+
+ def _specific_cleanup(self, full_path: str) -> bool:
+ """ This method can be used to apply specific treatment
+ to files present in the archive."""
+ # pylint: disable=unused-argument,no-self-use
+ return True # pragma: no cover
+
+ @staticmethod
+ def _clean_zipinfo(zipinfo: zipfile.ZipInfo) -> zipfile.ZipInfo:
+ zipinfo.create_system = 3 # Linux
+ zipinfo.comment = b''
+ zipinfo.date_time = (1980, 1, 1, 0, 0, 0) # this is as early as a zipfile can be
+ return zipinfo
+
+ @staticmethod
+ def _get_zipinfo_meta(zipinfo: zipfile.ZipInfo) -> Dict[str, str]:
+ metadata = {}
+ if zipinfo.create_system == 3: # this is Linux
+ pass
+ elif zipinfo.create_system == 2:
+ metadata['create_system'] = 'Windows'
+ else:
+ metadata['create_system'] = 'Weird'
+
+ if zipinfo.comment:
+ metadata['comment'] = zipinfo.comment # type: ignore
+
+ if zipinfo.date_time != (1980, 1, 1, 0, 0, 0):
+ metadata['date_time'] = str(datetime.datetime(*zipinfo.date_time))
+
+ return metadata
+
+ def remove_all(self) -> bool:
+ # pylint: disable=too-many-branches
+
+ with zipfile.ZipFile(self.filename) as zin,\
+ zipfile.ZipFile(self.output_filename, 'w') as zout:
+
+ temp_folder = tempfile.mkdtemp()
+ abort = False
+
+ # Since files order is a fingerprint factor,
+ # we're iterating (and thus inserting) them in lexicographic order.
+ for item in sorted(zin.infolist(), key=lambda z: z.filename):
+ if item.filename[-1] == '/': # `is_dir` is added in Python3.6
+ continue # don't keep empty folders
+
+ zin.extract(member=item, path=temp_folder)
+ full_path = os.path.join(temp_folder, item.filename)
+
+ if self._specific_cleanup(full_path) is False:
+ logging.warning("Something went wrong during deep cleaning of %s",
+ item.filename)
+ abort = True
+ continue
+
+ if any(map(lambda r: r.search(item.filename), self.files_to_keep)):
+ # those files aren't supported, but we want to add them anyway
+ pass
+ elif any(map(lambda r: r.search(item.filename), self.files_to_omit)):
+ continue
+ else: # supported files that we want to first clean, then add
+ tmp_parser, mtype = parser_factory.get_parser(full_path) # type: ignore
+ if not tmp_parser:
+ if self.unknown_member_policy == UnknownMemberPolicy.OMIT:
+ logging.warning("In file %s, omitting unknown element %s (format: %s)",
+ self.filename, item.filename, mtype)
+ continue
+ elif self.unknown_member_policy == UnknownMemberPolicy.KEEP:
+ logging.warning("In file %s, keeping unknown element %s (format: %s)",
+ self.filename, item.filename, mtype)
+ else:
+ logging.error("In file %s, element %s's format (%s) " +
+ "isn't supported",
+ self.filename, item.filename, mtype)
+ abort = True
+ continue
+ if tmp_parser:
+ tmp_parser.remove_all()
+ os.rename(tmp_parser.output_filename, full_path)
+
+ zinfo = zipfile.ZipInfo(item.filename) # type: ignore
+ clean_zinfo = self._clean_zipinfo(zinfo)
+ with open(full_path, 'rb') as f:
+ zout.writestr(clean_zinfo, f.read())
+
+ shutil.rmtree(temp_folder)
+ if abort:
+ os.remove(self.output_filename)
+ return False
+ return True
=====================================
libmat2/images.py
=====================================
@@ -62,9 +62,13 @@ class PNGParser(_ImageParser):
def __init__(self, filename):
super().__init__(filename)
+
+ if imghdr.what(filename) != 'png':
+ raise ValueError
+
try: # better fail here than later
cairo.ImageSurface.create_from_png(self.filename)
- except MemoryError:
+ except MemoryError: # pragma: no cover
raise ValueError
def remove_all(self):
=====================================
libmat2/office.py
=====================================
@@ -1,149 +1,166 @@
+import logging
import os
import re
-import shutil
-import tempfile
-import datetime
import zipfile
-import logging
from typing import Dict, Set, Pattern
-try: # protect against DoS
- from defusedxml import ElementTree as ET # type: ignore
-except ImportError:
- import xml.etree.ElementTree as ET # type: ignore
+import xml.etree.ElementTree as ET # type: ignore
+from .archive import ArchiveBasedAbstractParser
-from . import abstract, parser_factory
+# pylint: disable=line-too-long
# Make pyflakes happy
assert Set
assert Pattern
def _parse_xml(full_path: str):
- """ This function parse XML, with namespace support. """
+ """ This function parses XML, with namespace support. """
namespace_map = dict()
for _, (key, value) in ET.iterparse(full_path, ("start-ns", )):
+ # The ns[0-9]+ namespaces are reserved for internal usage, so
+ # we have to use an other nomenclature.
+ if re.match('^ns[0-9]+$', key, re.I): # pragma: no cover
+ key = 'mat' + key[2:]
+
namespace_map[key] = value
ET.register_namespace(key, value)
return ET.parse(full_path), namespace_map
-class ArchiveBasedAbstractParser(abstract.AbstractParser):
- """ Office files (.docx, .odt, …) are zipped files. """
- # Those are the files that have a format that _isn't_
- # supported by MAT2, but that we want to keep anyway.
- files_to_keep = set() # type: Set[str]
+def _sort_xml_attributes(full_path: str) -> bool:
+ """ Sort xml attributes lexicographically,
+ because it's possible to fingerprint producers (MS Office, Libreoffice, …)
+ since they are all using different orders.
+ """
+ tree = ET.parse(full_path)
+
+ for c in tree.getroot():
+ c[:] = sorted(c, key=lambda child: (child.tag, child.get('desc')))
+
+ tree.write(full_path, xml_declaration=True)
+ return True
+
+
+class MSOfficeParser(ArchiveBasedAbstractParser):
+ mimetypes = {
+ 'application/vnd.openxmlformats-officedocument.wordprocessingml.document',
+ 'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet',
+ 'application/vnd.openxmlformats-officedocument.presentationml.presentation'
+ }
+ content_types_to_keep = {
+ 'application/vnd.openxmlformats-officedocument.wordprocessingml.endnotes+xml', # /word/endnotes.xml
+ 'application/vnd.openxmlformats-officedocument.wordprocessingml.footnotes+xml', # /word/footnotes.xml
+ 'application/vnd.openxmlformats-officedocument.extended-properties+xml', # /docProps/app.xml
+ 'application/vnd.openxmlformats-officedocument.wordprocessingml.document.main+xml', # /word/document.xml
+ 'application/vnd.openxmlformats-officedocument.wordprocessingml.fontTable+xml', # /word/fontTable.xml
+ 'application/vnd.openxmlformats-officedocument.wordprocessingml.footer+xml', # /word/footer.xml
+ 'application/vnd.openxmlformats-officedocument.wordprocessingml.header+xml', # /word/header.xml
+ 'application/vnd.openxmlformats-officedocument.wordprocessingml.styles+xml', # /word/styles.xml
+ 'application/vnd.openxmlformats-package.core-properties+xml', # /docProps/core.xml
+
+ # Do we want to keep the following ones?
+ 'application/vnd.openxmlformats-officedocument.wordprocessingml.settings+xml',
+
+ # See https://0xacab.org/jvoisin/mat2/issues/71
+ 'application/vnd.openxmlformats-officedocument.wordprocessingml.numbering+xml', # /word/numbering.xml
+ }
- # Those are the files that we _do not_ want to keep,
- # no matter if they are supported or not.
- files_to_omit = set() # type: Set[Pattern]
def __init__(self, filename):
super().__init__(filename)
- try: # better fail here than later
- zipfile.ZipFile(self.filename)
- except zipfile.BadZipFile:
+
+ self.files_to_keep = set(map(re.compile, { # type: ignore
+ r'^\[Content_Types\]\.xml$',
+ r'^_rels/\.rels$',
+ r'^word/_rels/document\.xml\.rels$',
+ r'^word/_rels/footer[0-9]*\.xml\.rels$',
+ r'^word/_rels/header[0-9]*\.xml\.rels$',
+
+ # https://msdn.microsoft.com/en-us/library/dd908153(v=office.12).aspx
+ r'^word/stylesWithEffects\.xml$',
+ }))
+ self.files_to_omit = set(map(re.compile, { # type: ignore
+ r'^customXml/',
+ r'webSettings\.xml$',
+ r'^docProps/custom\.xml$',
+ r'^word/printerSettings/',
+ r'^word/theme',
+
+ # we have a whitelist in self.files_to_keep,
+ # so we can trash everything else
+ r'^word/_rels/',
+ }))
+
+ if self.__fill_files_to_keep_via_content_types() is False:
raise ValueError
- def _specific_cleanup(self, full_path: str) -> bool:
- """ This method can be used to apply specific treatment
- to files present in the archive."""
- # pylint: disable=unused-argument,no-self-use
- return True # pragma: no cover
+ def __fill_files_to_keep_via_content_types(self) -> bool:
+ """ There is a suer-handy `[Content_Types].xml` file
+ in MS Office archives, describing what each other file contains.
+ The self.content_types_to_keep member contains a type whitelist,
+ so we're using it to fill the self.files_to_keep one.
+ """
+ with zipfile.ZipFile(self.filename) as zin:
+ if '[Content_Types].xml' not in zin.namelist():
+ return False
+ xml_data = zin.read('[Content_Types].xml')
- @staticmethod
- def _clean_zipinfo(zipinfo: zipfile.ZipInfo) -> zipfile.ZipInfo:
- zipinfo.create_system = 3 # Linux
- zipinfo.comment = b''
- zipinfo.date_time = (1980, 1, 1, 0, 0, 0) # this is as early as a zipfile can be
- return zipinfo
+ self.content_types = dict() # type: Dict[str, str]
+ try:
+ tree = ET.fromstring(xml_data)
+ except ET.ParseError:
+ return False
+ for c in tree:
+ if 'PartName' not in c.attrib or 'ContentType' not in c.attrib:
+ continue
+ elif c.attrib['ContentType'] in self.content_types_to_keep:
+ fname = c.attrib['PartName'][1:] # remove leading `/`
+ re_fname = re.compile('^' + re.escape(fname) + '$')
+ self.files_to_keep.add(re_fname) # type: ignore
+ return True
@staticmethod
- def _get_zipinfo_meta(zipinfo: zipfile.ZipInfo) -> Dict[str, str]:
- metadata = {}
- if zipinfo.create_system == 3: # this is Linux
- pass
- elif zipinfo.create_system == 2:
- metadata['create_system'] = 'Windows'
- else:
- metadata['create_system'] = 'Weird'
-
- if zipinfo.comment:
- metadata['comment'] = zipinfo.comment # type: ignore
+ def __remove_rsid(full_path: str) -> bool:
+ """ The method will remove "revision session ID". We're '}rsid'
+ instead of proper parsing, since rsid can have multiple forms, like
+ `rsidRDefault`, `rsidR`, `rsids`, …
- if zipinfo.date_time != (1980, 1, 1, 0, 0, 0):
- metadata['date_time'] = str(datetime.datetime(*zipinfo.date_time))
+ We're removing rsid tags in two times, because we can't modify
+ the xml while we're iterating on it.
- return metadata
-
- def remove_all(self) -> bool:
- with zipfile.ZipFile(self.filename) as zin,\
- zipfile.ZipFile(self.output_filename, 'w') as zout:
+ For more details, see
+ - https://msdn.microsoft.com/en-us/library/office/documentformat.openxml.wordprocessing.previoussectionproperties.rsidrpr.aspx
+ - https://blogs.msdn.microsoft.com/brian_jones/2006/12/11/whats-up-with-all-those-rsids/
+ """
+ try:
+ tree, namespace = _parse_xml(full_path)
+ except ET.ParseError:
+ return False
- temp_folder = tempfile.mkdtemp()
+ # rsid, tags or attributes, are always under the `w` namespace
+ if 'w' not in namespace.keys():
+ return True
- for item in zin.infolist():
- if item.filename[-1] == '/': # `is_dir` is added in Python3.6
- continue # don't keep empty folders
+ parent_map = {c:p for p in tree.iter() for c in p}
- zin.extract(member=item, path=temp_folder)
- full_path = os.path.join(temp_folder, item.filename)
+ elements_to_remove = list()
+ for item in tree.iterfind('.//', namespace):
+ if '}rsid' in item.tag.strip().lower(): # rsid as tag
+ elements_to_remove.append(item)
+ continue
+ for key in list(item.attrib.keys()): # rsid as attribute
+ if '}rsid' in key.lower():
+ del item.attrib[key]
- if self._specific_cleanup(full_path) is False:
- shutil.rmtree(temp_folder)
- os.remove(self.output_filename)
- logging.warning("Something went wrong during deep cleaning of %s",
- item.filename)
- return False
+ for element in elements_to_remove:
+ parent_map[element].remove(element)
- if item.filename in self.files_to_keep:
- # those files aren't supported, but we want to add them anyway
- pass
- elif any(map(lambda r: r.search(item.filename), self.files_to_omit)):
- continue
- else:
- # supported files that we want to clean then add
- tmp_parser, mtype = parser_factory.get_parser(full_path) # type: ignore
- if not tmp_parser:
- shutil.rmtree(temp_folder)
- os.remove(self.output_filename)
- logging.error("In file %s, element %s's format (%s) " +
- "isn't supported",
- self.filename, item.filename, mtype)
- return False
- tmp_parser.remove_all()
- os.rename(tmp_parser.output_filename, full_path)
-
- zinfo = zipfile.ZipInfo(item.filename) # type: ignore
- clean_zinfo = self._clean_zipinfo(zinfo)
- with open(full_path, 'rb') as f:
- zout.writestr(clean_zinfo, f.read())
-
- shutil.rmtree(temp_folder)
+ tree.write(full_path, xml_declaration=True)
return True
-
-class MSOfficeParser(ArchiveBasedAbstractParser):
- mimetypes = {
- 'application/vnd.openxmlformats-officedocument.wordprocessingml.document',
- 'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet',
- 'application/vnd.openxmlformats-officedocument.presentationml.presentation'
- }
- files_to_keep = {
- '[Content_Types].xml',
- '_rels/.rels',
- 'word/_rels/document.xml.rels',
- 'word/document.xml',
- 'word/fontTable.xml',
- 'word/settings.xml',
- 'word/styles.xml',
- }
- files_to_omit = set(map(re.compile, { # type: ignore
- '^docProps/',
- }))
-
@staticmethod
def __remove_revisions(full_path: str) -> bool:
""" In this function, we're changing the XML document in several
@@ -152,7 +169,8 @@ class MSOfficeParser(ArchiveBasedAbstractParser):
"""
try:
tree, namespace = _parse_xml(full_path)
- except ET.ParseError:
+ except ET.ParseError as e:
+ logging.error("Unable to parse %s: %s", full_path, e)
return False
# Revisions are either deletions (`w:del`) or
@@ -172,7 +190,7 @@ class MSOfficeParser(ArchiveBasedAbstractParser):
elements = list()
for element in tree.iterfind('.//w:ins', namespace):
- for position, item in enumerate(tree.iter()): #pragma: no cover
+ for position, item in enumerate(tree.iter()): # pragma: no cover
if item == element:
for children in element.iterfind('./*'):
elements.append((element, position, children))
@@ -182,13 +200,100 @@ class MSOfficeParser(ArchiveBasedAbstractParser):
parent_map[element].remove(element)
tree.write(full_path, xml_declaration=True)
+ return True
+ def __remove_content_type_members(self, full_path: str) -> bool:
+ """ The method will remove the dangling references
+ form the [Content_Types].xml file, since MS office doesn't like them
+ """
+ try:
+ tree, namespace = _parse_xml(full_path)
+ except ET.ParseError: # pragma: no cover
+ return False
+
+ if len(namespace.items()) != 1:
+ return False # there should be only one namespace for Types
+
+ removed_fnames = set()
+ with zipfile.ZipFile(self.filename) as zin:
+ for fname in [item.filename for item in zin.infolist()]:
+ for file_to_omit in self.files_to_omit:
+ if file_to_omit.search(fname):
+ matches = map(lambda r: r.search(fname), self.files_to_keep)
+ if any(matches): # the file is whitelisted
+ continue
+ removed_fnames.add(fname)
+ break
+
+ root = tree.getroot()
+ for item in root.findall('{%s}Override' % namespace['']):
+ name = item.attrib['PartName'][1:] # remove the leading '/'
+ if name in removed_fnames:
+ root.remove(item)
+
+ tree.write(full_path, xml_declaration=True)
return True
def _specific_cleanup(self, full_path: str) -> bool:
- if full_path.endswith('/word/document.xml'):
+ # pylint: disable=too-many-return-statements
+ if os.stat(full_path).st_size == 0: # Don't process empty files
+ return True
+
+ if not full_path.endswith('.xml'):
+ return True
+
+ if full_path.endswith('/[Content_Types].xml'):
+ # this file contains references to files that we might
+ # remove, and MS Office doesn't like dangling references
+ if self.__remove_content_type_members(full_path) is False:
+ return False
+ elif full_path.endswith('/word/document.xml'):
# this file contains the revisions
- return self.__remove_revisions(full_path)
+ if self.__remove_revisions(full_path) is False:
+ return False
+ elif full_path.endswith('/docProps/app.xml'):
+ # This file must be present and valid,
+ # so we're removing as much as we can.
+ with open(full_path, 'wb') as f:
+ f.write(b'<?xml version="1.0" encoding="UTF-8" standalone="yes"?>')
+ f.write(b'<Properties xmlns="http://schemas.openxmlformats.org/officeDocument/2006/extended-properties">')
+ f.write(b'</Properties>')
+ elif full_path.endswith('/docProps/core.xml'):
+ # This file must be present and valid,
+ # so we're removing as much as we can.
+ with open(full_path, 'wb') as f:
+ f.write(b'<?xml version="1.0" encoding="UTF-8" standalone="yes"?>')
+ f.write(b'<cp:coreProperties xmlns:cp="http://schemas.openxmlformats.org/package/2006/metadata/core-properties">')
+ f.write(b'</cp:coreProperties>')
+
+
+ if self.__remove_rsid(full_path) is False:
+ return False
+
+ try:
+ _sort_xml_attributes(full_path)
+ except ET.ParseError as e: # pragma: no cover
+ logging.error("Unable to parse %s: %s", full_path, e)
+ return False
+
+ # This is awful, I'm sorry.
+ #
+ # Microsoft Office isn't happy when we have the `mc:Ignorable`
+ # tag containing namespaces that aren't present in the xml file,
+ # so instead of trying to remove this specific tag with etree,
+ # we're removing it, with a regexp.
+ #
+ # Since we're the ones producing this file, via the call to
+ # _sort_xml_attributes, there won't be any "funny tricks".
+ # Worst case, the tag isn't present, and everything is fine.
+ #
+ # see: https://docs.microsoft.com/en-us/dotnet/framework/wpf/advanced/mc-ignorable-attribute
+ with open(full_path, 'rb') as f:
+ text = f.read()
+ out = re.sub(b'mc:Ignorable="[^"]*"', b'', text, 1)
+ with open(full_path, 'wb') as f:
+ f.write(out)
+
return True
def get_meta(self) -> Dict[str, str]:
@@ -223,26 +328,31 @@ class LibreOfficeParser(ArchiveBasedAbstractParser):
'application/vnd.oasis.opendocument.formula',
'application/vnd.oasis.opendocument.image',
}
- files_to_keep = {
- 'META-INF/manifest.xml',
- 'content.xml',
- 'manifest.rdf',
- 'mimetype',
- 'settings.xml',
- 'styles.xml',
- }
- files_to_omit = set(map(re.compile, { # type: ignore
- r'^meta\.xml$',
- '^Configurations2/',
- '^Thumbnails/',
- }))
+ def __init__(self, filename):
+ super().__init__(filename)
+
+ self.files_to_keep = set(map(re.compile, { # type: ignore
+ r'^META-INF/manifest\.xml$',
+ r'^content\.xml$',
+ r'^manifest\.rdf$',
+ r'^mimetype$',
+ r'^settings\.xml$',
+ r'^styles\.xml$',
+ }))
+ self.files_to_omit = set(map(re.compile, { # type: ignore
+ r'^meta\.xml$',
+ r'^Configurations2/',
+ r'^Thumbnails/',
+ }))
+
@staticmethod
def __remove_revisions(full_path: str) -> bool:
try:
tree, namespace = _parse_xml(full_path)
- except ET.ParseError:
+ except ET.ParseError as e:
+ logging.error("Unable to parse %s: %s", full_path, e)
return False
if 'office' not in namespace.keys(): # no revisions in the current file
@@ -253,12 +363,22 @@ class LibreOfficeParser(ArchiveBasedAbstractParser):
text.remove(changes)
tree.write(full_path, xml_declaration=True)
-
return True
def _specific_cleanup(self, full_path: str) -> bool:
- if os.path.basename(full_path) == 'content.xml':
- return self.__remove_revisions(full_path)
+ if os.stat(full_path).st_size == 0: # Don't process empty files
+ return True
+
+ if os.path.basename(full_path).endswith('.xml'):
+ if os.path.basename(full_path) == 'content.xml':
+ if self.__remove_revisions(full_path) is False:
+ return False
+
+ try:
+ _sort_xml_attributes(full_path)
+ except ET.ParseError as e:
+ logging.error("Unable to parse %s: %s", full_path, e)
+ return False
return True
def get_meta(self) -> Dict[str, str]:
=====================================
libmat2/pdf.py
=====================================
@@ -17,7 +17,7 @@ from gi.repository import Poppler, GLib
from . import abstract
poppler_version = Poppler.get_version()
-if LooseVersion(poppler_version) < LooseVersion('0.46'): # pragma: no cover
+if LooseVersion(poppler_version) < LooseVersion('0.46'): # pragma: no cover
raise ValueError("MAT2 needs at least Poppler version 0.46 to work. \
The installed version is %s." % poppler_version) # pragma: no cover
@@ -118,7 +118,6 @@ class PDFParser(abstract.AbstractParser):
document.save('file://' + os.path.abspath(out_file))
return True
-
@staticmethod
def __parse_metadata_field(data: str) -> dict:
metadata = {}
=====================================
libmat2/torrent.py
=====================================
@@ -21,7 +21,6 @@ class TorrentParser(abstract.AbstractParser):
metadata[key.decode('utf-8')] = value
return metadata
-
def remove_all(self) -> bool:
cleaned = dict()
for key, value in self.dict_repr.items():
=====================================
mat2
=====================================
@@ -1,21 +1,20 @@
-#!/usr/bin/python3
+#!/usr/bin/env python3
import os
from typing import Tuple
import sys
-import itertools
import mimetypes
import argparse
-import multiprocessing
import logging
try:
- from libmat2 import parser_factory, UNSUPPORTED_EXTENSIONS, check_dependencies
+ from libmat2 import parser_factory, UNSUPPORTED_EXTENSIONS
+ from libmat2 import check_dependencies, UnknownMemberPolicy
except ValueError as e:
print(e)
sys.exit(1)
-__version__ = '0.3.1'
+__version__ = '0.4.0'
def __check_file(filename: str, mode: int=os.R_OK) -> bool:
if not os.path.exists(filename):
@@ -37,10 +36,13 @@ def create_arg_parser():
version='MAT2 %s' % __version__)
parser.add_argument('-l', '--list', action='store_true',
help='list all supported fileformats')
- parser.add_argument('-c', '--check-dependencies', action='store_true',
+ parser.add_argument('--check-dependencies', action='store_true',
help='check if MAT2 has all the dependencies it needs')
parser.add_argument('-V', '--verbose', action='store_true',
help='show more verbose status information')
+ parser.add_argument('--unknown-members', metavar='policy', default='abort',
+ help='how to handle unknown members of archive-style files (policy should' +
+ ' be one of: %s)' % ', '.join(p.value for p in UnknownMemberPolicy))
info = parser.add_mutually_exclusive_group()
@@ -67,8 +69,8 @@ def show_meta(filename: str):
except UnicodeEncodeError:
print(" %s: harmful content" % k)
-def clean_meta(params: Tuple[str, bool]) -> bool:
- filename, is_lightweight = params
+def clean_meta(params: Tuple[str, bool, UnknownMemberPolicy]) -> bool:
+ filename, is_lightweight, unknown_member_policy = params
if not __check_file(filename, os.R_OK|os.W_OK):
return False
@@ -76,6 +78,7 @@ def clean_meta(params: Tuple[str, bool]) -> bool:
if p is None:
print("[-] %s's format (%s) is not supported" % (filename, mtype))
return False
+ p.unknown_member_policy = unknown_member_policy
if is_lightweight:
return p.remove_all_lightweight()
return p.remove_all()
@@ -133,12 +136,16 @@ def main():
return 0
else:
- p = multiprocessing.Pool()
- mode = (args.lightweight is True)
- l = zip(__get_files_recursively(args.files), itertools.repeat(mode))
+ unknown_member_policy = UnknownMemberPolicy(args.unknown_members)
+ if unknown_member_policy == UnknownMemberPolicy.KEEP:
+ logging.warning('Keeping unknown member files may leak metadata in the resulting file!')
+
+ no_failure = True
+ for f in __get_files_recursively(args.files):
+ if clean_meta([f, args.lightweight, unknown_member_policy]) is False:
+ no_failure = False
+ return 0 if no_failure is True else -1
- ret = list(p.imap_unordered(clean_meta, list(l)))
- return 0 if all(ret) else -1
if __name__ == '__main__':
sys.exit(main())
=====================================
nautilus/mat2.py
=====================================
@@ -104,7 +104,6 @@ class ColumnExtension(GObject.GObject, Nautilus.MenuProvider, Nautilus.LocationW
box.add(self.__create_treeview())
window.show_all()
-
@staticmethod
def __validate(fileinfo) -> Tuple[bool, str]:
""" Validate if a given file FileInfo `fileinfo` can be processed.
@@ -115,7 +114,6 @@ class ColumnExtension(GObject.GObject, Nautilus.MenuProvider, Nautilus.LocationW
return False, "Not writeable"
return True, ""
-
def __create_treeview(self) -> Gtk.TreeView:
liststore = Gtk.ListStore(GdkPixbuf.Pixbuf, str, str)
treeview = Gtk.TreeView(model=liststore)
@@ -148,7 +146,6 @@ class ColumnExtension(GObject.GObject, Nautilus.MenuProvider, Nautilus.LocationW
treeview.show_all()
return treeview
-
def __create_progressbar(self) -> Gtk.ProgressBar:
""" Create the progressbar used to notify that files are currently
being processed.
@@ -211,7 +208,6 @@ class ColumnExtension(GObject.GObject, Nautilus.MenuProvider, Nautilus.LocationW
processing_queue.put(None) # signal that we processed all the files
return True
-
def __cb_menu_activate(self, menu, files):
""" This method is called when the user clicked the "clean metadata"
menu item.
@@ -228,7 +224,6 @@ class ColumnExtension(GObject.GObject, Nautilus.MenuProvider, Nautilus.LocationW
thread.daemon = True
thread.start()
-
def get_background_items(self, window, file):
""" https://bugzilla.gnome.org/show_bug.cgi?id=784278 """
return None
=====================================
setup.py
=====================================
@@ -5,7 +5,7 @@ with open("README.md", "r") as fh:
setuptools.setup(
name="mat2",
- version='0.3.1',
+ version='0.4.0',
author="Julien (jvoisin) Voisin",
author_email="julien.voisin+mat2 at dustri.org",
description="A handy tool to trash your metadata",
@@ -20,7 +20,7 @@ setuptools.setup(
'pycairo',
],
packages=setuptools.find_packages(exclude=('tests', )),
- classifiers=(
+ classifiers=[
"Development Status :: 3 - Alpha",
"Environment :: Console",
"License :: OSI Approved :: GNU Lesser General Public License v3 or later (LGPLv3+)",
@@ -28,7 +28,7 @@ setuptools.setup(
"Programming Language :: Python :: 3 :: Only",
"Topic :: Security",
"Intended Audience :: End Users/Desktop",
- ),
+ ],
project_urls={
'bugtacker': 'https://0xacab.org/jvoisin/mat2/issues',
},
=====================================
tests/data/broken_xml_content_types.docx
=====================================
Binary files /dev/null and b/tests/data/broken_xml_content_types.docx differ
=====================================
tests/data/malformed_content_types.docx
=====================================
Binary files /dev/null and b/tests/data/malformed_content_types.docx differ
=====================================
tests/data/no_content_types.docx
=====================================
Binary files /dev/null and b/tests/data/no_content_types.docx differ
=====================================
tests/data/office_revision_session_ids.docx
=====================================
Binary files /dev/null and b/tests/data/office_revision_session_ids.docx differ
=====================================
tests/test_climat2.py
=====================================
@@ -8,12 +8,16 @@ class TestHelp(unittest.TestCase):
def test_help(self):
proc = subprocess.Popen(['./mat2', '--help'], stdout=subprocess.PIPE)
stdout, _ = proc.communicate()
- self.assertIn(b'usage: mat2 [-h] [-v] [-l] [-c] [-V] [-s | -L] [files [files ...]]', stdout)
+ self.assertIn(b'usage: mat2 [-h] [-v] [-l] [--check-dependencies] [-V]',
+ stdout)
+ self.assertIn(b'[--unknown-members policy] [-s | -L]', stdout)
def test_no_arg(self):
proc = subprocess.Popen(['./mat2'], stdout=subprocess.PIPE)
stdout, _ = proc.communicate()
- self.assertIn(b'usage: mat2 [-h] [-v] [-l] [-c] [-V] [-s | -L] [files [files ...]]', stdout)
+ self.assertIn(b'usage: mat2 [-h] [-v] [-l] [--check-dependencies] [-V]',
+ stdout)
+ self.assertIn(b'[--unknown-members policy] [-s | -L]', stdout)
class TestVersion(unittest.TestCase):
@@ -46,7 +50,10 @@ class TestReturnValue(unittest.TestCase):
class TestCleanFolder(unittest.TestCase):
def test_jpg(self):
- os.mkdir('./tests/data/folder/')
+ try:
+ os.mkdir('./tests/data/folder/')
+ except FileExistsError:
+ pass
shutil.copy('./tests/data/dirty.jpg', './tests/data/folder/clean1.jpg')
shutil.copy('./tests/data/dirty.jpg', './tests/data/folder/clean2.jpg')
@@ -70,7 +77,6 @@ class TestCleanFolder(unittest.TestCase):
shutil.rmtree('./tests/data/folder/')
-
class TestCleanMeta(unittest.TestCase):
def test_jpg(self):
shutil.copy('./tests/data/dirty.jpg', './tests/data/clean.jpg')
=====================================
tests/test_corrupted_files.py
=====================================
@@ -1,11 +1,17 @@
-#!/usr/bin/python3
+#!/usr/bin/env python3
import unittest
import shutil
import os
+import logging
from libmat2 import pdf, images, audio, office, parser_factory, torrent, harmless
+# No need to logging messages, should something go wrong,
+# the testsuite _will_ fail.
+logger = logging.getLogger()
+logger.setLevel(logging.FATAL)
+
class TestInexistentFiles(unittest.TestCase):
def test_ro(self):
@@ -53,16 +59,21 @@ class TestUnsupportedFiles(unittest.TestCase):
class TestCorruptedEmbedded(unittest.TestCase):
def test_docx(self):
shutil.copy('./tests/data/embedded_corrupted.docx', './tests/data/clean.docx')
- parser, mimetype = parser_factory.get_parser('./tests/data/clean.docx')
+ parser, _ = parser_factory.get_parser('./tests/data/clean.docx')
self.assertFalse(parser.remove_all())
self.assertIsNotNone(parser.get_meta())
os.remove('./tests/data/clean.docx')
def test_odt(self):
+ expected = {
+ 'create_system': 'Weird',
+ 'date_time': '2018-06-10 17:18:18',
+ 'meta.xml': 'harmful content'
+ }
shutil.copy('./tests/data/embedded_corrupted.odt', './tests/data/clean.odt')
- parser, mimetype = parser_factory.get_parser('./tests/data/clean.odt')
+ parser, _ = parser_factory.get_parser('./tests/data/clean.odt')
self.assertFalse(parser.remove_all())
- self.assertEqual(parser.get_meta(), {'create_system': 'Weird', 'date_time': '2018-06-10 17:18:18', 'meta.xml': 'harmful content'})
+ self.assertEqual(parser.get_meta(), expected)
os.remove('./tests/data/clean.odt')
@@ -75,6 +86,26 @@ class TestExplicitelyUnsupportedFiles(unittest.TestCase):
os.remove('./tests/data/clean.py')
+class TestWrongContentTypesFileOffice(unittest.TestCase):
+ def test_office_incomplete(self):
+ shutil.copy('./tests/data/malformed_content_types.docx', './tests/data/clean.docx')
+ p = office.MSOfficeParser('./tests/data/clean.docx')
+ self.assertIsNotNone(p)
+ self.assertFalse(p.remove_all())
+ os.remove('./tests/data/clean.docx')
+
+ def test_office_broken(self):
+ shutil.copy('./tests/data/broken_xml_content_types.docx', './tests/data/clean.docx')
+ with self.assertRaises(ValueError):
+ office.MSOfficeParser('./tests/data/clean.docx')
+ os.remove('./tests/data/clean.docx')
+
+ def test_office_absent(self):
+ shutil.copy('./tests/data/no_content_types.docx', './tests/data/clean.docx')
+ with self.assertRaises(ValueError):
+ office.MSOfficeParser('./tests/data/clean.docx')
+ os.remove('./tests/data/clean.docx')
+
class TestCorruptedFiles(unittest.TestCase):
def test_pdf(self):
shutil.copy('./tests/data/dirty.png', './tests/data/clean.png')
@@ -90,7 +121,7 @@ class TestCorruptedFiles(unittest.TestCase):
def test_png2(self):
shutil.copy('./tests/test_libmat2.py', './tests/clean.png')
- parser, mimetype = parser_factory.get_parser('./tests/clean.png')
+ parser, _ = parser_factory.get_parser('./tests/clean.png')
self.assertIsNone(parser)
os.remove('./tests/clean.png')
@@ -134,25 +165,26 @@ class TestCorruptedFiles(unittest.TestCase):
def test_bmp(self):
shutil.copy('./tests/data/dirty.png', './tests/data/clean.bmp')
- harmless.HarmlessParser('./tests/data/clean.bmp')
+ ret = harmless.HarmlessParser('./tests/data/clean.bmp')
+ self.assertIsNotNone(ret)
os.remove('./tests/data/clean.bmp')
def test_docx(self):
shutil.copy('./tests/data/dirty.png', './tests/data/clean.docx')
with self.assertRaises(ValueError):
- office.MSOfficeParser('./tests/data/clean.docx')
+ office.MSOfficeParser('./tests/data/clean.docx')
os.remove('./tests/data/clean.docx')
def test_flac(self):
shutil.copy('./tests/data/dirty.png', './tests/data/clean.flac')
with self.assertRaises(ValueError):
- audio.FLACParser('./tests/data/clean.flac')
+ audio.FLACParser('./tests/data/clean.flac')
os.remove('./tests/data/clean.flac')
def test_mp3(self):
shutil.copy('./tests/data/dirty.png', './tests/data/clean.mp3')
with self.assertRaises(ValueError):
- audio.MP3Parser('./tests/data/clean.mp3')
+ audio.MP3Parser('./tests/data/clean.mp3')
os.remove('./tests/data/clean.mp3')
def test_jpg(self):
=====================================
tests/test_deep_cleaning.py
=====================================
@@ -0,0 +1,134 @@
+#!/usr/bin/env python3
+
+import unittest
+import shutil
+import os
+import zipfile
+import tempfile
+
+from libmat2 import office, parser_factory
+
+class TestZipMetadata(unittest.TestCase):
+ def __check_deep_meta(self, p):
+ tempdir = tempfile.mkdtemp()
+ zipin = zipfile.ZipFile(p.filename)
+ zipin.extractall(tempdir)
+
+ for subdir, dirs, files in os.walk(tempdir):
+ for f in files:
+ complete_path = os.path.join(subdir, f)
+ inside_p, _ = parser_factory.get_parser(complete_path)
+ if inside_p is None:
+ continue
+ self.assertEqual(inside_p.get_meta(), {})
+ shutil.rmtree(tempdir)
+
+ def __check_zip_meta(self, p):
+ zipin = zipfile.ZipFile(p.filename)
+ for item in zipin.infolist():
+ self.assertEqual(item.comment, b'')
+ self.assertEqual(item.date_time, (1980, 1, 1, 0, 0, 0))
+ self.assertEqual(item.create_system, 3) # 3 is UNIX
+
+ def test_office(self):
+ shutil.copy('./tests/data/dirty.docx', './tests/data/clean.docx')
+ p = office.MSOfficeParser('./tests/data/clean.docx')
+
+ meta = p.get_meta()
+ self.assertIsNotNone(meta)
+
+ ret = p.remove_all()
+ self.assertTrue(ret)
+
+ p = office.MSOfficeParser('./tests/data/clean.cleaned.docx')
+ self.assertEqual(p.get_meta(), {})
+
+ self.__check_zip_meta(p)
+ self.__check_deep_meta(p)
+
+ os.remove('./tests/data/clean.docx')
+ os.remove('./tests/data/clean.cleaned.docx')
+
+ def test_libreoffice(self):
+ shutil.copy('./tests/data/dirty.odt', './tests/data/clean.odt')
+ p = office.LibreOfficeParser('./tests/data/clean.odt')
+
+ meta = p.get_meta()
+ self.assertIsNotNone(meta)
+
+ ret = p.remove_all()
+ self.assertTrue(ret)
+
+ p = office.LibreOfficeParser('./tests/data/clean.cleaned.odt')
+ self.assertEqual(p.get_meta(), {})
+
+ self.__check_zip_meta(p)
+ self.__check_deep_meta(p)
+
+ os.remove('./tests/data/clean.odt')
+ os.remove('./tests/data/clean.cleaned.odt')
+
+
+class TestZipOrder(unittest.TestCase):
+ def test_libreoffice(self):
+ shutil.copy('./tests/data/dirty.odt', './tests/data/clean.odt')
+ p = office.LibreOfficeParser('./tests/data/clean.odt')
+
+ meta = p.get_meta()
+ self.assertIsNotNone(meta)
+
+ is_unordered = False
+ with zipfile.ZipFile('./tests/data/clean.odt') as zin:
+ previous_name = ''
+ for item in zin.infolist():
+ if previous_name == '':
+ previous_name = item.filename
+ continue
+ elif item.filename < previous_name:
+ is_unordered = True
+ break
+ self.assertTrue(is_unordered)
+
+ ret = p.remove_all()
+ self.assertTrue(ret)
+
+ with zipfile.ZipFile('./tests/data/clean.cleaned.odt') as zin:
+ previous_name = ''
+ for item in zin.infolist():
+ if previous_name == '':
+ previous_name = item.filename
+ continue
+ self.assertGreaterEqual(item.filename, previous_name)
+
+ os.remove('./tests/data/clean.odt')
+ os.remove('./tests/data/clean.cleaned.odt')
+
+class TestRsidRemoval(unittest.TestCase):
+ def test_office(self):
+ shutil.copy('./tests/data/office_revision_session_ids.docx', './tests/data/clean.docx')
+ p = office.MSOfficeParser('./tests/data/clean.docx')
+
+ meta = p.get_meta()
+ self.assertIsNotNone(meta)
+
+ how_many_rsid = False
+ with zipfile.ZipFile('./tests/data/clean.docx') as zin:
+ for item in zin.infolist():
+ if not item.filename.endswith('.xml'):
+ continue
+ num = zin.read(item).decode('utf-8').lower().count('w:rsid')
+ how_many_rsid += num
+ self.assertEqual(how_many_rsid, 11)
+
+ ret = p.remove_all()
+ self.assertTrue(ret)
+
+ with zipfile.ZipFile('./tests/data/clean.cleaned.docx') as zin:
+ for item in zin.infolist():
+ if not item.filename.endswith('.xml'):
+ continue
+ num = zin.read(item).decode('utf-8').lower().count('w:rsid')
+ self.assertEqual(num, 0)
+
+ os.remove('./tests/data/clean.docx')
+ os.remove('./tests/data/clean.cleaned.docx')
=====================================
tests/test_libmat2.py
=====================================
@@ -1,10 +1,9 @@
-#!/usr/bin/python3
+#!/usr/bin/env python3
import unittest
import shutil
import os
import zipfile
-import tempfile
from libmat2 import pdf, images, audio, office, parser_factory, torrent, harmless
from libmat2 import check_dependencies
@@ -13,7 +12,7 @@ from libmat2 import check_dependencies
class TestCheckDependencies(unittest.TestCase):
def test_deps(self):
ret = check_dependencies()
- for key, value in ret.items():
+ for value in ret.values():
self.assertTrue(value)
@@ -56,8 +55,8 @@ class TestGetMeta(unittest.TestCase):
self.assertEqual(meta['producer'], 'pdfTeX-1.40.14')
self.assertEqual(meta['creator'], "'Certified by IEEE PDFeXpress at 03/19/2016 2:56:07 AM'")
self.assertEqual(meta['DocumentID'], "uuid:4a1a79c8-404e-4d38-9580-5bc081036e61")
- self.assertEqual(meta['PTEX.Fullbanner'], "This is pdfTeX, Version " \
- "3.1415926-2.5-1.40.14 (TeX Live 2013/Debian) kpathsea " \
+ self.assertEqual(meta['PTEX.Fullbanner'], "This is pdfTeX, Version "
+ "3.1415926-2.5-1.40.14 (TeX Live 2013/Debian) kpathsea "
"version 6.1.1")
def test_torrent(self):
@@ -182,70 +181,6 @@ class TestRevisionsCleaning(unittest.TestCase):
os.remove('./tests/data/revision_clean.docx')
os.remove('./tests/data/revision_clean.cleaned.docx')
-
-class TestDeepCleaning(unittest.TestCase):
- def __check_deep_meta(self, p):
- tempdir = tempfile.mkdtemp()
- zipin = zipfile.ZipFile(p.filename)
- zipin.extractall(tempdir)
-
- for subdir, dirs, files in os.walk(tempdir):
- for f in files:
- complete_path = os.path.join(subdir, f)
- inside_p, _ = parser_factory.get_parser(complete_path)
- if inside_p is None:
- continue
- self.assertEqual(inside_p.get_meta(), {})
- shutil.rmtree(tempdir)
-
-
- def __check_zip_meta(self, p):
- zipin = zipfile.ZipFile(p.filename)
- for item in zipin.infolist():
- self.assertEqual(item.comment, b'')
- self.assertEqual(item.date_time, (1980, 1, 1, 0, 0, 0))
- self.assertEqual(item.create_system, 3) # 3 is UNIX
-
-
- def test_office(self):
- shutil.copy('./tests/data/dirty.docx', './tests/data/clean.docx')
- p = office.MSOfficeParser('./tests/data/clean.docx')
-
- meta = p.get_meta()
- self.assertIsNotNone(meta)
-
- ret = p.remove_all()
- self.assertTrue(ret)
-
- p = office.MSOfficeParser('./tests/data/clean.cleaned.docx')
- self.assertEqual(p.get_meta(), {})
-
- self.__check_zip_meta(p)
- self.__check_deep_meta(p)
-
- os.remove('./tests/data/clean.docx')
- os.remove('./tests/data/clean.cleaned.docx')
-
-
- def test_libreoffice(self):
- shutil.copy('./tests/data/dirty.odt', './tests/data/clean.odt')
- p = office.LibreOfficeParser('./tests/data/clean.odt')
-
- meta = p.get_meta()
- self.assertIsNotNone(meta)
-
- ret = p.remove_all()
- self.assertTrue(ret)
-
- p = office.LibreOfficeParser('./tests/data/clean.cleaned.odt')
- self.assertEqual(p.get_meta(), {})
-
- self.__check_zip_meta(p)
- self.__check_deep_meta(p)
-
- os.remove('./tests/data/clean.odt')
- os.remove('./tests/data/clean.cleaned.odt')
-
class TestLightWeightCleaning(unittest.TestCase):
def test_pdf(self):
shutil.copy('./tests/data/dirty.pdf', './tests/data/clean.pdf')
@@ -294,9 +229,11 @@ class TestCleaning(unittest.TestCase):
p = pdf.PDFParser('./tests/data/clean.cleaned.pdf')
expected_meta = {'creation-date': -1, 'format': 'PDF-1.5', 'mod-date': -1}
self.assertEqual(p.get_meta(), expected_meta)
+ self.assertTrue(p.remove_all())
os.remove('./tests/data/clean.pdf')
os.remove('./tests/data/clean.cleaned.pdf')
+ os.remove('./tests/data/clean.cleaned.cleaned.pdf')
def test_png(self):
shutil.copy('./tests/data/dirty.png', './tests/data/clean.png')
@@ -310,9 +247,11 @@ class TestCleaning(unittest.TestCase):
p = images.PNGParser('./tests/data/clean.cleaned.png')
self.assertEqual(p.get_meta(), {})
+ self.assertTrue(p.remove_all())
os.remove('./tests/data/clean.png')
os.remove('./tests/data/clean.cleaned.png')
+ os.remove('./tests/data/clean.cleaned.cleaned.png')
def test_jpg(self):
shutil.copy('./tests/data/dirty.jpg', './tests/data/clean.jpg')
@@ -326,9 +265,11 @@ class TestCleaning(unittest.TestCase):
p = images.JPGParser('./tests/data/clean.cleaned.jpg')
self.assertEqual(p.get_meta(), {})
+ self.assertTrue(p.remove_all())
os.remove('./tests/data/clean.jpg')
os.remove('./tests/data/clean.cleaned.jpg')
+ os.remove('./tests/data/clean.cleaned.cleaned.jpg')
def test_mp3(self):
shutil.copy('./tests/data/dirty.mp3', './tests/data/clean.mp3')
@@ -342,9 +283,11 @@ class TestCleaning(unittest.TestCase):
p = audio.MP3Parser('./tests/data/clean.cleaned.mp3')
self.assertEqual(p.get_meta(), {})
+ self.assertTrue(p.remove_all())
os.remove('./tests/data/clean.mp3')
os.remove('./tests/data/clean.cleaned.mp3')
+ os.remove('./tests/data/clean.cleaned.cleaned.mp3')
def test_ogg(self):
shutil.copy('./tests/data/dirty.ogg', './tests/data/clean.ogg')
@@ -358,9 +301,11 @@ class TestCleaning(unittest.TestCase):
p = audio.OGGParser('./tests/data/clean.cleaned.ogg')
self.assertEqual(p.get_meta(), {})
+ self.assertTrue(p.remove_all())
os.remove('./tests/data/clean.ogg')
os.remove('./tests/data/clean.cleaned.ogg')
+ os.remove('./tests/data/clean.cleaned.cleaned.ogg')
def test_flac(self):
shutil.copy('./tests/data/dirty.flac', './tests/data/clean.flac')
@@ -374,9 +319,11 @@ class TestCleaning(unittest.TestCase):
p = audio.FLACParser('./tests/data/clean.cleaned.flac')
self.assertEqual(p.get_meta(), {})
+ self.assertTrue(p.remove_all())
os.remove('./tests/data/clean.flac')
os.remove('./tests/data/clean.cleaned.flac')
+ os.remove('./tests/data/clean.cleaned.cleaned.flac')
def test_office(self):
shutil.copy('./tests/data/dirty.docx', './tests/data/clean.docx')
@@ -390,10 +337,11 @@ class TestCleaning(unittest.TestCase):
p = office.MSOfficeParser('./tests/data/clean.cleaned.docx')
self.assertEqual(p.get_meta(), {})
+ self.assertTrue(p.remove_all())
os.remove('./tests/data/clean.docx')
os.remove('./tests/data/clean.cleaned.docx')
-
+ os.remove('./tests/data/clean.cleaned.cleaned.docx')
def test_libreoffice(self):
shutil.copy('./tests/data/dirty.odt', './tests/data/clean.odt')
@@ -407,9 +355,11 @@ class TestCleaning(unittest.TestCase):
p = office.LibreOfficeParser('./tests/data/clean.cleaned.odt')
self.assertEqual(p.get_meta(), {})
+ self.assertTrue(p.remove_all())
os.remove('./tests/data/clean.odt')
os.remove('./tests/data/clean.cleaned.odt')
+ os.remove('./tests/data/clean.cleaned.cleaned.odt')
def test_tiff(self):
shutil.copy('./tests/data/dirty.tiff', './tests/data/clean.tiff')
@@ -423,9 +373,11 @@ class TestCleaning(unittest.TestCase):
p = images.TiffParser('./tests/data/clean.cleaned.tiff')
self.assertEqual(p.get_meta(), {})
+ self.assertTrue(p.remove_all())
os.remove('./tests/data/clean.tiff')
os.remove('./tests/data/clean.cleaned.tiff')
+ os.remove('./tests/data/clean.cleaned.cleaned.tiff')
def test_bmp(self):
shutil.copy('./tests/data/dirty.bmp', './tests/data/clean.bmp')
@@ -439,9 +391,11 @@ class TestCleaning(unittest.TestCase):
p = harmless.HarmlessParser('./tests/data/clean.cleaned.bmp')
self.assertEqual(p.get_meta(), {})
+ self.assertTrue(p.remove_all())
os.remove('./tests/data/clean.bmp')
os.remove('./tests/data/clean.cleaned.bmp')
+ os.remove('./tests/data/clean.cleaned.cleaned.bmp')
def test_torrent(self):
shutil.copy('./tests/data/dirty.torrent', './tests/data/clean.torrent')
@@ -455,9 +409,11 @@ class TestCleaning(unittest.TestCase):
p = torrent.TorrentParser('./tests/data/clean.cleaned.torrent')
self.assertEqual(p.get_meta(), {})
+ self.assertTrue(p.remove_all())
os.remove('./tests/data/clean.torrent')
os.remove('./tests/data/clean.cleaned.torrent')
+ os.remove('./tests/data/clean.cleaned.cleaned.torrent')
def test_odf(self):
shutil.copy('./tests/data/dirty.odf', './tests/data/clean.odf')
@@ -471,10 +427,11 @@ class TestCleaning(unittest.TestCase):
p = office.LibreOfficeParser('./tests/data/clean.cleaned.odf')
self.assertEqual(p.get_meta(), {})
+ self.assertTrue(p.remove_all())
os.remove('./tests/data/clean.odf')
os.remove('./tests/data/clean.cleaned.odf')
-
+ os.remove('./tests/data/clean.cleaned.cleaned.odf')
def test_odg(self):
shutil.copy('./tests/data/dirty.odg', './tests/data/clean.odg')
@@ -488,9 +445,11 @@ class TestCleaning(unittest.TestCase):
p = office.LibreOfficeParser('./tests/data/clean.cleaned.odg')
self.assertEqual(p.get_meta(), {})
+ self.assertTrue(p.remove_all())
os.remove('./tests/data/clean.odg')
os.remove('./tests/data/clean.cleaned.odg')
+ os.remove('./tests/data/clean.cleaned.cleaned.odg')
def test_txt(self):
shutil.copy('./tests/data/dirty.txt', './tests/data/clean.txt')
@@ -504,6 +463,8 @@ class TestCleaning(unittest.TestCase):
p = harmless.HarmlessParser('./tests/data/clean.cleaned.txt')
self.assertEqual(p.get_meta(), {})
+ self.assertTrue(p.remove_all())
os.remove('./tests/data/clean.txt')
os.remove('./tests/data/clean.cleaned.txt')
+ os.remove('./tests/data/clean.cleaned.cleaned.txt')
=====================================
tests/test_policy.py
=====================================
@@ -0,0 +1,31 @@
+#!/usr/bin/env python3
+
+import unittest
+import shutil
+import os
+
+from libmat2 import office, UnknownMemberPolicy
+
+class TestPolicy(unittest.TestCase):
+ def test_policy_omit(self):
+ shutil.copy('./tests/data/embedded.docx', './tests/data/clean.docx')
+ p = office.MSOfficeParser('./tests/data/clean.docx')
+ p.unknown_member_policy = UnknownMemberPolicy.OMIT
+ self.assertTrue(p.remove_all())
+ os.remove('./tests/data/clean.docx')
+ os.remove('./tests/data/clean.cleaned.docx')
+
+ def test_policy_keep(self):
+ shutil.copy('./tests/data/embedded.docx', './tests/data/clean.docx')
+ p = office.MSOfficeParser('./tests/data/clean.docx')
+ p.unknown_member_policy = UnknownMemberPolicy.KEEP
+ self.assertTrue(p.remove_all())
+ os.remove('./tests/data/clean.docx')
+ os.remove('./tests/data/clean.cleaned.docx')
+
+ def test_policy_unknown(self):
+ shutil.copy('./tests/data/embedded.docx', './tests/data/clean.docx')
+ p = office.MSOfficeParser('./tests/data/clean.docx')
+ with self.assertRaises(ValueError):
+ p.unknown_member_policy = UnknownMemberPolicy('unknown_policy_name_totally_invalid')
+ os.remove('./tests/data/clean.docx')
View it on GitLab: https://salsa.debian.org/pkg-privacy-team/mat2/commit/2ebbdb392594d41156aaac6cd3b5eecaa0d556a7
--
View it on GitLab: https://salsa.debian.org/pkg-privacy-team/mat2/commit/2ebbdb392594d41156aaac6cd3b5eecaa0d556a7
You're receiving this email because of your account on salsa.debian.org.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://alioth-lists.debian.net/pipermail/pkg-privacy-commits/attachments/20181003/082fb161/attachment-0001.html>
More information about the Pkg-privacy-commits
mailing list