Bug#898022: diffoscope: Traceback when comparing paths with invalid unicode characters

Mattia Rizzolo mattia at debian.org
Thu May 10 17:01:50 BST 2018

On Thu, May 10, 2018 at 04:43:37PM +0100, Chris Lamb wrote:
> > Do you think this would be fine?
> Whilst this works, would it not be better if we could use bytes for
> filenames throughout? I mean, AIUI there is no assumption that
> filesystems need to have any form of valid encoding whatsoever, let
> alone UTF-8.

That was my initial idea as well, but apparently the Python developers
are of different opinion. Check out the PEP I linked in my previous
email: https://www.python.org/dev/peps/pep-0383/

Together with the argparse bug I also linked:
https://bugs.python.org/issue21416 - apparently it's "hard" (more like
impossible?) to get bytes from the CLI...

I believe that, like that bug is showing, we should just specify
    type=os.fsencode    # https://docs.python.org/3/library/os.html#os.fsencode
in the parser.add_argument() calls using a filename (to make sure
argparse doesn't change output), and then re-encode them before passing
them to functions that can't handle surrogate encoded stuff like this
magic module.

> However, somewhat happy to see this in diffoscope as it certainly
> improves the current state of affairs. If you do commit it, please
> include my testcase (or something based on it) that I added in:
>   https://bugs.debian.org/898022#5

Of course.

                        Mattia Rizzolo

GPG Key: 66AE 2B4A FCCF 3F52 DA18  4D18 4B04 3FCD B944 4540      .''`.
more about me:  https://mapreri.org                             : :'  :
Launchpad user: https://launchpad.net/~mapreri                  `. `'`
Debian QA page: https://qa.debian.org/developer.php?login=mattia  `-
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: not available
URL: <http://alioth-lists.debian.net/pipermail/reproducible-builds/attachments/20180510/2ca746c3/attachment.sig>

More information about the Reproducible-builds mailing list