Bug#833375: python3-debian: Cannot parse input of bytes under Python 3
stuart at debian.org
Mon Dec 26 08:02:03 UTC 2016
Control: tags -1 + patch
I'm very suspicious of automagical decoding from bytes to strings -- it is
asking for trouble. These are exactly the sorts of places were we make
mojibake or choke by double-decoding. Then again, we want python-debian to be
as useful as possible to people and I might have a workable plan.
Adding an extra acceptable input to `iter_paragraphs` actually looks easier
than where you have ended up; the attached patch seems to work OK. I'd
appreciate it if you could test it to confirm that it does indeed solve your
use case. (python-debian comaintainers, I'd appreciate an ack from you too)
To the example code in question:
> with open('Sources', 'rb') as f:
> content = f.read()
this has explicitly opened the file as a *binary* file, there is no specified
encoding and then the data is given to a *text* processor. Text parsers want
text not bytes. Binary→text can't happen automatically without assuming an
encoding. That it (accidentally) works while being imprecise about encodings
with Python 2 is exactly the sort of thing that the str/unicode v str/bytes
changes between Python 2 and 3 was supposed to catch -- as it did in the
I have a definite preference for being explicit and decoding/encoding at the
program boundary with the caller dealing with the requested encoding (i.e.
using `open(…, encoding=…)` in Python 3 or `codecs` in Python 2).
> for src in Sources.iter_paragraphs(content):
> print((src['Package'], src['Version']))
Now for me, the redemption is that `iter_paragraphs` is already a bit of a
DWIM interface that accepts a scary number of different things *and* it has an
`encoding` keyword argument with a default value of 'utf-8'... having a
specified encoding available enables automatic conversion.
As an aside, is there a real need in the example code to force the use of
deb822's internal 822 parser rather than using the much faster one from apt's
TagFile? The example snippet is clearly part of something bigger and perhaps
you need the content separately -- you may find that passing the fd straight to
iter_paragraphs is worthwhile.
Stuart Prescott http://www.nanonanonano.net/ stuart at nanonanonano.net
Debian Developer http://www.debian.org/ stuart at debian.org
GPG fingerprint 90E2 D2C1 AD14 6A1B 7EBB 891D BBC1 7EBB 1396 F2F7
-------------- next part --------------
A non-text attachment was scrubbed...
Size: 1913 bytes
Desc: not available
More information about the pkg-python-debian-maint