Bug#833375: python3-debian: Cannot parse input of bytes under Python 3

Mon Dec 26 08:02:03 UTC 2016

Control: tags -1 + patch

Hi Chris!

I'm very suspicious of automagical decoding from bytes to strings -- it is 
asking for trouble. These are exactly the sorts of places were we make 
mojibake or choke by double-decoding. Then again, we want python-debian to be 
as useful as possible to people and I might have a workable plan.

Adding an extra acceptable input to `iter_paragraphs` actually looks easier 
than where you have ended up; the attached patch seems to work OK. I'd 
appreciate it if you could test it to confirm that it does indeed solve your 
use case. (python-debian comaintainers, I'd appreciate an ack from you too)

To the example code in question:

>     with open('Sources', 'rb') as f:
>         content = f.read()

this has explicitly opened the file as a *binary* file, there is no specified 
encoding and then the data is given to a *text* processor. Text parsers want 
text not bytes. Binary→text can't happen automatically without assuming an 
encoding. That it (accidentally) works while being imprecise about encodings 
with Python 2 is exactly the sort of thing that the str/unicode v str/bytes 
changes between Python 2 and 3 was supposed to catch -- as it did in the 
example.

I have a definite preference for being explicit and decoding/encoding at the 
program boundary with the caller dealing with the requested encoding (i.e. 
using `open(…, encoding=…)` in Python 3 or `codecs` in Python 2). 

>     for src in Sources.iter_paragraphs(content):
>          print((src['Package'], src['Version']))

Now for me, the redemption is that `iter_paragraphs` is already a bit of a 
DWIM interface that accepts a scary number of different things *and* it has an 
`encoding` keyword argument with a default value of 'utf-8'... having a 
specified encoding available enables automatic conversion.

As an aside, is there a real need in the example code to force the use of 
deb822's internal 822 parser rather than using the much faster one from apt's 
TagFile? The example snippet is clearly part of something bigger and perhaps 
you need the content separately -- you may find that passing the fd straight to 
iter_paragraphs is worthwhile.

cheers
Stuart

-- 
Stuart Prescott    http://www.nanonanonano.net/   stuart at nanonanonano.net
Debian Developer   http://www.debian.org/         stuart at debian.org
GPG fingerprint    90E2 D2C1 AD14 6A1B 7EBB 891D BBC1 7EBB 1396 F2F7
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0001-Allow-iter_paragraphs-to-accept-bytes.patch
Type: text/x-patch
Size: 1913 bytes
Desc: not available
URL: <http://lists.alioth.debian.org/pipermail/pkg-python-debian-maint/attachments/20161226/82721383/attachment.bin>