Bug#586021: python-debian: can deb822.Sources can not handle Sources file with mixed data

sean finney seanius at debian.org
Wed Jun 16 07:14:05 UTC 2010


hi john,

On Tue, Jun 15, 2010 at 02:40:33PM -0600, John Wright wrote:
>   * Accept 'raw' as a Deb822 constructor encoding argument, or add a
>     raw_strings keyword argument, that turns off the unicode behavior
>     - Con: old code still breaks with mixed data - you have to change
>       your code to use the new constructor argument
>     - Pro: most consistent results (raw strings are only returned if you
>       explicitly ask for them)
>   
>   * Wrap unicode stuff in try/except, and use the raw string if
>     something goes wrong
>     - Con: not as consistent results as above option
>     - Pro: old code works out-of-box with mixed data
> 
> Which one do you think makes more sense?

the problem with the former is that since the input is typically outside
the control of the programmer, most/many people would end up always having
to pass it along, which kinda defeats the purpose and also complicates the api.
And I agree about the issue you raise with consistency in the latter,
so I don't think either of these two are that great.

fwiw, after having looked at the code i have found a workaround in
the meantime, which may point at anohter option for a real solution.
since the encoding seems to be stored per instance from what the iterator
returns, explicitly setting it after catching the UnicodeError seems to
get around the problem:

slist = deb822.Sources.iter_paragraphs(fh)
for ent in slist:
    try:
      outf.write(ent.dump().encode('utf-8'))
    except UnicodeDecodeError:
      ent.encoding = 'latin-1'
      outf.write(ent.dump().encode('utf-8'))
    outf.write("\n")

however trying to do this:

    outf.write(ent.dump(encoding='latin-1').encode('utf-8'))

does not work, as it seems there's still something somewhere using the
instance's encoding attribute instead of the function parameter.  if *that*
could be fixed, i don't think we'd have a bug here.  i.e.:

slist = deb822.Sources.iter_paragraphs(fh)
for ent in slist:
    try:
      outf.write(ent.dump().encode('utf-8'))
    except UnicodeDecodeError:
      outf.write(ent.dump(encoding='latin-1').encode('utf-8'))
    outf.write("\n")

is pretty much what I would have expected to need in my code knowing
that deb822 now uses unicode internally.  it feels very python like and
doesn't involve any extra API changes.  what do you think?



	sean
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 190 bytes
Desc: Digital signature
URL: <http://lists.alioth.debian.org/pipermail/pkg-python-debian-maint/attachments/20100616/cf16676c/attachment.pgp>


More information about the pkg-python-debian-maint mailing list