Bug#420636: [xml/sgml-pkgs] Bug#420636: crashes on feeds that contain invalid utf-8 sequences

Mike Hommey mh at glandium.org
Mon Apr 23 18:57:17 UTC 2007


On Mon, Apr 23, 2007 at 02:34:25PM -0400, Joey Hess <joeyh at debian.org> wrote:
> Mike Hommey wrote:
> > On Mon, Apr 23, 2007 at 02:03:44PM -0400, Joey Hess <joeyh at debian.org> wrote:
> > > Mike Hommey wrote:
> > > > Such a lax xml parser is not an xml parser. This bug is therefore a
> > > > wishlist bug.
> > > 
> > > Not being able to parse as many xml feeds out in the wild as other
> > > language's parsers is not a bug?
> > 
> > The fact is they are *not* xml feeds.
> 
> The fact is that they are out there in the wild and tools are needed to
> parse them. Do you read Planet Debian? I'll guarantee you that at least
> one feed currently on there is not valid xml.
> 
> It seems that you are more interested in arguing semantics than actually
> fixing the problem? (Which I've worked around in my code now anyway by
> calling Encode::decode_utf8 on the feed if XML::Feed crashes.)

And this is exactly how you should be dealing with the "problem". There
are requirements from XML parsers in the XML specifications. If you want
an XML parser to parse something else than valid XML data, you should
make your own steps so that your data becomes valid XML data.

This is because of the existence of tools that torerate invalid XML that
there are still invalid XML files around.

Mike





More information about the debian-xml-sgml-pkgs mailing list