[Python-modules-team] Bug#763990: white spaces changed to &nbsp_place_holder;

Steve B b.steve at gmx.com
Mon Oct 20 17:39:01 UTC 2014


On 10/18/2014 09:08 PM, Stefano Rivera wrote:
> Control: tag -1 upstream
>
> Hi Steve (2014.10.04_07:54:18_-0700)
>> http://www.nextinpact.com/news/90246-twitch-et-valve-veulent-plus-transparence-sur-contenus-sponsorises.htm
>
> Running it on that page doesn't seem to introduce any
> &nbsp_place_holder; entities.
>
> Can you find any HTML that reproduces this?
>
> SR
>

Hi Stefano.

My first description is misleading, sorry.

It's the RSS feed of the website that should be parsed with html2text, 
as I use it through rss2email.

Steps to reproduce :
wget http://www.nextinpact.com/rss/news.xml
python /usr/share/pyshared/html2text.py news.xml > news.html
python /usr/share/pyshared/html2text.py news.html
First pass converts XML to HTML and second one HTML to plain text (I 
guess that's what rss2email do).

There you should see &nbsp_place_holder; entities.

Looking further in this feed I see other similar problems :
é which should produce "é" gives sometime "e".
è which should produce "è" gives sometime "e".
ê which should produce "ê" gives sometime "e".
...

But this behavior is not consistent : I see words like "société" written 
as "societe" and others like "vidéo" written as "video" but also as "vidéo".


Steve



More information about the Python-modules-team mailing list