[Python-modules-team] Bug#763990: white spaces changed to  _place_holder;
Steve B
b.steve at gmx.com
Mon Oct 20 17:39:01 UTC 2014
On 10/18/2014 09:08 PM, Stefano Rivera wrote:
> Control: tag -1 upstream
>
> Hi Steve (2014.10.04_07:54:18_-0700)
>> http://www.nextinpact.com/news/90246-twitch-et-valve-veulent-plus-transparence-sur-contenus-sponsorises.htm
>
> Running it on that page doesn't seem to introduce any
>  _place_holder; entities.
>
> Can you find any HTML that reproduces this?
>
> SR
>
Hi Stefano.
My first description is misleading, sorry.
It's the RSS feed of the website that should be parsed with html2text,
as I use it through rss2email.
Steps to reproduce :
wget http://www.nextinpact.com/rss/news.xml
python /usr/share/pyshared/html2text.py news.xml > news.html
python /usr/share/pyshared/html2text.py news.html
First pass converts XML to HTML and second one HTML to plain text (I
guess that's what rss2email do).
There you should see  _place_holder; entities.
Looking further in this feed I see other similar problems :
é which should produce "é" gives sometime "e".
è which should produce "è" gives sometime "e".
ê which should produce "ê" gives sometime "e".
...
But this behavior is not consistent : I see words like "société" written
as "societe" and others like "vidéo" written as "video" but also as "vidéo".
Steve
More information about the Python-modules-team
mailing list