[Python-modules-team] Bug#763990: white spaces changed to  _place_holder;
b.steve at gmx.com
Mon Oct 20 17:39:01 UTC 2014
On 10/18/2014 09:08 PM, Stefano Rivera wrote:
> Control: tag -1 upstream
> Hi Steve (2014.10.04_07:54:18_-0700)
> Running it on that page doesn't seem to introduce any
>  _place_holder; entities.
> Can you find any HTML that reproduces this?
My first description is misleading, sorry.
It's the RSS feed of the website that should be parsed with html2text,
as I use it through rss2email.
Steps to reproduce :
python /usr/share/pyshared/html2text.py news.xml > news.html
python /usr/share/pyshared/html2text.py news.html
First pass converts XML to HTML and second one HTML to plain text (I
guess that's what rss2email do).
There you should see  _place_holder; entities.
Looking further in this feed I see other similar problems :
é which should produce "é" gives sometime "e".
è which should produce "è" gives sometime "e".
ê which should produce "ê" gives sometime "e".
But this behavior is not consistent : I see words like "société" written
as "societe" and others like "vidéo" written as "video" but also as "vidéo".
More information about the Python-modules-team