Inconsistency in the Reuters dataset from python-nltk

Fri Jan 20 19:42:57 UTC 2017

Hello!

I recently encountered a problem while using the Reuters dataset from
python-nltk.

I'm currently running python-nltk 3.2.1-2 on Debian Stretch, and I
downloaded the Reuters corpus using nltk.download()

In the dataset, the character "<" (u+003C) is incorrectly represented as
"<"

This inconsistency occurs only for such character, while all the other
special character seem to be correctly encoded.

I generated a list of files presenting the problem, but I'm not sure
where to file the bug report.

I have only access to Debian computers, so I haven't tried to replicate
the problem on other systems. Since the character is represented in HTML
notation, I suspect this problem occurred when the files where generated
from the original Reuters dataset, since it is stored in SGML format[1],
so the problem might be general for NLTK.

[1]
http://www.daviddlewis.com/resources/testcollections/reuters21578/readme.txt

Thank you for your attention,

Matteo Gamba

-- 

XMPP: argon at hipatia.net

----------------------------------------------------------------------------
Fingerprint: 1CD6 BCD3 582C 9107 3173 AE0C 1457 F9D5 E4DE AEB8
Public key: 0xE4DEAEB8 - http://pgp.mit.edu
----------------------------------------------------------------------------

http://guri.hipatia.net
http://www.hipatia.net