Inconsistency in the Reuters dataset from python-nltk
Matteo Gamba
argon at hipatia.net
Fri Jan 20 19:42:57 UTC 2017
Hello!
I recently encountered a problem while using the Reuters dataset from
python-nltk.
I'm currently running python-nltk 3.2.1-2 on Debian Stretch, and I
downloaded the Reuters corpus using nltk.download()
In the dataset, the character "<" (u+003C) is incorrectly represented as
"<"
This inconsistency occurs only for such character, while all the other
special character seem to be correctly encoded.
I generated a list of files presenting the problem, but I'm not sure
where to file the bug report.
I have only access to Debian computers, so I haven't tried to replicate
the problem on other systems. Since the character is represented in HTML
notation, I suspect this problem occurred when the files where generated
from the original Reuters dataset, since it is stored in SGML format[1],
so the problem might be general for NLTK.
[1]
http://www.daviddlewis.com/resources/testcollections/reuters21578/readme.txt
Thank you for your attention,
Matteo Gamba
--
XMPP: argon at hipatia.net
----------------------------------------------------------------------------
Fingerprint: 1CD6 BCD3 582C 9107 3173 AE0C 1457 F9D5 E4DE AEB8
Public key: 0xE4DEAEB8 - http://pgp.mit.edu
----------------------------------------------------------------------------
http://guri.hipatia.net
http://www.hipatia.net
More information about the debian-science-maintainers
mailing list