[Python-modules-team] Bug#720341: [python-sphinxcontrib.spelling] Spellchecker is not unicode avare in PythonBuiltinsFilter class

Tue Aug 20 18:08:42 UTC 2013

Package: python-sphinxcontrib.spelling
Version: 1.4-1
Severity: normal
Tags: patch

Hi,

the package has a problem with unicode strings/words in in the
PythonBuiltinsFilter's _skip method and when i try to check rst
documents written in Slovak language i get:

Exception occurred:
  File "/usr/lib/pymodules/python2.7/sphinx/application.py", line 204,
in build
    self.builder.build_update()
  File "/usr/lib/pymodules/python2.7/sphinx/builders/__init__.py", line
191, in build_update
    self.build(['__all__'], to_build)
  File "/usr/lib/pymodules/python2.7/sphinx/builders/__init__.py", line
252, in build
    self.write(docnames, list(updated_docnames), method)
  File "/usr/lib/pymodules/python2.7/sphinx/builders/__init__.py", line
292, in write
    self.write_doc(docname, doctree)
  File "/usr/lib/pymodules/python2.7/sphinxcontrib/spelling.py", line
295, in write_doc
    for word, suggestions in self.checker.check(node.astext()):
  File "/usr/lib/pymodules/python2.7/sphinxcontrib/spelling.py", line
203, in check
    for word, pos in self.tokenizer(text):
  File "/usr/lib/python2.7/dist-packages/enchant/tokenize/__init__.py",
line 389, in next
    (word,pos) = next(self._tokenizer)
  File "/usr/lib/python2.7/dist-packages/enchant/tokenize/__init__.py",
line 389, in next
    (word,pos) = next(self._tokenizer)
  File "/usr/lib/python2.7/dist-packages/enchant/tokenize/__init__.py",
line 390, in next
    while self._skip(word):
  File "/usr/lib/pymodules/python2.7/sphinxcontrib/spelling.py", line
150, in _skip
    return hasattr(__builtin__, word)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' in
position 2: ordinal not in range(128)

After some inspection i found, that sphinx sends all words as unicode
strings (<type 'unicode'>), not matter if they have non ASCII chars or
not, but when there are non ASCII chars here is a problem,
because the hasattr function gets the *str* as argument. Solution seems
to be to add "encode...", to convert from unicode to the str type (at
line 149):

   return hasattr(__builtin__, word.encode("utf-8"))

I am not sure if it is workaround or solution, but seems to work for
English texts too. Patch attached.

regards

--- System information. ---
Architecture: amd64
Kernel:       Linux 3.10-2-amd64

Debian Release: jessie/sid

--- Package information. ---
Depends               (Version) | Installed
===============================-+-============
python                          | 2.7.5-2
python-support      (>= 0.90.0) | 1.0.15
python-docutils                 | 0.10-3
python-enchant                  | 1.6.5-2
python-sphinx                   | 1.1.3+dfsg-8

-- 
Slavko
http://slavino.sk

-------------- next part --------------
A non-text attachment was scrubbed...
Name: spelling_unicode.patch
Type: text/x-diff
Size: 447 bytes
Desc: not available
URL: <http://lists.alioth.debian.org/pipermail/python-modules-team/attachments/20130820/79cf5ace/attachment.patch>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 490 bytes
Desc: OpenPGP digital signature
URL: <http://lists.alioth.debian.org/pipermail/python-modules-team/attachments/20130820/79cf5ace/attachment.sig>