[Python-modules-commits] [python-bleach] 01/05: Imported Upstream version 2.1.2

Scott Kitterman kitterman at moszumanska.debian.org
Wed Jan 10 04:39:33 UTC 2018


This is an automated email from the git hooks/post-receive script.

kitterman pushed a commit to branch debian/master
in repository python-bleach.

commit 0bf95a330ef15873da77462572f2fb78f5b3e633
Author: Scott Kitterman <scott at kitterman.com>
Date:   Tue Jan 9 23:25:10 2018 -0500

    Imported Upstream version 2.1.2
---
 .gitignore                               |   1 +
 CHANGES                                  | 262 ++++++++++++++----
 CODE_OF_CONDUCT.rst                      |   9 +
 CONTRIBUTORS                             |  10 +-
 MANIFEST.in                              |   3 +
 README.rst                               |  36 ++-
 bleach/__init__.py                       |  28 +-
 bleach/callbacks.py                      |  10 +-
 bleach/encoding.py                       |  62 -----
 bleach/linkifier.py                      |  19 +-
 bleach/sanitizer.py                      | 317 ++++++++++++++++++++-
 bleach/utils.py                          |  21 ++
 bleach/version.py                        |   6 -
 docs/clean.rst                           |  36 ++-
 docs/conf.py                             |  14 +-
 docs/dev.rst                             |  34 ++-
 docs/goals.rst                           |  76 +++++-
 setup.py                                 |  27 +-
 tests/data/13.test.out                   |   2 +-
 tests/data/14.test.out                   |   2 +-
 tests/data/15.test.out                   |   2 +-
 tests/data/16.test.out                   |   2 +-
 tests/data/17.test.out                   |   2 +-
 tests/data/18.test.out                   |   2 +-
 tests/data/19.test.out                   |   3 +-
 tests/data/20.test                       |   1 +
 tests/data/20.test.out                   |   1 +
 tests/test_basics.py                     | 365 -------------------------
 tests/test_callbacks.py                  |  63 +++++
 tests/test_clean.py                      | 454 +++++++++++++++++++++++++++++++
 tests/{test_links.py => test_linkify.py} |  33 ++-
 tests/test_security.py                   |  42 ++-
 tests_website/.gitignore                 |   1 +
 tests_website/README.rst                 |  28 ++
 tests_website/data_to_json.py            |  53 ++++
 tests_website/index.html                 | 149 ++++++++++
 tests_website/open_test_page.py          |  36 +++
 tests_website/server.py                  |  52 ++++
 tox.ini                                  |  61 ++++-
 39 files changed, 1753 insertions(+), 572 deletions(-)

diff --git a/.gitignore b/.gitignore
index f5adb54..26bbdf8 100644
--- a/.gitignore
+++ b/.gitignore
@@ -10,3 +10,4 @@ build
 docs/_build/
 .cache/
 .eggs/
+.*env*/
diff --git a/CHANGES b/CHANGES
index 7caa99f..47bf390 100644
--- a/CHANGES
+++ b/CHANGES
@@ -1,9 +1,108 @@
 Bleach Changes
 ==============
 
+Version 2.1.2 (December 7th, 2017)
+----------------------------------
+
+**Security fixes**
+
+None
+
+**Backwards incompatible changes**
+
+None
+
+**Features**
+
+None
+
+**Bug fixes**
+
+* Support html5lib-python 1.0.1. (#337)
+
+* Add deprecation warning for supporting html5lib-python < 1.0.
+
+* Switch to semver.
+
+
+Version 2.1.1 (October 2nd, 2017)
+---------------------------------
+
+**Security fixes**
+
+None
+
+**Backwards incompatible changes**
+
+None
+
+**Features**
+
+None
+
+**Bug fixes**
+
+* Fix ``setup.py`` opening files when ``LANG=``. (#324)
+
+
+Version 2.1 (September 28th, 2017)
+----------------------------------
+
+**Security fixes**
+
+* Convert control characters (backspace particularly) to "?" preventing
+  malicious copy-and-paste situations. (#298)
+
+  See `<https://github.com/mozilla/bleach/issues/298>`_ for more details.
+
+  This affects all previous versions of Bleach. Check the comments on that
+  issue for ways to alleviate the issue if you can't upgrade to Bleach 2.1.
+
+
+**Backwards incompatible changes**
+
+* Redid versioning. ``bleach.VERSION`` is no longer available. Use the string
+  version at ``bleach.__version__`` and parse it with
+  ``pkg_resources.parse_version``. (#307)
+
+* clean, linkify: linkify and clean should only accept text types; thank you,
+  Janusz! (#292)
+
+* clean, linkify: accept only unicode or utf-8-encoded str (#176)
+
+
+**Features**
+
+
+**Bug fixes**
+
+* ``bleach.clean()`` no longer unescapes entities including ones that are missing
+  a ``;`` at the end which can happen in urls and other places. (#143)
+
+* linkify: fix http links inside of mailto links; thank you, sedrubal! (#300)
+
+* clarify security policy in docs (#303)
+
+* fix dependency specification for html5lib 1.0b8, 1.0b9, and 1.0b10; thank you,
+  Zoltán! (#268)
+
+* add Bleach vs. html5lib comparison to README; thank you, Stu Cox! (#278)
+
+* fix KeyError exceptions on tags without href attr; thank you, Alex Defsen!
+  (#273)
+
+* add test website and scripts to test ``bleach.clean()`` output in browser;
+  thank you, Greg Guthe!
+
+
 Version 2.0 (March 8th, 2017)
 -----------------------------
 
+**Security fixes**
+
+* None
+
+
 **Backwards incompatible changes**
 
 * Removed support for Python 2.6. #206
@@ -15,6 +114,11 @@ Version 2.0 (March 8th, 2017)
   This version is a rewrite to use the new sanitizing API since the old
   one was dropped in html5lib 0.99999999 (8 9s).
 
+  If you're using 0.9999999 (7 9s) upgrade to 0.99999999 (8 9s) or higher.
+
+  If you're using 1.0b8 (equivalent to 0.9999999 (7 9s)), upgrade to 1.0b9
+  (equivalent to 0.99999999 (8 9s)) or higher.
+
 * ``bleach.clean`` and friends were rewritten
 
   ``clean`` was reimplemented as an html5lib filter and happens at a different
@@ -105,9 +209,13 @@ Version 2.0 (March 8th, 2017)
 Version 1.5 (November 4th, 2016)
 --------------------------------
 
+**Security fixes**
+
+* None
+
 **Backwards incompatible changes**
 
-- clean: The list of ``ALLOWED_PROTOCOLS`` now defaults to http, https and
+* clean: The list of ``ALLOWED_PROTOCOLS`` now defaults to http, https and
   mailto.
 
   Previously it was a long list of protocols something like ed2k, ftp, http,
@@ -116,28 +224,40 @@ Version 1.5 (November 4th, 2016)
 
 **Changes**
 
-- clean: Added ``protocols`` to arguments list to let you override the list of
+* clean: Added ``protocols`` to arguments list to let you override the list of
   allowed protocols. Thank you, Andreas Malecki! #149
-- linkify: Fix a bug involving periods at the end of an email address. Thank you,
+
+* linkify: Fix a bug involving periods at the end of an email address. Thank you,
   Lorenz Schori! #219
-- linkify: Fix linkification of non-ascii ports. Thank you Alexandre, Macabies!
+
+* linkify: Fix linkification of non-ascii ports. Thank you Alexandre, Macabies!
   #207
-- linkify: Fix linkify inappropriately removing node tails when dropping nodes.
+
+* linkify: Fix linkify inappropriately removing node tails when dropping nodes.
   #132
-- Fixed a test that failed periodically. #161
-- Switched from nose to py.test. #204
-- Add test matrix for all supported Python and html5lib versions. #230
-- Limit to html5lib ``>=0.999,!=0.9999,!=0.99999,<0.99999999`` because 0.9999
+
+* Fixed a test that failed periodically. #161
+
+* Switched from nose to py.test. #204
+
+* Add test matrix for all supported Python and html5lib versions. #230
+
+* Limit to html5lib ``>=0.999,!=0.9999,!=0.99999,<0.99999999`` because 0.9999
   and 0.99999 are busted.
-- Add support for ``python setup.py test``. #97
+
+* Add support for ``python setup.py test``. #97
 
 
 Version 1.4.3 (May 23rd, 2016)
 ------------------------------
 
+**Security fixes**
+
+* None
+
 **Changes**
 
-- Limit to html5lib ``>=0.999,<0.99999999`` because of impending change to
+* Limit to html5lib ``>=0.999,<0.99999999`` because of impending change to
   sanitizer api. #195
 
 
@@ -146,10 +266,13 @@ Version 1.4.2 (September 11, 2015)
 
 **Changes**
 
-- linkify: Fix hang in linkify with ``parse_email=True``. #124
-- linkify: Fix crash in linkify when removing a link that is a first-child. #136
-- Updated TLDs.
-- linkify: Don't remove exterior brackets when linkifying. #146
+* linkify: Fix hang in linkify with ``parse_email=True``. #124
+
+* linkify: Fix crash in linkify when removing a link that is a first-child. #136
+
+* Updated TLDs.
+
+* linkify: Don't remove exterior brackets when linkifying. #146
 
 
 Version 1.4.1 (December 15, 2014)
@@ -157,8 +280,9 @@ Version 1.4.1 (December 15, 2014)
 
 **Changes**
 
-- Consistent order of attributes in output.
-- Python 3.4 support.
+* Consistent order of attributes in output.
+
+* Python 3.4 support.
 
 
 Version 1.4 (January 12, 2014)
@@ -166,44 +290,54 @@ Version 1.4 (January 12, 2014)
 
 **Changes**
 
-- linkify: Update linkify to use etree type Treewalker instead of simpletree.
-- Updated html5lib to version ``>=0.999``.
-- Update all code to be compatible with Python 3 and 2 using six.
-- Switch to Apache License.
+* linkify: Update linkify to use etree type Treewalker instead of simpletree.
+
+* Updated html5lib to version ``>=0.999``.
+
+* Update all code to be compatible with Python 3 and 2 using six.
+
+* Switch to Apache License.
 
 
 Version 1.3
 -----------
 
-- Used by Python 3-only fork.
+* Used by Python 3-only fork.
 
 
 Version 1.2.2 (May 18, 2013)
 ----------------------------
 
-- Pin html5lib to version 0.95 for now due to major API break.
+* Pin html5lib to version 0.95 for now due to major API break.
+
 
 Version 1.2.1 (February 19, 2013)
 ---------------------------------
 
-- clean() no longer considers ``feed:`` an acceptable protocol due to
+* ``clean()`` no longer considers ``feed:`` an acceptable protocol due to
   inconsistencies in browser behavior.
 
 
 Version 1.2 (January 28, 2013)
 ------------------------------
 
-- linkify() has changed considerably. Many keyword arguments have been
-  replaced with a single callbacks list. Please see the documentation
-  for more information.
-- Bleach will no longer consider unacceptable protocols when linkifying.
-- linkify() now takes a tokenizer argument that allows it to skip
+* ``linkify()`` has changed considerably. Many keyword arguments have been
+  replaced with a single callbacks list. Please see the documentation for more
+  information.
+
+* Bleach will no longer consider unacceptable protocols when linkifying.
+
+* ``linkify()`` now takes a tokenizer argument that allows it to skip
   sanitization.
-- delinkify() is gone.
-- Removed exception handling from _render. clean() and linkify() may now
-  throw.
-- linkify() correctly ignores case for protocols and domain names.
-- linkify() correctly handles markup within an <a> tag.
+
+* ``delinkify()`` is gone.
+
+* Removed exception handling from ``_render``. ``clean()`` and ``linkify()`` may
+  now throw.
+
+* ``linkify()`` correctly ignores case for protocols and domain names.
+
+* ``linkify()`` correctly handles markup within an <a> tag.
 
 
 Version 1.1.5
@@ -217,61 +351,75 @@ Version 1.1.4
 Version 1.1.3 (July 10, 2012)
 -----------------------------
 
-- Fix parsing bare URLs when parse_email=True.
+* Fix parsing bare URLs when parse_email=True.
 
 
 Version 1.1.2 (June 1, 2012)
 ----------------------------
 
-- Fix hang in style attribute sanitizer. (#61)
-- Allow '/' in style attribute values.
+* Fix hang in style attribute sanitizer. (#61)
+
+* Allow ``/`` in style attribute values.
 
 
 Version 1.1.1 (February 17, 2012)
 ---------------------------------
 
-- Fix tokenizer for html5lib 0.9.5.
+* Fix tokenizer for html5lib 0.9.5.
 
 
 Version 1.1.0 (October 24, 2011)
 --------------------------------
 
-- linkify() now understands port numbers. (#38)
-- Documented character encoding behavior. (#41)
-- Add an optional target argument to linkify().
-- Add delinkify() method. (#45)
-- Support subdomain whitelist for delinkify(). (#47, #48)
+* ``linkify()`` now understands port numbers. (#38)
+
+* Documented character encoding behavior. (#41)
+
+* Add an optional target argument to ``linkify()``.
+
+* Add ``delinkify()`` method. (#45)
+
+* Support subdomain whitelist for ``delinkify()``. (#47, #48)
 
 
 Version 1.0.4 (September 2, 2011)
 ---------------------------------
 
-- Switch to SemVer git tags.
-- Make linkify() smarter about trailing punctuation. (#30)
-- Pass exc_info to logger during rendering issues.
-- Add wildcard key for attributes. (#19)
-- Make linkify() use the HTMLSanitizer tokenizer. (#36)
-- Fix URLs wrapped in parentheses. (#23)
-- Make linkify() UTF-8 safe. (#33)
+* Switch to SemVer git tags.
+
+* Make ``linkify()`` smarter about trailing punctuation. (#30)
+
+* Pass ``exc_info`` to logger during rendering issues.
+
+* Add wildcard key for attributes. (#19)
+
+* Make ``linkify()`` use the ``HTMLSanitizer`` tokenizer. (#36)
+
+* Fix URLs wrapped in parentheses. (#23)
+
+* Make ``linkify()`` UTF-8 safe. (#33)
 
 
 Version 1.0.3 (June 14, 2011)
 -----------------------------
 
-- linkify() works with 3rd level domains. (#24)
-- clean() supports vendor prefixes in style values. (#31, #32)
-- Fix linkify() email escaping.
+* ``linkify()`` works with 3rd level domains. (#24)
+
+* ``clean()`` supports vendor prefixes in style values. (#31, #32)
+
+* Fix ``linkify()`` email escaping.
 
 
 Version 1.0.2 (June 6, 2011)
 ----------------------------
 
-- linkify() supports email addresses.
-- clean() supports callables in attributes filter.
+* ``linkify()`` supports email addresses.
+
+* ``clean()`` supports callables in attributes filter.
 
 
 Version 1.0.1 (April 12, 2011)
 ------------------------------
 
-- linkify() doesn't drop trailing slashes. (#21)
-- linkify() won't linkify 'libgl.so.1'. (#22)
+* ``linkify()`` doesn't drop trailing slashes. (#21)
+* ``linkify()`` won't linkify 'libgl.so.1'. (#22)
diff --git a/CODE_OF_CONDUCT.rst b/CODE_OF_CONDUCT.rst
new file mode 100644
index 0000000..da20d8d
--- /dev/null
+++ b/CODE_OF_CONDUCT.rst
@@ -0,0 +1,9 @@
+Code of conduct
+===============
+
+This project and repository is governed by Mozilla's code of conduct and
+etiquette guidelines. For more details please see the `Mozilla Community
+Participation Guidelines
+<https://www.mozilla.org/about/governance/policies/participation/>`_ and
+`Developer Etiquette Guidelines
+<https://bugzilla.mozilla.org/page.cgi?id=etiquette.html>`_.
diff --git a/CONTRIBUTORS b/CONTRIBUTORS
index 4c90ae5..9427624 100644
--- a/CONTRIBUTORS
+++ b/CONTRIBUTORS
@@ -1,10 +1,12 @@
 Bleach was originally written and maintained by James Socol and various
 contributors within and without the Mozilla Corporation and Foundation.
-It is currently maintained by Jannis Leidel and Will Kahn-Greene.
+
+It is currently maintained by Will Kahn-Greene an Greg Guthe.
 
 Maintainers:
 
 - Will Kahn-Greene <willkg at mozilla.com>
+- Greg Guthe <gguthe at mozilla.com>
 
 Maintainer emeritus:
 
@@ -18,6 +20,7 @@ Contributors:
 - Alek
 - Alexandre Macabies
 - Alexandr N. Zamaraev
+- Alex Defsen
 - Alex Ehlke
 - Alireza Savand
 - Andreas Malecki
@@ -28,11 +31,14 @@ Contributors:
 - Erik Rose
 - Gaurav Dadhania
 - Geoffrey Sneddon
+- Greg Guthe
 - Istvan Albert
 - Jaime Irurzun
 - James Socol
 - Jannis Leidel
+- Janusz Kamieński
 - Jeff Balogh
+- Jonathan Vanasco
 - Lee, Cheon-il
 - Les Orchard
 - Lorenz Schori
@@ -48,8 +54,10 @@ Contributors:
 - Ricky Rosario
 - Ryan Niemeyer
 - Sébastien Fievet
+- sedrubal
 - Tim Dumol
 - Timothy Fitz
 - Vitaly Volkov
 - Will Kahn-Greene
+- Zoltán
 - zyegfryed
diff --git a/MANIFEST.in b/MANIFEST.in
index d8329f6..1ae68e2 100644
--- a/MANIFEST.in
+++ b/MANIFEST.in
@@ -1,6 +1,7 @@
 include CHANGES
 include CONTRIBUTORS
 include CONTRIBUTING.rst
+include CODE_OF_CONDUCT.rst
 include requirements.txt
 include tox.ini
 include LICENSE
@@ -12,3 +13,5 @@ include docs/Makefile
 recursive-include docs *.rst
 
 recursive-include tests *.py *.test *.out
+
+recursive-include tests_website *.html *.py *.rst
diff --git a/README.rst b/README.rst
index 08dd886..9668789 100644
--- a/README.rst
+++ b/README.rst
@@ -8,7 +8,7 @@ Bleach
 .. image:: https://badge.fury.io/py/bleach.svg
    :target: http://badge.fury.io/py/bleach
 
-Bleach is a allowed-list-based HTML sanitizing library that escapes or strips
+Bleach is an allowed-list-based HTML sanitizing library that escapes or strips
 markup and attributes.
 
 Bleach can also linkify text safely, applying filters that Django's ``urlize``
@@ -62,14 +62,6 @@ Or with ``easy_install``::
 
     $ easy_install bleach
 
-Or by cloning the repo from GitHub_::
-
-    $ git clone git://github.com/mozilla/bleach.git
-
-Then install it by running::
-
-    $ python setup.py install
-
 
 Upgrading Bleach
 ================
@@ -97,6 +89,32 @@ The simplest way to use Bleach is:
     u'an <a href="http://example.com" rel="nofollow">http://example.com</a> url
 
 
+Security
+========
+
+Bleach is a security-related library.
+
+We have a responsible security vulnerability reporting process. Please use
+that if you're reporting a security issue.
+
+Security issues are fixed in private. After we land such a fix, we'll do a
+release.
+
+For every release, we mark security issues we've fixed in the ``CHANGES`` in
+the **Security issues** section. We include relevant CVE links.
+
+
+Code of conduct
+===============
+
+This project and repository is governed by Mozilla's code of conduct and
+etiquette guidelines. For more details please see the `Mozilla Community
+Participation Guidelines
+<https://www.mozilla.org/about/governance/policies/participation/>`_ and
+`Developer Etiquette Guidelines
+<https://bugzilla.mozilla.org/page.cgi?id=etiquette.html>`_.
+
+
 .. _html5lib: https://github.com/html5lib/html5lib-python
 .. _GitHub: https://github.com/mozilla/bleach
 .. _ReadTheDocs: https://bleach.readthedocs.io/
diff --git a/bleach/__init__.py b/bleach/__init__.py
index c9a7fe4..6ebdc20 100644
--- a/bleach/__init__.py
+++ b/bleach/__init__.py
@@ -2,20 +2,42 @@
 
 from __future__ import unicode_literals
 
+import warnings
+from pkg_resources import parse_version
+
 from bleach.linkifier import (
     DEFAULT_CALLBACKS,
     Linker,
-    LinkifyFilter,
 )
 from bleach.sanitizer import (
     ALLOWED_ATTRIBUTES,
     ALLOWED_PROTOCOLS,
     ALLOWED_STYLES,
     ALLOWED_TAGS,
-    BleachSanitizerFilter,
     Cleaner,
 )
-from bleach.version import __version__, VERSION # flake8: noqa
+
+
+import html5lib
+try:
+    _html5lib_version = html5lib.__version__.split('.')
+    if len(_html5lib_version) < 2:
+        _html5lib_version = _html5lib_version + ['0']
+except Exception:
+    _h5ml5lib_version = ['unknown', 'unknown']
+
+
+# Bleach 3.0.0 won't support html5lib-python < 1.0.0.
+if _html5lib_version < ['1', '0'] or 'b' in _html5lib_version[1]:
+    warnings.warn('Support for html5lib-python < 1.0.0 is deprecated.', DeprecationWarning)
+
+
+# yyyymmdd
+__releasedate__ = '20171207'
+# x.y.z or x.y.z.dev0 -- semver
+__version__ = '2.1.2'
+VERSION = parse_version(__version__)
+
 
 __all__ = ['clean', 'linkify']
 
diff --git a/bleach/callbacks.py b/bleach/callbacks.py
index d2ba101..99d56b8 100644
--- a/bleach/callbacks.py
+++ b/bleach/callbacks.py
@@ -4,7 +4,11 @@ from __future__ import unicode_literals
 
 def nofollow(attrs, new=False):
     href_key = (None, u'href')
-    if href_key not in attrs or attrs[href_key].startswith(u'mailto:'):
+
+    if href_key not in attrs:
+        return attrs
+
+    if attrs[href_key].startswith(u'mailto:'):
         return attrs
 
     rel_key = (None, u'rel')
@@ -18,6 +22,10 @@ def nofollow(attrs, new=False):
 
 def target_blank(attrs, new=False):
     href_key = (None, u'href')
+
+    if href_key not in attrs:
+        return attrs
+
     if attrs[href_key].startswith(u'mailto:'):
         return attrs
 
diff --git a/bleach/encoding.py b/bleach/encoding.py
deleted file mode 100644
index 707adaa..0000000
--- a/bleach/encoding.py
+++ /dev/null
@@ -1,62 +0,0 @@
-import datetime
-from decimal import Decimal
-import types
-import six
-
-
-def is_protected_type(obj):
-    """Determine if the object instance is of a protected type.
-
-    Objects of protected types are preserved as-is when passed to
-    force_unicode(strings_only=True).
-    """
-    return isinstance(obj, (
-        six.integer_types +
-        (types.NoneType,
-         datetime.datetime, datetime.date, datetime.time,
-         float, Decimal))
-    )
-
-
-def force_unicode(s, encoding='utf-8', strings_only=False, errors='strict'):
-    """
-    Similar to smart_text, except that lazy instances are resolved to
-    strings, rather than kept as lazy objects.
-
-    If strings_only is True, don't convert (some) non-string-like objects.
-    """
-    # Handle the common case first, saves 30-40% when s is an instance of
-    # six.text_type. This function gets called often in that setting.
-    if isinstance(s, six.text_type):
-        return s
-    if strings_only and is_protected_type(s):
-        return s
-    try:
-        if not isinstance(s, six.string_types):
-            if hasattr(s, '__unicode__'):
-                s = s.__unicode__()
-            else:
-                if six.PY3:
-                    if isinstance(s, bytes):
-                        s = six.text_type(s, encoding, errors)
-                    else:
-                        s = six.text_type(s)
-                else:
-                    s = six.text_type(bytes(s), encoding, errors)
-        else:
-            # Note: We use .decode() here, instead of six.text_type(s,
-            # encoding, errors), so that if s is a SafeBytes, it ends up being
-            # a SafeText at the end.
-            s = s.decode(encoding, errors)
-    except UnicodeDecodeError as e:
-        if not isinstance(s, Exception):
-            raise UnicodeDecodeError(*e.args)
-        else:
-            # If we get to here, the caller has passed in an Exception
-            # subclass populated with non-ASCII bytestring data without a
-            # working unicode method. Try to handle this without raising a
-            # further exception by individually forcing the exception args
-            # to unicode.
-            s = ' '.join([force_unicode(arg, encoding, strings_only,
-                          errors) for arg in s])
-    return s
diff --git a/bleach/linkifier.py b/bleach/linkifier.py
index fc346c3..849443c 100644
--- a/bleach/linkifier.py
+++ b/bleach/linkifier.py
@@ -1,5 +1,6 @@
 from __future__ import unicode_literals
 import re
+import six
 
 import html5lib
 from html5lib.filters.base import Filter
@@ -7,8 +8,7 @@ from html5lib.filters.sanitizer import allowed_protocols
 from html5lib.serializer import HTMLSerializer
 
 from bleach import callbacks as linkify_callbacks
-from bleach.encoding import force_unicode
-from bleach.utils import alphabetize_attributes
+from bleach.utils import alphabetize_attributes, force_unicode
 
 
 #: List of default callbacks
@@ -134,7 +134,12 @@ class Linker(object):
 
         :returns: linkified text as unicode
 
+        :raises TypeError: if ``text`` is not a text type
+
         """
+        if not isinstance(text, six.string_types):
+            raise TypeError('argument must be of text type')
+
         text = force_unicode(text)
 
         if not text:
@@ -344,7 +349,17 @@ class LinkifyFilter(Filter):
 
     def handle_links(self, src_iter):
         """Handle links in character tokens"""
+        in_a = False  # happens, if parse_email=True and if a mail was found
         for token in src_iter:
+            if in_a:
+                if token['type'] == 'EndTag' and token['name'] == 'a':
+                    in_a = False
+                yield token
+                continue
+            elif token['type'] == 'StartTag' and token['name'] == 'a':
+                in_a = True
+                yield token
+                continue
             if token['type'] == 'Characters':
                 text = token['data']
                 new_tokens = []
diff --git a/bleach/sanitizer.py b/bleach/sanitizer.py
index 539711a..81df765 100644
--- a/bleach/sanitizer.py
+++ b/bleach/sanitizer.py
@@ -1,15 +1,34 @@
 from __future__ import unicode_literals
+from itertools import chain
 import re
+import string
+
+import six
 from xml.sax.saxutils import unescape
 
 import html5lib
-from html5lib.constants import namespaces
+from html5lib.constants import (
+    entities,
+    namespaces,
+    prefixes,
+    tokenTypes,
+)
+try:
+    from html5lib.constants import ReparseException
+except ImportError:
+    # html5lib-python 1.0 changed the name
+    from html5lib.constants import _ReparseException as ReparseException
+from html5lib.filters.base import Filter
 from html5lib.filters import sanitizer
 from html5lib.serializer import HTMLSerializer
+from html5lib._tokenizer import HTMLTokenizer
+from html5lib._trie import Trie
+
+from bleach.utils import alphabetize_attributes, force_unicode
 
-from bleach.encoding import force_unicode
-from bleach.utils import alphabetize_attributes
 
+#: Trie of html entity string -> character representation
+ENTITIES_TRIE = Trie(entities)
 
 #: List of allowed tags
 ALLOWED_TAGS = [
@@ -44,6 +63,52 @@ ALLOWED_STYLES = []
 ALLOWED_PROTOCOLS = ['http', 'https', 'mailto']
 
 
+AMP_SPLIT_RE = re.compile('(&)')
+
+#: Invisible characters--0 to and including 31 except 9 (tab), 10 (lf), and 13 (cr)
+INVISIBLE_CHARACTERS = ''.join([chr(c) for c in chain(range(0, 9), range(11, 13), range(14, 32))])
+
+#: Regexp for characters that are invisible
+INVISIBLE_CHARACTERS_RE = re.compile(
+    '[' + INVISIBLE_CHARACTERS + ']',
+    re.UNICODE
+)
+
+#: String to replace invisible characters with. This can be a character, a
+#: string, or even a function that takes a Python re matchobj
+INVISIBLE_REPLACEMENT_CHAR = '?'
+
+
+class BleachHTMLTokenizer(HTMLTokenizer):
+    def consumeEntity(self, allowedChar=None, fromAttribute=False):
+        # We don't want to consume and convert entities, so this overrides the
+        # html5lib tokenizer's consumeEntity so that it's now a no-op.
+        #
+        # However, when that gets called, it's consumed an &, so we put that in
+        # the steam.
+        if fromAttribute:
+            self.currentToken['data'][-1][1] += '&'
+
+        else:
+            self.tokenQueue.append({"type": tokenTypes['Characters'], "data": '&'})
+
+
+class BleachHTMLParser(html5lib.HTMLParser):
+    def _parse(self, stream, innerHTML=False, container="div", scripting=False, **kwargs):
+        # Override HTMLParser so we can swap out the tokenizer for our own.
+        self.innerHTMLMode = innerHTML
+        self.container = container
+        self.scripting = scripting
+        self.tokenizer = BleachHTMLTokenizer(stream, parser=self, **kwargs)
+        self.reset()
+
+        try:
+            self.mainLoop()
+        except ReparseException:
+            self.reset()
+            self.mainLoop()
+
+
 class Cleaner(object):
     """Cleaner for cleaning HTML fragments of malicious content
 
@@ -104,11 +169,16 @@ class Cleaner(object):
         self.strip_comments = strip_comments
         self.filters = filters or []
 
-        self.parser = html5lib.HTMLParser(namespaceHTMLElements=False)
+        self.parser = BleachHTMLParser(namespaceHTMLElements=False)
         self.walker = html5lib.getTreeWalker('etree')
-        self.serializer = HTMLSerializer(
+        self.serializer = BleachHTMLSerializer(
             quote_attr_values='always',
             omit_optional_tags=False,
+            escape_lt_in_attrs=True,
+
+            # We want to leave entities as they are without escaping or
+            # resolving or expanding
+            resolve_entities=False,
 
             # Bleach has its own sanitizer, so don't use the html5lib one
             sanitize=False,
@@ -124,7 +194,14 @@ class Cleaner(object):
 
         :returns: sanitized text as unicode
 
+        :raises TypeError: if ``text`` is not a text type
+
         """
+        if not isinstance(text, six.string_types):
+            message = "argument cannot be of '{name}' type, must be of text type".format(
+                name=text.__class__.__name__)
+            raise TypeError(message)
+
         if not text:
             return u''
 
@@ -194,6 +271,79 @@ def attribute_filter_factory(attributes):
     raise ValueError('attributes needs to be a callable, a list or a dict')
 
 
+def match_entity(stream):
+    """Returns first entity in stream or None if no entity exists
+
+    Note: For Bleach purposes, entities must start with a "&" and end with
+    a ";".
+
+    :arg stream: the character stream
+
+    :returns: ``None`` or the entity string without "&" or ";"
+
+    """
+    # Nix the & at the beginning
+    if stream[0] != '&':
+        raise ValueError('Stream should begin with "&"')
+
+    stream = stream[1:]
+
+    stream = list(stream)
+    possible_entity = ''
+    end_characters = '<&=;' + string.whitespace
+
+    # Handle number entities
+    if stream and stream[0] == '#':
+        possible_entity = '#'
+        stream.pop(0)
+
+        if stream and stream[0] in ('x', 'X'):
+            allowed = '0123456789abcdefABCDEF'
+            possible_entity += stream.pop(0)
+        else:
+            allowed = '0123456789'
+
+        # FIXME(willkg): Do we want to make sure these are valid number
+        # entities? This doesn't do that currently.
+        while stream and stream[0] not in end_characters:
+            c = stream.pop(0)
+            if c not in allowed:
+                break
+            possible_entity += c
+
+        if possible_entity and stream and stream[0] == ';':
+            return possible_entity
+        return None
+
+    # Handle character entities
+    while stream and stream[0] not in end_characters:
+        c = stream.pop(0)
+        if not ENTITIES_TRIE.has_keys_with_prefix(possible_entity):
+            break
+        possible_entity += c
+
+    if possible_entity and stream and stream[0] == ';':
+        return possible_entity
+
+    return None
+
+
+def next_possible_entity(text):
+    """Takes a text and generates a list of possible entities
+
+    :arg text: the text to look at
+
+    :returns: generator where each part (except the first) starts with an
+        "&"
+
+    """
+    for i, part in enumerate(AMP_SPLIT_RE.split(text)):
+        if i == 0:
+            yield part
+        elif i % 2 == 0:
+            yield '&' + part
+
+
 class BleachSanitizerFilter(sanitizer.Filter):
     """html5lib Filter that sanitizes text
... 2115 lines suppressed ...

-- 
Alioth's /usr/local/bin/git-commit-notice on /srv/git.debian.org/git/python-modules/packages/python-bleach.git



More information about the Python-modules-commits mailing list