[Python-modules-commits] [python-bleach] 01/05: Imported Upstream version 2.1.2
Scott Kitterman
kitterman at moszumanska.debian.org
Wed Jan 10 04:39:33 UTC 2018
This is an automated email from the git hooks/post-receive script.
kitterman pushed a commit to branch debian/master
in repository python-bleach.
commit 0bf95a330ef15873da77462572f2fb78f5b3e633
Author: Scott Kitterman <scott at kitterman.com>
Date: Tue Jan 9 23:25:10 2018 -0500
Imported Upstream version 2.1.2
---
.gitignore | 1 +
CHANGES | 262 ++++++++++++++----
CODE_OF_CONDUCT.rst | 9 +
CONTRIBUTORS | 10 +-
MANIFEST.in | 3 +
README.rst | 36 ++-
bleach/__init__.py | 28 +-
bleach/callbacks.py | 10 +-
bleach/encoding.py | 62 -----
bleach/linkifier.py | 19 +-
bleach/sanitizer.py | 317 ++++++++++++++++++++-
bleach/utils.py | 21 ++
bleach/version.py | 6 -
docs/clean.rst | 36 ++-
docs/conf.py | 14 +-
docs/dev.rst | 34 ++-
docs/goals.rst | 76 +++++-
setup.py | 27 +-
tests/data/13.test.out | 2 +-
tests/data/14.test.out | 2 +-
tests/data/15.test.out | 2 +-
tests/data/16.test.out | 2 +-
tests/data/17.test.out | 2 +-
tests/data/18.test.out | 2 +-
tests/data/19.test.out | 3 +-
tests/data/20.test | 1 +
tests/data/20.test.out | 1 +
tests/test_basics.py | 365 -------------------------
tests/test_callbacks.py | 63 +++++
tests/test_clean.py | 454 +++++++++++++++++++++++++++++++
tests/{test_links.py => test_linkify.py} | 33 ++-
tests/test_security.py | 42 ++-
tests_website/.gitignore | 1 +
tests_website/README.rst | 28 ++
tests_website/data_to_json.py | 53 ++++
tests_website/index.html | 149 ++++++++++
tests_website/open_test_page.py | 36 +++
tests_website/server.py | 52 ++++
tox.ini | 61 ++++-
39 files changed, 1753 insertions(+), 572 deletions(-)
diff --git a/.gitignore b/.gitignore
index f5adb54..26bbdf8 100644
--- a/.gitignore
+++ b/.gitignore
@@ -10,3 +10,4 @@ build
docs/_build/
.cache/
.eggs/
+.*env*/
diff --git a/CHANGES b/CHANGES
index 7caa99f..47bf390 100644
--- a/CHANGES
+++ b/CHANGES
@@ -1,9 +1,108 @@
Bleach Changes
==============
+Version 2.1.2 (December 7th, 2017)
+----------------------------------
+
+**Security fixes**
+
+None
+
+**Backwards incompatible changes**
+
+None
+
+**Features**
+
+None
+
+**Bug fixes**
+
+* Support html5lib-python 1.0.1. (#337)
+
+* Add deprecation warning for supporting html5lib-python < 1.0.
+
+* Switch to semver.
+
+
+Version 2.1.1 (October 2nd, 2017)
+---------------------------------
+
+**Security fixes**
+
+None
+
+**Backwards incompatible changes**
+
+None
+
+**Features**
+
+None
+
+**Bug fixes**
+
+* Fix ``setup.py`` opening files when ``LANG=``. (#324)
+
+
+Version 2.1 (September 28th, 2017)
+----------------------------------
+
+**Security fixes**
+
+* Convert control characters (backspace particularly) to "?" preventing
+ malicious copy-and-paste situations. (#298)
+
+ See `<https://github.com/mozilla/bleach/issues/298>`_ for more details.
+
+ This affects all previous versions of Bleach. Check the comments on that
+ issue for ways to alleviate the issue if you can't upgrade to Bleach 2.1.
+
+
+**Backwards incompatible changes**
+
+* Redid versioning. ``bleach.VERSION`` is no longer available. Use the string
+ version at ``bleach.__version__`` and parse it with
+ ``pkg_resources.parse_version``. (#307)
+
+* clean, linkify: linkify and clean should only accept text types; thank you,
+ Janusz! (#292)
+
+* clean, linkify: accept only unicode or utf-8-encoded str (#176)
+
+
+**Features**
+
+
+**Bug fixes**
+
+* ``bleach.clean()`` no longer unescapes entities including ones that are missing
+ a ``;`` at the end which can happen in urls and other places. (#143)
+
+* linkify: fix http links inside of mailto links; thank you, sedrubal! (#300)
+
+* clarify security policy in docs (#303)
+
+* fix dependency specification for html5lib 1.0b8, 1.0b9, and 1.0b10; thank you,
+ Zoltán! (#268)
+
+* add Bleach vs. html5lib comparison to README; thank you, Stu Cox! (#278)
+
+* fix KeyError exceptions on tags without href attr; thank you, Alex Defsen!
+ (#273)
+
+* add test website and scripts to test ``bleach.clean()`` output in browser;
+ thank you, Greg Guthe!
+
+
Version 2.0 (March 8th, 2017)
-----------------------------
+**Security fixes**
+
+* None
+
+
**Backwards incompatible changes**
* Removed support for Python 2.6. #206
@@ -15,6 +114,11 @@ Version 2.0 (March 8th, 2017)
This version is a rewrite to use the new sanitizing API since the old
one was dropped in html5lib 0.99999999 (8 9s).
+ If you're using 0.9999999 (7 9s) upgrade to 0.99999999 (8 9s) or higher.
+
+ If you're using 1.0b8 (equivalent to 0.9999999 (7 9s)), upgrade to 1.0b9
+ (equivalent to 0.99999999 (8 9s)) or higher.
+
* ``bleach.clean`` and friends were rewritten
``clean`` was reimplemented as an html5lib filter and happens at a different
@@ -105,9 +209,13 @@ Version 2.0 (March 8th, 2017)
Version 1.5 (November 4th, 2016)
--------------------------------
+**Security fixes**
+
+* None
+
**Backwards incompatible changes**
-- clean: The list of ``ALLOWED_PROTOCOLS`` now defaults to http, https and
+* clean: The list of ``ALLOWED_PROTOCOLS`` now defaults to http, https and
mailto.
Previously it was a long list of protocols something like ed2k, ftp, http,
@@ -116,28 +224,40 @@ Version 1.5 (November 4th, 2016)
**Changes**
-- clean: Added ``protocols`` to arguments list to let you override the list of
+* clean: Added ``protocols`` to arguments list to let you override the list of
allowed protocols. Thank you, Andreas Malecki! #149
-- linkify: Fix a bug involving periods at the end of an email address. Thank you,
+
+* linkify: Fix a bug involving periods at the end of an email address. Thank you,
Lorenz Schori! #219
-- linkify: Fix linkification of non-ascii ports. Thank you Alexandre, Macabies!
+
+* linkify: Fix linkification of non-ascii ports. Thank you Alexandre, Macabies!
#207
-- linkify: Fix linkify inappropriately removing node tails when dropping nodes.
+
+* linkify: Fix linkify inappropriately removing node tails when dropping nodes.
#132
-- Fixed a test that failed periodically. #161
-- Switched from nose to py.test. #204
-- Add test matrix for all supported Python and html5lib versions. #230
-- Limit to html5lib ``>=0.999,!=0.9999,!=0.99999,<0.99999999`` because 0.9999
+
+* Fixed a test that failed periodically. #161
+
+* Switched from nose to py.test. #204
+
+* Add test matrix for all supported Python and html5lib versions. #230
+
+* Limit to html5lib ``>=0.999,!=0.9999,!=0.99999,<0.99999999`` because 0.9999
and 0.99999 are busted.
-- Add support for ``python setup.py test``. #97
+
+* Add support for ``python setup.py test``. #97
Version 1.4.3 (May 23rd, 2016)
------------------------------
+**Security fixes**
+
+* None
+
**Changes**
-- Limit to html5lib ``>=0.999,<0.99999999`` because of impending change to
+* Limit to html5lib ``>=0.999,<0.99999999`` because of impending change to
sanitizer api. #195
@@ -146,10 +266,13 @@ Version 1.4.2 (September 11, 2015)
**Changes**
-- linkify: Fix hang in linkify with ``parse_email=True``. #124
-- linkify: Fix crash in linkify when removing a link that is a first-child. #136
-- Updated TLDs.
-- linkify: Don't remove exterior brackets when linkifying. #146
+* linkify: Fix hang in linkify with ``parse_email=True``. #124
+
+* linkify: Fix crash in linkify when removing a link that is a first-child. #136
+
+* Updated TLDs.
+
+* linkify: Don't remove exterior brackets when linkifying. #146
Version 1.4.1 (December 15, 2014)
@@ -157,8 +280,9 @@ Version 1.4.1 (December 15, 2014)
**Changes**
-- Consistent order of attributes in output.
-- Python 3.4 support.
+* Consistent order of attributes in output.
+
+* Python 3.4 support.
Version 1.4 (January 12, 2014)
@@ -166,44 +290,54 @@ Version 1.4 (January 12, 2014)
**Changes**
-- linkify: Update linkify to use etree type Treewalker instead of simpletree.
-- Updated html5lib to version ``>=0.999``.
-- Update all code to be compatible with Python 3 and 2 using six.
-- Switch to Apache License.
+* linkify: Update linkify to use etree type Treewalker instead of simpletree.
+
+* Updated html5lib to version ``>=0.999``.
+
+* Update all code to be compatible with Python 3 and 2 using six.
+
+* Switch to Apache License.
Version 1.3
-----------
-- Used by Python 3-only fork.
+* Used by Python 3-only fork.
Version 1.2.2 (May 18, 2013)
----------------------------
-- Pin html5lib to version 0.95 for now due to major API break.
+* Pin html5lib to version 0.95 for now due to major API break.
+
Version 1.2.1 (February 19, 2013)
---------------------------------
-- clean() no longer considers ``feed:`` an acceptable protocol due to
+* ``clean()`` no longer considers ``feed:`` an acceptable protocol due to
inconsistencies in browser behavior.
Version 1.2 (January 28, 2013)
------------------------------
-- linkify() has changed considerably. Many keyword arguments have been
- replaced with a single callbacks list. Please see the documentation
- for more information.
-- Bleach will no longer consider unacceptable protocols when linkifying.
-- linkify() now takes a tokenizer argument that allows it to skip
+* ``linkify()`` has changed considerably. Many keyword arguments have been
+ replaced with a single callbacks list. Please see the documentation for more
+ information.
+
+* Bleach will no longer consider unacceptable protocols when linkifying.
+
+* ``linkify()`` now takes a tokenizer argument that allows it to skip
sanitization.
-- delinkify() is gone.
-- Removed exception handling from _render. clean() and linkify() may now
- throw.
-- linkify() correctly ignores case for protocols and domain names.
-- linkify() correctly handles markup within an <a> tag.
+
+* ``delinkify()`` is gone.
+
+* Removed exception handling from ``_render``. ``clean()`` and ``linkify()`` may
+ now throw.
+
+* ``linkify()`` correctly ignores case for protocols and domain names.
+
+* ``linkify()`` correctly handles markup within an <a> tag.
Version 1.1.5
@@ -217,61 +351,75 @@ Version 1.1.4
Version 1.1.3 (July 10, 2012)
-----------------------------
-- Fix parsing bare URLs when parse_email=True.
+* Fix parsing bare URLs when parse_email=True.
Version 1.1.2 (June 1, 2012)
----------------------------
-- Fix hang in style attribute sanitizer. (#61)
-- Allow '/' in style attribute values.
+* Fix hang in style attribute sanitizer. (#61)
+
+* Allow ``/`` in style attribute values.
Version 1.1.1 (February 17, 2012)
---------------------------------
-- Fix tokenizer for html5lib 0.9.5.
+* Fix tokenizer for html5lib 0.9.5.
Version 1.1.0 (October 24, 2011)
--------------------------------
-- linkify() now understands port numbers. (#38)
-- Documented character encoding behavior. (#41)
-- Add an optional target argument to linkify().
-- Add delinkify() method. (#45)
-- Support subdomain whitelist for delinkify(). (#47, #48)
+* ``linkify()`` now understands port numbers. (#38)
+
+* Documented character encoding behavior. (#41)
+
+* Add an optional target argument to ``linkify()``.
+
+* Add ``delinkify()`` method. (#45)
+
+* Support subdomain whitelist for ``delinkify()``. (#47, #48)
Version 1.0.4 (September 2, 2011)
---------------------------------
-- Switch to SemVer git tags.
-- Make linkify() smarter about trailing punctuation. (#30)
-- Pass exc_info to logger during rendering issues.
-- Add wildcard key for attributes. (#19)
-- Make linkify() use the HTMLSanitizer tokenizer. (#36)
-- Fix URLs wrapped in parentheses. (#23)
-- Make linkify() UTF-8 safe. (#33)
+* Switch to SemVer git tags.
+
+* Make ``linkify()`` smarter about trailing punctuation. (#30)
+
+* Pass ``exc_info`` to logger during rendering issues.
+
+* Add wildcard key for attributes. (#19)
+
+* Make ``linkify()`` use the ``HTMLSanitizer`` tokenizer. (#36)
+
+* Fix URLs wrapped in parentheses. (#23)
+
+* Make ``linkify()`` UTF-8 safe. (#33)
Version 1.0.3 (June 14, 2011)
-----------------------------
-- linkify() works with 3rd level domains. (#24)
-- clean() supports vendor prefixes in style values. (#31, #32)
-- Fix linkify() email escaping.
+* ``linkify()`` works with 3rd level domains. (#24)
+
+* ``clean()`` supports vendor prefixes in style values. (#31, #32)
+
+* Fix ``linkify()`` email escaping.
Version 1.0.2 (June 6, 2011)
----------------------------
-- linkify() supports email addresses.
-- clean() supports callables in attributes filter.
+* ``linkify()`` supports email addresses.
+
+* ``clean()`` supports callables in attributes filter.
Version 1.0.1 (April 12, 2011)
------------------------------
-- linkify() doesn't drop trailing slashes. (#21)
-- linkify() won't linkify 'libgl.so.1'. (#22)
+* ``linkify()`` doesn't drop trailing slashes. (#21)
+* ``linkify()`` won't linkify 'libgl.so.1'. (#22)
diff --git a/CODE_OF_CONDUCT.rst b/CODE_OF_CONDUCT.rst
new file mode 100644
index 0000000..da20d8d
--- /dev/null
+++ b/CODE_OF_CONDUCT.rst
@@ -0,0 +1,9 @@
+Code of conduct
+===============
+
+This project and repository is governed by Mozilla's code of conduct and
+etiquette guidelines. For more details please see the `Mozilla Community
+Participation Guidelines
+<https://www.mozilla.org/about/governance/policies/participation/>`_ and
+`Developer Etiquette Guidelines
+<https://bugzilla.mozilla.org/page.cgi?id=etiquette.html>`_.
diff --git a/CONTRIBUTORS b/CONTRIBUTORS
index 4c90ae5..9427624 100644
--- a/CONTRIBUTORS
+++ b/CONTRIBUTORS
@@ -1,10 +1,12 @@
Bleach was originally written and maintained by James Socol and various
contributors within and without the Mozilla Corporation and Foundation.
-It is currently maintained by Jannis Leidel and Will Kahn-Greene.
+
+It is currently maintained by Will Kahn-Greene an Greg Guthe.
Maintainers:
- Will Kahn-Greene <willkg at mozilla.com>
+- Greg Guthe <gguthe at mozilla.com>
Maintainer emeritus:
@@ -18,6 +20,7 @@ Contributors:
- Alek
- Alexandre Macabies
- Alexandr N. Zamaraev
+- Alex Defsen
- Alex Ehlke
- Alireza Savand
- Andreas Malecki
@@ -28,11 +31,14 @@ Contributors:
- Erik Rose
- Gaurav Dadhania
- Geoffrey Sneddon
+- Greg Guthe
- Istvan Albert
- Jaime Irurzun
- James Socol
- Jannis Leidel
+- Janusz Kamieński
- Jeff Balogh
+- Jonathan Vanasco
- Lee, Cheon-il
- Les Orchard
- Lorenz Schori
@@ -48,8 +54,10 @@ Contributors:
- Ricky Rosario
- Ryan Niemeyer
- Sébastien Fievet
+- sedrubal
- Tim Dumol
- Timothy Fitz
- Vitaly Volkov
- Will Kahn-Greene
+- Zoltán
- zyegfryed
diff --git a/MANIFEST.in b/MANIFEST.in
index d8329f6..1ae68e2 100644
--- a/MANIFEST.in
+++ b/MANIFEST.in
@@ -1,6 +1,7 @@
include CHANGES
include CONTRIBUTORS
include CONTRIBUTING.rst
+include CODE_OF_CONDUCT.rst
include requirements.txt
include tox.ini
include LICENSE
@@ -12,3 +13,5 @@ include docs/Makefile
recursive-include docs *.rst
recursive-include tests *.py *.test *.out
+
+recursive-include tests_website *.html *.py *.rst
diff --git a/README.rst b/README.rst
index 08dd886..9668789 100644
--- a/README.rst
+++ b/README.rst
@@ -8,7 +8,7 @@ Bleach
.. image:: https://badge.fury.io/py/bleach.svg
:target: http://badge.fury.io/py/bleach
-Bleach is a allowed-list-based HTML sanitizing library that escapes or strips
+Bleach is an allowed-list-based HTML sanitizing library that escapes or strips
markup and attributes.
Bleach can also linkify text safely, applying filters that Django's ``urlize``
@@ -62,14 +62,6 @@ Or with ``easy_install``::
$ easy_install bleach
-Or by cloning the repo from GitHub_::
-
- $ git clone git://github.com/mozilla/bleach.git
-
-Then install it by running::
-
- $ python setup.py install
-
Upgrading Bleach
================
@@ -97,6 +89,32 @@ The simplest way to use Bleach is:
u'an <a href="http://example.com" rel="nofollow">http://example.com</a> url
+Security
+========
+
+Bleach is a security-related library.
+
+We have a responsible security vulnerability reporting process. Please use
+that if you're reporting a security issue.
+
+Security issues are fixed in private. After we land such a fix, we'll do a
+release.
+
+For every release, we mark security issues we've fixed in the ``CHANGES`` in
+the **Security issues** section. We include relevant CVE links.
+
+
+Code of conduct
+===============
+
+This project and repository is governed by Mozilla's code of conduct and
+etiquette guidelines. For more details please see the `Mozilla Community
+Participation Guidelines
+<https://www.mozilla.org/about/governance/policies/participation/>`_ and
+`Developer Etiquette Guidelines
+<https://bugzilla.mozilla.org/page.cgi?id=etiquette.html>`_.
+
+
.. _html5lib: https://github.com/html5lib/html5lib-python
.. _GitHub: https://github.com/mozilla/bleach
.. _ReadTheDocs: https://bleach.readthedocs.io/
diff --git a/bleach/__init__.py b/bleach/__init__.py
index c9a7fe4..6ebdc20 100644
--- a/bleach/__init__.py
+++ b/bleach/__init__.py
@@ -2,20 +2,42 @@
from __future__ import unicode_literals
+import warnings
+from pkg_resources import parse_version
+
from bleach.linkifier import (
DEFAULT_CALLBACKS,
Linker,
- LinkifyFilter,
)
from bleach.sanitizer import (
ALLOWED_ATTRIBUTES,
ALLOWED_PROTOCOLS,
ALLOWED_STYLES,
ALLOWED_TAGS,
- BleachSanitizerFilter,
Cleaner,
)
-from bleach.version import __version__, VERSION # flake8: noqa
+
+
+import html5lib
+try:
+ _html5lib_version = html5lib.__version__.split('.')
+ if len(_html5lib_version) < 2:
+ _html5lib_version = _html5lib_version + ['0']
+except Exception:
+ _h5ml5lib_version = ['unknown', 'unknown']
+
+
+# Bleach 3.0.0 won't support html5lib-python < 1.0.0.
+if _html5lib_version < ['1', '0'] or 'b' in _html5lib_version[1]:
+ warnings.warn('Support for html5lib-python < 1.0.0 is deprecated.', DeprecationWarning)
+
+
+# yyyymmdd
+__releasedate__ = '20171207'
+# x.y.z or x.y.z.dev0 -- semver
+__version__ = '2.1.2'
+VERSION = parse_version(__version__)
+
__all__ = ['clean', 'linkify']
diff --git a/bleach/callbacks.py b/bleach/callbacks.py
index d2ba101..99d56b8 100644
--- a/bleach/callbacks.py
+++ b/bleach/callbacks.py
@@ -4,7 +4,11 @@ from __future__ import unicode_literals
def nofollow(attrs, new=False):
href_key = (None, u'href')
- if href_key not in attrs or attrs[href_key].startswith(u'mailto:'):
+
+ if href_key not in attrs:
+ return attrs
+
+ if attrs[href_key].startswith(u'mailto:'):
return attrs
rel_key = (None, u'rel')
@@ -18,6 +22,10 @@ def nofollow(attrs, new=False):
def target_blank(attrs, new=False):
href_key = (None, u'href')
+
+ if href_key not in attrs:
+ return attrs
+
if attrs[href_key].startswith(u'mailto:'):
return attrs
diff --git a/bleach/encoding.py b/bleach/encoding.py
deleted file mode 100644
index 707adaa..0000000
--- a/bleach/encoding.py
+++ /dev/null
@@ -1,62 +0,0 @@
-import datetime
-from decimal import Decimal
-import types
-import six
-
-
-def is_protected_type(obj):
- """Determine if the object instance is of a protected type.
-
- Objects of protected types are preserved as-is when passed to
- force_unicode(strings_only=True).
- """
- return isinstance(obj, (
- six.integer_types +
- (types.NoneType,
- datetime.datetime, datetime.date, datetime.time,
- float, Decimal))
- )
-
-
-def force_unicode(s, encoding='utf-8', strings_only=False, errors='strict'):
- """
- Similar to smart_text, except that lazy instances are resolved to
- strings, rather than kept as lazy objects.
-
- If strings_only is True, don't convert (some) non-string-like objects.
- """
- # Handle the common case first, saves 30-40% when s is an instance of
- # six.text_type. This function gets called often in that setting.
- if isinstance(s, six.text_type):
- return s
- if strings_only and is_protected_type(s):
- return s
- try:
- if not isinstance(s, six.string_types):
- if hasattr(s, '__unicode__'):
- s = s.__unicode__()
- else:
- if six.PY3:
- if isinstance(s, bytes):
- s = six.text_type(s, encoding, errors)
- else:
- s = six.text_type(s)
- else:
- s = six.text_type(bytes(s), encoding, errors)
- else:
- # Note: We use .decode() here, instead of six.text_type(s,
- # encoding, errors), so that if s is a SafeBytes, it ends up being
- # a SafeText at the end.
- s = s.decode(encoding, errors)
- except UnicodeDecodeError as e:
- if not isinstance(s, Exception):
- raise UnicodeDecodeError(*e.args)
- else:
- # If we get to here, the caller has passed in an Exception
- # subclass populated with non-ASCII bytestring data without a
- # working unicode method. Try to handle this without raising a
- # further exception by individually forcing the exception args
- # to unicode.
- s = ' '.join([force_unicode(arg, encoding, strings_only,
- errors) for arg in s])
- return s
diff --git a/bleach/linkifier.py b/bleach/linkifier.py
index fc346c3..849443c 100644
--- a/bleach/linkifier.py
+++ b/bleach/linkifier.py
@@ -1,5 +1,6 @@
from __future__ import unicode_literals
import re
+import six
import html5lib
from html5lib.filters.base import Filter
@@ -7,8 +8,7 @@ from html5lib.filters.sanitizer import allowed_protocols
from html5lib.serializer import HTMLSerializer
from bleach import callbacks as linkify_callbacks
-from bleach.encoding import force_unicode
-from bleach.utils import alphabetize_attributes
+from bleach.utils import alphabetize_attributes, force_unicode
#: List of default callbacks
@@ -134,7 +134,12 @@ class Linker(object):
:returns: linkified text as unicode
+ :raises TypeError: if ``text`` is not a text type
+
"""
+ if not isinstance(text, six.string_types):
+ raise TypeError('argument must be of text type')
+
text = force_unicode(text)
if not text:
@@ -344,7 +349,17 @@ class LinkifyFilter(Filter):
def handle_links(self, src_iter):
"""Handle links in character tokens"""
+ in_a = False # happens, if parse_email=True and if a mail was found
for token in src_iter:
+ if in_a:
+ if token['type'] == 'EndTag' and token['name'] == 'a':
+ in_a = False
+ yield token
+ continue
+ elif token['type'] == 'StartTag' and token['name'] == 'a':
+ in_a = True
+ yield token
+ continue
if token['type'] == 'Characters':
text = token['data']
new_tokens = []
diff --git a/bleach/sanitizer.py b/bleach/sanitizer.py
index 539711a..81df765 100644
--- a/bleach/sanitizer.py
+++ b/bleach/sanitizer.py
@@ -1,15 +1,34 @@
from __future__ import unicode_literals
+from itertools import chain
import re
+import string
+
+import six
from xml.sax.saxutils import unescape
import html5lib
-from html5lib.constants import namespaces
+from html5lib.constants import (
+ entities,
+ namespaces,
+ prefixes,
+ tokenTypes,
+)
+try:
+ from html5lib.constants import ReparseException
+except ImportError:
+ # html5lib-python 1.0 changed the name
+ from html5lib.constants import _ReparseException as ReparseException
+from html5lib.filters.base import Filter
from html5lib.filters import sanitizer
from html5lib.serializer import HTMLSerializer
+from html5lib._tokenizer import HTMLTokenizer
+from html5lib._trie import Trie
+
+from bleach.utils import alphabetize_attributes, force_unicode
-from bleach.encoding import force_unicode
-from bleach.utils import alphabetize_attributes
+#: Trie of html entity string -> character representation
+ENTITIES_TRIE = Trie(entities)
#: List of allowed tags
ALLOWED_TAGS = [
@@ -44,6 +63,52 @@ ALLOWED_STYLES = []
ALLOWED_PROTOCOLS = ['http', 'https', 'mailto']
+AMP_SPLIT_RE = re.compile('(&)')
+
+#: Invisible characters--0 to and including 31 except 9 (tab), 10 (lf), and 13 (cr)
+INVISIBLE_CHARACTERS = ''.join([chr(c) for c in chain(range(0, 9), range(11, 13), range(14, 32))])
+
+#: Regexp for characters that are invisible
+INVISIBLE_CHARACTERS_RE = re.compile(
+ '[' + INVISIBLE_CHARACTERS + ']',
+ re.UNICODE
+)
+
+#: String to replace invisible characters with. This can be a character, a
+#: string, or even a function that takes a Python re matchobj
+INVISIBLE_REPLACEMENT_CHAR = '?'
+
+
+class BleachHTMLTokenizer(HTMLTokenizer):
+ def consumeEntity(self, allowedChar=None, fromAttribute=False):
+ # We don't want to consume and convert entities, so this overrides the
+ # html5lib tokenizer's consumeEntity so that it's now a no-op.
+ #
+ # However, when that gets called, it's consumed an &, so we put that in
+ # the steam.
+ if fromAttribute:
+ self.currentToken['data'][-1][1] += '&'
+
+ else:
+ self.tokenQueue.append({"type": tokenTypes['Characters'], "data": '&'})
+
+
+class BleachHTMLParser(html5lib.HTMLParser):
+ def _parse(self, stream, innerHTML=False, container="div", scripting=False, **kwargs):
+ # Override HTMLParser so we can swap out the tokenizer for our own.
+ self.innerHTMLMode = innerHTML
+ self.container = container
+ self.scripting = scripting
+ self.tokenizer = BleachHTMLTokenizer(stream, parser=self, **kwargs)
+ self.reset()
+
+ try:
+ self.mainLoop()
+ except ReparseException:
+ self.reset()
+ self.mainLoop()
+
+
class Cleaner(object):
"""Cleaner for cleaning HTML fragments of malicious content
@@ -104,11 +169,16 @@ class Cleaner(object):
self.strip_comments = strip_comments
self.filters = filters or []
- self.parser = html5lib.HTMLParser(namespaceHTMLElements=False)
+ self.parser = BleachHTMLParser(namespaceHTMLElements=False)
self.walker = html5lib.getTreeWalker('etree')
- self.serializer = HTMLSerializer(
+ self.serializer = BleachHTMLSerializer(
quote_attr_values='always',
omit_optional_tags=False,
+ escape_lt_in_attrs=True,
+
+ # We want to leave entities as they are without escaping or
+ # resolving or expanding
+ resolve_entities=False,
# Bleach has its own sanitizer, so don't use the html5lib one
sanitize=False,
@@ -124,7 +194,14 @@ class Cleaner(object):
:returns: sanitized text as unicode
+ :raises TypeError: if ``text`` is not a text type
+
"""
+ if not isinstance(text, six.string_types):
+ message = "argument cannot be of '{name}' type, must be of text type".format(
+ name=text.__class__.__name__)
+ raise TypeError(message)
+
if not text:
return u''
@@ -194,6 +271,79 @@ def attribute_filter_factory(attributes):
raise ValueError('attributes needs to be a callable, a list or a dict')
+def match_entity(stream):
+ """Returns first entity in stream or None if no entity exists
+
+ Note: For Bleach purposes, entities must start with a "&" and end with
+ a ";".
+
+ :arg stream: the character stream
+
+ :returns: ``None`` or the entity string without "&" or ";"
+
+ """
+ # Nix the & at the beginning
+ if stream[0] != '&':
+ raise ValueError('Stream should begin with "&"')
+
+ stream = stream[1:]
+
+ stream = list(stream)
+ possible_entity = ''
+ end_characters = '<&=;' + string.whitespace
+
+ # Handle number entities
+ if stream and stream[0] == '#':
+ possible_entity = '#'
+ stream.pop(0)
+
+ if stream and stream[0] in ('x', 'X'):
+ allowed = '0123456789abcdefABCDEF'
+ possible_entity += stream.pop(0)
+ else:
+ allowed = '0123456789'
+
+ # FIXME(willkg): Do we want to make sure these are valid number
+ # entities? This doesn't do that currently.
+ while stream and stream[0] not in end_characters:
+ c = stream.pop(0)
+ if c not in allowed:
+ break
+ possible_entity += c
+
+ if possible_entity and stream and stream[0] == ';':
+ return possible_entity
+ return None
+
+ # Handle character entities
+ while stream and stream[0] not in end_characters:
+ c = stream.pop(0)
+ if not ENTITIES_TRIE.has_keys_with_prefix(possible_entity):
+ break
+ possible_entity += c
+
+ if possible_entity and stream and stream[0] == ';':
+ return possible_entity
+
+ return None
+
+
+def next_possible_entity(text):
+ """Takes a text and generates a list of possible entities
+
+ :arg text: the text to look at
+
+ :returns: generator where each part (except the first) starts with an
+ "&"
+
+ """
+ for i, part in enumerate(AMP_SPLIT_RE.split(text)):
+ if i == 0:
+ yield part
+ elif i % 2 == 0:
+ yield '&' + part
+
+
class BleachSanitizerFilter(sanitizer.Filter):
"""html5lib Filter that sanitizes text
... 2115 lines suppressed ...
--
Alioth's /usr/local/bin/git-commit-notice on /srv/git.debian.org/git/python-modules/packages/python-bleach.git
More information about the Python-modules-commits
mailing list