[Python-modules-commits] [python-idna] 05/07: Import python-idna_2.6.orig.tar.gz
Ondrej Novy
onovy at debian.org
Mon Dec 11 20:52:40 UTC 2017
This is an automated email from the git hooks/post-receive script.
onovy pushed a commit to branch master
in repository python-idna.
commit e4e478f3e0c9345634e7efe010fdb9ef0b264d41
Author: Ondřej Nový <onovy at debian.org>
Date: Mon Dec 11 21:45:45 2017 +0100
Import python-idna_2.6.orig.tar.gz
---
HISTORY.rst | 14 +
PKG-INFO | 37 ++-
README.rst | 35 +++
idna.egg-info/PKG-INFO | 37 ++-
idna.egg-info/SOURCES.txt | 5 +-
idna.egg-info/pbr.json | 1 -
idna/__init__.py | 1 +
idna/idnadata.py | 3 +-
idna/package_data.py | 2 +
idna/uts46data.py | 3 +-
setup.py | 6 +-
tools/build-idnadata.py | 110 --------
tools/build-uts46data.py | 106 --------
tools/idna-data | 671 ++++++++++++++++++++++++++++++++++++++++++++++
14 files changed, 805 insertions(+), 226 deletions(-)
diff --git a/HISTORY.rst b/HISTORY.rst
index 23cc93d..95a0a38 100644
--- a/HISTORY.rst
+++ b/HISTORY.rst
@@ -3,6 +3,20 @@
History
-------
+2.6 (2017-08-08)
+++++++++++++++++
+
+- Allows generation of IDNA and UTS 46 table data for different
+ versions of Unicode, by deriving properties directly from
+ Unicode data.
+- Ability to generate RFC 5892/IANA-style table data
+- Diagnostic output of IDNA-related Unicode properties and
+ derived calculations for a given codepoint
+- Support for idna.__version__ to report version
+- Support for idna.idnadata.__version__ and
+ idna.uts46data.__version__ to report Unicode version of
+ underlying IDNA and UTS 46 data respectively.
+
2.5 (2017-03-07)
++++++++++++++++
diff --git a/PKG-INFO b/PKG-INFO
index e1d7205..68e24ab 100644
--- a/PKG-INFO
+++ b/PKG-INFO
@@ -1,6 +1,6 @@
Metadata-Version: 1.1
Name: idna
-Version: 2.5
+Version: 2.6
Summary: Internationalized Domain Names in Applications (IDNA)
Home-page: https://github.com/kjd/idna
Author: Kim Davies
@@ -171,6 +171,41 @@ Description: Internationalized Domain Names in Applications (IDNA)
when the codepoint is illegal based on its positional context (i.e. it is CONTEXTO
or CONTEXTJ but the contextual requirements are not satisfied.)
+ Building and Diagnostics
+ ------------------------
+
+ The IDNA and UTS 46 functionality relies upon pre-calculated lookup tables for
+ performance. These tables are derived from computing against eligibility criteria
+ in the respective standards. These tables are computed using the command-line
+ script ``tools/idna-data``.
+
+ This tool will fetch relevant tables from the Unicode Consortium and perform the
+ required calculations to identify eligibility. It has three main modes:
+
+ * ``idna-data make-libdata``. Generates ``idnadata.py`` and ``uts46data.py``,
+ the pre-calculated lookup tables using for IDNA and UTS 46 conversions. Implementors
+ who wish to track this library against a different Unicode version may use this tool
+ to manually generate a different version of the ``idnadata.py`` and ``uts46data.py``
+ files.
+
+ * ``idna-data make-table``. Generate a table of the IDNA disposition
+ (e.g. PVALID, CONTEXTJ, CONTEXTO) in the format found in Appendix B.1 of RFC
+ 5892 and the pre-computed tables published by `IANA <http://iana.org/>`_.
+
+ * ``idna-data U+0061``. Prints debugging output on the various properties
+ associated with an individual Unicode codepoint (in this case, U+0061), that are
+ used to assess the IDNA and UTS 46 status of a codepoint. This is helpful in debugging
+ or analysis.
+
+ The tool accepts a number of arguments, described using ``idna-data -h``. Most notably,
+ the ``--version`` argument allows the specification of the version of Unicode to use
+ in computing the table data. For example, ``idna-data --version 9.0.0 make-libdata``
+ will generate library data against Unicode 9.0.0.
+
+ Note that this script requires Python 3, but all generated library data will work
+ in Python 2.6+.
+
+
Testing
-------
diff --git a/README.rst b/README.rst
index 5f3ea8f..3a9b2b5 100644
--- a/README.rst
+++ b/README.rst
@@ -163,6 +163,41 @@ an illegal character in an IDN label (i.e. INVALID); and ``idna.InvalidCodepoint
when the codepoint is illegal based on its positional context (i.e. it is CONTEXTO
or CONTEXTJ but the contextual requirements are not satisfied.)
+Building and Diagnostics
+------------------------
+
+The IDNA and UTS 46 functionality relies upon pre-calculated lookup tables for
+performance. These tables are derived from computing against eligibility criteria
+in the respective standards. These tables are computed using the command-line
+script ``tools/idna-data``.
+
+This tool will fetch relevant tables from the Unicode Consortium and perform the
+required calculations to identify eligibility. It has three main modes:
+
+* ``idna-data make-libdata``. Generates ``idnadata.py`` and ``uts46data.py``,
+ the pre-calculated lookup tables using for IDNA and UTS 46 conversions. Implementors
+ who wish to track this library against a different Unicode version may use this tool
+ to manually generate a different version of the ``idnadata.py`` and ``uts46data.py``
+ files.
+
+* ``idna-data make-table``. Generate a table of the IDNA disposition
+ (e.g. PVALID, CONTEXTJ, CONTEXTO) in the format found in Appendix B.1 of RFC
+ 5892 and the pre-computed tables published by `IANA <http://iana.org/>`_.
+
+* ``idna-data U+0061``. Prints debugging output on the various properties
+ associated with an individual Unicode codepoint (in this case, U+0061), that are
+ used to assess the IDNA and UTS 46 status of a codepoint. This is helpful in debugging
+ or analysis.
+
+The tool accepts a number of arguments, described using ``idna-data -h``. Most notably,
+the ``--version`` argument allows the specification of the version of Unicode to use
+in computing the table data. For example, ``idna-data --version 9.0.0 make-libdata``
+will generate library data against Unicode 9.0.0.
+
+Note that this script requires Python 3, but all generated library data will work
+in Python 2.6+.
+
+
Testing
-------
diff --git a/idna.egg-info/PKG-INFO b/idna.egg-info/PKG-INFO
index e1d7205..68e24ab 100644
--- a/idna.egg-info/PKG-INFO
+++ b/idna.egg-info/PKG-INFO
@@ -1,6 +1,6 @@
Metadata-Version: 1.1
Name: idna
-Version: 2.5
+Version: 2.6
Summary: Internationalized Domain Names in Applications (IDNA)
Home-page: https://github.com/kjd/idna
Author: Kim Davies
@@ -171,6 +171,41 @@ Description: Internationalized Domain Names in Applications (IDNA)
when the codepoint is illegal based on its positional context (i.e. it is CONTEXTO
or CONTEXTJ but the contextual requirements are not satisfied.)
+ Building and Diagnostics
+ ------------------------
+
+ The IDNA and UTS 46 functionality relies upon pre-calculated lookup tables for
+ performance. These tables are derived from computing against eligibility criteria
+ in the respective standards. These tables are computed using the command-line
+ script ``tools/idna-data``.
+
+ This tool will fetch relevant tables from the Unicode Consortium and perform the
+ required calculations to identify eligibility. It has three main modes:
+
+ * ``idna-data make-libdata``. Generates ``idnadata.py`` and ``uts46data.py``,
+ the pre-calculated lookup tables using for IDNA and UTS 46 conversions. Implementors
+ who wish to track this library against a different Unicode version may use this tool
+ to manually generate a different version of the ``idnadata.py`` and ``uts46data.py``
+ files.
+
+ * ``idna-data make-table``. Generate a table of the IDNA disposition
+ (e.g. PVALID, CONTEXTJ, CONTEXTO) in the format found in Appendix B.1 of RFC
+ 5892 and the pre-computed tables published by `IANA <http://iana.org/>`_.
+
+ * ``idna-data U+0061``. Prints debugging output on the various properties
+ associated with an individual Unicode codepoint (in this case, U+0061), that are
+ used to assess the IDNA and UTS 46 status of a codepoint. This is helpful in debugging
+ or analysis.
+
+ The tool accepts a number of arguments, described using ``idna-data -h``. Most notably,
+ the ``--version`` argument allows the specification of the version of Unicode to use
+ in computing the table data. For example, ``idna-data --version 9.0.0 make-libdata``
+ will generate library data against Unicode 9.0.0.
+
+ Note that this script requires Python 3, but all generated library data will work
+ in Python 2.6+.
+
+
Testing
-------
diff --git a/idna.egg-info/SOURCES.txt b/idna.egg-info/SOURCES.txt
index 4aaa732..7976040 100644
--- a/idna.egg-info/SOURCES.txt
+++ b/idna.egg-info/SOURCES.txt
@@ -10,11 +10,11 @@ idna/compat.py
idna/core.py
idna/idnadata.py
idna/intranges.py
+idna/package_data.py
idna/uts46data.py
idna.egg-info/PKG-INFO
idna.egg-info/SOURCES.txt
idna.egg-info/dependency_links.txt
-idna.egg-info/pbr.json
idna.egg-info/top_level.txt
tests/IdnaTest.txt.gz
tests/__init__.py
@@ -23,6 +23,5 @@ tests/test_idna_codec.py
tests/test_idna_compat.py
tests/test_idna_uts46.py
tests/test_intranges.py
-tools/build-idnadata.py
-tools/build-uts46data.py
+tools/idna-data
tools/intranges.py
\ No newline at end of file
diff --git a/idna.egg-info/pbr.json b/idna.egg-info/pbr.json
deleted file mode 100644
index a5e6b0f..0000000
--- a/idna.egg-info/pbr.json
+++ /dev/null
@@ -1 +0,0 @@
-{"is_release": true, "git_version": "0088bfc"}
\ No newline at end of file
diff --git a/idna/__init__.py b/idna/__init__.py
index bb67a43..847bf93 100644
--- a/idna/__init__.py
+++ b/idna/__init__.py
@@ -1 +1,2 @@
+from .package_data import __version__
from .core import *
diff --git a/idna/idnadata.py b/idna/idnadata.py
index 2ff30fe..c48f1b5 100644
--- a/idna/idnadata.py
+++ b/idna/idnadata.py
@@ -1,5 +1,6 @@
-# This file is automatically generated by build-idnadata.py
+# This file is automatically generated by tools/idna-data
+__version__ = "6.3.0"
scripts = {
'Greek': (
0x37000000374,
diff --git a/idna/package_data.py b/idna/package_data.py
new file mode 100644
index 0000000..fc33139
--- /dev/null
+++ b/idna/package_data.py
@@ -0,0 +1,2 @@
+__version__ = '2.6'
+
diff --git a/idna/uts46data.py b/idna/uts46data.py
index 48da840..f9b3236 100644
--- a/idna/uts46data.py
+++ b/idna/uts46data.py
@@ -1,9 +1,10 @@
-# This file is automatically generated by tools/build-uts46data.py
+# This file is automatically generated by tools/idna-data
# vim: set fileencoding=utf-8 :
"""IDNA Mapping Table from UTS46."""
+__version__ = "6.3.0"
def _seg_0():
return [
(0x0, '3'),
diff --git a/setup.py b/setup.py
index 4147ccb..2442f74 100644
--- a/setup.py
+++ b/setup.py
@@ -9,7 +9,6 @@ the "encodings.idna" module.
import io, sys
from setuptools import setup
-version = "2.5"
def main():
@@ -17,10 +16,13 @@ def main():
if python_version < (2,6):
raise SystemExit("Sorry, Python 2.6 or newer required")
+ package_data = {}
+ exec(open('idna/package_data.py').read(), package_data)
+
arguments = {
'name': 'idna',
'packages': ['idna'],
- 'version': version,
+ 'version': package_data['__version__'],
'description': 'Internationalized Domain Names in Applications (IDNA)',
'long_description': io.open("README.rst", encoding="UTF-8").read(),
'author': 'Kim Davies',
diff --git a/tools/build-idnadata.py b/tools/build-idnadata.py
deleted file mode 100755
index 3b6e24f..0000000
--- a/tools/build-idnadata.py
+++ /dev/null
@@ -1,110 +0,0 @@
-#!/usr/bin/env python
-
-from __future__ import print_function
-
-try:
- from urllib.request import urlopen
-except ImportError:
- from urllib2 import urlopen
-import xml.etree.ElementTree as etree
-
-from intranges import intranges_from_list
-
-UNICODE_VERSION = '6.3.0'
-
-SCRIPTS_URL = "http://www.unicode.org/Public/{version}/ucd/Scripts.txt"
-JOININGTYPES_URL = "http://www.unicode.org/Public/{version}/ucd/ArabicShaping.txt"
-IDNATABLES_URL = "http://www.iana.org/assignments/idna-tables-{version}/idna-tables-{version}.xml"
-IDNATABLES_NS = "http://www.iana.org/assignments"
-
-# These scripts are needed to compute IDNA contextual rules, see
-# https://www.iana.org/assignments/idna-tables-6.3.0#idna-tables-context
-
-SCRIPT_WHITELIST = sorted(['Greek', 'Han', 'Hebrew', 'Hiragana', 'Katakana'])
-
-
-def print_optimised_list(d):
- print("(")
- for value in intranges_from_list(d):
- print(" {},".format(hex(value)))
- print(" ),")
-
-
-def build_idnadata(version):
-
- print("# This file is automatically generated by build-idnadata.py\n")
-
- #
- # Script classifications are used by some CONTEXTO rules in RFC 5891
- #
- print("scripts = {")
- scripts = {}
- for line in urlopen(SCRIPTS_URL.format(version=version)).readlines():
- line = line.decode('utf-8')
- line = line.strip()
- if not line or line[0] == '#':
- continue
- if line.find('#'):
- line = line.split('#')[0]
- (codepoints, scriptname) = [x.strip() for x in line.split(';')]
- if not scriptname in scripts:
- scripts[scriptname] = set()
- if codepoints.find('..') > 0:
- (begin, end) = [int(x, 16) for x in codepoints.split('..')]
- for cp in range(begin, end+1):
- scripts[scriptname].add(cp)
- else:
- scripts[scriptname].add(int(codepoints, 16))
-
- for script in SCRIPT_WHITELIST:
- print(" '{0}':".format(script), end=' ')
- print_optimised_list(scripts[script])
-
- print("}")
-
- #
- # Joining types are used by CONTEXTJ rule A.1
- #
- print("joining_types = {")
- scripts = {}
- for line in urlopen(JOININGTYPES_URL.format(version=version)).readlines():
- line = line.decode('utf-8')
- line = line.strip()
- if not line or line[0] == '#':
- continue
- (codepoint, name, joiningtype, group) = [x.strip() for x in line.split(';')]
- print(" {0}: {1},".format(hex(int(codepoint, 16)), ord(joiningtype)))
- print("}")
-
- #
- # These are the classification of codepoints into PVALID, CONTEXTO, CONTEXTJ, etc.
- #
- print("codepoint_classes = {")
- classes = {}
-
- namespace = "{{{0}}}".format(IDNATABLES_NS)
- idntables_data = urlopen(IDNATABLES_URL.format(version=version)).read()
- root = etree.fromstring(idntables_data)
-
- for record in root.findall('{0}registry[@id="idna-tables-properties"]/{0}record'.format(namespace)):
- codepoint = record.find("{0}codepoint".format(namespace)).text
- prop = record.find("{0}property".format(namespace)).text
- if prop in ('UNASSIGNED', 'DISALLOWED'):
- continue
- if not prop in classes:
- classes[prop] = set()
- if codepoint.find('-') > 0:
- (begin, end) = [int(x, 16) for x in codepoint.split('-')]
- for cp in range(begin, end+1):
- classes[prop].add(cp)
- else:
- classes[prop].add(int(codepoint, 16))
-
- for prop in classes:
- print(" '{0}':".format(prop), end=' ')
- print_optimised_list(classes[prop])
-
- print("}")
-
-if __name__ == "__main__":
- build_idnadata(UNICODE_VERSION)
diff --git a/tools/build-uts46data.py b/tools/build-uts46data.py
deleted file mode 100755
index a53bb55..0000000
--- a/tools/build-uts46data.py
+++ /dev/null
@@ -1,106 +0,0 @@
-#!/usr/bin/env python
-
-"""Create a Python version of the IDNA Mapping Table from UTS46."""
-
-import re
-import sys
-
-# pylint: disable=unused-import,import-error,undefined-variable
-if sys.version_info[0] == 3:
- from urllib.request import urlopen
- unichr = chr
-else:
- from urllib2 import urlopen
-# pylint: enable=unused-import,import-error,undefined-variable
-
-UNICODE_VERSION = '6.3.0'
-SEGMENT_SIZE = 100
-
-DATA_URL = "http://www.unicode.org/Public/idna/{version}/IdnaMappingTable.txt"
-RE_CHAR_RANGE = re.compile(br"([0-9a-fA-F]{4,6})(?:\.\.([0-9a-fA-F]{4,6}))?$")
-STATUSES = {
- b"valid": ("V", False),
- b"ignored": ("I", False),
- b"mapped": ("M", True),
- b"deviation": ("D", True),
- b"disallowed": ("X", False),
- b"disallowed_STD3_valid": ("3", False),
- b"disallowed_STD3_mapped": ("3", True)
-}
-
-
-def parse_idna_mapping_table(inputstream):
- """Parse IdnaMappingTable.txt and return a list of tuples."""
- ranges = []
- last_code = -1
- last = (None, None)
- for line in inputstream:
- line = line.strip()
- if b"#" in line:
- line = line.split(b"#", 1)[0]
- if not line:
- continue
- fields = [field.strip() for field in line.split(b";")]
- char_range = RE_CHAR_RANGE.match(fields[0])
- if not char_range:
- raise ValueError(
- "Invalid character or range {!r}".format(fields[0]))
- start = int(char_range.group(1), 16)
- if start != last_code + 1:
- raise ValueError(
- "Code point {!r} is not continguous".format(fields[0]))
- if char_range.lastindex == 2:
- last_code = int(char_range.group(2), 16)
- else:
- last_code = start
- status, mapping = STATUSES[fields[1]]
- if mapping:
- mapping = (u"".join(unichr(int(codepoint, 16))
- for codepoint in fields[2].split()).
- replace("\\", "\\\\").replace("'", "\\'"))
- else:
- mapping = None
- if start > 255 and (status, mapping) == last:
- continue
- last = (status, mapping)
- while True:
- if mapping is not None:
- ranges.append(u"(0x{0:X}, '{1}', u'{2}')".format(
- start, status, mapping))
- else:
- ranges.append(u"(0x{0:X}, '{1}')".format(start, status))
- start += 1
- if start > 255 or start > last_code:
- break
- return ranges
-
-
-def build_uts46data(version):
- """Fetch the mapping table, parse it, and rewrite idna/uts46data.py."""
- ranges = parse_idna_mapping_table(urlopen(DATA_URL.format(version=version)))
- with open("idna/uts46data.py", "wb") as outputstream:
- outputstream.write(b'''\
-# This file is automatically generated by tools/build-uts46data.py
-# vim: set fileencoding=utf-8 :
-
-"""IDNA Mapping Table from UTS46."""
-
-
-''')
- for idx, row in enumerate(ranges):
- if idx % SEGMENT_SIZE == 0:
- if idx!=0:
- outputstream.write(b" ]\n\n")
- outputstream.write(u"def _seg_{0}():\n return [\n".format(idx/SEGMENT_SIZE).encode("utf8"))
- outputstream.write(u" {0},\n".format(row).encode("utf8"))
- outputstream.write(b" ]\n\n")
- outputstream.write(b"uts46data = tuple(\n")
-
- outputstream.write(b" _seg_0()\n")
- for i in xrange(1, (len(ranges)-1)/SEGMENT_SIZE+1):
- outputstream.write(u" + _seg_{0}()\n".format(i).encode("utf8"))
- outputstream.write(b")\n")
-
-
-if __name__ == "__main__":
- build_uts46data(UNICODE_VERSION)
diff --git a/tools/idna-data b/tools/idna-data
new file mode 100755
index 0000000..0da5aa9
--- /dev/null
+++ b/tools/idna-data
@@ -0,0 +1,671 @@
+#!/usr/bin/env python3
+
+import argparse, collections, datetime, os, re, sys, unicodedata
+from urllib.request import urlopen
+from intranges import intranges_from_list
+
+if sys.version_info[0] < 3:
+ print("Only Python 3 supported.")
+ sys.exit(2)
+
+# PREFERRED_VERSION = 'latest' # https://github.com/kjd/idna/issues/8
+PREFERRED_VERSION = '6.3.0'
+UCD_URL = 'http://www.unicode.org/Public/{version}/ucd/{filename}'
+UTS46_URL = 'http://www.unicode.org/Public/idna/{version}/{filename}'
+
+DEFAULT_CACHE_DIR = '~/.cache/unidata'
+
+# Scripts affected by IDNA contextual rules
+SCRIPT_WHITELIST = sorted(['Greek', 'Han', 'Hebrew', 'Hiragana', 'Katakana'])
+
+# Used to piece apart UTS#46 data for Jython compatibility
+UTS46_SEGMENT_SIZE = 100
+
+UTS46_STATUSES = {
+ "valid": ("V", False),
+ "ignored": ("I", False),
+ "mapped": ("M", True),
+ "deviation": ("D", True),
+ "disallowed": ("X", False),
+ "disallowed_STD3_valid": ("3", False),
+ "disallowed_STD3_mapped": ("3", True)
+}
+
+# Exceptions are manually assigned in Section 2.6 of RFC 5892.
+exceptions = {
+ 0x00DF: 'PVALID', # LATIN SMALL LETTER SHARP S
+ 0x03C2: 'PVALID', # GREEK SMALL LETTER FINAL SIGMA
+ 0x06FD: 'PVALID', # ARABIC SIGN SINDHI AMPERSAND
+ 0x06FE: 'PVALID', # ARABIC SIGN SINDHI POSTPOSITION MEN
+ 0x0F0B: 'PVALID', # TIBETAN MARK INTERSYLLABIC TSHEG
+ 0x3007: 'PVALID', # IDEOGRAPHIC NUMBER ZERO
+ 0x00B7: 'CONTEXTO', # MIDDLE DOT
+ 0x0375: 'CONTEXTO', # GREEK LOWER NUMERAL SIGN (KERAIA)
+ 0x05F3: 'CONTEXTO', # HEBREW PUNCTUATION GERESH
+ 0x05F4: 'CONTEXTO', # HEBREW PUNCTUATION GERSHAYIM
+ 0x30FB: 'CONTEXTO', # KATAKANA MIDDLE DOT
+ 0x0660: 'CONTEXTO', # ARABIC-INDIC DIGIT ZERO
+ 0x0661: 'CONTEXTO', # ARABIC-INDIC DIGIT ONE
+ 0x0662: 'CONTEXTO', # ARABIC-INDIC DIGIT TWO
+ 0x0663: 'CONTEXTO', # ARABIC-INDIC DIGIT THREE
+ 0x0664: 'CONTEXTO', # ARABIC-INDIC DIGIT FOUR
+ 0x0665: 'CONTEXTO', # ARABIC-INDIC DIGIT FIVE
+ 0x0666: 'CONTEXTO', # ARABIC-INDIC DIGIT SIX
+ 0x0667: 'CONTEXTO', # ARABIC-INDIC DIGIT SEVEN
+ 0x0668: 'CONTEXTO', # ARABIC-INDIC DIGIT EIGHT
+ 0x0669: 'CONTEXTO', # ARABIC-INDIC DIGIT NINE
+ 0x06F0: 'CONTEXTO', # EXTENDED ARABIC-INDIC DIGIT ZERO
+ 0x06F1: 'CONTEXTO', # EXTENDED ARABIC-INDIC DIGIT ONE
+ 0x06F2: 'CONTEXTO', # EXTENDED ARABIC-INDIC DIGIT TWO
+ 0x06F3: 'CONTEXTO', # EXTENDED ARABIC-INDIC DIGIT THREE
+ 0x06F4: 'CONTEXTO', # EXTENDED ARABIC-INDIC DIGIT FOUR
+ 0x06F5: 'CONTEXTO', # EXTENDED ARABIC-INDIC DIGIT FIVE
+ 0x06F6: 'CONTEXTO', # EXTENDED ARABIC-INDIC DIGIT SIX
+ 0x06F7: 'CONTEXTO', # EXTENDED ARABIC-INDIC DIGIT SEVEN
+ 0x06F8: 'CONTEXTO', # EXTENDED ARABIC-INDIC DIGIT EIGHT
+ 0x06F9: 'CONTEXTO', # EXTENDED ARABIC-INDIC DIGIT NINE
+ 0x0640: 'DISALLOWED', # ARABIC TATWEEL
+ 0x07FA: 'DISALLOWED', # NKO LAJANYALAN
+ 0x302E: 'DISALLOWED', # HANGUL SINGLE DOT TONE MARK
+ 0x302F: 'DISALLOWED', # HANGUL DOUBLE DOT TONE MARK
+ 0x3031: 'DISALLOWED', # VERTICAL KANA REPEAT MARK
+ 0x3032: 'DISALLOWED', # VERTICAL KANA REPEAT WITH VOICED SOUND MARK
+ 0x3033: 'DISALLOWED', # VERTICAL KANA REPEAT MARK UPPER HALF
+ 0x3034: 'DISALLOWED', # VERTICAL KANA REPEAT WITH VOICED SOUND MARK UPPER HA
+ 0x3035: 'DISALLOWED', # VERTICAL KANA REPEAT MARK LOWER HALF
+ 0x303B: 'DISALLOWED', # VERTICAL IDEOGRAPHIC ITERATION MARK
+}
+backwardscompatible = {}
+
+
+def hexrange(start, end):
+ return range(int(start, 16), int(end, 16) + 1)
+
+def hexvalue(value):
+ return int(value, 16)
+
+
+class UnicodeVersion(object):
+
+ def __init__(self, version):
+ result = re.match('^(?P<major>\d+)\.(?P<minor>\d+)\.(?P<patch>\d+)$', version)
+ if result:
+ self.major = int(result.group('major'))
+ self.minor = int(result.group('minor'))
+ self.patch = int(result.group('patch'))
+ self.numerical = (self.major << 8) + (self.minor << 4) + self.patch
+ self.latest = False
+ elif version == 'latest':
+ self.latest = True
+ else:
+ raise ValueError('Unrecognized Unicode version')
+
+ def __repr__(self, with_date=True):
+ if self.latest:
+ if with_date:
+ return 'latest@{}'.format(datetime.datetime.now().strftime('%Y-%m-%d'))
+ else:
+ return 'latest'
+ else:
+ return "{}.{}.{}".format(self.major, self.minor, self.patch)
+
+ @property
+ def tag(self):
+ return self.__repr__(with_date=False)
+
+ def __gt__(self, other):
+ if self.latest:
+ return True
+ return self.numerical > other.numerical
+
+ def __eq__(self, other):
+ if self.latest:
+ return False
+ return self.numerical == other.numerical
+
+
+class UnicodeData(object):
+
+ def __init__(self, version, cache, args):
+ self.version = UnicodeVersion(version)
+ self.system_version = UnicodeVersion(unicodedata.unidata_version)
+ self.source = args.source
+ self.cache = cache
+ self.max = 0
+
+ if self.system_version < self.version:
+ print("Warning: Character stability not guaranteed as Python Unicode data {}"
+ " older than requested {}".format(self.system_version, self.version))
+
+ self._load_unicodedata()
+ self._load_proplist()
+ self._load_derivedcoreprops()
+ self._load_blocks()
+ self._load_casefolding()
+ self._load_hangulst()
+ self._load_arabicshaping()
+ self._load_scripts()
+ self._load_uts46mapping()
+
+ def _load_unicodedata(self):
+
+ f_ud = self._ucdfile('UnicodeData.txt')
+ self.ucd_data = {}
+ range_begin = None
+ for line in f_ud.splitlines():
+ fields = line.split(';')
+ value = int(fields[0], 16)
+ start_marker = re.match('^<(?P<name>.*?), First>$', fields[1])
+ end_marker = re.match('^<(?P<name>.*?), Last>$', fields[1])
+ if start_marker:
+ range_begin = value
+ elif end_marker:
+ for i in range(range_begin, value+1):
+ fields[1] = '<{}>'.format(end_marker.group('name'))
+ self.ucd_data[i] = fields[1:]
+ range_begin = None
+ else:
+ self.ucd_data[value] = fields[1:]
+
+ def _load_proplist(self):
+
+ f_pl = self._ucdfile('PropList.txt')
+ self.ucd_props = collections.defaultdict(list)
+ for line in f_pl.splitlines():
+ result = re.match(
+ '^(?P<start>[0-9A-F]{4,6})(|\.\.(?P<end>[0-9A-F]{4,6}))\s*;\s*(?P<prop>\S+)\s*(|\#.*)$',
+ line)
+ if result:
+ if result.group('end'):
+ for i in hexrange(result.group('start'), result.group('end')):
+ self.ucd_props[i].append(result.group('prop'))
+ else:
+ i = hexvalue(result.group('start'))
+ self.ucd_props[i].append(result.group('prop'))
+
+ def _load_derivedcoreprops(self):
+
+ f_dcp = self._ucdfile('DerivedCoreProperties.txt')
+ for line in f_dcp.splitlines():
+ result = re.match(
+ '^(?P<start>[0-9A-F]{4,6})(|\.\.(?P<end>[0-9A-F]{4,6}))\s*;\s*(?P<prop>\S+)\s*(|\#.*)$',
+ line)
+ if result:
+ if result.group('end'):
+ for i in hexrange(result.group('start'), result.group('end')):
+ self.ucd_props[i].append(result.group('prop'))
+ else:
+ i = hexvalue(result.group('start'))
+ self.ucd_props[i].append(result.group('prop'))
+
+ def _load_blocks(self):
+
+ self.ucd_block = {}
+ f_b = self._ucdfile('Blocks.txt')
+ for line in f_b.splitlines():
+ result = re.match(
+ '^(?P<start>[0-9A-F]{4,6})\.\.(?P<end>[0-9A-F]{4,6})\s*;\s*(?P<block>.*)\s*$',
+ line)
+ if result:
+ for i in hexrange(result.group('start'), result.group('end')):
+ self.ucd_block[i] = result.group('block')
+ self.max = max(self.max, i)
+
+ def _load_casefolding(self):
+
+ self.ucd_cf = {}
+ f_cf = self._ucdfile('CaseFolding.txt')
+ for line in f_cf.splitlines():
+ result = re.match(
+ '^(?P<cp>[0-9A-F]{4,6})\s*;\s*(?P<type>\S+)\s*;\s*(?P<subst>[0-9A-F\s]+)\s*',
+ line)
+ if result:
+ if result.group('type') in ('C', 'F'):
+ self.ucd_cf[int(result.group('cp'), 16)] = \
+ ''.join([chr(int(x, 16)) for x in result.group('subst').split(' ')])
+
+ def _load_hangulst(self):
+
+ self.ucd_hst = {}
+ f_hst = self._ucdfile('HangulSyllableType.txt')
+ for line in f_hst.splitlines():
+ result = re.match(
+ '^(?P<start>[0-9A-F]{4,6})\.\.(?P<end>[0-9A-F]{4,6})\s*;\s*(?P<type>\S+)\s*(|\#.*)$',
+ line)
+ if result:
+ for i in hexrange(result.group('start'), result.group('end')):
+ self.ucd_hst[i] = result.group('type')
+
+ def _load_arabicshaping(self):
+
+ self.ucd_as = {}
+ f_as = self._ucdfile('ArabicShaping.txt')
+ for line in f_as.splitlines():
+ result = re.match('^(?P<cp>[0-9A-F]{4,6})\s*;\s*.*?\s*;\s*(?P<jt>\S+)\s*;', line)
+ if result:
+ self.ucd_as[int(result.group('cp'), 16)] = result.group('jt')
+
+ def _load_scripts(self):
+
+ self.ucd_s = {}
+ f_s = self._ucdfile('Scripts.txt')
+ for line in f_s.splitlines():
+ result = re.match(
+ '^(?P<start>[0-9A-F]{4,6})(|\.\.(?P<end>[0-9A-F]{4,6}))\s*;\s*(?P<script>\S+)\s*(|\#.*)$',
+ line)
+ if result:
+ if not result.group('script') in self.ucd_s:
+ self.ucd_s[result.group('script')] = set()
+ if result.group('end'):
+ for i in hexrange(result.group('start'), result.group('end')):
+ self.ucd_s[result.group('script')].add(i)
+ else:
+ i = hexvalue(result.group('start'))
+ self.ucd_s[result.group('script')].add(i)
+
+ def _load_uts46mapping(self):
+
+ self.ucd_idnamt = {}
+ f_idnamt = self._ucdfile('IdnaMappingTable.txt', urlbase=UTS46_URL)
+ for line in f_idnamt.splitlines():
+ result = re.match(
+ '^(?P<start>[0-9A-F]{4,6})(|\.\.(?P<end>[0-9A-F]{4,6}))\s*;\s*(?P<fields>[^#]+)',
+ line)
+ if result:
+ fields = [x.strip() for x in result.group('fields').split(';')]
+ if result.group('end'):
+ for i in hexrange(result.group('start'), result.group('end')):
+ self.ucd_idnamt[i] = fields
+ else:
+ i = hexvalue(result.group('start'))
+ self.ucd_idnamt[i] = fields
+
+ def _ucdfile(self, filename, urlbase=UCD_URL):
+ if self.source:
+ f = open("{}/{}".format(self.source, filename))
+ return f.read()
+ else:
+ cache_file = None
+ if self.cache:
+ cache_file = os.path.expanduser("{}/{}/{}".format(
+ self.cache, self.version.tag, filename))
+ if os.path.isfile(cache_file):
+ f = open(cache_file)
+ return f.read()
+
+ version_path = self.version.tag
+ if version_path == 'latest':
+ version_path = 'UCD/latest'
+ url = urlbase.format(
+ version=version_path,
+ filename=filename,
+ )
+ content = urlopen(url).read()
+
+ if cache_file:
+ if not os.path.isdir(os.path.dirname(cache_file)):
+ os.makedirs(os.path.dirname(cache_file))
+ f = open(cache_file, 'wb')
+ f.write(content)
+ f.close()
+
+ return str(content)
+
+ def codepoints(self):
+ for i in range(0, self.max + 1):
+ yield CodePoint(i, ucdata=self)
+
+
+class CodePoint:
+
+ def __init__(self, value=None, ucdata=None):
+ self.value = value
+ self.ucdata = ucdata
+
+ def _casefold(self, s):
+ r = ''
+ for c in s:
+ r += self.ucdata.ucd_cf.get(ord(c), c)
+ return r
+
+ @property
+ def exception_value(self):
+ return exceptions.get(self.value, False)
+
+ @property
+ def compat_value(self):
+ return backwardscompatible.get(self.value, False)
+
+ @property
+ def name(self):
+ if self.value in self.ucdata.ucd_data:
+ return self.ucdata.ucd_data[self.value][0]
+ elif 'Noncharacter_Code_Point' in self.ucdata.ucd_props[self.value]:
+ return '<noncharacter>'
+ else:
+ return '<reserved>'
+
+ @property
+ def general_category(self):
+ return self.ucdata.ucd_data.get(self.value, [None, None])[1]
+
+ @property
+ def unassigned(self):
+ return not ('Noncharacter_Code_Point' in self.ucdata.ucd_props[self.value] or \
+ self.value in self.ucdata.ucd_data)
+
+ @property
+ def ldh(self):
+ if self.value == 0x002d or \
+ self.value in range(0x0030, 0x0039+1) or \
+ self.value in range(0x0061, 0x007a+1):
+ return True
+ return False
+
+ @property
+ def join_control(self):
+ return 'Join_Control' in self.ucdata.ucd_props[self.value]
+
+ @property
+ def joining_type(self):
+ return self.ucdata.ucd_as.get(self.value, None)
+
+ @property
+ def char(self):
+ return chr(self.value)
+
+ @property
+ def nfkc_cf(self):
+ return unicodedata.normalize('NFKC',
+ self._casefold(unicodedata.normalize('NFKC', self.char)))
+
+ @property
+ def unstable(self):
+ return self.char != self.nfkc_cf
+
+ @property
+ def in_ignorableproperties(self):
+ for prop in ['Default_Ignorable_Code_Point', 'White_Space', 'Noncharacter_Code_Point']:
+ if prop in self.ucdata.ucd_props[self.value]:
+ return True
+ return False
+
+ @property
+ def in_ignorableblocks(self):
+ return self.ucdata.ucd_block.get(self.value) in (
+ 'Combining Diacritical Marks for Symbols', 'Musical Symbols',
+ 'Ancient Greek Musical Notation'
+ )
+
+ @property
+ def oldhanguljamo(self):
+ return self.ucdata.ucd_hst.get(self.value) in ('L', 'V', 'T')
+
+ @property
+ def in_lettersdigits(self):
+ return self.general_category in ('Ll', 'Lu', 'Lo', 'Nd', 'Lm', 'Mn', 'Mc')
+
+ @property
+ def idna2008_status(self):
+ if self.exception_value:
+ return self.exception_value
+ elif self.compat_value:
+ return self.compat_value
+ elif self.unassigned:
+ return 'UNASSIGNED'
+ elif self.ldh:
+ return 'PVALID'
+ elif self.join_control:
+ return 'CONTEXTJ'
+ elif self.unstable:
+ return 'DISALLOWED'
+ elif self.in_ignorableproperties:
+ return 'DISALLOWED'
+ elif self.in_ignorableblocks:
+ return 'DISALLOWED'
+ elif self.oldhanguljamo:
+ return 'DISALLOWED'
+ elif self.in_lettersdigits:
+ return 'PVALID'
+ else:
+ return 'DISALLOWED'
+
+ @property
+ def uts46_data(self):
+ return self.ucdata.ucd_idnamt.get(self.value, None)
+
+ @property
+ def uts46_status(self):
+ return ' '.join(self.uts46_data)
+
+
+def diagnose_codepoint(codepoint, args, ucdata):
+
+ cp = CodePoint(codepoint, ucdata=ucdata)
+
+ print("U+{:04X}:".format(codepoint))
+ print(" Name: {}".format(cp.name))
+ print("1 Exceptions: {}".format(exceptions.get(codepoint, False)))
+ print("2 Backwards Compat: {}".format(backwardscompatible.get(codepoint, False)))
+ print("3 Unassigned: {}".format(cp.unassigned))
+ print("4 LDH: {}".format(cp.ldh))
+ print(" Properties: {}".format(" ".join(sorted(ucdata.ucd_props.get(codepoint, ['None'])))))
+ print("5 .Join Control: {}".format(cp.join_control))
+ print(" NFKC CF: {}".format(" ".join(["U+{:04X}".format(ord(x)) for x in cp.nfkc_cf])))
+ print("6 .Unstable: {}".format(cp.unstable))
+ print("7 .Ignorable Prop: {}".format(cp.in_ignorableproperties))
+ print(" Block: {}".format(ucdata.ucd_block.get(codepoint, None)))
+ print("8 .Ignorable Block: {}".format(cp.in_ignorableblocks))
+ print(" Hangul Syll Type: {}".format(ucdata.ucd_hst.get(codepoint, None)))
+ print("9 .Old Hangul Jamo: {}".format(cp.oldhanguljamo))
+ print(" General Category: {}".format(cp.general_category))
... 210 lines suppressed ...
--
Alioth's /usr/local/bin/git-commit-notice on /srv/git.debian.org/git/python-modules/packages/python-idna.git
More information about the Python-modules-commits
mailing list