[Python-apps-team] Bug#737498: [PATCH RFC] patch: when importing from email, RFC2047-decode From/Subject headers
Matt Mackall
mpm at selenic.com
Thu Mar 3 18:49:22 UTC 2016
On Thu, 2016-03-03 at 18:55 +0100, Julien Cristau wrote:
> # HG changeset patch
> # User Julien Cristau <julien.cristau at logilab.fr>
> # Date 1457026459 -3600
> # Thu Mar 03 18:34:19 2016 +0100
> # Node ID 6c153cbad4a032861417dbba9d1d90332964ab5f
> # Parent 549ff28a345f595cad7e06fb08c2ac6973e2f030
> patch: when importing from email, RFC2047-decode From/Subject headers
>
> I'm not too sure about the Subject part: it should be possible to use
> the charset information from the email (RFC2047 encoding and the
> Content-Type header), but mercurial seems to use its own encoding
> instead (in the test, that means the commit message ends up as "????"
> if the import is done without --encoding utf-8). Advice welcome.
>
> Reported at https://bugs.debian.org/737498
You should probably immediately relay such reports upstream.
> diff --git a/mercurial/patch.py b/mercurial/patch.py
> --- a/mercurial/patch.py
> +++ b/mercurial/patch.py
> @@ -201,19 +201,28 @@ def extract(ui, fileobj):
> # (this heuristic is borrowed from quilt)
> diffre = re.compile(r'^(?:Index:[ \t]|diff[ \t]|RCS file: |'
> r'retrieving revision [0-9]+(\.[0-9]+)*$|'
> r'---[ \t].*?^\+\+\+[ \t]|'
> r'\*\*\*[ \t].*?^---[ \t])', re.MULTILINE|re.DOTALL)
> + def decode_header(header):
FYI, names with underbars are against our coding convention, contrib/check-
commit ought to warn about this.
> + if header is None:
> + return None
> + parts = []
> + for part, charset in email.Header.decode_header(header):
> + if charset is None:
> + charset = 'ascii'
This will almost certainly explode on some emails. We should probably do
something like this:
- attempt to decode based on header garbage
- attempt to decode with UTF-8
- assume Latin-1 (not ascii)
> + parts.append(part.decode(charset))
> + return encoding.tolocal(u' '.join(parts).encode('utf-8'))
Using Unicode objects outside of encoding.py is strongly discouraged. If you
must, it'd be great to unambiguously mark them all with a leading u on the
variable name. This isn't a good fit for encoding.py since it uses a third
encoding besides UTF-8 and local. Probably belongs in mail.py.
--
Mathematics is the supreme nostalgia of our time.
More information about the Python-apps-team
mailing list