Bug#843776: dpkg-buildpackage should set LC_COLLATE=C.UTF-8

Ian Jackson ijackson at chiark.greenend.org.uk
Wed Nov 9 13:43:21 UTC 2016


Package: dpkg-dev
Version: 1.18.12

According to POSIX, the meaning of glob patterns is unspecified in
locales other than `the POSIX locale'. [1]  It's not easy to see from
the spec, but the relevant env var is LC_COLLATE.

Many package build rules, dh rules, etc., rely on shell globbing.
This shell globbing needs to be predictable.

The output of a package build ought not to depend on the locale at
all, really.  (This is one of the things that the reproducible builds
people are trying to ensure.)  But we don't want to set LC_MESSAGES,
at least, because we want people to be able to debug builds in their
native language, as far as possible.

It is difficult to imagine a situation where a honouring a user's
LC_COLLATE during a package build would be beneficial.

In practice, nonstandard LC_COLLATE values can break perfectly
sensible looking build code.  For example, chiark-utils 5.0.0+exp1
FTBFS in current stretch when LC_COLLATE=fr_CH.UTF-8 because of this:
  $ touch 11 pp qq
  $ LC_COLLATE=fr_CH.UTF-8 bash -c 'echo [!A-Z]*[!~]'
  11
  $
(Interestingly, many of these FTBFS problems will be hidden if /bin/sh
is dash, because dash does not honour locales for globbing.  This is
clearly legal according to the spec, and probably a good decision.)

In principle this bug might be fixable by asking (almost) every
package to set LC_COLLATE in debian/rules.  But ISTM that it would be
much better to fix this in dpkg-buildpackage.

I suggest that dpkg-buildpackage should do as follows:

 * Unconditionally set one of the following
       LC_COLLATE=C.UTF-8
       LC_COLLATE=C
   Colin Watson tells me that C.UTF-8 has been in libc since
   approximately squeeze.  C is theoretically UB (!) for high-bit
   set octets but in practice works just fine (and it would be
   intolerable if it didn't).

 * Check the effective LC_COLLATE using locale(1), and produce
   a warning if the result is not m/^C(?=\.|$)/.  (This is useful
   because some misguided user might set LC_ALL.)

In the meantime the reproducible builds folks may want to consider
explicitly setting LC_COLLATE to something sane in their 2nd build.

Thanks for your attention.

Regards,
Ian.

[1]
Shell path glob patterns are mostly like normal glob patterns:
  http://pubs.opengroup.org/onlinepubs/9699919799/utilities/V3_chap02.html#tag_18_13_03

Glob patterns' bracketed [] character sets are mostly like regexp ones:
  http://pubs.opengroup.org/onlinepubs/9699919799/utilities/V3_chap02.html#tag_18_13_02

Regexp bracketed character sets with ranges depend on locale.
Point 7 of:
  http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#tag_09_03_05

-- 
Ian Jackson <ijackson at chiark.greenend.org.uk>   These opinions are my own.

If I emailed you from an address @fyvzl.net or @evade.org.uk, that is
a private address which bypasses my fierce spamfilter.



More information about the Reproducible-builds mailing list