[Pkg-shadow-devel] Musings about Usernames in adduser and Debian

Sat Nov 23 07:48:10 GMT 2024

Marc Haber left as an exercise for the reader:
> (1)
> Should Debian allow UTF-8 user names in the first place or should we
> restrict names for regular users to some us-ascii near set as well? (I
> think yes, we should)

I feel strongly yes, despite POSIX admonitions (quoted elsewhere
in this thread) and sure breakage any number of places. I think
a test plan would be very desirable (off the top of my head,
we'd want to check login, the DMs, PAM, OpenSSH, passwd, w,
framebuffer console input, etc. It would probably also be a good
idea to loop in other distributions.

I recommend Chapter 7 of my free book, "Hacking the Planet with
Notcurses: A Guide to TUIs and Character Semigraphics" for the
full story (as I understand it) regarding Unicode presentation:
https://nick-black.com/htp-notcurses.pdf (starts on page 41).

Some serious concerns:

 * any upstream tool could say "bad idea" and refuse patches,
   requiring their long term management,
 * the Linux framebuffer console is pretty limited in what
   glyphs it has available, and the number of glyphs it can
   support,
 * you want installer support if you intend to do this right,
 * ubiquitous input for UTF-8 is a pretty complicated story, and
 * broken localization (or failure to call setlocale()) could be
   a bigger problem, especially for root/system accounts.

Other concerns:

You'll likely now be linking libunistring into some
binaries where it wasn't previously used.

Regarding the subset of Unicode characters you'd want to allow,
this would be best decided using the General Category trait.
Each codepoint is assigned one of a finite set of General
Categories. We would probably want to allow Letters, Marks, and
Numbers, and perhaps a whitelist from Punctuation and Symbols
(Punctuation, connector and Punctuation, dash are probably all
we'd want) extended from currently supported ispunct(3)
characters. This data is available from libunistring (and
probably other places). This eliminates a great swatch of known
security issues.

Names containing invalid UTF-8 sequences ought be rejected.

Characters 0-127 would presumably be allowed iff they are now;
UTF-8 preserves US-ASCII.

We ought support combining characters up through the Extended
Grapheme Cluster (a single user-perceived character, roughly a
glyph, made up of one or more encoded characters). Generally a
single backspace ought map to an entire EGC.

Regarding canonicalization/normalization, this is a complex
question without a necessarily correct technical answer. I think
you'd want to follow the Principle of Least Astonishment; as to what
would astonish the least, I'd like to hear wider input. But
Unicode definitely defines multiple normal forms and equivalency
classes.

You now have glyphs which occupy more than one column. Are your
columnar/tabular programs prepared for that? ﷽𒁭𒐫

> (2)
> If the answer to (1) is "allow UTF-8", should we also do that for system
> users? (I think no, we should not)

I think you should, simply because otherwise you have two paths
in more places.

> (2a)
> Which UTF-8 subset / code point classes should we allow and which should
> we reject? (I don't have an opinion about that)

Answered above.

> (3)
> I think that 32 characters/bytes (it's the same if we don't allow UTF-8)
> is a good limitation for a system user name. But, should we increase
> that for regular user names? (I think yes)

I hesitate to comment here because who really cares, but does 32
save us something over 128? 128 seems the default "enough for
everybody" these days, looking at IPv6 and ZFS.

My printer is administered by i̸̒n̴͛e̵̎l̴͝u̷̾c̴̉t̵́å̵b̷͋l̷͐e̴̋m̸̆o̷̚d̴̐ä̸́l̶͝i̷̋t̷͗ẏ̷ȏ̵f̸̃t̶͘h̷͗e̴̿v̶͘i̷̛s̸̈́ì̵b̷̃l̶̎e̷͊.

> (5)
> Is it right to say "the user name in /etc/passwd is UTF-8 encoded" or
> should I better say "the user name in /etc/passwd can be UTF-8 encoded"?

"It is UTF-8 encoded."

> (6)
> Does it still make sense to give non-UTF-8-locales special handling
> (which one?), or can adduser safely assume that any non-ascii locale is
> UTF-8? Or must I check for locale and reject UTF-8 user names on
> non-UTF-8 locales? (I hope that we can safely assume UTF-8)

It cannot. "C" is not UTF-8. Assumption of UTF-8 requires a
properly set LANG and programs calling setlocale(). This, as
alluded to above, has the potential for a big mess.

-- 
nick black -=- https://nick-black.com
to make an apple pie from scratch,
you need first invent a universe.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: not available
URL: <http://alioth-lists.debian.net/pipermail/pkg-shadow-devel/attachments/20241123/2cc1ada2/attachment.sig>