[Pkg-shadow-devel] Bug#478771: passwd: shadow libraries ignore stale locks based only on PID

Wed May 14 16:56:38 UTC 2008

On Wed, May 14, 2008 at 06:00:51PM +0400, xrgtn at yandex.ru wrote:
> Hello,
> 
> On 01.05.08, 02:07, Castor Fu wrote:
> 
> > When creating or deleting users or groups, the shadow utilities
> > creates lock files like /etc/group.lock, /etc/shadow.lock, etc.
> >
> > These lock files contain the PID of the locker, so that if the
> > lock older dies without unlocking the file, the lock requestor
> > can break the lock and take ownership.
> >
> > One scenario that we've encountered is that if the system is 
> > restarted soon after creating new groups, the attempt to delete
> > the locks is lost.
> >
> > The low PID then is used upon startup by a long-running daemon,
> > and then the lock is never freed.
> 
> Also the low PID can be reused on systems with random
> PID allocation.
> 
> > A potential workaround would be to check and see if the mod date 
> > is older than the system uptime. Then one is limited to cases
> > where the PID has wrapped and collided which is much less
> > likely.
> 
> Basically, there are 3 methods for solving the problem:
> 
> 1. mtime vs. uptime workaround (dunno how to code this
>    in C though. Maybe a separate clean-passwd-lock-after-reboot
>    shellcript will do better ;))
> 
> 2. check that the PID from /etc/group.lock, /etc/shadow.lock
>    is not only alive (kill -O $PID), but also owned by one
>    of the shadow utils. In Perl, I do this next way:
>    
>         if ($pid and $pid != $$ and kill(0, $pid)) {
>             my @ps = split m|$/|, qx/ps -fp $pid/
>                 || die("ERROR: ps utility not available: $!");
>             s/^\s+// for @ps;   # leading spaces confuse us
>             my @ps_header = split(/\s+/, $ps[0]);
>             @ps = split /\s+/, $ps[1], $#ps_header + 1;
>             $pid = undef unless scalar grep /$pattern/, $ps[$#ps_header];
>         };
>         return $pid;
> 
>    $pattern is passed as a parameter into the subroutine,
>    it specifies "our" process/processes. Sorry for ugly
>    code, it's actually taken and modified from my own
>    2-way raceless .pid/.lock handshake routine. I know
>    there exists at least one "ps -fp"- or "ps -p"-based
>    implementation on CPAN, the above code was borrowed
>    from there long time ago.
> 
>    This way is not universally portable, also it's not easy
>    to code this in C.
> 
> 3. Try either of flock/lockf (whatever is supported on the
>    target platform). This is not portable across Unices
>    and prone to race conditions (ps -fp/PID checks too).
>    But this is more easy to program in C.
> 
> 4. Forcibly overwrite .lock files after some timeout
>    (e.g. around 2 minutes).
> 
> Maybe we should try combined 1/3 approach. 1st on flock-less
> platforms, 3rd of flock-able ones. What's your opinion,

Now, I remember I forgot to reply ;)

My idea was to let the admin remove the lock file herself.

If the system is restarted, or a locker process died for any reason, it
might be better to warn the admin, and let her do the cleanup.
The user/group databases might be inconsistent, and only the admin may
know what append, when, and what operation was ongoing.

I don't know what is the error message currently, but it might be good to
review it (them) to make sure it is helpful (at least to indicate that a
lock file exists).

It would however be too much to indicate that:
 * another process might be using the database and she has to retry later
 * one process might have die and left lock file. In that case, the user
   must audit the system and then remove the lock file manually.
 * there might be a bug in a shadow tool or another tool that avoided the
   lock file removal.

But that could be mentioned in the manpages (FILES section).

Kind Regards,
-- 
Nekral