Bug#807244: Bug#800574: Bug#807244: libegl1-nvidia: Programs crash due to elisian-unlock on skylake processor with nvidia driver 352.63-1 (experimental)

Tue Dec 8 09:23:05 UTC 2015

Hi,

On 2015-12-07 23:26, Andreas Beckmann wrote:
> Dear libc maintainers,
> 
> we recently got a bug report regarding the TSX-NI / lock elision bug in
> combination with the non-free nvidia driver (#807244). Since that is
> supposed to be fixed with the libc in experimental (and now sid as
> well), perhaps you could take a look why this still happens.
> Several forum posts denote that "compiling glibc without
> --enable-lock-elision" works around that issue.

I disagree it is supposed to be fixed. Intel got a few bugs in there
TSX-NI implementation for Haswell and Broadwell and possibly early
versions of Skylake, and to avoid data loss we have therefore disabled
lock elision for some CPU revisions. That said the bugs in the Intel
implementation are corner cases, and it took quite some time for them to
get discovered. If your program crashes reproducibly, it's definitely not
an issue with the TSX-NI implementation. Disabling --enable-lock-elision
it's just a workaround for the real issue. People now start to have CPUs
with a working TSX-NI implementation which is therefore not blacklisted
and thus the problem is appearing again.

> A few ideas from my side, but since I don't have the hardware to test, I
> cannot check anything:
> * that specific CPU needs to be blacklisted / is incorrectly whitelisted

As said above that couldn't be that.

> * nvidia utilizes a code path in libc that is not covered by the current
> patch (and that code path is not used by any other application)
> * nvidia does call something it shouldn't call directly ... thus
> circumenting the runtime-disabling of the specific routines in libc6

According to the backtrace the problem is typical of a call to
mutex_unlock() on a mutex which hasn't been locked with mutex_lock()
before. Nvidia should fix the bug there.

> * nvidia code does issue the problematic instructions itself (but the
> backtrace points to libc, so this sounds unlikely)
> 
> Is there some way to check at runtime how lock elision is handled by
> libc (on a concrete system)?

What do you mean by "how is it handled"? I have attached a small program
which demonstrate the issue. You can use it to check if your system is
using lock elision or not. Running this program with ltrace it's quite
easy the call to an already unlocked mutex. I wonder if it's doable to
do the same with the whole Nvidia stack.

Aurelien

-- 
Aurelien Jarno                          GPG: 4096R/1DDD8C9B
aurelien at aurel32.net                 http://www.aurel32.net
-------------- next part --------------
A non-text attachment was scrubbed...
Name: mutex_crash_tsx.c
Type: text/x-csrc
Size: 248 bytes
Desc: not available
URL: <http://lists.alioth.debian.org/pipermail/pkg-nvidia-devel/attachments/20151208/536a64eb/attachment.c>