Bug#684941: Conversation with NVIDIA, expect fix in next 304.xx

Mon Aug 27 23:21:56 UTC 2012

I sent a bug report to linux-bugs at nvidia.com regarding the problem. I  
had a very nice conversation with a Pierre-Loup A. Griffais, who  
reports that the problem has been fixed and can be expected in the  
next 304.xx release.

/Daniel

Our conversation:

Quoting "Pierre-Loup A. Griffais":
> Hi Daniel,
>
> I don't have an estimate, but it made it into the current release  
> series >(304.xx), so whenever there is a new release from that.
>
> Feel free to update the bug report with this information.
>
> Thanks a lot,
> - Pierre-Loup
>
> On 08/24/2012 08:27 PM, Daniel Anderberg wrote:
>> Hi Pierre-Loup,
>>
>> Exelent! Do you dare to venture a guess as to when this will be
>> available in a released driver?
>>
>> Would it be OK if I attached our conversation below to the Debian  
>> bug report?
>>
>> Best regards
>> Daniel Anderberg
>>
>> Quoting "Pierre-Loup A. Griffais":
>>
>>> Hi Daniel,
>>>
>>> This should now be fixed; again, thanks a lot for your detailed report.
>>>  - Pierre-Loup
>>>
>>> On 08/20/2012 06:49 PM, Pierre-Loup Griffais wrote:
>>>> Hi Daniel,
>>>>
>>>> Thanks a lot for your bug report and your in-depth research! I agree
>>>> with your initial findings; the result of the LoaderSymbol call not
>>>> being cached in this case just looks like an oversight. I'm sorry for
>>>> the inconvenience that this has caused you and will look into fixing it.
>>>>
>>>> Thanks,
>>>>  - Pierre-Loup
>>>>
>>>> On 08/20/2012 09:02 AM, Daniel Anderberg wrote:
>>>>> Hello,
>>>>>
>>>>> I have been having problems with X loocking up, requiring me to log in
>>>>> from another computer and kill X. This lockup is sporadic but quite
>>>>> frequent.
>>>>> nvidia-bug-report.log.gz attached.
>>>>>
>>>>> Problems verified both with 302.17 and 304.37 (the current Debian
>>>>> testing and unstable packages).
>>>>>
>>>>> I have already filed a bug report with Debian
>>>>> (http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=684941). The Debian
>>>>> maintainer sugested I start a thread over at nvnews.net, but since I
>>>>> still cant post there (nvnews support staff have been contacted), I
>>>>> figured that a mail to linux-bugs might not be a bad idea.
>>>>>
>>>>> In analyzing the problem I have arrived at the following points of data:
>>>>>
>>>>> When X freeze, if I attach gdb and get a back-trace I ALLWAYS  
>>>>> get the same:
>>>>>   - X calls into nvidia_drv.so
>>>>>   - nvidia_drive.so calls LoaderSymbol("rrPrivKeyRec")
>>>>>   - LoaderSymbol calls dlsym that locks a mutex
>>>>>   - Signal interrupts
>>>>>   - X calls into nvidia_drv.so
>>>>>   - nvidia_drive.so calls LoaderSymbol("rrPrivKeyRec")
>>>>>   - LoaderSymbol calls dlsym that attempts to lock the same mutex
>>>>>   * Deadlock
>>>>>
>>>>> Example bt attached.
>>>>>
>>>>> Instrumenting dlsym via LD_AUDIT show that while I move the mouse
>>>>> pointer. X will resolve the symbol "rrPrivKeyRec" approx 1000 times
>>>>> every second.
>>>>>
>>>>> If I LD_PRELOAD the attached (admittedly crude and not generally
>>>>> applicable) nvidia_workaround when starting the X server the problem
>>>>> goes away (used to occure every few hours, now running good for half a
>>>>> week).
>>>>>
>>>>> My conclusions are the following:
>>>>>
>>>>> - nvidia_drv.so indirectly uses dlsym in a signal handler.
>>>>> - dlsym is not among the async-signal-safe functions according to
>>>>>    POSIX.1-2008.
>>>>> - violating the async-signal-safe rule hundreds of times per  
>>>>> second greatly
>>>>>    increases the risk of deadlock.
>>>>> - From an efficiency standpoint, using LoaderSymbol in a hotpath seems
>>>>>    suboptimal, and cacheing the return value from
>>>>>    LoaderSymbol("rrPrivKeyRec") seems prudent.
>>>>>
>>>>> Best wishes
>>>>> Daniel Anderberg