Bug#684941: Conversation with NVIDIA, expect fix in next 304.xx
Daniel Anderberg
daniel.anderberg at fagotten.org
Mon Aug 27 23:21:56 UTC 2012
I sent a bug report to linux-bugs at nvidia.com regarding the problem. I
had a very nice conversation with a Pierre-Loup A. Griffais, who
reports that the problem has been fixed and can be expected in the
next 304.xx release.
/Daniel
Our conversation:
Quoting "Pierre-Loup A. Griffais":
> Hi Daniel,
>
> I don't have an estimate, but it made it into the current release
> series >(304.xx), so whenever there is a new release from that.
>
> Feel free to update the bug report with this information.
>
> Thanks a lot,
> - Pierre-Loup
>
> On 08/24/2012 08:27 PM, Daniel Anderberg wrote:
>> Hi Pierre-Loup,
>>
>> Exelent! Do you dare to venture a guess as to when this will be
>> available in a released driver?
>>
>> Would it be OK if I attached our conversation below to the Debian
>> bug report?
>>
>> Best regards
>> Daniel Anderberg
>>
>> Quoting "Pierre-Loup A. Griffais":
>>
>>> Hi Daniel,
>>>
>>> This should now be fixed; again, thanks a lot for your detailed report.
>>> - Pierre-Loup
>>>
>>> On 08/20/2012 06:49 PM, Pierre-Loup Griffais wrote:
>>>> Hi Daniel,
>>>>
>>>> Thanks a lot for your bug report and your in-depth research! I agree
>>>> with your initial findings; the result of the LoaderSymbol call not
>>>> being cached in this case just looks like an oversight. I'm sorry for
>>>> the inconvenience that this has caused you and will look into fixing it.
>>>>
>>>> Thanks,
>>>> - Pierre-Loup
>>>>
>>>> On 08/20/2012 09:02 AM, Daniel Anderberg wrote:
>>>>> Hello,
>>>>>
>>>>> I have been having problems with X loocking up, requiring me to log in
>>>>> from another computer and kill X. This lockup is sporadic but quite
>>>>> frequent.
>>>>> nvidia-bug-report.log.gz attached.
>>>>>
>>>>> Problems verified both with 302.17 and 304.37 (the current Debian
>>>>> testing and unstable packages).
>>>>>
>>>>> I have already filed a bug report with Debian
>>>>> (http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=684941). The Debian
>>>>> maintainer sugested I start a thread over at nvnews.net, but since I
>>>>> still cant post there (nvnews support staff have been contacted), I
>>>>> figured that a mail to linux-bugs might not be a bad idea.
>>>>>
>>>>> In analyzing the problem I have arrived at the following points of data:
>>>>>
>>>>> When X freeze, if I attach gdb and get a back-trace I ALLWAYS
>>>>> get the same:
>>>>> - X calls into nvidia_drv.so
>>>>> - nvidia_drive.so calls LoaderSymbol("rrPrivKeyRec")
>>>>> - LoaderSymbol calls dlsym that locks a mutex
>>>>> - Signal interrupts
>>>>> - X calls into nvidia_drv.so
>>>>> - nvidia_drive.so calls LoaderSymbol("rrPrivKeyRec")
>>>>> - LoaderSymbol calls dlsym that attempts to lock the same mutex
>>>>> * Deadlock
>>>>>
>>>>> Example bt attached.
>>>>>
>>>>> Instrumenting dlsym via LD_AUDIT show that while I move the mouse
>>>>> pointer. X will resolve the symbol "rrPrivKeyRec" approx 1000 times
>>>>> every second.
>>>>>
>>>>> If I LD_PRELOAD the attached (admittedly crude and not generally
>>>>> applicable) nvidia_workaround when starting the X server the problem
>>>>> goes away (used to occure every few hours, now running good for half a
>>>>> week).
>>>>>
>>>>> My conclusions are the following:
>>>>>
>>>>> - nvidia_drv.so indirectly uses dlsym in a signal handler.
>>>>> - dlsym is not among the async-signal-safe functions according to
>>>>> POSIX.1-2008.
>>>>> - violating the async-signal-safe rule hundreds of times per
>>>>> second greatly
>>>>> increases the risk of deadlock.
>>>>> - From an efficiency standpoint, using LoaderSymbol in a hotpath seems
>>>>> suboptimal, and cacheing the return value from
>>>>> LoaderSymbol("rrPrivKeyRec") seems prudent.
>>>>>
>>>>> Best wishes
>>>>> Daniel Anderberg
More information about the pkg-nvidia-devel
mailing list