Bug#852013: Patch to prevent segfaults on signal

Brett Smith debbug at brettcsmith.org
Sun Jan 29 15:59:56 UTC 2017

On 01/29/2017 04:21 AM, Mattia Rizzolo wrote:
> Yesterday I tried several times with what was HEAD (with your patch) and
> before that, and appeared to not clean up in both cases, without
> sefualt reported.  Today it seem to always clean up with your patches
> but always segfault (which is weird, as it never segfaulted reliably for
> me) without.  meh.

So there's some more background I found during debugging that might be
relevant here.  While I was debugging, I added the line
`traceback.print_stack(stack_frame, file=sys.stderr)` to the top of
sigterm_handler, before the exit.

When Python *did* segfault, it was always in the middle of trying to
acquire a threading.Lock object.  Both threading.Thread objects and
Queue objects (used by the ExThread class) use locks underneath, so
there were a few ways to be getting to that point.  Here's a few
traceback excerpts of segfaults.  I didn't keep great notes about the
state of the code when I got them, so some of these might be from the
code in development states that won't be reflected in Git anywhere:

  File "/home/brett/repos/diffoscope/diffoscope/diff.py", line 238, in
  File "/home/brett/repos/diffoscope/diffoscope/diff.py", line 224, in join
    ex = self.wait_for_exc_info()
  File "/home/brett/repos/diffoscope/diffoscope/diff.py", line 221, in
    return self.__status_queue.get()
  File "/usr/lib/python3.4/queue.py", line 167, in get
  File "/usr/lib/python3.4/threading.py", line 290, in wait

  File "/home/brett/repos/diffoscope/diffoscope/diff.py", line 221, in
  File "/usr/lib/python3.4/threading.py", line 855, in start
  File "/usr/lib/python3.4/threading.py", line 553, in wait
    signaled = self._cond.wait(timeout)
  File "/usr/lib/python3.4/threading.py", line 290, in wait

  File "/home/brett/repos/diffoscope/diffoscope/diff.py", line 225, in
  File "/home/brett/repos/diffoscope/diffoscope/diff.py", line 211, in join
  File "/usr/lib/python3.4/threading.py", line 1060, in join
  File "/usr/lib/python3.4/threading.py", line 1076, in
    elif lock.acquire(block, timeout):

Segfaults also always wrote this in my syslog:

diffoscope[21517]: segfault at 0 ip 00007fa568ab9049 sp 00007fa56089a168
error 6 in libc-2.19.so[7fa568a27000+1a1000]

The "error 6 in libc-2.19.so[7fa568a27000+1a1000]" bit was always the same.

I bring all this up to say, I'm not shocked that we might see different
results on different systems.  Python itself and potentially glibc are
involved here too.  For example, one theory I had is that the glibc call
underlying the lock.acquire() method is not reentrant, but Python was
trying to reenter it after handling SIGTERM in the main thread.

Another note is that whether the crash happens seems to depend on where
the non-main threads are in their execution when the signal is handled.
Unfortunately I wasn't sure how to get tracebacks for them, so I don't
have those.  But I did see some hints that changing the timing of events
affected behavior.  If I traced all execution using the trace module, I
could never reproduce the issue (although I didn't try very much because
it was slow).  Adding more debugging calls to sigterm_handler also
tended to make segfaults rarer, I'm guessing because it gave the
non-main threads time to get past the troublesome point in their execution.

I started my patch just trying to clean up some of the other issues I
noted in the commit log, just because it helped me wrap my head around
what the code was doing.  Once I had all the file handling on one side
of the thread boundary, that resolved the issue for me, so I thought it
was worth proposing the patch at that point.  But I admittedly can't
explain the connection between that theory, and some of these messages I
was seeing during debugging.

Brett Smith

More information about the Reproducible-builds mailing list