Bug#880883: Some findings about g_closure_ref()/g_closure_unref()

Sat Dec 2 20:48:08 UTC 2017

I had a chance to dig into this further and think I now know what's going on.

I ran the test on an AMD Seattle system (same as arm-arm-04) and
noticed that the closures test (tests/refcount/closures) didn't finish
in >150 minutes. Profiling the test using "perf" pointed to
g_closure_ref() and g_closure_unref() as taking the bulk of execution
time. This confirms Emilio's suspicion in #20.

$ perf report

# To display the perf.data header info, please use
--header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 3M of event 'cycles:uppp'
# Event count (approx.): 1497293182208
#
# Overhead  Command   Shared Object               Symbol
# ........  ........  ..........................
............................................
#
    67.85%  closures  libgobject-2.0.so.0.5400.2  [.] g_closure_ref
    29.10%  closures  libgobject-2.0.so.0.5400.2  [.] g_closure_unref
     2.45%  closures  libgobject-2.0.so.0.5400.2  [.] g_closure_invoke
     0.22%  closures  closures                    [.] test_closure
     0.11%  closures  closures                    [.] g_closure_unref at plt

Copying the annotated profile of g_closure_ref(). First column is %
samples at instructions as observed by "perf record"

          │
          │       INC_ASSIGN (closure, ref_count, &new_ref_count);
  0.20 │ 30: ldr    w1, [x19]
  0.31 │       str    w1, [x29, #32]
  1.06 │       ldr    w0, [x29, #32]
  0.32 │       ldr    w0, [x29, #32]
  0.24 │       ldr    w2, [x29, #32]
          │       add  w0, w0, #0x1                 <------- A
          │       bfxil  w2, w0, #0, #15
  0.13 │       str    w2, [x29, #32]
  2.35 │       ldr    w0, [x29, #32]
  0.46 │       ldr    w2, [x29, #32]
          │       and  w0, w0, #0x7fff
 54.73│ 5c: ldxr   w3, [x19]                       <------- B
          │       cmp  w3, w1
  0.02 │     ↓b.ne  70
  0.97 │       stlxr  w4, w2, [x19]
          │     ↑cbnz  w4, 5c
 38.56│ 70: dmb  ish
          │     ↑b.ne  30
          │       g_return_val_if_fail (new_ref_count > 1, NULL);
  0.00 │       cmp  w0, #0x1
          │     ↓b.ls   c8
          │
          │       return closure;

ldxr and the dmb instruction together make up ~92% of all the samples.

Also the generated code looks quite bad - notice the redundant loads
to w0/w2. It's as if the code is compiled with optimisations turned
off - which is not the case ( it's compile with -O2 )

The duplicate loads also lead to a large window between incrementing
the reference[A] and trying to do a compare and swap starting at [B].
Another thread performing the same accesses between A and B will cause
the cache line to bounce between the two cores and also fail the ldxr,
restarting the loop.

Aside: The x86 assembly for ref/unref also has duplicate loads but it
seems to not mind them much.

On further inspection, I noticed that the following definition of the
GCloure structure.

struct _GClosure
{
  /*< private >*/
  volatile              guint    ref_count : 15;
  /* meta_marshal is not used anymore but must be zero for historical reasons
     as it was exposed in the G_CLOSURE_N_NOTIFIERS macro */
  volatile              guint    meta_marshal_nouse : 1;
  volatile              guint    n_guards : 1;
  volatile              guint    n_fnotifiers : 2;      /*
finalization notifiers */
  volatile              guint    n_inotifiers : 8;      /*
invalidation notifiers */
  volatile              guint    in_inotify : 1;
  volatile              guint    floating : 1;
  /*< protected >*/
  volatile              guint    derivative_flag : 1;
  /*< public >*/
  volatile              guint    in_marshal : 1;
  volatile              guint    is_invalid : 1;
...
}

Specifically, the ref_count is defined as a volatile bit field. That
coupled with the implementation of increment explains the bad assembly
sequence generated for g_closure_ref()/g_closure_unref().

AFAICT, there is no need for the ref_count (or the other bit fields in
the structure) to be volatile, especially as they seem to be updated
with gcc __sync builtins.

I have a few ideas on how to improve the code by re-factoring
INC_ASSIGN/DEC_ASSIGN to use specialised versions of CHANGE_FIELD in
gobject/gclosure.c. I'll give those a shot and see if they lead to an
improvement in performance. Happy to discuss any suggestions or try
patches and report back.

It might also be worth raising/discussing the findings here with
upstream. I am not at all familiar with glib and they might be able to
help explain the code structure and have suggestions for improvement.

Since I hadn't dug into this enough... ;)

I also ran a few iteration of the test on some other ARM64 platforms,
Juno R2 and Juno R0 (this one is the same as arm-arm-01). On the R2
the test always finished in a couple of minutes. But on Juno R0, it
sometimes finished in a couple of minutes but seemed to get tied up in
knots and took 15min (at which point I terminated the test). After
some experiments with restricting the cores the test ran on (using
taskset), I concluded that when the test ran on the A57s (incidentally
the same cores as on the AMD Seattle system) it took a long time but
not when it was running on the A53s.

So it seems the micro-architecture behaviour combined with the poor
code sequence is making this especially painful while other cores seem
to be chug through without breaking a sweat.