Bug#880883: Some findings about g_closure_ref()/g_closure_unref()
Punit Agrawal
punitagrawal at gmail.com
Sat Dec 2 20:48:08 UTC 2017
I had a chance to dig into this further and think I now know what's going on.
I ran the test on an AMD Seattle system (same as arm-arm-04) and
noticed that the closures test (tests/refcount/closures) didn't finish
in >150 minutes. Profiling the test using "perf" pointed to
g_closure_ref() and g_closure_unref() as taking the bulk of execution
time. This confirms Emilio's suspicion in #20.
$ perf report
# To display the perf.data header info, please use
--header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 3M of event 'cycles:uppp'
# Event count (approx.): 1497293182208
#
# Overhead Command Shared Object Symbol
# ........ ........ ..........................
............................................
#
67.85% closures libgobject-2.0.so.0.5400.2 [.] g_closure_ref
29.10% closures libgobject-2.0.so.0.5400.2 [.] g_closure_unref
2.45% closures libgobject-2.0.so.0.5400.2 [.] g_closure_invoke
0.22% closures closures [.] test_closure
0.11% closures closures [.] g_closure_unref at plt
Copying the annotated profile of g_closure_ref(). First column is %
samples at instructions as observed by "perf record"
│
│ INC_ASSIGN (closure, ref_count, &new_ref_count);
0.20 │ 30: ldr w1, [x19]
0.31 │ str w1, [x29, #32]
1.06 │ ldr w0, [x29, #32]
0.32 │ ldr w0, [x29, #32]
0.24 │ ldr w2, [x29, #32]
│ add w0, w0, #0x1 <------- A
│ bfxil w2, w0, #0, #15
0.13 │ str w2, [x29, #32]
2.35 │ ldr w0, [x29, #32]
0.46 │ ldr w2, [x29, #32]
│ and w0, w0, #0x7fff
54.73│ 5c: ldxr w3, [x19] <------- B
│ cmp w3, w1
0.02 │ ↓b.ne 70
0.97 │ stlxr w4, w2, [x19]
│ ↑cbnz w4, 5c
38.56│ 70: dmb ish
│ ↑b.ne 30
│ g_return_val_if_fail (new_ref_count > 1, NULL);
0.00 │ cmp w0, #0x1
│ ↓b.ls c8
│
│ return closure;
ldxr and the dmb instruction together make up ~92% of all the samples.
Also the generated code looks quite bad - notice the redundant loads
to w0/w2. It's as if the code is compiled with optimisations turned
off - which is not the case ( it's compile with -O2 )
The duplicate loads also lead to a large window between incrementing
the reference[A] and trying to do a compare and swap starting at [B].
Another thread performing the same accesses between A and B will cause
the cache line to bounce between the two cores and also fail the ldxr,
restarting the loop.
Aside: The x86 assembly for ref/unref also has duplicate loads but it
seems to not mind them much.
On further inspection, I noticed that the following definition of the
GCloure structure.
struct _GClosure
{
/*< private >*/
volatile guint ref_count : 15;
/* meta_marshal is not used anymore but must be zero for historical reasons
as it was exposed in the G_CLOSURE_N_NOTIFIERS macro */
volatile guint meta_marshal_nouse : 1;
volatile guint n_guards : 1;
volatile guint n_fnotifiers : 2; /*
finalization notifiers */
volatile guint n_inotifiers : 8; /*
invalidation notifiers */
volatile guint in_inotify : 1;
volatile guint floating : 1;
/*< protected >*/
volatile guint derivative_flag : 1;
/*< public >*/
volatile guint in_marshal : 1;
volatile guint is_invalid : 1;
...
}
Specifically, the ref_count is defined as a volatile bit field. That
coupled with the implementation of increment explains the bad assembly
sequence generated for g_closure_ref()/g_closure_unref().
AFAICT, there is no need for the ref_count (or the other bit fields in
the structure) to be volatile, especially as they seem to be updated
with gcc __sync builtins.
I have a few ideas on how to improve the code by re-factoring
INC_ASSIGN/DEC_ASSIGN to use specialised versions of CHANGE_FIELD in
gobject/gclosure.c. I'll give those a shot and see if they lead to an
improvement in performance. Happy to discuss any suggestions or try
patches and report back.
It might also be worth raising/discussing the findings here with
upstream. I am not at all familiar with glib and they might be able to
help explain the code structure and have suggestions for improvement.
Since I hadn't dug into this enough... ;)
I also ran a few iteration of the test on some other ARM64 platforms,
Juno R2 and Juno R0 (this one is the same as arm-arm-01). On the R2
the test always finished in a couple of minutes. But on Juno R0, it
sometimes finished in a couple of minutes but seemed to get tied up in
knots and took 15min (at which point I terminated the test). After
some experiments with restricting the cores the test ran on (using
taskset), I concluded that when the test ran on the A57s (incidentally
the same cores as on the AMD Seattle system) it took a long time but
not when it was running on the A53s.
So it seems the micro-architecture behaviour combined with the poor
code sequence is making this especially painful while other cores seem
to be chug through without breaking a sweat.
More information about the pkg-gnome-maintainers
mailing list