[Pkg-zfsonlinux-devel] Bug#983331: zfs-linux: zfs_zrele_async can cause txg sync deadlocks

Mon Feb 22 13:32:54 GMT 2021

Source: zfs-linux
Severity: important

Dear Maintainer,

For certain ZFS workloads, we can see hung task timeouts in the kernel logs due
to a transaction group deadlock. Userspace process will hang and display stack
traces similar to the one below:
[49181.619711] clnt_server     D    0 21699  28868 0x00000320
[49181.619715] Call Trace:
[49181.619725]  __schedule+0x24e/0x880
[49181.619730]  schedule+0x2c/0x80
[49181.619750]  cv_wait_common+0x11e/0x140 [spl]
[49181.619763]  ? wait_woken+0x80/0x80
[49181.619775]  __cv_wait+0x15/0x20 [spl]
[49181.619872]  zil_commit.part.14+0x80/0x8c0 [zfs]
[49181.619884]  ? _cond_resched+0x19/0x40
[49181.619887]  ? mutex_lock+0x12/0x40
[49181.619959]  zil_commit+0x17/0x20 [zfs]
[49181.620026]  zfs_fsync+0x77/0xe0 [zfs]
[49181.620093]  zpl_fsync+0x68/0xa0 [zfs]
[49181.620100]  vfs_fsync_range+0x51/0xb0
[49181.620105]  do_fsync+0x3d/0x70
[49181.620109]  SyS_fsync+0x10/0x20
[49181.620114]  do_syscall_64+0x73/0x130
[49181.620119]  entry_SYSCALL_64_after_hwframe+0x41/0xa6

We also might see a kworker thread blocking in the zfs writeback/evict path:
[49181.881570] kworker/u17:3   D    0  4915      2 0x80000000
[49181.881576] Workqueue: writeback wb_workfn (flush-zfs-10)
[49181.881577] Call Trace:
[49181.881580]  __schedule+0x24e/0x880
[49181.881582]  ? atomic_t_wait+0x60/0x60
[49181.881584]  schedule+0x2c/0x80
[49181.881588]  bit_wait+0x11/0x60
[49181.881592]  __wait_on_bit+0x4c/0x90
[49181.881596]  ? atomic_t_wait+0x60/0x60
[49181.881599]  __inode_wait_for_writeback+0xb9/0xf0
[49181.881601]  ? bit_waitqueue+0x40/0x40
[49181.881605]  inode_wait_for_writeback+0x26/0x40
[49181.881609]  evict+0xb5/0x1a0
[49181.881611]  iput+0x19c/0x230
[49181.881648]  zfs_iput_async+0x1d/0x80 [zfs]
[49181.881682]  zfs_get_data+0x1d4/0x2a0 [zfs]
[49181.881718]  zil_commit.part.14+0x640/0x8c0 [zfs]
[49181.881752]  zil_commit+0x17/0x20 [zfs]
[49181.881784]  zpl_writepages+0xd5/0x160 [zfs]
[49181.881787]  do_writepages+0x4b/0xe0
[49181.881790]  __writeback_single_inode+0x45/0x350
[49181.881792]  ? __writeback_single_inode+0x45/0x350
[49181.881794]  writeback_sb_inodes+0x1d7/0x530
[49181.881796]  wb_writeback+0xfb/0x300
[49181.881799]  wb_workfn+0xad/0x400
[49181.881800]  ? wb_workfn+0xad/0x400
[49181.881803]  ? __switch_to_asm+0x35/0x70
[49181.881809]  process_one_work+0x1de/0x420
[49181.881811]  worker_thread+0x32/0x410
[49181.881813]  kthread+0x121/0x140
[49181.881815]  ? process_one_work+0x420/0x420
[49181.881817]  ? kthread_create_worker_on_cpu+0x70/0x70
[49181.881819]  ret_from_fork+0x35/0x40

This is caused by a race between ZFS writeback and evict threads, usually during
a transaction group sync operation. It's possible to have two iput() threads
racing for the same inode: one of them scheduled async and the other executed
synchronously as part of the writeback path. If the writeback thread tries to
evict the inode while the async thread is running, it might re-enter the block
layer for the same inode due to ZFS counters being in an inconsistent
state. This then causes the kworker thread to stall the writeback, which in turn
prevents the transaction group sync to complete and locks other ZFS threads.

This is fixed by the upstream commit:
- Fix zrele race in zrele_async that can cause hang (2921ad6cba54) [0]

[0] https://github.com/openzfs/zfs/pull/11530