diff mbox series

[PATCHv2,1/1] xfs: Add cond_resched in xfs_bunmapi_range loop

Message ID f7d3db235a2c7e16681a323a99bb0ce50a92296a.1714033516.git.ritesh.list@gmail.com (mailing list archive)
State Accepted, archived
Headers show
Series xfs: soft lockups while unmapping large no. of extents | expand

Commit Message

Ritesh Harjani (IBM) April 25, 2024, 8:35 a.m. UTC
An async dio write to a sparse file can generate a lot of extents
and when we unlink this file (using rm), the kernel can be busy in umapping
and freeing those extents as part of transaction processing.
Add cond_resched() in xfs_bunmapi_range() to avoid soft lockups
messages like these. Here is a call trace of such a soft lockup.

watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [kworker/1:0:82435]
CPU: 1 PID: 82435 Comm: kworker/1:0 Tainted: G S  L   6.9.0-rc5-0-default #1
Workqueue: xfs-inodegc/sda2 xfs_inodegc_worker
NIP [c000000000beea10] xfs_extent_busy_trim+0x100/0x290
LR [c000000000bee958] xfs_extent_busy_trim+0x48/0x290
Call Trace:
  xfs_alloc_get_rec+0x54/0x1b0 (unreliable)
  xfs_alloc_compute_aligned+0x5c/0x144
  xfs_alloc_ag_vextent_size+0x238/0x8d4
  xfs_alloc_fix_freelist+0x540/0x694
  xfs_free_extent_fix_freelist+0x84/0xe0
  __xfs_free_extent+0x74/0x1ec
  xfs_extent_free_finish_item+0xcc/0x214
  xfs_defer_finish_one+0x194/0x388
  xfs_defer_finish_noroll+0x1b4/0x5c8
  xfs_defer_finish+0x2c/0xc4
  xfs_bunmapi_range+0xa4/0x100
  xfs_itruncate_extents_flags+0x1b8/0x2f4
  xfs_inactive_truncate+0xe0/0x124
  xfs_inactive+0x30c/0x3e0
  xfs_inodegc_worker+0x140/0x234
  process_scheduled_works+0x240/0x57c
  worker_thread+0x198/0x468
  kthread+0x138/0x140
  start_kernel_thread+0x14/0x18

Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
cc: Ojaswin Mujoo <ojaswin@linux.ibm.com>
---
 fs/xfs/libxfs/xfs_bmap.c | 1 +
 1 file changed, 1 insertion(+)

--
2.44.0

Comments

Christoph Hellwig April 25, 2024, 12:18 p.m. UTC | #1
Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

(and I can't wait for the day that preempt models are consolidated and
cond_resched() is gone..)
Ritesh Harjani (IBM) April 29, 2024, 8:44 a.m. UTC | #2
"Ritesh Harjani (IBM)" <ritesh.list@gmail.com> writes:

> An async dio write to a sparse file can generate a lot of extents
> and when we unlink this file (using rm), the kernel can be busy in umapping
> and freeing those extents as part of transaction processing.
> Add cond_resched() in xfs_bunmapi_range() to avoid soft lockups
> messages like these. Here is a call trace of such a soft lockup.
>
> watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [kworker/1:0:82435]
> CPU: 1 PID: 82435 Comm: kworker/1:0 Tainted: G S  L   6.9.0-rc5-0-default #1
> Workqueue: xfs-inodegc/sda2 xfs_inodegc_worker
> NIP [c000000000beea10] xfs_extent_busy_trim+0x100/0x290
> LR [c000000000bee958] xfs_extent_busy_trim+0x48/0x290
> Call Trace:
>   xfs_alloc_get_rec+0x54/0x1b0 (unreliable)
>   xfs_alloc_compute_aligned+0x5c/0x144
>   xfs_alloc_ag_vextent_size+0x238/0x8d4
>   xfs_alloc_fix_freelist+0x540/0x694
>   xfs_free_extent_fix_freelist+0x84/0xe0
>   __xfs_free_extent+0x74/0x1ec
>   xfs_extent_free_finish_item+0xcc/0x214
>   xfs_defer_finish_one+0x194/0x388
>   xfs_defer_finish_noroll+0x1b4/0x5c8
>   xfs_defer_finish+0x2c/0xc4
>   xfs_bunmapi_range+0xa4/0x100
>   xfs_itruncate_extents_flags+0x1b8/0x2f4
>   xfs_inactive_truncate+0xe0/0x124
>   xfs_inactive+0x30c/0x3e0
>   xfs_inodegc_worker+0x140/0x234
>   process_scheduled_works+0x240/0x57c
>   worker_thread+0x198/0x468
>   kthread+0x138/0x140
>   start_kernel_thread+0x14/0x18
>

My v1 patch had cond_resched() in xfs_defer_finish_noroll, since I was
suspecting that it's a common point where we loop for many other
operations. And initially Dave also suggested for the same [1].
But I was not totally convinced given the only problematic path I
had till now was in unmapping extents. So this patch keeps the
cond_resched() in xfs_bunmapi_range() loop.

[1]: https://lore.kernel.org/all/ZZ8OaNnp6b%2FPJzsb@dread.disaster.area/

However, I was able to reproduce a problem with reflink remapping path
both on Power (with 64k bs) and on x86 (with preempt=none and with KASAN
enabled). I actually noticed while I was doing regression testing of
some of the iomap changes with KASAN enabled. The issue was seen with
generic/175 for both on Power and x86.

Do you think we should keep the cond_resched() inside
xfs_defer_finish_noroll() loop like we had in v1 [2]. If yes, then I can rebase
v1 on the latest upstream tree and also update the commit msg with both
call stacks.

[2]: https://lore.kernel.org/all/0bfaf740a2d10cc846616ae05963491316850c52.1713674899.git.ritesh.list@gmail.com/


<call stack on Power>
======================
 run fstests generic/175 at 2024-02-02 04:40:21
 <...>
[   C17] watchdog: BUG: soft lockup - CPU#17 stuck for 23s! [xfs_io:7679]
 watchdog: BUG: soft lockup - CPU#17 stuck for 23s! [xfs_io:7679]
 CPU: 17 PID: 7679 Comm: xfs_io Kdump: loaded Tainted: G               X     6.4.0-150600.5-default #1
 NIP [c008000005e3ec94] xfs_rmapbt_diff_two_keys+0x54/0xe0 [xfs]
 LR [c008000005e08798] xfs_btree_get_leaf_keys+0x110/0x1e0 [xfs]
 Call Trace:
  0xc000000014107c00 (unreliable)
  __xfs_btree_updkeys+0x8c/0x2c0 [xfs]
  xfs_btree_update_keys+0x150/0x170 [xfs]
  xfs_btree_lshift+0x534/0x660 [xfs]
  xfs_btree_make_block_unfull+0x19c/0x240 [xfs]
  xfs_btree_insrec+0x4e4/0x630 [xfs]
  xfs_btree_insert+0x104/0x2d0 [xfs]
  xfs_rmap_insert+0xc4/0x260 [xfs]
  xfs_rmap_map_shared+0x228/0x630 [xfs]
  xfs_rmap_finish_one+0x2d4/0x350 [xfs]
  xfs_rmap_update_finish_item+0x44/0xc0 [xfs]
  xfs_defer_finish_noroll+0x2e4/0x740 [xfs]
  __xfs_trans_commit+0x1f4/0x400 [xfs]
  xfs_reflink_remap_extent+0x2d8/0x650 [xfs]
  xfs_reflink_remap_blocks+0x154/0x320 [xfs]
  xfs_file_remap_range+0x138/0x3a0 [xfs]
  do_clone_file_range+0x11c/0x2f0
  vfs_clone_file_range+0x60/0x1c0
  ioctl_file_clone+0x78/0x140
  sys_ioctl+0x934/0x1270
  system_call_exception+0x158/0x320
  system_call_vectored_common+0x15c/0x2ec


<call stack on x86 with KASAN>
===============================
 watchdog: BUG: soft lockup - CPU#6 stuck for 26s! [xfs_io:3438095]
 CPU: 6 PID: 3438095 Comm: xfs_io Not tainted 6.9.0-rc5-xfstests-perf-00008-g4e2752e99f55 #1
 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.o4
 RIP: 0010:_raw_spin_unlock_irqrestore+0x3c/0x60
 Code: 10 48 89 fb 48 83 c7 18 e8 31 7c 10 fd 48 89 df e8 79 f5 10 fd f7 c5 00 02 00 00 74 06 e8 0c 6
 Call Trace:
  <IRQ>
  ? watchdog_timer_fn+0x2dc/0x3a0
  ? __pfx_watchdog_timer_fn+0x10/0x10
  ? __hrtimer_run_queues+0x4a1/0x870
  ? __pfx___hrtimer_run_queues+0x10/0x10
  ? kvm_clock_get_cycles+0x18/0x30
  ? ktime_get_update_offsets_now+0xc6/0x2f0
  ? hrtimer_interrupt+0x2b8/0x7a0
  ? __sysvec_apic_timer_interrupt+0xca/0x390
  ? sysvec_apic_timer_interrupt+0x65/0x80
  </IRQ>
  <TASK>
  ? asm_sysvec_apic_timer_interrupt+0x1a/0x20
  ? _raw_spin_unlock_irqrestore+0x34/0x60
  ? _raw_spin_unlock_irqrestore+0x3c/0x60
  ? _raw_spin_unlock_irqrestore+0x34/0x60
  get_partial_node.part.0+0x1af/0x340
  ___slab_alloc+0xc07/0x1250
  ? do_vfs_ioctl+0xe5c/0x1660
  ? __x64_sys_ioctl+0xd5/0x1b0
 ? __alloc_object+0x39/0x660
 ? __pfx___might_resched+0x10/0x10
 ? __alloc_object+0x39/0x660
 ? kmem_cache_alloc+0x3cd/0x410
 ? should_failslab+0xe/0x20
 kmem_cache_alloc+0x3cd/0x410
 __alloc_object+0x39/0x660
 __create_object+0x22/0x90
 kmem_cache_alloc+0x324/0x410
 xfs_bui_init+0x1b/0x150
 xfs_bmap_update_create_intent+0x48/0x110
 ? __pfx_xfs_bmap_update_create_intent+0x10/0x10
 xfs_defer_create_intent+0xcc/0x1b0
 xfs_defer_create_intents+0x8f/0x230
 xfs_defer_finish_noroll+0x1c0/0x1160
 ? xfs_inode_item_precommit+0x2c1/0x880
 ? __create_object+0x5e/0x90
 ? __pfx_xfs_defer_finish_noroll+0x10/0x10
 ? xfs_trans_run_precommits+0x126/0x200
 __xfs_trans_commit+0x767/0xbe0
 ? inode_maybe_inc_iversion+0xe2/0x150
 ? __pfx___xfs_trans_commit+0x10/0x10
 xfs_reflink_remap_extent+0x654/0xd40
 ? __pfx_xfs_reflink_remap_extent+0x10/0x10
 ? __pfx_down_read_nested+0x10/0x10
 xfs_reflink_remap_blocks+0x21a/0x850
 ? __pfx_xfs_reflink_remap_blocks+0x10/0x10
 ? _raw_spin_unlock+0x23/0x40
 ? xfs_reflink_remap_prep+0x47d/0x900
 xfs_file_remap_range+0x296/0xb40
 ? __pfx_xfs_file_remap_range+0x10/0x10
 ? __pfx_lock_acquire+0x10/0x10
 ? __pfx___might_resched+0x10/0x10
 vfs_clone_file_range+0x260/0xc20
 ioctl_file_clone+0x49/0xb0
 do_vfs_ioctl+0xe5c/0x1660
 ? __pfx_do_vfs_ioctl+0x10/0x10
 ? trace_irq_enable.constprop.0+0xd2/0x110
 ? kasan_quarantine_put+0x7e/0x1d0
 ? do_sys_openat2+0x120/0x170
 ? lock_acquire+0x43b/0x4f0
 ? __pfx_lock_release+0x10/0x10
 ? __pfx_do_sys_openat2+0x10/0x10
 ? __do_sys_newfstatat+0x94/0xe0
 ? __fget_files+0x1ce/0x330
 __x64_sys_ioctl+0xd5/0x1b0
 do_syscall_64+0x6a/0x140
 entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x7ffff7d1a94f
 </TASK>


-ritesh



> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
> cc: Ojaswin Mujoo <ojaswin@linux.ibm.com>
> ---
>  fs/xfs/libxfs/xfs_bmap.c | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> index 656c95a22f2e..44d5381bc66f 100644
> --- a/fs/xfs/libxfs/xfs_bmap.c
> +++ b/fs/xfs/libxfs/xfs_bmap.c
> @@ -6354,6 +6354,7 @@ xfs_bunmapi_range(
>  		error = xfs_defer_finish(tpp);
>  		if (error)
>  			goto out;
> +		cond_resched();
>  	}
>  out:
>  	return error;
> --
> 2.44.0
Darrick J. Wong April 29, 2024, 3:11 p.m. UTC | #3
On Mon, Apr 29, 2024 at 02:14:46PM +0530, Ritesh Harjani wrote:
> "Ritesh Harjani (IBM)" <ritesh.list@gmail.com> writes:
> 
> > An async dio write to a sparse file can generate a lot of extents
> > and when we unlink this file (using rm), the kernel can be busy in umapping
> > and freeing those extents as part of transaction processing.
> > Add cond_resched() in xfs_bunmapi_range() to avoid soft lockups
> > messages like these. Here is a call trace of such a soft lockup.
> >
> > watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [kworker/1:0:82435]
> > CPU: 1 PID: 82435 Comm: kworker/1:0 Tainted: G S  L   6.9.0-rc5-0-default #1
> > Workqueue: xfs-inodegc/sda2 xfs_inodegc_worker
> > NIP [c000000000beea10] xfs_extent_busy_trim+0x100/0x290
> > LR [c000000000bee958] xfs_extent_busy_trim+0x48/0x290
> > Call Trace:
> >   xfs_alloc_get_rec+0x54/0x1b0 (unreliable)
> >   xfs_alloc_compute_aligned+0x5c/0x144
> >   xfs_alloc_ag_vextent_size+0x238/0x8d4
> >   xfs_alloc_fix_freelist+0x540/0x694
> >   xfs_free_extent_fix_freelist+0x84/0xe0
> >   __xfs_free_extent+0x74/0x1ec
> >   xfs_extent_free_finish_item+0xcc/0x214
> >   xfs_defer_finish_one+0x194/0x388
> >   xfs_defer_finish_noroll+0x1b4/0x5c8
> >   xfs_defer_finish+0x2c/0xc4
> >   xfs_bunmapi_range+0xa4/0x100
> >   xfs_itruncate_extents_flags+0x1b8/0x2f4
> >   xfs_inactive_truncate+0xe0/0x124
> >   xfs_inactive+0x30c/0x3e0
> >   xfs_inodegc_worker+0x140/0x234
> >   process_scheduled_works+0x240/0x57c
> >   worker_thread+0x198/0x468
> >   kthread+0x138/0x140
> >   start_kernel_thread+0x14/0x18
> >
> 
> My v1 patch had cond_resched() in xfs_defer_finish_noroll, since I was
> suspecting that it's a common point where we loop for many other
> operations. And initially Dave also suggested for the same [1].
> But I was not totally convinced given the only problematic path I
> had till now was in unmapping extents. So this patch keeps the
> cond_resched() in xfs_bunmapi_range() loop.
> 
> [1]: https://lore.kernel.org/all/ZZ8OaNnp6b%2FPJzsb@dread.disaster.area/
> 
> However, I was able to reproduce a problem with reflink remapping path
> both on Power (with 64k bs) and on x86 (with preempt=none and with KASAN
> enabled). I actually noticed while I was doing regression testing of
> some of the iomap changes with KASAN enabled. The issue was seen with
> generic/175 for both on Power and x86.
> 
> Do you think we should keep the cond_resched() inside
> xfs_defer_finish_noroll() loop like we had in v1 [2]. If yes, then I can rebase
> v1 on the latest upstream tree and also update the commit msg with both
> call stacks.

I think there ought to be one in xfs_defer_finish_noroll (or even
xfs_trans_roll) when we're between dirty transactions, since long
running transaction chains can indeed stall.  That'll hopefully solve
the problem for bunmapi and long running online repair operations.

But that said, the main FICLONE loop continually creates new transaction
chains, so you probably need one there too.  Hence the grumblings about
cond_resched whackamole.

Thoughts?

--D

> [2]: https://lore.kernel.org/all/0bfaf740a2d10cc846616ae05963491316850c52.1713674899.git.ritesh.list@gmail.com/
> 
> 
> <call stack on Power>
> ======================
>  run fstests generic/175 at 2024-02-02 04:40:21
>  <...>
> [   C17] watchdog: BUG: soft lockup - CPU#17 stuck for 23s! [xfs_io:7679]
>  watchdog: BUG: soft lockup - CPU#17 stuck for 23s! [xfs_io:7679]
>  CPU: 17 PID: 7679 Comm: xfs_io Kdump: loaded Tainted: G               X     6.4.0-150600.5-default #1
>  NIP [c008000005e3ec94] xfs_rmapbt_diff_two_keys+0x54/0xe0 [xfs]
>  LR [c008000005e08798] xfs_btree_get_leaf_keys+0x110/0x1e0 [xfs]
>  Call Trace:
>   0xc000000014107c00 (unreliable)
>   __xfs_btree_updkeys+0x8c/0x2c0 [xfs]
>   xfs_btree_update_keys+0x150/0x170 [xfs]
>   xfs_btree_lshift+0x534/0x660 [xfs]
>   xfs_btree_make_block_unfull+0x19c/0x240 [xfs]
>   xfs_btree_insrec+0x4e4/0x630 [xfs]
>   xfs_btree_insert+0x104/0x2d0 [xfs]
>   xfs_rmap_insert+0xc4/0x260 [xfs]
>   xfs_rmap_map_shared+0x228/0x630 [xfs]
>   xfs_rmap_finish_one+0x2d4/0x350 [xfs]
>   xfs_rmap_update_finish_item+0x44/0xc0 [xfs]
>   xfs_defer_finish_noroll+0x2e4/0x740 [xfs]
>   __xfs_trans_commit+0x1f4/0x400 [xfs]
>   xfs_reflink_remap_extent+0x2d8/0x650 [xfs]
>   xfs_reflink_remap_blocks+0x154/0x320 [xfs]
>   xfs_file_remap_range+0x138/0x3a0 [xfs]
>   do_clone_file_range+0x11c/0x2f0
>   vfs_clone_file_range+0x60/0x1c0
>   ioctl_file_clone+0x78/0x140
>   sys_ioctl+0x934/0x1270
>   system_call_exception+0x158/0x320
>   system_call_vectored_common+0x15c/0x2ec
> 
> 
> <call stack on x86 with KASAN>
> ===============================
>  watchdog: BUG: soft lockup - CPU#6 stuck for 26s! [xfs_io:3438095]
>  CPU: 6 PID: 3438095 Comm: xfs_io Not tainted 6.9.0-rc5-xfstests-perf-00008-g4e2752e99f55 #1
>  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.o4
>  RIP: 0010:_raw_spin_unlock_irqrestore+0x3c/0x60
>  Code: 10 48 89 fb 48 83 c7 18 e8 31 7c 10 fd 48 89 df e8 79 f5 10 fd f7 c5 00 02 00 00 74 06 e8 0c 6
>  Call Trace:
>   <IRQ>
>   ? watchdog_timer_fn+0x2dc/0x3a0
>   ? __pfx_watchdog_timer_fn+0x10/0x10
>   ? __hrtimer_run_queues+0x4a1/0x870
>   ? __pfx___hrtimer_run_queues+0x10/0x10
>   ? kvm_clock_get_cycles+0x18/0x30
>   ? ktime_get_update_offsets_now+0xc6/0x2f0
>   ? hrtimer_interrupt+0x2b8/0x7a0
>   ? __sysvec_apic_timer_interrupt+0xca/0x390
>   ? sysvec_apic_timer_interrupt+0x65/0x80
>   </IRQ>
>   <TASK>
>   ? asm_sysvec_apic_timer_interrupt+0x1a/0x20
>   ? _raw_spin_unlock_irqrestore+0x34/0x60
>   ? _raw_spin_unlock_irqrestore+0x3c/0x60
>   ? _raw_spin_unlock_irqrestore+0x34/0x60
>   get_partial_node.part.0+0x1af/0x340
>   ___slab_alloc+0xc07/0x1250
>   ? do_vfs_ioctl+0xe5c/0x1660
>   ? __x64_sys_ioctl+0xd5/0x1b0
>  ? __alloc_object+0x39/0x660
>  ? __pfx___might_resched+0x10/0x10
>  ? __alloc_object+0x39/0x660
>  ? kmem_cache_alloc+0x3cd/0x410
>  ? should_failslab+0xe/0x20
>  kmem_cache_alloc+0x3cd/0x410
>  __alloc_object+0x39/0x660
>  __create_object+0x22/0x90
>  kmem_cache_alloc+0x324/0x410
>  xfs_bui_init+0x1b/0x150
>  xfs_bmap_update_create_intent+0x48/0x110
>  ? __pfx_xfs_bmap_update_create_intent+0x10/0x10
>  xfs_defer_create_intent+0xcc/0x1b0
>  xfs_defer_create_intents+0x8f/0x230
>  xfs_defer_finish_noroll+0x1c0/0x1160
>  ? xfs_inode_item_precommit+0x2c1/0x880
>  ? __create_object+0x5e/0x90
>  ? __pfx_xfs_defer_finish_noroll+0x10/0x10
>  ? xfs_trans_run_precommits+0x126/0x200
>  __xfs_trans_commit+0x767/0xbe0
>  ? inode_maybe_inc_iversion+0xe2/0x150
>  ? __pfx___xfs_trans_commit+0x10/0x10
>  xfs_reflink_remap_extent+0x654/0xd40
>  ? __pfx_xfs_reflink_remap_extent+0x10/0x10
>  ? __pfx_down_read_nested+0x10/0x10
>  xfs_reflink_remap_blocks+0x21a/0x850
>  ? __pfx_xfs_reflink_remap_blocks+0x10/0x10
>  ? _raw_spin_unlock+0x23/0x40
>  ? xfs_reflink_remap_prep+0x47d/0x900
>  xfs_file_remap_range+0x296/0xb40
>  ? __pfx_xfs_file_remap_range+0x10/0x10
>  ? __pfx_lock_acquire+0x10/0x10
>  ? __pfx___might_resched+0x10/0x10
>  vfs_clone_file_range+0x260/0xc20
>  ioctl_file_clone+0x49/0xb0
>  do_vfs_ioctl+0xe5c/0x1660
>  ? __pfx_do_vfs_ioctl+0x10/0x10
>  ? trace_irq_enable.constprop.0+0xd2/0x110
>  ? kasan_quarantine_put+0x7e/0x1d0
>  ? do_sys_openat2+0x120/0x170
>  ? lock_acquire+0x43b/0x4f0
>  ? __pfx_lock_release+0x10/0x10
>  ? __pfx_do_sys_openat2+0x10/0x10
>  ? __do_sys_newfstatat+0x94/0xe0
>  ? __fget_files+0x1ce/0x330
>  __x64_sys_ioctl+0xd5/0x1b0
>  do_syscall_64+0x6a/0x140
>  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> RIP: 0033:0x7ffff7d1a94f
>  </TASK>
> 
> 
> -ritesh
> 
> 
> 
> > Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
> > cc: Ojaswin Mujoo <ojaswin@linux.ibm.com>
> > ---
> >  fs/xfs/libxfs/xfs_bmap.c | 1 +
> >  1 file changed, 1 insertion(+)
> >
> > diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> > index 656c95a22f2e..44d5381bc66f 100644
> > --- a/fs/xfs/libxfs/xfs_bmap.c
> > +++ b/fs/xfs/libxfs/xfs_bmap.c
> > @@ -6354,6 +6354,7 @@ xfs_bunmapi_range(
> >  		error = xfs_defer_finish(tpp);
> >  		if (error)
> >  			goto out;
> > +		cond_resched();
> >  	}
> >  out:
> >  	return error;
> > --
> > 2.44.0
>
Ritesh Harjani (IBM) April 30, 2024, 5:04 a.m. UTC | #4
"Darrick J. Wong" <djwong@kernel.org> writes:

> On Mon, Apr 29, 2024 at 02:14:46PM +0530, Ritesh Harjani wrote:
>> "Ritesh Harjani (IBM)" <ritesh.list@gmail.com> writes:
>> 
>> > An async dio write to a sparse file can generate a lot of extents
>> > and when we unlink this file (using rm), the kernel can be busy in umapping
>> > and freeing those extents as part of transaction processing.
>> > Add cond_resched() in xfs_bunmapi_range() to avoid soft lockups
>> > messages like these. Here is a call trace of such a soft lockup.
>> >
>> > watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [kworker/1:0:82435]
>> > CPU: 1 PID: 82435 Comm: kworker/1:0 Tainted: G S  L   6.9.0-rc5-0-default #1
>> > Workqueue: xfs-inodegc/sda2 xfs_inodegc_worker
>> > NIP [c000000000beea10] xfs_extent_busy_trim+0x100/0x290
>> > LR [c000000000bee958] xfs_extent_busy_trim+0x48/0x290
>> > Call Trace:
>> >   xfs_alloc_get_rec+0x54/0x1b0 (unreliable)
>> >   xfs_alloc_compute_aligned+0x5c/0x144
>> >   xfs_alloc_ag_vextent_size+0x238/0x8d4
>> >   xfs_alloc_fix_freelist+0x540/0x694
>> >   xfs_free_extent_fix_freelist+0x84/0xe0
>> >   __xfs_free_extent+0x74/0x1ec
>> >   xfs_extent_free_finish_item+0xcc/0x214
>> >   xfs_defer_finish_one+0x194/0x388
>> >   xfs_defer_finish_noroll+0x1b4/0x5c8
>> >   xfs_defer_finish+0x2c/0xc4
>> >   xfs_bunmapi_range+0xa4/0x100
>> >   xfs_itruncate_extents_flags+0x1b8/0x2f4
>> >   xfs_inactive_truncate+0xe0/0x124
>> >   xfs_inactive+0x30c/0x3e0
>> >   xfs_inodegc_worker+0x140/0x234
>> >   process_scheduled_works+0x240/0x57c
>> >   worker_thread+0x198/0x468
>> >   kthread+0x138/0x140
>> >   start_kernel_thread+0x14/0x18
>> >
>> 
>> My v1 patch had cond_resched() in xfs_defer_finish_noroll, since I was
>> suspecting that it's a common point where we loop for many other
>> operations. And initially Dave also suggested for the same [1].
>> But I was not totally convinced given the only problematic path I
>> had till now was in unmapping extents. So this patch keeps the
>> cond_resched() in xfs_bunmapi_range() loop.
>> 
>> [1]: https://lore.kernel.org/all/ZZ8OaNnp6b%2FPJzsb@dread.disaster.area/
>> 
>> However, I was able to reproduce a problem with reflink remapping path
>> both on Power (with 64k bs) and on x86 (with preempt=none and with KASAN
>> enabled). I actually noticed while I was doing regression testing of
>> some of the iomap changes with KASAN enabled. The issue was seen with
>> generic/175 for both on Power and x86.
>> 
>> Do you think we should keep the cond_resched() inside
>> xfs_defer_finish_noroll() loop like we had in v1 [2]. If yes, then I can rebase
>> v1 on the latest upstream tree and also update the commit msg with both
>> call stacks.
>
> I think there ought to be one in xfs_defer_finish_noroll (or even
> xfs_trans_roll) when we're between dirty transactions, since long
> running transaction chains can indeed stall.  That'll hopefully solve
> the problem for bunmapi and long running online repair operations.
>
> But that said, the main FICLONE loop continually creates new transaction
> chains, so you probably need one there too.

We need not right, as long as we have one in xfs_defer_finish_noroll(),
because we commit everything before returning from
xfs_reflink_remap_extent(). And we do have XFS_TRANS_PERM_LOG_RES set,
so xfs_defer_finish_noroll() should always be called. 
(Apologies, if I am not making sense. I new to XFS code)

I gave generic/175 an overnight run with cond_resched() call inside
xfs_defer_finish_noroll() loop. I couldn't reproduce the soft lockup bug.

So does it sound ok to have cond_resched() within
xfs_defer_finish_noroll(). I can submit the new version then.

This should solve the soft lockup problems in both unmapping of extents
causing large transaction chains and also reflink remapping path.

> Hence the grumblings about cond_resched whackamole.
> Thoughts?

Yes, I agree that life will be easy if we don't have to play with the
hacks of putting such calls within tight loops. But until we have such
infrastructure ready, I was hoping we could find a correct fix for such
problems for current upstream and old stable kernels.


Thanks for the review and suggestions!

-ritesh
diff mbox series

Patch

diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 656c95a22f2e..44d5381bc66f 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -6354,6 +6354,7 @@  xfs_bunmapi_range(
 		error = xfs_defer_finish(tpp);
 		if (error)
 			goto out;
+		cond_resched();
 	}
 out:
 	return error;