diff mbox series

xfs: explicitly call cond_resched in xfs_itruncate_extents_flags

Message ID 20240110071347.3711925-1-wenjian1@xiaomi.com (mailing list archive)
State New
Headers show
Series xfs: explicitly call cond_resched in xfs_itruncate_extents_flags | expand

Commit Message

Jian Wen Jan. 10, 2024, 7:13 a.m. UTC
From: Jian Wen <wenjianhn@gmail.com>

Deleting a file with lots of extents may cause a soft lockup if the
preemption model is none(CONFIG_PREEMPT_NONE=y or preempt=none is set
in the kernel cmdline). Alibaba cloud kernel and Oracle UEK container
kernel are affected by the issue, since they select CONFIG_PREEMPT_NONE=y.

Explicitly call cond_resched in xfs_itruncate_extents_flags avoid
the below softlockup warning.
watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [kworker/0:13:139]
CPU: 0 PID: 139 Comm: kworker/0:13 Not tainted 6.7.0-rc8-g610a9b8f49fb #23
 Workqueue: xfs-inodegc/vda1 xfs_inodegc_worker
 Call Trace:
  _raw_spin_lock+0x30/0x80
  ? xfs_extent_busy_trim+0x38/0x200
  xfs_extent_busy_trim+0x38/0x200
  xfs_alloc_compute_aligned+0x38/0xd0
  xfs_alloc_ag_vextent_size+0x1f1/0x870
  xfs_alloc_fix_freelist+0x58a/0x770
  xfs_free_extent_fix_freelist+0x60/0xa0
  __xfs_free_extent+0x66/0x1d0
  xfs_trans_free_extent+0xac/0x290
  xfs_extent_free_finish_item+0xf/0x40
  xfs_defer_finish_noroll+0x1db/0x7f0
  xfs_defer_finish+0x10/0xa0
  xfs_itruncate_extents_flags+0x169/0x4c0
  xfs_inactive_truncate+0xba/0x140
  xfs_inactive+0x239/0x2a0
  xfs_inodegc_worker+0xa3/0x210
  ? process_scheduled_works+0x273/0x550
  process_scheduled_works+0x2da/0x550
  worker_thread+0x180/0x350

Most of the Linux distributions default to voluntary preemption,
might_sleep() would yield CPU if needed. Thus they are not affected.
kworker/0:24+xf     298 [000]  7294.810021: probe:__cond_resched:
  __cond_resched+0x5 ([kernel.kallsyms])
  __kmem_cache_alloc_node+0x17c ([kernel.kallsyms])
  __kmalloc+0x4d ([kernel.kallsyms])
  kmem_alloc+0x7a ([kernel.kallsyms])
  xfs_extent_busy_insert_list+0x2e ([kernel.kallsyms])
  __xfs_free_extent+0xd3 ([kernel.kallsyms])
  xfs_trans_free_extent+0x93 ([kernel.kallsyms])
  xfs_extent_free_finish_item+0xf ([kernel.kallsyms])

kworker/0:24+xf     298 [000]  7294.810045: probe:__cond_resched:
  __cond_resched+0x5 ([kernel.kallsyms])
  down+0x11 ([kernel.kallsyms])
  xfs_buf_lock+0x2c ([kernel.kallsyms])
  xfs_buf_find_lock+0x62 ([kernel.kallsyms])
  xfs_buf_get_map+0x1fd ([kernel.kallsyms])
  xfs_buf_read_map+0x51 ([kernel.kallsyms])
  xfs_trans_read_buf_map+0x1c5 ([kernel.kallsyms])
  xfs_btree_read_buf_block.constprop.0+0xa1 ([kernel.kallsyms])
  xfs_btree_lookup_get_block+0x97 ([kernel.kallsyms])
  xfs_btree_lookup+0xc5 ([kernel.kallsyms])
  xfs_alloc_lookup_eq+0x18 ([kernel.kallsyms])
  xfs_free_ag_extent+0x63f ([kernel.kallsyms])
  __xfs_free_extent+0xa7 ([kernel.kallsyms])
  xfs_trans_free_extent+0x93 ([kernel.kallsyms])
  xfs_extent_free_finish_item+0xf ([kernel.kallsyms])

Signed-off-by: Jian Wen <wenjian1@xiaomi.com>
---
 fs/xfs/xfs_inode.c | 3 +++
 1 file changed, 3 insertions(+)

Comments

Darrick J. Wong Jan. 10, 2024, 5:45 p.m. UTC | #1
On Wed, Jan 10, 2024 at 03:13:47PM +0800, Jian Wen wrote:
> From: Jian Wen <wenjianhn@gmail.com>
> 
> Deleting a file with lots of extents may cause a soft lockup if the
> preemption model is none(CONFIG_PREEMPT_NONE=y or preempt=none is set
> in the kernel cmdline). Alibaba cloud kernel and Oracle UEK container
> kernel are affected by the issue, since they select CONFIG_PREEMPT_NONE=y.
> 
> Explicitly call cond_resched in xfs_itruncate_extents_flags avoid
> the below softlockup warning.
> watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [kworker/0:13:139]

Wowwee, how many extents does this file have, anyway?

> CPU: 0 PID: 139 Comm: kworker/0:13 Not tainted 6.7.0-rc8-g610a9b8f49fb #23
>  Workqueue: xfs-inodegc/vda1 xfs_inodegc_worker
>  Call Trace:
>   _raw_spin_lock+0x30/0x80
>   ? xfs_extent_busy_trim+0x38/0x200
>   xfs_extent_busy_trim+0x38/0x200
>   xfs_alloc_compute_aligned+0x38/0xd0
>   xfs_alloc_ag_vextent_size+0x1f1/0x870
>   xfs_alloc_fix_freelist+0x58a/0x770
>   xfs_free_extent_fix_freelist+0x60/0xa0
>   __xfs_free_extent+0x66/0x1d0
>   xfs_trans_free_extent+0xac/0x290
>   xfs_extent_free_finish_item+0xf/0x40
>   xfs_defer_finish_noroll+0x1db/0x7f0
>   xfs_defer_finish+0x10/0xa0
>   xfs_itruncate_extents_flags+0x169/0x4c0
>   xfs_inactive_truncate+0xba/0x140
>   xfs_inactive+0x239/0x2a0
>   xfs_inodegc_worker+0xa3/0x210
>   ? process_scheduled_works+0x273/0x550
>   process_scheduled_works+0x2da/0x550
>   worker_thread+0x180/0x350
> 
> Most of the Linux distributions default to voluntary preemption,
> might_sleep() would yield CPU if needed. Thus they are not affected.
> kworker/0:24+xf     298 [000]  7294.810021: probe:__cond_resched:
>   __cond_resched+0x5 ([kernel.kallsyms])
>   __kmem_cache_alloc_node+0x17c ([kernel.kallsyms])
>   __kmalloc+0x4d ([kernel.kallsyms])
>   kmem_alloc+0x7a ([kernel.kallsyms])
>   xfs_extent_busy_insert_list+0x2e ([kernel.kallsyms])
>   __xfs_free_extent+0xd3 ([kernel.kallsyms])
>   xfs_trans_free_extent+0x93 ([kernel.kallsyms])
>   xfs_extent_free_finish_item+0xf ([kernel.kallsyms])
> 
> kworker/0:24+xf     298 [000]  7294.810045: probe:__cond_resched:
>   __cond_resched+0x5 ([kernel.kallsyms])
>   down+0x11 ([kernel.kallsyms])
>   xfs_buf_lock+0x2c ([kernel.kallsyms])
>   xfs_buf_find_lock+0x62 ([kernel.kallsyms])
>   xfs_buf_get_map+0x1fd ([kernel.kallsyms])
>   xfs_buf_read_map+0x51 ([kernel.kallsyms])
>   xfs_trans_read_buf_map+0x1c5 ([kernel.kallsyms])
>   xfs_btree_read_buf_block.constprop.0+0xa1 ([kernel.kallsyms])
>   xfs_btree_lookup_get_block+0x97 ([kernel.kallsyms])
>   xfs_btree_lookup+0xc5 ([kernel.kallsyms])
>   xfs_alloc_lookup_eq+0x18 ([kernel.kallsyms])
>   xfs_free_ag_extent+0x63f ([kernel.kallsyms])
>   __xfs_free_extent+0xa7 ([kernel.kallsyms])
>   xfs_trans_free_extent+0x93 ([kernel.kallsyms])
>   xfs_extent_free_finish_item+0xf ([kernel.kallsyms])
> 
> Signed-off-by: Jian Wen <wenjian1@xiaomi.com>
> ---
>  fs/xfs/xfs_inode.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index c0f1c89786c2..194381e10472 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -4,6 +4,7 @@
>   * All Rights Reserved.
>   */
>  #include <linux/iversion.h>
> +#include <linux/sched.h>
>  
>  #include "xfs.h"
>  #include "xfs_fs.h"
> @@ -1383,6 +1384,8 @@ xfs_itruncate_extents_flags(
>  		error = xfs_defer_finish(&tp);
>  		if (error)
>  			goto out;
> +
> +		cond_resched();

Is there a particular reason why truncation uses a giant chained
transaction for all extents (with xfs_defer_finish) instead of, say, one
chain per extent?

(If you've actually studied this and know why then I guess cond_resched
is a reasonable bandaid, but I don't want to play cond_resched
whackamole here.)

--D

>  	}
>  
>  	if (whichfork == XFS_DATA_FORK) {
> -- 
> 2.25.1
> 
>
Dave Chinner Jan. 10, 2024, 9:38 p.m. UTC | #2
On Wed, Jan 10, 2024 at 03:13:47PM +0800, Jian Wen wrote:
> From: Jian Wen <wenjianhn@gmail.com>
> 
> Deleting a file with lots of extents may cause a soft lockup if the
> preemption model is none(CONFIG_PREEMPT_NONE=y or preempt=none is set
> in the kernel cmdline). Alibaba cloud kernel and Oracle UEK container
> kernel are affected by the issue, since they select CONFIG_PREEMPT_NONE=y.

Time for them to move to CONFIG_PREEMPT_DYNAMIC?

Also there has been recent action towards removing
CONFIG_PREEMPT_NONE/VOLUNTARY and cond_resched() altogether because
the lazy preemption model coming present in the RTPREEMPT patchset
solves the performance issues with full preemption that PREEMPT_NONE
works around...

https://lwn.net/Articles/944686/
https://lwn.net/Articles/945422/

Further, Thomas Gleixner has stated in those discussions that:

	"Though definitely I'm putting a permanent NAK in place for
	 any attempts to duct tape the preempt=NONE model any
	 further by sprinkling more cond*() and whatever warts
	 around."

https://lwn.net/ml/linux-kernel/87jzshhexi.ffs@tglx/

> Explicitly call cond_resched in xfs_itruncate_extents_flags avoid
> the below softlockup warning.

IOWs, this is no longer considered an acceptible solution by core
kernel maintainers.

Regardless of these policy issues, the code change:

> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index c0f1c89786c2..194381e10472 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -4,6 +4,7 @@
>   * All Rights Reserved.
>   */
>  #include <linux/iversion.h>
> +#include <linux/sched.h>

Global includes like this go in fs/xfs/xfs_linux.h, but I don't
think that's even necessary because we have cond_resched() calls
elsewhere in XFS with the same include list as xfs_inode.c...

>  #include "xfs.h"
>  #include "xfs_fs.h"
> @@ -1383,6 +1384,8 @@ xfs_itruncate_extents_flags(
>  		error = xfs_defer_finish(&tp);
>  		if (error)
>  			goto out;
> +
> +		cond_resched();
>  	}

Shouldn't this go in xfs_defer_finish() so that we capture all the
cases where we loop indefinitely over a range continually rolling a
permanent transaction via xfs_defer_finish()?

-Dave.
Jian Wen Jan. 11, 2024, 12:40 p.m. UTC | #3
On Thu, Jan 11, 2024 at 1:45 AM Darrick J. Wong <djwong@kernel.org> wrote:
>
> On Wed, Jan 10, 2024 at 03:13:47PM +0800, Jian Wen wrote:
> > From: Jian Wen <wenjianhn@gmail.com>
> >
> > Deleting a file with lots of extents may cause a soft lockup if the
> > preemption model is none(CONFIG_PREEMPT_NONE=y or preempt=none is set
> > in the kernel cmdline). Alibaba cloud kernel and Oracle UEK container
> > kernel are affected by the issue, since they select CONFIG_PREEMPT_NONE=y.
> >
> > Explicitly call cond_resched in xfs_itruncate_extents_flags avoid
> > the below softlockup warning.
> > watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [kworker/0:13:139]
>
> Wowwee, how many extents does this file have, anyway?
I deleted a file with 346915 extents in my test environment.
I should have mentioned lockdep was enabled: CONFIG_LOCKDEP=y
Without CONFIG_LOCKDEP the scheduling latency was several seconds.

P99 latency of one of our services increased from a few ms to 280 ms
because of the issue.

Below are the steps to make a file badly fragmented.
root@xfsunlink:~/vda# uname -r
6.7.0-rc8-g610a9b8f49fb

root@xfsunlink:~/vda# img=test.raw ; qemu-img create -f raw -o
extent_size_hint=0 ${img} 10G && qemu-img bench -f raw -t none -n -w
${img} -c 1000000 -S 8192 -o 0 && filefrag ${img}
Formatting 'test.raw', fmt=raw size=10737418240 extent_size_hint=0
Sending 1000000 write requests, 4096 bytes each, 64 in parallel
(starting at offset 0, step size 8192)
Run completed in 42.506 seconds.
test.raw: 346915 extents found

>
> > CPU: 0 PID: 139 Comm: kworker/0:13 Not tainted 6.7.0-rc8-g610a9b8f49fb #23
> >  Workqueue: xfs-inodegc/vda1 xfs_inodegc_worker
> >  Call Trace:
> >   _raw_spin_lock+0x30/0x80
> >   ? xfs_extent_busy_trim+0x38/0x200
> >   xfs_extent_busy_trim+0x38/0x200
> >   xfs_alloc_compute_aligned+0x38/0xd0
> >   xfs_alloc_ag_vextent_size+0x1f1/0x870
> >   xfs_alloc_fix_freelist+0x58a/0x770
> >   xfs_free_extent_fix_freelist+0x60/0xa0
> >   __xfs_free_extent+0x66/0x1d0
> >   xfs_trans_free_extent+0xac/0x290
> >   xfs_extent_free_finish_item+0xf/0x40
> >   xfs_defer_finish_noroll+0x1db/0x7f0
> >   xfs_defer_finish+0x10/0xa0
> >   xfs_itruncate_extents_flags+0x169/0x4c0
> >   xfs_inactive_truncate+0xba/0x140
> >   xfs_inactive+0x239/0x2a0
> >   xfs_inodegc_worker+0xa3/0x210
> >   ? process_scheduled_works+0x273/0x550
> >   process_scheduled_works+0x2da/0x550
> >   worker_thread+0x180/0x350
> >
> > Most of the Linux distributions default to voluntary preemption,
> > might_sleep() would yield CPU if needed. Thus they are not affected.
> > kworker/0:24+xf     298 [000]  7294.810021: probe:__cond_resched:
> >   __cond_resched+0x5 ([kernel.kallsyms])
> >   __kmem_cache_alloc_node+0x17c ([kernel.kallsyms])
> >   __kmalloc+0x4d ([kernel.kallsyms])
> >   kmem_alloc+0x7a ([kernel.kallsyms])
> >   xfs_extent_busy_insert_list+0x2e ([kernel.kallsyms])
> >   __xfs_free_extent+0xd3 ([kernel.kallsyms])
> >   xfs_trans_free_extent+0x93 ([kernel.kallsyms])
> >   xfs_extent_free_finish_item+0xf ([kernel.kallsyms])
> >
> > kworker/0:24+xf     298 [000]  7294.810045: probe:__cond_resched:
> >   __cond_resched+0x5 ([kernel.kallsyms])
> >   down+0x11 ([kernel.kallsyms])
> >   xfs_buf_lock+0x2c ([kernel.kallsyms])
> >   xfs_buf_find_lock+0x62 ([kernel.kallsyms])
> >   xfs_buf_get_map+0x1fd ([kernel.kallsyms])
> >   xfs_buf_read_map+0x51 ([kernel.kallsyms])
> >   xfs_trans_read_buf_map+0x1c5 ([kernel.kallsyms])
> >   xfs_btree_read_buf_block.constprop.0+0xa1 ([kernel.kallsyms])
> >   xfs_btree_lookup_get_block+0x97 ([kernel.kallsyms])
> >   xfs_btree_lookup+0xc5 ([kernel.kallsyms])
> >   xfs_alloc_lookup_eq+0x18 ([kernel.kallsyms])
> >   xfs_free_ag_extent+0x63f ([kernel.kallsyms])
> >   __xfs_free_extent+0xa7 ([kernel.kallsyms])
> >   xfs_trans_free_extent+0x93 ([kernel.kallsyms])
> >   xfs_extent_free_finish_item+0xf ([kernel.kallsyms])
> >
> > Signed-off-by: Jian Wen <wenjian1@xiaomi.com>
> > ---
> >  fs/xfs/xfs_inode.c | 3 +++
> >  1 file changed, 3 insertions(+)
> >
> > diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> > index c0f1c89786c2..194381e10472 100644
> > --- a/fs/xfs/xfs_inode.c
> > +++ b/fs/xfs/xfs_inode.c
> > @@ -4,6 +4,7 @@
> >   * All Rights Reserved.
> >   */
> >  #include <linux/iversion.h>
> > +#include <linux/sched.h>
> >
> >  #include "xfs.h"
> >  #include "xfs_fs.h"
> > @@ -1383,6 +1384,8 @@ xfs_itruncate_extents_flags(
> >               error = xfs_defer_finish(&tp);
> >               if (error)
> >                       goto out;
> > +
> > +             cond_resched();
>
> Is there a particular reason why truncation uses a giant chained
> transaction for all extents (with xfs_defer_finish) instead of, say, one
> chain per extent?
I have no clue yet.
>
> (If you've actually studied this and know why then I guess cond_resched
> is a reasonable bandaid, but I don't want to play cond_resched
> whackamole here.)
>
> --D
>
> >       }
> >
> >       if (whichfork == XFS_DATA_FORK) {
> > --
> > 2.25.1
> >
> >
Jian Wen Jan. 11, 2024, 12:52 p.m. UTC | #4
On Thu, Jan 11, 2024 at 5:38 AM Dave Chinner <david@fromorbit.com> wrote:
>
> On Wed, Jan 10, 2024 at 03:13:47PM +0800, Jian Wen wrote:
> > From: Jian Wen <wenjianhn@gmail.com>
> >
> > Deleting a file with lots of extents may cause a soft lockup if the
> > preemption model is none(CONFIG_PREEMPT_NONE=y or preempt=none is set
> > in the kernel cmdline). Alibaba cloud kernel and Oracle UEK container
> > kernel are affected by the issue, since they select CONFIG_PREEMPT_NONE=y.
>
> Time for them to move to CONFIG_PREEMPT_DYNAMIC?
I had asked one of them to support CONFIG_PREEMPT_DYNAMIC before
sending the patch.
>
> Also there has been recent action towards removing
> CONFIG_PREEMPT_NONE/VOLUNTARY and cond_resched() altogether because
> the lazy preemption model coming present in the RTPREEMPT patchset
> solves the performance issues with full preemption that PREEMPT_NONE
> works around...
>
> https://lwn.net/Articles/944686/
> https://lwn.net/Articles/945422/
>
> Further, Thomas Gleixner has stated in those discussions that:
>
>         "Though definitely I'm putting a permanent NAK in place for
>          any attempts to duct tape the preempt=NONE model any
>          further by sprinkling more cond*() and whatever warts
>          around."
>
> https://lwn.net/ml/linux-kernel/87jzshhexi.ffs@tglx/
>
> > Explicitly call cond_resched in xfs_itruncate_extents_flags avoid
> > the below softlockup warning.
>
> IOWs, this is no longer considered an acceptible solution by core
> kernel maintainers.
Understood. I will only build a hotfix for our production kernel then.
>
> Regardless of these policy issues, the code change:
>
> > diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> > index c0f1c89786c2..194381e10472 100644
> > --- a/fs/xfs/xfs_inode.c
> > +++ b/fs/xfs/xfs_inode.c
> > @@ -4,6 +4,7 @@
> >   * All Rights Reserved.
> >   */
> >  #include <linux/iversion.h>
> > +#include <linux/sched.h>
>
> Global includes like this go in fs/xfs/xfs_linux.h, but I don't
> think that's even necessary because we have cond_resched() calls
> elsewhere in XFS with the same include list as xfs_inode.c...
>
> >  #include "xfs.h"
> >  #include "xfs_fs.h"
> > @@ -1383,6 +1384,8 @@ xfs_itruncate_extents_flags(
> >               error = xfs_defer_finish(&tp);
> >               if (error)
> >                       goto out;
> > +
> > +             cond_resched();
> >       }
>
> Shouldn't this go in xfs_defer_finish() so that we capture all the
> cases where we loop indefinitely over a range continually rolling a
> permanent transaction via xfs_defer_finish()?
It seems xfs_collapse_file_space and xfs_insert_file_space also need
to yield CPU.
I don't have use cases for them yet.
>
> -Dave.
> --
> Dave Chinner
> david@fromorbit.com
Dave Chinner Jan. 11, 2024, 8:16 p.m. UTC | #5
On Wed, Jan 10, 2024 at 09:45:01AM -0800, Darrick J. Wong wrote:
> On Wed, Jan 10, 2024 at 03:13:47PM +0800, Jian Wen wrote:
> > From: Jian Wen <wenjianhn@gmail.com>
> > 
> > Deleting a file with lots of extents may cause a soft lockup if the
> > preemption model is none(CONFIG_PREEMPT_NONE=y or preempt=none is set
> > in the kernel cmdline). Alibaba cloud kernel and Oracle UEK container
> > kernel are affected by the issue, since they select CONFIG_PREEMPT_NONE=y.
> > 
> > Explicitly call cond_resched in xfs_itruncate_extents_flags avoid
> > the below softlockup warning.
> > watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [kworker/0:13:139]
> 
> Wowwee, how many extents does this file have, anyway?

IIRC, extent removal runs at about 50,000 extents/s on a typical
modern CPU on a production kernel build when everything is cached.
23s then equates to a bit above a million extents....

> > diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> > index c0f1c89786c2..194381e10472 100644
> > --- a/fs/xfs/xfs_inode.c
> > +++ b/fs/xfs/xfs_inode.c
> > @@ -4,6 +4,7 @@
> >   * All Rights Reserved.
> >   */
> >  #include <linux/iversion.h>
> > +#include <linux/sched.h>
> >  
> >  #include "xfs.h"
> >  #include "xfs_fs.h"
> > @@ -1383,6 +1384,8 @@ xfs_itruncate_extents_flags(
> >  		error = xfs_defer_finish(&tp);
> >  		if (error)
> >  			goto out;
> > +
> > +		cond_resched();
> 
> Is there a particular reason why truncation uses a giant chained
> transaction for all extents (with xfs_defer_finish) instead of, say, one
> chain per extent?

IIUC the question correctly, extent removal for a given range ibeing
truncated has to been seen externally as an atomic change so the
ILOCK has to held across the entire removal operation.

This means it has to be executed as a single rolling transaction as
transaction reservation cannot be done with the ILOCK held and the
atomicity of extent removal prevents the ILOCK from being dropped.

This atomicity only really matters for active operations
like truncate, hole punching, etc when other operations may also be
trying to act on the extent list. However, in inodegc, we are
essentially guaranteed that nobody else will need to access the
extent list whilst we are truncating them away, so it could be done
as a bunch of fine grained transactions.

However, this doesn't solve the problem the CPU never being yeilded
if everything is cached and no locks or transaction reservations are
ever contended. It will still hold the CPU for the same length of
time. Hence I'm not sure changing the way the transactions are run
will really change anything material w.r.t. the issue at hand.

-Dave.
Dave Chinner Jan. 11, 2024, 8:27 p.m. UTC | #6
[cc Thomas, lkml]

On Thu, Jan 11, 2024 at 08:52:22PM +0800, Jian Wen wrote:
> On Thu, Jan 11, 2024 at 5:38 AM Dave Chinner <david@fromorbit.com> wrote:
> >
> > On Wed, Jan 10, 2024 at 03:13:47PM +0800, Jian Wen wrote:
> > > From: Jian Wen <wenjianhn@gmail.com>
> > >
> > > Deleting a file with lots of extents may cause a soft lockup if the
> > > preemption model is none(CONFIG_PREEMPT_NONE=y or preempt=none is set
> > > in the kernel cmdline). Alibaba cloud kernel and Oracle UEK container
> > > kernel are affected by the issue, since they select CONFIG_PREEMPT_NONE=y.
> >
> > Time for them to move to CONFIG_PREEMPT_DYNAMIC?
> I had asked one of them to support CONFIG_PREEMPT_DYNAMIC before
> sending the patch.

OK.

> > Also there has been recent action towards removing
> > CONFIG_PREEMPT_NONE/VOLUNTARY and cond_resched() altogether because
> > the lazy preemption model coming present in the RTPREEMPT patchset
> > solves the performance issues with full preemption that PREEMPT_NONE
> > works around...
> >
> > https://lwn.net/Articles/944686/
> > https://lwn.net/Articles/945422/
> >
> > Further, Thomas Gleixner has stated in those discussions that:
> >
> >         "Though definitely I'm putting a permanent NAK in place for
> >          any attempts to duct tape the preempt=NONE model any
> >          further by sprinkling more cond*() and whatever warts
> >          around."
> >
> > https://lwn.net/ml/linux-kernel/87jzshhexi.ffs@tglx/
> >
> > > Explicitly call cond_resched in xfs_itruncate_extents_flags avoid
> > > the below softlockup warning.
> >
> > IOWs, this is no longer considered an acceptible solution by core
> > kernel maintainers.
> Understood. I will only build a hotfix for our production kernel then.

Yeah, that may be your best short term fix. We'll need to clarify
what the current policy is on adding cond_resched points before we
go any further in this direction.

Thomas, any update on what is happening with cond_resched() - is
there an ETA on it going away/being unnecessary?

> > Regardless of these policy issues, the code change:
> >
> > > diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> > > index c0f1c89786c2..194381e10472 100644
> > > --- a/fs/xfs/xfs_inode.c
> > > +++ b/fs/xfs/xfs_inode.c
> > > @@ -4,6 +4,7 @@
> > >   * All Rights Reserved.
> > >   */
> > >  #include <linux/iversion.h>
> > > +#include <linux/sched.h>
> >
> > Global includes like this go in fs/xfs/xfs_linux.h, but I don't
> > think that's even necessary because we have cond_resched() calls
> > elsewhere in XFS with the same include list as xfs_inode.c...
> >
> > >  #include "xfs.h"
> > >  #include "xfs_fs.h"
> > > @@ -1383,6 +1384,8 @@ xfs_itruncate_extents_flags(
> > >               error = xfs_defer_finish(&tp);
> > >               if (error)
> > >                       goto out;
> > > +
> > > +             cond_resched();
> > >       }
> >
> > Shouldn't this go in xfs_defer_finish() so that we capture all the
> > cases where we loop indefinitely over a range continually rolling a
> > permanent transaction via xfs_defer_finish()?
> It seems xfs_collapse_file_space and xfs_insert_file_space also need
> to yield CPU.
> I don't have use cases for them yet.

Yup, they do, but they also call xfs_defer_finish(), so having the
cond_resched() in that function will capture them as well.

Also, the current upstream tree has moved this code from
xfs_itruncate_extents_flags() to xfs_bunmapi_range(), so the
cond_resched() has to be moved, anyway. We may as well put it in
xfs_defer_finish() if we end up doing this.

Cheers,

Dave.
Thomas Gleixner Jan. 12, 2024, 1:01 p.m. UTC | #7
On Fri, Jan 12 2024 at 07:27, Dave Chinner wrote:

Cc: Ankur

> On Thu, Jan 11, 2024 at 08:52:22PM +0800, Jian Wen wrote:
>> On Thu, Jan 11, 2024 at 5:38 AM Dave Chinner <david@fromorbit.com> wrote:
>> > IOWs, this is no longer considered an acceptible solution by core
>> > kernel maintainers.
>> Understood. I will only build a hotfix for our production kernel then.
>
> Yeah, that may be your best short term fix. We'll need to clarify
> what the current policy is on adding cond_resched points before we
> go any further in this direction.

Well, right now until the scheduler situation is sorted there is no
other solution than to add the cond_resched() muck.

> Thomas, any update on what is happening with cond_resched() - is
> there an ETA on it going away/being unnecessary?

Ankur is working on that...
Ankur Arora Jan. 23, 2024, 7:01 a.m. UTC | #8
[ Missed this email until now. ]

Thomas Gleixner writes:

>On Fri, Jan 12 2024 at 07:27, Dave Chinner wrote:
>
>Cc: Ankur
>
> > On Thu, Jan 11, 2024 at 08:52:22PM +0800, Jian Wen wrote:
> >> On Thu, Jan 11, 2024 at 5:38 AM Dave Chinner <david@fromorbit.com> wrote:
> >> > IOWs, this is no longer considered an acceptible solution by core
> >> > kernel maintainers.
> >> Understood. I will only build a hotfix for our production kernel then.
> >
> > Yeah, that may be your best short term fix. We'll need to clarify
> > what the current policy is on adding cond_resched points before we
> > go any further in this direction.
>
> Well, right now until the scheduler situation is sorted there is no
> other solution than to add the cond_resched() muck.
>
> > Thomas, any update on what is happening with cond_resched() - is
> > there an ETA on it going away/being unnecessary?
>
> Ankur is working on that...

Yeah, running through a final round of tests before sending out the series.

Dave, on the status of cond_resched(): the work on this adds a new scheduling
model (as Thomas implemented in his PoC) undwer which cond_resched() would
basically be a stub.

However, given that other preemption models continue to use cond_resched(),
we would need to live with cond_resched() for a while -- at least while
this model works well enough under a wide enough variety of loads.

Thanks
Ankur
diff mbox series

Patch

diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index c0f1c89786c2..194381e10472 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -4,6 +4,7 @@ 
  * All Rights Reserved.
  */
 #include <linux/iversion.h>
+#include <linux/sched.h>
 
 #include "xfs.h"
 #include "xfs_fs.h"
@@ -1383,6 +1384,8 @@  xfs_itruncate_extents_flags(
 		error = xfs_defer_finish(&tp);
 		if (error)
 			goto out;
+
+		cond_resched();
 	}
 
 	if (whichfork == XFS_DATA_FORK) {