diff mbox series

[for-akpm,for-6.14-rcX] NFS: fix nfs_release_folio() to not call nfs_wb_folio() from kcompactd

Message ID 20250225003301.25693-1-snitzer@kernel.org (mailing list archive)
State Handled Elsewhere
Headers show
Series [for-akpm,for-6.14-rcX] NFS: fix nfs_release_folio() to not call nfs_wb_folio() from kcompactd | expand

Commit Message

Mike Snitzer Feb. 25, 2025, 12:33 a.m. UTC
Add PF_KCOMPACTD flag and current_is_kcompactd() helper to check for
it so nfs_release_folio() can skip calling nfs_wb_folio() from
kcompactd.

Otherwise NFS can deadlock waiting for kcompactd enduced writeback
which recurses back to NFS (which triggers writeback to NFSD via
NFS loopback mount on the same host, NFSD blocks waiting for XFS's
call to __filemap_get_folio):

6070.550357] INFO: task kcompactd0:58 blocked for more than 4435 seconds.

{---
[58] "kcompactd0"
[<0>] folio_wait_bit+0xe8/0x200
[<0>] folio_wait_writeback+0x2b/0x80
[<0>] nfs_wb_folio+0x80/0x1b0 [nfs]
[<0>] nfs_release_folio+0x68/0x130 [nfs]
[<0>] split_huge_page_to_list_to_order+0x362/0x840
[<0>] migrate_pages_batch+0x43d/0xb90
[<0>] migrate_pages_sync+0x9a/0x240
[<0>] migrate_pages+0x93c/0x9f0
[<0>] compact_zone+0x8e2/0x1030
[<0>] compact_node+0xdb/0x120
[<0>] kcompactd+0x121/0x2e0
[<0>] kthread+0xcf/0x100
[<0>] ret_from_fork+0x31/0x40
[<0>] ret_from_fork_asm+0x1a/0x30
---}

Fixes: 96780ca55e3cb ("NFS: fix up nfs_release_folio() to try to release the page")
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
---
 fs/nfs/file.c              | 3 ++-
 include/linux/compaction.h | 5 +++++
 include/linux/sched.h      | 2 +-
 mm/compaction.c            | 3 +++
 4 files changed, 11 insertions(+), 2 deletions(-)

Comments

Christoph Hellwig Feb. 26, 2025, 2:58 p.m. UTC | #1
On Mon, Feb 24, 2025 at 07:33:01PM -0500, Mike Snitzer wrote:
> Add PF_KCOMPACTD flag and current_is_kcompactd() helper to check for
> it so nfs_release_folio() can skip calling nfs_wb_folio() from
> kcompactd.
> 
> Otherwise NFS can deadlock waiting for kcompactd enduced writeback
> which recurses back to NFS (which triggers writeback to NFSD via
> NFS loopback mount on the same host, NFSD blocks waiting for XFS's
> call to __filemap_get_folio):

Having a flag for a specific kernel thread feels wrong.  I'm not an
expert in this area, but as fast as I can tell the problem is that
kcompactd should be calling into ->release_folio without __GFP_IO
set.
Mike Snitzer Feb. 26, 2025, 3:44 p.m. UTC | #2
On Wed, Feb 26, 2025 at 06:58:11AM -0800, Christoph Hellwig wrote:
> On Mon, Feb 24, 2025 at 07:33:01PM -0500, Mike Snitzer wrote:
> > Add PF_KCOMPACTD flag and current_is_kcompactd() helper to check for
> > it so nfs_release_folio() can skip calling nfs_wb_folio() from
> > kcompactd.
> > 
> > Otherwise NFS can deadlock waiting for kcompactd enduced writeback
> > which recurses back to NFS (which triggers writeback to NFSD via
> > NFS loopback mount on the same host, NFSD blocks waiting for XFS's
> > call to __filemap_get_folio):
> 
> Having a flag for a specific kernel thread feels wrong.  I'm not an
> expert in this area, but as fast as I can tell the problem is that
> kcompactd should be calling into ->release_folio without __GFP_IO
> set.

We can easily remove PF_KCOMPACTD and current_is_kcompactd() if/when
more analysis and a cleaner fix emerges (from you, mm experts, etc).
I'm not saying you're wrong about GFP flags needing to be clamped by
kcompactd's call to ->release_folio, but it seems more finishing work
needed.

As you can tell from my patch, we already have kswapd specialization
with PF_KSWAPD and current_is_kswapd().  A consumer of large folios
knowing that it being called as kcompactd is useful, hence my fix for
deadlock seen with NFS loopback mounts on memory constrained systems.

Mike
Mike Snitzer Feb. 26, 2025, 3:55 p.m. UTC | #3
On Wed, Feb 26, 2025 at 10:44:15AM -0500, Mike Snitzer wrote:
> On Wed, Feb 26, 2025 at 06:58:11AM -0800, Christoph Hellwig wrote:
> > On Mon, Feb 24, 2025 at 07:33:01PM -0500, Mike Snitzer wrote:
> > > Add PF_KCOMPACTD flag and current_is_kcompactd() helper to check for
> > > it so nfs_release_folio() can skip calling nfs_wb_folio() from
> > > kcompactd.
> > > 
> > > Otherwise NFS can deadlock waiting for kcompactd enduced writeback
> > > which recurses back to NFS (which triggers writeback to NFSD via
> > > NFS loopback mount on the same host, NFSD blocks waiting for XFS's
> > > call to __filemap_get_folio):
> > 
> > Having a flag for a specific kernel thread feels wrong.  I'm not an
> > expert in this area, but as fast as I can tell the problem is that
> > kcompactd should be calling into ->release_folio without __GFP_IO
> > set.
> 
> We can easily remove PF_KCOMPACTD and current_is_kcompactd() if/when
> more analysis and a cleaner fix emerges (from you, mm experts, etc).
> I'm not saying you're wrong about GFP flags needing to be clamped by
> kcompactd's call to ->release_folio, but it seems more finishing work
> needed.
> 
> As you can tell from my patch, we already have kswapd specialization
> with PF_KSWAPD and current_is_kswapd().  A consumer of large folios
> knowing that it being called as kcompactd is useful, hence my fix for
> deadlock seen with NFS loopback mounts on memory constrained systems.

Also, as followup to my fix: Trond and I have started exploring giving
NFS the ability to do more meaningful work on behalf of kswapd and
kcompactd from ->release_folio (rather than simply punt, by returning
false, like your proposed GFP flag clamping would do), see:

https://git.kernel.org/pub/scm/linux/kernel/git/snitzer/linux.git/commit/?h=kernel-6.12/nfs-next&id=b8fd0d3864e4bd43eaca1045002e89bede8fd91f

(I'm still working on testing/refining this patch...)
Trond Myklebust Feb. 26, 2025, 5:22 p.m. UTC | #4
On Wed, 2025-02-26 at 06:58 -0800, Christoph Hellwig wrote:
> On Mon, Feb 24, 2025 at 07:33:01PM -0500, Mike Snitzer wrote:
> > Add PF_KCOMPACTD flag and current_is_kcompactd() helper to check
> > for
> > it so nfs_release_folio() can skip calling nfs_wb_folio() from
> > kcompactd.
> > 
> > Otherwise NFS can deadlock waiting for kcompactd enduced writeback
> > which recurses back to NFS (which triggers writeback to NFSD via
> > NFS loopback mount on the same host, NFSD blocks waiting for XFS's
> > call to __filemap_get_folio):
> 
> Having a flag for a specific kernel thread feels wrong.  I'm not an
> expert in this area, but as fast as I can tell the problem is that
> kcompactd should be calling into ->release_folio without __GFP_IO
> set.

I've had that conversation before with several mm folks before we added
the PF_KSWAPD flag, and they appeared to disagree.
However perhaps these daemons could set PF_MEMALLOC_NOFS before calling
into release_folio()?
Andrew Morton Feb. 26, 2025, 10:02 p.m. UTC | #5
On Wed, 26 Feb 2025 06:58:11 -0800 Christoph Hellwig <hch@infradead.org> wrote:

> On Mon, Feb 24, 2025 at 07:33:01PM -0500, Mike Snitzer wrote:
> > Add PF_KCOMPACTD flag and current_is_kcompactd() helper to check for
> > it so nfs_release_folio() can skip calling nfs_wb_folio() from
> > kcompactd.
> > 
> > Otherwise NFS can deadlock waiting for kcompactd enduced writeback
> > which recurses back to NFS (which triggers writeback to NFSD via
> > NFS loopback mount on the same host, NFSD blocks waiting for XFS's
> > call to __filemap_get_folio):
> 
> Having a flag for a specific kernel thread feels wrong.

Why do you think this?

We are getting a bit short of space in tsk->flags.  Five PF__HOLE__'s remain.

For this patch and for kswapd I guess we could simply record the
task_struct*'s in a static array, something like

static struct task_struct nice_name[max nodes];

	...

	if (current == nice_name[my node])
		...

It's something we can look at if we start getting really tight on PF_ space.
diff mbox series

Patch

diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index 1bb646752e466..033feeab8c346 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -29,6 +29,7 @@ 
 #include <linux/pagemap.h>
 #include <linux/gfp.h>
 #include <linux/swap.h>
+#include <linux/compaction.h>
 
 #include <linux/uaccess.h>
 #include <linux/filelock.h>
@@ -457,7 +458,7 @@  static bool nfs_release_folio(struct folio *folio, gfp_t gfp)
 	/* If the private flag is set, then the folio is not freeable */
 	if (folio_test_private(folio)) {
 		if ((current_gfp_context(gfp) & GFP_KERNEL) != GFP_KERNEL ||
-		    current_is_kswapd())
+		    current_is_kswapd() || current_is_kcompactd())
 			return false;
 		if (nfs_wb_folio(folio->mapping->host, folio) < 0)
 			return false;
diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index e947764960496..7bf0c521db634 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -80,6 +80,11 @@  static inline unsigned long compact_gap(unsigned int order)
 	return 2UL << order;
 }
 
+static inline int current_is_kcompactd(void)
+{
+	return current->flags & PF_KCOMPACTD;
+}
+
 #ifdef CONFIG_COMPACTION
 
 extern unsigned int extfrag_for_order(struct zone *zone, unsigned int order);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 8982820dae213..0d1d70aded38f 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1682,7 +1682,7 @@  extern struct pid *cad_pid;
 #define PF_USED_MATH		0x00002000	/* If unset the fpu must be initialized before use */
 #define PF_USER_WORKER		0x00004000	/* Kernel thread cloned from userspace thread */
 #define PF_NOFREEZE		0x00008000	/* This thread should not be frozen */
-#define PF__HOLE__00010000	0x00010000
+#define PF_KCOMPACTD		0x00010000	/* I am kcompactd */
 #define PF_KSWAPD		0x00020000	/* I am kswapd */
 #define PF_MEMALLOC_NOFS	0x00040000	/* All allocations inherit GFP_NOFS. See memalloc_nfs_save() */
 #define PF_MEMALLOC_NOIO	0x00080000	/* All allocations inherit GFP_NOIO. See memalloc_noio_save() */
diff --git a/mm/compaction.c b/mm/compaction.c
index 384e4672998e5..2dd03105e6898 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -3164,6 +3164,7 @@  static int kcompactd(void *p)
 	if (!cpumask_empty(cpumask))
 		set_cpus_allowed_ptr(tsk, cpumask);
 
+	tsk->flags | PF_KCOMPACTD;
 	set_freezable();
 
 	pgdat->kcompactd_max_order = 0;
@@ -3220,6 +3221,8 @@  static int kcompactd(void *p)
 			pgdat->proactive_compact_trigger = false;
 	}
 
+	tsk->flags &= ~PF_KCOMPACTD;
+
 	return 0;
 }