mm, percpu: do not consider sleepable allocations atomic

Message ID	20250206122633.167896-1-mhocko@kernel.org (mailing list archive)
State	New
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: Michal Hocko <mhocko@kernel.org> To: Dennis Zhou <dennis@kernel.org>, Tejun Heo <tj@kernel.org>, Filipe Manana <fdmanana@suse.com> Cc: Andrew Morton <akpm@linux-foundation.org>, <linux-mm@kvack.org>, LKML <linux-kernel@vger.kernel.org>, Michal Hocko <mhocko@suse.com> Subject: [PATCH] mm, percpu: do not consider sleepable allocations atomic Date: Thu, 6 Feb 2025 13:26:33 +0100 Message-ID: <20250206122633.167896-1-mhocko@kernel.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	mm, percpu: do not consider sleepable allocations atomic \| expand mm, percpu: do not consider sleepable allocations atomic

Michal Hocko Feb. 6, 2025, 12:26 p.m. UTC

From: Michal Hocko <mhocko@suse.com>

28307d938fb2 ("percpu: make pcpu_alloc() aware of current gfp context")
has fixed a reclaim recursion for scoped GFP_NOFS context. It has done
that by avoiding taking pcpu_alloc_mutex. This is a correct solution as
the worker context with full GFP_KERNEL allocation/reclaim power and which
is using the same lock cannot block the NOFS pcpu_alloc caller.

On the other hand this is a very conservative approach that could lead
to failures because pcpu_alloc lockless implementation is quite limited.

We have a bug report about premature failures when scsi array of 193
devices is scanned. Sometimes (not consistently) the scanning aborts
because the iscsid daemon fails to create the queue for a random scsi
device during the scan. iscsid itslef is running with PR_SET_IO_FLUSHER
set so all allocations from this process context are GFP_NOIO. This in
turn makes any pcpu_alloc lockless (without pcpu_alloc_mutex) which
leads to pre-mature failures.

It has turned out that iscsid has worked around this by dropping
PR_SET_IO_FLUSHER (https://github.com/open-iscsi/open-iscsi/pull/382)
when scanning host. But we can do better in this case on the kernel side
and use pcpu_alloc_mutex for NOIO resp.  NOFS constrained allocation
scopes too.  We just need the WQ worker to never trigger IO/FS reclaim.
Achieve that by enforcing scoped GFP_NOIO for the whole execution of
pcpu_balance_workfn (this will imply NOFS constrain as well). This will
remove the dependency chain and preserve the full allocation power of
the pcpu_alloc call.

While at it make is_atomic really test for blockable allocations.

Fixes: 28307d938fb2 ("percpu: make pcpu_alloc() aware of current gfp context
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/percpu.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

Vlastimil Babka Feb. 11, 2025, 3:05 p.m. UTC | #1

On 2/6/25 13:26, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> 28307d938fb2 ("percpu: make pcpu_alloc() aware of current gfp context")
> has fixed a reclaim recursion for scoped GFP_NOFS context. It has done
> that by avoiding taking pcpu_alloc_mutex. This is a correct solution as
> the worker context with full GFP_KERNEL allocation/reclaim power and which
> is using the same lock cannot block the NOFS pcpu_alloc caller.
> 
> On the other hand this is a very conservative approach that could lead
> to failures because pcpu_alloc lockless implementation is quite limited.
> 
> We have a bug report about premature failures when scsi array of 193
> devices is scanned. Sometimes (not consistently) the scanning aborts
> because the iscsid daemon fails to create the queue for a random scsi
> device during the scan. iscsid itslef is running with PR_SET_IO_FLUSHER
> set so all allocations from this process context are GFP_NOIO. This in
> turn makes any pcpu_alloc lockless (without pcpu_alloc_mutex) which
> leads to pre-mature failures.
> 
> It has turned out that iscsid has worked around this by dropping
> PR_SET_IO_FLUSHER (https://github.com/open-iscsi/open-iscsi/pull/382)
> when scanning host. But we can do better in this case on the kernel side
> and use pcpu_alloc_mutex for NOIO resp.  NOFS constrained allocation
> scopes too.  We just need the WQ worker to never trigger IO/FS reclaim.
> Achieve that by enforcing scoped GFP_NOIO for the whole execution of
> pcpu_balance_workfn (this will imply NOFS constrain as well). This will
> remove the dependency chain and preserve the full allocation power of
> the pcpu_alloc call.
> 
> While at it make is_atomic really test for blockable allocations.
> 
> Fixes: 28307d938fb2 ("percpu: make pcpu_alloc() aware of current gfp context
> Signed-off-by: Michal Hocko <mhocko@suse.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

> ---
>  mm/percpu.c | 8 +++++++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/percpu.c b/mm/percpu.c
> index d8dd31a2e407..192c2a8e901d 100644
> --- a/mm/percpu.c
> +++ b/mm/percpu.c
> @@ -1758,7 +1758,7 @@ void __percpu *pcpu_alloc_noprof(size_t size, size_t align, bool reserved,
>  	gfp = current_gfp_context(gfp);
>  	/* whitelisted flags that can be passed to the backing allocators */
>  	pcpu_gfp = gfp & (GFP_KERNEL | __GFP_NORETRY | __GFP_NOWARN);
> -	is_atomic = (gfp & GFP_KERNEL) != GFP_KERNEL;
> +	is_atomic = !gfpflags_allow_blocking(gfp);
>  	do_warn = !(gfp & __GFP_NOWARN);
>  
>  	/*
> @@ -2204,7 +2204,12 @@ static void pcpu_balance_workfn(struct work_struct *work)
>  	 * to grow other chunks.  This then gives pcpu_reclaim_populated() time
>  	 * to move fully free chunks to the active list to be freed if
>  	 * appropriate.
> +	 *
> +	 * Enforce GFP_NOIO allocations because we have pcpu_alloc users
> +	 * constrained to GFP_NOIO/NOFS contexts and they could form lock
> +	 * dependency through pcpu_alloc_mutex
>  	 */
> +	unsigned int flags = memalloc_noio_save();
>  	mutex_lock(&pcpu_alloc_mutex);
>  	spin_lock_irq(&pcpu_lock);
>  
> @@ -2215,6 +2220,7 @@ static void pcpu_balance_workfn(struct work_struct *work)
>  
>  	spin_unlock_irq(&pcpu_lock);
>  	mutex_unlock(&pcpu_alloc_mutex);
> +	memalloc_noio_restore(flags);
>  }
>  
>  /**

Tejun Heo Feb. 11, 2025, 8:55 p.m. UTC | #2

Hello, Michal.

On Thu, Feb 06, 2025 at 01:26:33PM +0100, Michal Hocko wrote:
...
> It has turned out that iscsid has worked around this by dropping
> PR_SET_IO_FLUSHER (https://github.com/open-iscsi/open-iscsi/pull/382)
> when scanning host. But we can do better in this case on the kernel side

FWIW, requiring GFP_KERNEL context for probing doesn't sound too crazy to
me.

> @@ -2204,7 +2204,12 @@ static void pcpu_balance_workfn(struct work_struct *work)
>  	 * to grow other chunks.  This then gives pcpu_reclaim_populated() time
>  	 * to move fully free chunks to the active list to be freed if
>  	 * appropriate.
> +	 *
> +	 * Enforce GFP_NOIO allocations because we have pcpu_alloc users
> +	 * constrained to GFP_NOIO/NOFS contexts and they could form lock
> +	 * dependency through pcpu_alloc_mutex
>  	 */
> +	unsigned int flags = memalloc_noio_save();

Just for context, the reason why the allocation mask support was limited to
GFP_KERNEL or not rather than supporting full range of GFP flags is because
percpu memory area expansion can involve page table allocations in the
vmalloc area which always uses GFP_KERNEL. memalloc_noio_save() masks IO
part out of that, right? It might be worthwhile to explain why we aren't
passing down GPF flags throughout and instead depending on masking.

Also, doesn't the above always prevent percpu allocations from doing fs/io
reclaims? ie. Shouldn't the masking only be used if the passed in gfp
doesn't allow fs/io?

Thanks.

Michal Hocko Feb. 12, 2025, 4:57 p.m. UTC | #3

On Tue 11-02-25 10:55:20, Tejun Heo wrote:
> Hello, Michal.
> 
> On Thu, Feb 06, 2025 at 01:26:33PM +0100, Michal Hocko wrote:
> ...
> > It has turned out that iscsid has worked around this by dropping
> > PR_SET_IO_FLUSHER (https://github.com/open-iscsi/open-iscsi/pull/382)
> > when scanning host. But we can do better in this case on the kernel side
> 
> FWIW, requiring GFP_KERNEL context for probing doesn't sound too crazy to
> me.
> 
> > @@ -2204,7 +2204,12 @@ static void pcpu_balance_workfn(struct work_struct *work)
> >  	 * to grow other chunks.  This then gives pcpu_reclaim_populated() time
> >  	 * to move fully free chunks to the active list to be freed if
> >  	 * appropriate.
> > +	 *
> > +	 * Enforce GFP_NOIO allocations because we have pcpu_alloc users
> > +	 * constrained to GFP_NOIO/NOFS contexts and they could form lock
> > +	 * dependency through pcpu_alloc_mutex
> >  	 */
> > +	unsigned int flags = memalloc_noio_save();
> 
> Just for context, the reason why the allocation mask support was limited to
> GFP_KERNEL or not rather than supporting full range of GFP flags is because
> percpu memory area expansion can involve page table allocations in the
> vmalloc area which always uses GFP_KERNEL. memalloc_noio_save() masks IO
> part out of that, right? It might be worthwhile to explain why we aren't
> passing down GPF flags throughout and instead depending on masking.

I have gone with masking because that seemed easier to review and more
robust solution. vmalloc does support NOFS/NOIO contexts these days (it
will just uses scoped masking in those cases). Propagating the gfp
throughout the worker code path is likely possible, but I haven't really
explored that in detail to be sure. Would that be preferable even if the
fix would be more involved?

> Also, doesn't the above always prevent percpu allocations from doing fs/io
> reclaims? 

Yes it does. Probably worth mentioning in the changelog. These
allocations should be rare so having a constrained reclaim didn't really
seem problematic to me. There should be kswapd running in the background
with the full reclaim power.

> ie. Shouldn't the masking only be used if the passed in gfp
> doesn't allow fs/io?

This is a good question. I have to admit that my understanding might be
incorrect but wouldn't it be possible that we could get the lock
dependency chain if GFP_KERNEL and scoped NOFS alloc_pcp calls are
competing? 

					fs/io lock
					pcpu_alloc_noprof(NOFS/NOIO)
pcpu_alloc_noprof(GFP_KERNEL)
  pcpu_schedule_balance_work
    pcpu_alloc_mutex
    					  pcpu_alloc_mutex
      allocation_deadlock throgh fs/io lock

This is currently not possible because constrained allocations only do
trylock.

Makes sense?

Tejun Heo Feb. 12, 2025, 6:14 p.m. UTC | #4

Hello,

On Wed, Feb 12, 2025 at 05:57:04PM +0100, Michal Hocko wrote:
...
> I have gone with masking because that seemed easier to review and more
> robust solution. vmalloc does support NOFS/NOIO contexts these days (it
> will just uses scoped masking in those cases). Propagating the gfp

I see. Nice.

> throughout the worker code path is likely possible, but I haven't really
> explored that in detail to be sure. Would that be preferable even if the
> fix would be more involved?

Longer term, yeah, I think so.

> > Also, doesn't the above always prevent percpu allocations from doing fs/io
> > reclaims? 
> 
> Yes it does. Probably worth mentioning in the changelog. These
> allocations should be rare so having a constrained reclaim didn't really
> seem problematic to me. There should be kswapd running in the background
> with the full reclaim power.

Hmm... you'd a better judge on whether that'd be okay or not but it does
bother me that we might be increasing the chance of allocation failures for
GFP_KERNEL users at least under memory pressure.

> > ie. Shouldn't the masking only be used if the passed in gfp
> > doesn't allow fs/io?
> 
> This is a good question. I have to admit that my understanding might be
> incorrect but wouldn't it be possible that we could get the lock
> dependency chain if GFP_KERNEL and scoped NOFS alloc_pcp calls are
> competing? 
> 
> 					fs/io lock
> 					pcpu_alloc_noprof(NOFS/NOIO)
> pcpu_alloc_noprof(GFP_KERNEL)
>   pcpu_schedule_balance_work
>     pcpu_alloc_mutex
>     					  pcpu_alloc_mutex
>       allocation_deadlock throgh fs/io lock
> 
> This is currently not possible because constrained allocations only do
> trylock.

Right, the current locking in expansion path is really simple because it was
assuming everyone would be doing GFP_KERNEL allocation. We'd have to break
up the locking so that allocations are done outside locking, which hopefully
shouldn't be too complicated.

Thanks.

Michal Hocko Feb. 12, 2025, 8:53 p.m. UTC | #5

On Wed 12-02-25 08:14:35, Tejun Heo wrote:
> Hello,
> 
> On Wed, Feb 12, 2025 at 05:57:04PM +0100, Michal Hocko wrote:
> ...
> > I have gone with masking because that seemed easier to review and more
> > robust solution. vmalloc does support NOFS/NOIO contexts these days (it
> > will just uses scoped masking in those cases). Propagating the gfp
> 
> I see. Nice.
> 
> > throughout the worker code path is likely possible, but I haven't really
> > explored that in detail to be sure. Would that be preferable even if the
> > fix would be more involved?
> 
> Longer term, yeah, I think so.

I can invest more time into that direction if this is really preferred
way. Not my call but I would argue that the scope interface is actually a
good fit in the current implementation because it clearly defines the
scope of all allocation context at a single place. Ideally with a good
explanation on why is that (I guess I owe in that regards).

> > > Also, doesn't the above always prevent percpu allocations from doing fs/io
> > > reclaims? 
> > 
> > Yes it does. Probably worth mentioning in the changelog. These
> > allocations should be rare so having a constrained reclaim didn't really
> > seem problematic to me. There should be kswapd running in the background
> > with the full reclaim power.
> 
> Hmm... you'd a better judge on whether that'd be okay or not but it does
> bother me that we might be increasing the chance of allocation failures for
> GFP_KERNEL users at least under memory pressure.

Nope, this will not change the allocation failure mode. Reclaim
constrains do not change the failure mode they just change how much the
allocation might struggle to reclaim to succeed. 

My undocumented assumption (another dept on my end) is that pcp
allocations are no hot paths. So the worst case is that GFP_KERNEL
pcp_allocation could have been satisfied _easier_ (i.e. faster) because
it could have reclaimed fs/io caches and now it needs to rely on kswapd
to do that on memory tight situations. On the other hand we have a
situation when NOIO/FS allocations fail prematurely so there is
certainly some pros and cons.

As I've said I am no pcp allocator expert so I cannot really make proper
judgment calls. I can improve the changelog or move from scope to
specific gfp flags but I do not feel like I am positioned to make deeper
changes to the subsystem.

Tejun Heo Feb. 12, 2025, 9:30 p.m. UTC | #6

Hello,

On Wed, Feb 12, 2025 at 09:53:20PM +0100, Michal Hocko wrote:
...
> > Hmm... you'd a better judge on whether that'd be okay or not but it does
> > bother me that we might be increasing the chance of allocation failures for
> > GFP_KERNEL users at least under memory pressure.
> 
> Nope, this will not change the allocation failure mode. Reclaim
> constrains do not change the failure mode they just change how much the
> allocation might struggle to reclaim to succeed. 
>
> My undocumented assumption (another dept on my end) is that pcp
> allocations are no hot paths. So the worst case is that GFP_KERNEL
> pcp_allocation could have been satisfied _easier_ (i.e. faster) because
> it could have reclaimed fs/io caches and now it needs to rely on kswapd
> to do that on memory tight situations. On the other hand we have a
> situation when NOIO/FS allocations fail prematurely so there is
> certainly some pros and cons.

I'm having a hard time following. Are you saying that it won't increase the
likelihood of allocation failures even under memory pressure but that it
might just make allocations take longer to succeed?

NOFS/IO prevents allocation attempt from entering fs/io reclaim paths,
right? It would still trigger kswapd for reclaim but can the allocation
attempt wait for that to finish? If so, wouldn't that constitute a
dependency cycle all the same?

All in all, percpu allocations taking longer under memory pressure is fine.
Becoming more prone to allocation failures, especially for GFP_KERNEL
callers, probably isn't great.

> As I've said I am no pcp allocator expert so I cannot really make proper
> judgment calls. I can improve the changelog or move from scope to
> specific gfp flags but I do not feel like I am positioned to make deeper
> changes to the subsystem.

I don't think deciding whether always using NOIO/FS is a good idea requires
knowing the percpu allocator that well. It's just depending on the
underlying page allocator for that part.

Thanks.

Dennis Zhou Feb. 12, 2025, 9:39 p.m. UTC | #7

Hello,

On Wed, Feb 12, 2025 at 11:30:08AM -1000, Tejun Heo wrote:
> Hello,
> 
> On Wed, Feb 12, 2025 at 09:53:20PM +0100, Michal Hocko wrote:
> ...
> > > Hmm... you'd a better judge on whether that'd be okay or not but it does
> > > bother me that we might be increasing the chance of allocation failures for
> > > GFP_KERNEL users at least under memory pressure.
> > 
> > Nope, this will not change the allocation failure mode. Reclaim
> > constrains do not change the failure mode they just change how much the
> > allocation might struggle to reclaim to succeed. 
> >
> > My undocumented assumption (another dept on my end) is that pcp
> > allocations are no hot paths. So the worst case is that GFP_KERNEL
> > pcp_allocation could have been satisfied _easier_ (i.e. faster) because
> > it could have reclaimed fs/io caches and now it needs to rely on kswapd
> > to do that on memory tight situations. On the other hand we have a
> > situation when NOIO/FS allocations fail prematurely so there is
> > certainly some pros and cons.
> 
> I'm having a hard time following. Are you saying that it won't increase the
> likelihood of allocation failures even under memory pressure but that it
> might just make allocations take longer to succeed?
> 
> NOFS/IO prevents allocation attempt from entering fs/io reclaim paths,
> right? It would still trigger kswapd for reclaim but can the allocation
> attempt wait for that to finish? If so, wouldn't that constitute a
> dependency cycle all the same?
> 
> All in all, percpu allocations taking longer under memory pressure is fine.
> Becoming more prone to allocation failures, especially for GFP_KERNEL
> callers, probably isn't great.
> 

Wait, I think I'm interpreting this change differently. This is
preventing the worker from allocating backing pages via GFP_KERNEL. It
isn't preventing an allocation via alloc_percpu() from being GFP_KERNEL
and providing those flags down to the backing page code. alloc_percpu()
for GFP_KERNEL allocations will populate the pages before returning.

I'm reading this as potentially making atomic percpu allocations fail as
we might be low on backing pages. This change makes the worker now need
to wait for kswapd to give it pages. Consequently, if there are a lot of
allocations coming in when it's low, we might burn a bit of cpu from the
worker now.

We could take the time to split out pcpu_alloc_mutex and pcpu_lock more
to provide finer grain / concurrrent allocations. But I don't currently
have a justification for it.

> > As I've said I am no pcp allocator expert so I cannot really make proper
> > judgment calls. I can improve the changelog or move from scope to
> > specific gfp flags but I do not feel like I am positioned to make deeper
> > changes to the subsystem.
> 
> I don't think deciding whether always using NOIO/FS is a good idea requires
> knowing the percpu allocator that well. It's just depending on the
> underlying page allocator for that part.
> 
> Thanks.
> 
> -- 
> tejun

Thanks,
Dennis

Michal Hocko Feb. 14, 2025, 3:43 p.m. UTC | #8

On Wed 12-02-25 11:30:08, Tejun Heo wrote:
> Hello,
> 
> On Wed, Feb 12, 2025 at 09:53:20PM +0100, Michal Hocko wrote:
> ...
> > > Hmm... you'd a better judge on whether that'd be okay or not but it does
> > > bother me that we might be increasing the chance of allocation failures for
> > > GFP_KERNEL users at least under memory pressure.
> > 
> > Nope, this will not change the allocation failure mode. Reclaim
> > constrains do not change the failure mode they just change how much the
> > allocation might struggle to reclaim to succeed. 
> >
> > My undocumented assumption (another dept on my end) is that pcp
> > allocations are no hot paths. So the worst case is that GFP_KERNEL
> > pcp_allocation could have been satisfied _easier_ (i.e. faster) because
> > it could have reclaimed fs/io caches and now it needs to rely on kswapd
> > to do that on memory tight situations. On the other hand we have a
> > situation when NOIO/FS allocations fail prematurely so there is
> > certainly some pros and cons.
> 
> I'm having a hard time following. Are you saying that it won't increase the
> likelihood of allocation failures even under memory pressure but that it
> might just make allocations take longer to succeed?

yes, this is like any other NOFS/NOIO allocation non-costly
(<=PAGE_ALLOC_COSTLY_ORDER) which effectively never fail.

Michal Hocko Feb. 14, 2025, 3:52 p.m. UTC | #9

On Wed 12-02-25 13:39:31, Dennis Zhou wrote:
> Hello,
> 
> On Wed, Feb 12, 2025 at 11:30:08AM -1000, Tejun Heo wrote:
> > Hello,
> > 
> > On Wed, Feb 12, 2025 at 09:53:20PM +0100, Michal Hocko wrote:
> > ...
> > > > Hmm... you'd a better judge on whether that'd be okay or not but it does
> > > > bother me that we might be increasing the chance of allocation failures for
> > > > GFP_KERNEL users at least under memory pressure.
> > > 
> > > Nope, this will not change the allocation failure mode. Reclaim
> > > constrains do not change the failure mode they just change how much the
> > > allocation might struggle to reclaim to succeed. 
> > >
> > > My undocumented assumption (another dept on my end) is that pcp
> > > allocations are no hot paths. So the worst case is that GFP_KERNEL
> > > pcp_allocation could have been satisfied _easier_ (i.e. faster) because
> > > it could have reclaimed fs/io caches and now it needs to rely on kswapd
> > > to do that on memory tight situations. On the other hand we have a
> > > situation when NOIO/FS allocations fail prematurely so there is
> > > certainly some pros and cons.
> > 
> > I'm having a hard time following. Are you saying that it won't increase the
> > likelihood of allocation failures even under memory pressure but that it
> > might just make allocations take longer to succeed?
> > 
> > NOFS/IO prevents allocation attempt from entering fs/io reclaim paths,
> > right? It would still trigger kswapd for reclaim but can the allocation
> > attempt wait for that to finish? If so, wouldn't that constitute a
> > dependency cycle all the same?
> > 
> > All in all, percpu allocations taking longer under memory pressure is fine.
> > Becoming more prone to allocation failures, especially for GFP_KERNEL
> > callers, probably isn't great.
> > 
> 
> Wait, I think I'm interpreting this change differently. This is
> preventing the worker from allocating backing pages via GFP_KERNEL. It
> isn't preventing an allocation via alloc_percpu() from being GFP_KERNEL
> and providing those flags down to the backing page code. alloc_percpu()
> for GFP_KERNEL allocations will populate the pages before returning.

Correct.
 
> I'm reading this as potentially making atomic percpu allocations fail as
> we might be low on backing pages. This change makes the worker now need
> to wait for kswapd to give it pages. Consequently, if there are a lot of
> allocations coming in when it's low, we might burn a bit of cpu from the
> worker now.

Yes, this is potential side effect. On the other hand NOFS/NOIO requests
wouldn't be considered atomic anymore and they wouldn't fail that
easily. Maybe that is an odd case not worth the additional worker
overhead. As I've said I am not familiar with the pcp internals to know
how often the worker is really required

mm, percpu: do not consider sleepable allocations atomic

Commit Message

Comments

Patch