diff mbox series

[v2] mm/page_alloc: Leave IRQs enabled for per-cpu page allocations

Message ID 20221104142259.5hohev5hzvwanbi2@techsingularity.net (mailing list archive)
State New
Headers show
Series [v2] mm/page_alloc: Leave IRQs enabled for per-cpu page allocations | expand

Commit Message

Mel Gorman Nov. 4, 2022, 2:22 p.m. UTC
Changelog since v1
 o Use trylock in free_unref_page_list due to IO completion from softirq
   context

The pcp_spin_lock_irqsave protecting the PCP lists is IRQ-safe as a task
allocating from the PCP must not re-enter the allocator from IRQ context.
In each instance where IRQ-reentrancy is possible, the lock is acquired using
pcp_spin_trylock_irqsave() even though IRQs are disabled and re-entrancy
is impossible.

Demote the lock to pcp_spin_lock avoids an IRQ disable/enable in the common
case at the cost of some IRQ allocations taking a slower path. If the PCP
lists need to be refilled, the zone lock still needs to disable IRQs but
that will only happen on PCP refill and drain. If an IRQ is raised when
a PCP allocation is in progress, the trylock will fail and fallback to
using the buddy lists directly. Note that this may not be a universal win
if an interrupt-intensive workload also allocates heavily from interrupt
context and contends heavily on the zone->lock as a result.

[yuzhao@google.com: Reported lockdep issue on IO completion from softirq]
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 mm/page_alloc.c | 122 ++++++++++++++++++++++++--------------------------------
 1 file changed, 53 insertions(+), 69 deletions(-)

Comments

Hugh Dickins Nov. 6, 2022, 4:42 p.m. UTC | #1
On Fri, 4 Nov 2022, Mel Gorman wrote:

> Changelog since v1
>  o Use trylock in free_unref_page_list due to IO completion from softirq
>    context
> 
> The pcp_spin_lock_irqsave protecting the PCP lists is IRQ-safe as a task
> allocating from the PCP must not re-enter the allocator from IRQ context.
> In each instance where IRQ-reentrancy is possible, the lock is acquired using
> pcp_spin_trylock_irqsave() even though IRQs are disabled and re-entrancy
> is impossible.
> 
> Demote the lock to pcp_spin_lock avoids an IRQ disable/enable in the common
> case at the cost of some IRQ allocations taking a slower path. If the PCP
> lists need to be refilled, the zone lock still needs to disable IRQs but
> that will only happen on PCP refill and drain. If an IRQ is raised when
> a PCP allocation is in progress, the trylock will fail and fallback to
> using the buddy lists directly. Note that this may not be a universal win
> if an interrupt-intensive workload also allocates heavily from interrupt
> context and contends heavily on the zone->lock as a result.
> 
> [yuzhao@google.com: Reported lockdep issue on IO completion from softirq]
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>

Hi Mel, I think you Cc'ed me for the purpose of giving this patch a
run, and reporting if it's not good.  That is the case, I'm afraid.

I first tried it on a v6.1-rc3, and very soon crashed under load with
something about PageBuddy in the output.  When I reverted, no problem;
I thought maybe it's dependent on other commits in akpm's tree.

Later I tried on current mm-unstable: which is living up to the name
in other ways, but when other issues patched, it soon crashed under
load, GPF probably for non-canonical address 0xdead0000000000f8
in compact_zone < compact_zone_order < try_to_compact_pages <
.... < shmem_alloc_hugefolio < ...

I do try to exercise compaction as hard as I can, even to the point
of having a hack in what used to be called shmem_getpage_gfp(),
reverting to the stronger attempt to get huge pages, before Rik
weakened the effect of huge=always with vma_thp_gfp_mask() in 5.12:
so shmem is probably applying stronger flags for compaction than it
would in your tree - I'm using
GFP_TRANSHUGE_LIGHT | __GFP_RECLAIM | __GFP_NORETRY there.

Sorry for not giving you more info, I'm rather hoping that compaction
is relevant, and will give you a clue (maybe that capture code, which
surprised us once before??).  What I'm really trying to do is fix
the bug in Linus's rmap/TLB series, and its interaction with my
rmap series, and report back on his series (asking for temporary
drop), before next-20221107 goes down in flames.

I'd advocate for dropping this patch of yours too; but if it's giving
nobody else any trouble, I can easily continue to patch it out.

Hugh

> ---
>  mm/page_alloc.c | 122 ++++++++++++++++++++++++--------------------------------
>  1 file changed, 53 insertions(+), 69 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index e20ade858e71..ae410adf36fb 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -170,21 +170,12 @@ static DEFINE_MUTEX(pcp_batch_high_lock);
>  	_ret;								\
>  })
>  
> -#define pcpu_spin_lock_irqsave(type, member, ptr, flags)		\
> +#define pcpu_spin_trylock(type, member, ptr)				\
>  ({									\
>  	type *_ret;							\
>  	pcpu_task_pin();						\
>  	_ret = this_cpu_ptr(ptr);					\
> -	spin_lock_irqsave(&_ret->member, flags);			\
> -	_ret;								\
> -})
> -
> -#define pcpu_spin_trylock_irqsave(type, member, ptr, flags)		\
> -({									\
> -	type *_ret;							\
> -	pcpu_task_pin();						\
> -	_ret = this_cpu_ptr(ptr);					\
> -	if (!spin_trylock_irqsave(&_ret->member, flags)) {		\
> +	if (!spin_trylock(&_ret->member)) {				\
>  		pcpu_task_unpin();					\
>  		_ret = NULL;						\
>  	}								\
> @@ -197,27 +188,16 @@ static DEFINE_MUTEX(pcp_batch_high_lock);
>  	pcpu_task_unpin();						\
>  })
>  
> -#define pcpu_spin_unlock_irqrestore(member, ptr, flags)			\
> -({									\
> -	spin_unlock_irqrestore(&ptr->member, flags);			\
> -	pcpu_task_unpin();						\
> -})
> -
>  /* struct per_cpu_pages specific helpers. */
>  #define pcp_spin_lock(ptr)						\
>  	pcpu_spin_lock(struct per_cpu_pages, lock, ptr)
>  
> -#define pcp_spin_lock_irqsave(ptr, flags)				\
> -	pcpu_spin_lock_irqsave(struct per_cpu_pages, lock, ptr, flags)
> -
> -#define pcp_spin_trylock_irqsave(ptr, flags)				\
> -	pcpu_spin_trylock_irqsave(struct per_cpu_pages, lock, ptr, flags)
> +#define pcp_spin_trylock(ptr)						\
> +	pcpu_spin_trylock(struct per_cpu_pages, lock, ptr)
>  
>  #define pcp_spin_unlock(ptr)						\
>  	pcpu_spin_unlock(lock, ptr)
>  
> -#define pcp_spin_unlock_irqrestore(ptr, flags)				\
> -	pcpu_spin_unlock_irqrestore(lock, ptr, flags)
>  #ifdef CONFIG_USE_PERCPU_NUMA_NODE_ID
>  DEFINE_PER_CPU(int, numa_node);
>  EXPORT_PER_CPU_SYMBOL(numa_node);
> @@ -1546,6 +1526,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
>  					struct per_cpu_pages *pcp,
>  					int pindex)
>  {
> +	unsigned long flags;
>  	int min_pindex = 0;
>  	int max_pindex = NR_PCP_LISTS - 1;
>  	unsigned int order;
> @@ -1561,8 +1542,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
>  	/* Ensure requested pindex is drained first. */
>  	pindex = pindex - 1;
>  
> -	/* Caller must hold IRQ-safe pcp->lock so IRQs are disabled. */
> -	spin_lock(&zone->lock);
> +	spin_lock_irqsave(&zone->lock, flags);
>  	isolated_pageblocks = has_isolate_pageblock(zone);
>  
>  	while (count > 0) {
> @@ -1610,7 +1590,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
>  		} while (count > 0 && !list_empty(list));
>  	}
>  
> -	spin_unlock(&zone->lock);
> +	spin_unlock_irqrestore(&zone->lock, flags);
>  }
>  
>  static void free_one_page(struct zone *zone,
> @@ -3124,10 +3104,10 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
>  			unsigned long count, struct list_head *list,
>  			int migratetype, unsigned int alloc_flags)
>  {
> +	unsigned long flags;
>  	int i, allocated = 0;
>  
> -	/* Caller must hold IRQ-safe pcp->lock so IRQs are disabled. */
> -	spin_lock(&zone->lock);
> +	spin_lock_irqsave(&zone->lock, flags);
>  	for (i = 0; i < count; ++i) {
>  		struct page *page = __rmqueue(zone, order, migratetype,
>  								alloc_flags);
> @@ -3161,7 +3141,7 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
>  	 * pages added to the pcp list.
>  	 */
>  	__mod_zone_page_state(zone, NR_FREE_PAGES, -(i << order));
> -	spin_unlock(&zone->lock);
> +	spin_unlock_irqrestore(&zone->lock, flags);
>  	return allocated;
>  }
>  
> @@ -3178,16 +3158,9 @@ void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp)
>  	batch = READ_ONCE(pcp->batch);
>  	to_drain = min(pcp->count, batch);
>  	if (to_drain > 0) {
> -		unsigned long flags;
> -
> -		/*
> -		 * free_pcppages_bulk expects IRQs disabled for zone->lock
> -		 * so even though pcp->lock is not intended to be IRQ-safe,
> -		 * it's needed in this context.
> -		 */
> -		spin_lock_irqsave(&pcp->lock, flags);
> +		spin_lock(&pcp->lock);
>  		free_pcppages_bulk(zone, to_drain, pcp, 0);
> -		spin_unlock_irqrestore(&pcp->lock, flags);
> +		spin_unlock(&pcp->lock);
>  	}
>  }
>  #endif
> @@ -3201,12 +3174,9 @@ static void drain_pages_zone(unsigned int cpu, struct zone *zone)
>  
>  	pcp = per_cpu_ptr(zone->per_cpu_pageset, cpu);
>  	if (pcp->count) {
> -		unsigned long flags;
> -
> -		/* See drain_zone_pages on why this is disabling IRQs */
> -		spin_lock_irqsave(&pcp->lock, flags);
> +		spin_lock(&pcp->lock);
>  		free_pcppages_bulk(zone, pcp->count, pcp, 0);
> -		spin_unlock_irqrestore(&pcp->lock, flags);
> +		spin_unlock(&pcp->lock);
>  	}
>  }
>  
> @@ -3472,7 +3442,6 @@ static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp,
>   */
>  void free_unref_page(struct page *page, unsigned int order)
>  {
> -	unsigned long flags;
>  	unsigned long __maybe_unused UP_flags;
>  	struct per_cpu_pages *pcp;
>  	struct zone *zone;
> @@ -3500,10 +3469,10 @@ void free_unref_page(struct page *page, unsigned int order)
>  
>  	zone = page_zone(page);
>  	pcp_trylock_prepare(UP_flags);
> -	pcp = pcp_spin_trylock_irqsave(zone->per_cpu_pageset, flags);
> +	pcp = pcp_spin_trylock(zone->per_cpu_pageset);
>  	if (pcp) {
>  		free_unref_page_commit(zone, pcp, page, migratetype, order);
> -		pcp_spin_unlock_irqrestore(pcp, flags);
> +		pcp_spin_unlock(pcp);
>  	} else {
>  		free_one_page(zone, page, pfn, order, migratetype, FPI_NONE);
>  	}
> @@ -3515,10 +3484,10 @@ void free_unref_page(struct page *page, unsigned int order)
>   */
>  void free_unref_page_list(struct list_head *list)
>  {
> +	unsigned long __maybe_unused UP_flags;
>  	struct page *page, *next;
>  	struct per_cpu_pages *pcp = NULL;
>  	struct zone *locked_zone = NULL;
> -	unsigned long flags;
>  	int batch_count = 0;
>  	int migratetype;
>  
> @@ -3547,11 +3516,26 @@ void free_unref_page_list(struct list_head *list)
>  
>  		/* Different zone, different pcp lock. */
>  		if (zone != locked_zone) {
> -			if (pcp)
> -				pcp_spin_unlock_irqrestore(pcp, flags);
> +			if (pcp) {
> +				pcp_spin_unlock(pcp);
> +				pcp_trylock_finish(UP_flags);
> +			}
>  
> +			/*
> +			 * trylock is necessary as pages may be getting freed
> +			 * from IRQ or SoftIRQ context after an IO completion.
> +			 */
> +			pcp_trylock_prepare(UP_flags);
> +			pcp = pcp_spin_trylock(zone->per_cpu_pageset);
> +			if (!pcp) {
> +				pcp_trylock_finish(UP_flags);
> +				list_del(&page->lru);
> +				free_one_page(page_zone(page), page,
> +					      page_to_pfn(page), 0, migratetype,
> +					      FPI_NONE);
> +				continue;
> +			}
>  			locked_zone = zone;
> -			pcp = pcp_spin_lock_irqsave(locked_zone->per_cpu_pageset, flags);
>  		}
>  
>  		/*
> @@ -3566,18 +3550,23 @@ void free_unref_page_list(struct list_head *list)
>  		free_unref_page_commit(zone, pcp, page, migratetype, 0);
>  
>  		/*
> -		 * Guard against excessive IRQ disabled times when we get
> -		 * a large list of pages to free.
> +		 * Guard against excessive IRQ disabled times when freeing
> +		 * a large list of pages. Lock will be reacquired if
> +		 * necessary on the next iteration.
>  		 */
>  		if (++batch_count == SWAP_CLUSTER_MAX) {
> -			pcp_spin_unlock_irqrestore(pcp, flags);
> +			pcp_spin_unlock(pcp);
> +			pcp_trylock_finish(UP_flags);
>  			batch_count = 0;
> -			pcp = pcp_spin_lock_irqsave(locked_zone->per_cpu_pageset, flags);
> +			pcp = NULL;
> +			locked_zone = NULL;
>  		}
>  	}
>  
> -	if (pcp)
> -		pcp_spin_unlock_irqrestore(pcp, flags);
> +	if (pcp) {
> +		pcp_spin_unlock(pcp);
> +		pcp_trylock_finish(UP_flags);
> +	}
>  }
>  
>  /*
> @@ -3778,15 +3767,11 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone,
>  	struct per_cpu_pages *pcp;
>  	struct list_head *list;
>  	struct page *page;
> -	unsigned long flags;
>  	unsigned long __maybe_unused UP_flags;
>  
> -	/*
> -	 * spin_trylock may fail due to a parallel drain. In the future, the
> -	 * trylock will also protect against IRQ reentrancy.
> -	 */
> +	/* spin_trylock may fail due to a parallel drain or IRQ reentrancy. */
>  	pcp_trylock_prepare(UP_flags);
> -	pcp = pcp_spin_trylock_irqsave(zone->per_cpu_pageset, flags);
> +	pcp = pcp_spin_trylock(zone->per_cpu_pageset);
>  	if (!pcp) {
>  		pcp_trylock_finish(UP_flags);
>  		return NULL;
> @@ -3800,7 +3785,7 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone,
>  	pcp->free_factor >>= 1;
>  	list = &pcp->lists[order_to_pindex(migratetype, order)];
>  	page = __rmqueue_pcplist(zone, order, migratetype, alloc_flags, pcp, list);
> -	pcp_spin_unlock_irqrestore(pcp, flags);
> +	pcp_spin_unlock(pcp);
>  	pcp_trylock_finish(UP_flags);
>  	if (page) {
>  		__count_zid_vm_events(PGALLOC, page_zonenum(page), 1 << order);
> @@ -5368,7 +5353,6 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
>  			struct page **page_array)
>  {
>  	struct page *page;
> -	unsigned long flags;
>  	unsigned long __maybe_unused UP_flags;
>  	struct zone *zone;
>  	struct zoneref *z;
> @@ -5450,9 +5434,9 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
>  	if (unlikely(!zone))
>  		goto failed;
>  
> -	/* Is a parallel drain in progress? */
> +	/* spin_trylock may fail due to a parallel drain or IRQ reentrancy. */
>  	pcp_trylock_prepare(UP_flags);
> -	pcp = pcp_spin_trylock_irqsave(zone->per_cpu_pageset, flags);
> +	pcp = pcp_spin_trylock(zone->per_cpu_pageset);
>  	if (!pcp)
>  		goto failed_irq;
>  
> @@ -5471,7 +5455,7 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
>  		if (unlikely(!page)) {
>  			/* Try and allocate at least one page */
>  			if (!nr_account) {
> -				pcp_spin_unlock_irqrestore(pcp, flags);
> +				pcp_spin_unlock(pcp);
>  				goto failed_irq;
>  			}
>  			break;
> @@ -5486,7 +5470,7 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
>  		nr_populated++;
>  	}
>  
> -	pcp_spin_unlock_irqrestore(pcp, flags);
> +	pcp_spin_unlock(pcp);
>  	pcp_trylock_finish(UP_flags);
>  
>  	__count_zid_vm_events(PGALLOC, zone_idx(zone), nr_account);
>
Andrew Morton Nov. 6, 2022, 11:55 p.m. UTC | #2
On Sun, 6 Nov 2022 08:42:32 -0800 (PST) Hugh Dickins <hughd@google.com> wrote:

> I'd advocate for dropping this patch of yours too; but if it's giving
> nobody else any trouble, I can easily continue to patch it out.

Thanks, I'll drop it for now.  Please keep plugging away at mm-unstable
until we can get it less so, then let's start moving forwards again.
Mel Gorman Nov. 7, 2022, 11:16 a.m. UTC | #3
On Sun, Nov 06, 2022 at 08:42:32AM -0800, Hugh Dickins wrote:
> On Fri, 4 Nov 2022, Mel Gorman wrote:
> 
> > Changelog since v1
> >  o Use trylock in free_unref_page_list due to IO completion from softirq
> >    context
> > 
> > The pcp_spin_lock_irqsave protecting the PCP lists is IRQ-safe as a task
> > allocating from the PCP must not re-enter the allocator from IRQ context.
> > In each instance where IRQ-reentrancy is possible, the lock is acquired using
> > pcp_spin_trylock_irqsave() even though IRQs are disabled and re-entrancy
> > is impossible.
> > 
> > Demote the lock to pcp_spin_lock avoids an IRQ disable/enable in the common
> > case at the cost of some IRQ allocations taking a slower path. If the PCP
> > lists need to be refilled, the zone lock still needs to disable IRQs but
> > that will only happen on PCP refill and drain. If an IRQ is raised when
> > a PCP allocation is in progress, the trylock will fail and fallback to
> > using the buddy lists directly. Note that this may not be a universal win
> > if an interrupt-intensive workload also allocates heavily from interrupt
> > context and contends heavily on the zone->lock as a result.
> > 
> > [yuzhao@google.com: Reported lockdep issue on IO completion from softirq]
> > Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> 
> Hi Mel, I think you Cc'ed me for the purpose of giving this patch a
> run, and reporting if it's not good.  That is the case, I'm afraid.
> 

Thanks for testing and yes, you were cc'd in the hope you'd run it through a
stress test of some sort. A lot of the test I run are performance orientated
and relatively few target functional issues.

> I first tried it on a v6.1-rc3, and very soon crashed under load with
> something about PageBuddy in the output.  When I reverted, no problem;
> I thought maybe it's dependent on other commits in akpm's tree.
> 

Can you tell me what sort of load it's under? I would like to add something
similar to the general battery of tests I run all patches affecting the
page allocator through. Even if this is just a crude shell script, it would
be enough for me to work with and incorporate into mmtests. If it's there
and I find mm-unstable has its own problems, bisection can brute force
the problem.

> Later I tried on current mm-unstable: which is living up to the name
> in other ways, but when other issues patched, it soon crashed under
> load, GPF probably for non-canonical address 0xdead0000000000f8
> in compact_zone < compact_zone_order < try_to_compact_pages <
> .... < shmem_alloc_hugefolio < ...
> 

0xdead000* looks like ILLEGAL_POINTER_VALUE which is used as a poison
value so a full list of debugging options you apply for the stress test
would also be useful.

> I do try to exercise compaction as hard as I can, even to the point
> of having a hack in what used to be called shmem_getpage_gfp(),
> reverting to the stronger attempt to get huge pages, before Rik
> weakened the effect of huge=always with vma_thp_gfp_mask() in 5.12:
> so shmem is probably applying stronger flags for compaction than it
> would in your tree - I'm using
> GFP_TRANSHUGE_LIGHT | __GFP_RECLAIM | __GFP_NORETRY there.
> 
> Sorry for not giving you more info, I'm rather hoping that compaction
> is relevant, and will give you a clue (maybe that capture code, which
> surprised us once before??). 

While capture is a possibility, it's a bad fit for this patch because
pages are captured under task context.

> What I'm really trying to do is fix
> the bug in Linus's rmap/TLB series, and its interaction with my
> rmap series, and report back on his series (asking for temporary
> drop), before next-20221107 goes down in flames.
> 
> I'd advocate for dropping this patch of yours too; but if it's giving
> nobody else any trouble, I can easily continue to patch it out.
> 

Given that you tested the patch against v6.1-rc3, it's clear that the
patch on its own causes problems. Having a reproduction case will help
me figure out why.
Hugh Dickins Nov. 8, 2022, 9:40 a.m. UTC | #4
On Mon, 7 Nov 2022, Mel Gorman wrote:
> On Sun, Nov 06, 2022 at 08:42:32AM -0800, Hugh Dickins wrote:
> > On Fri, 4 Nov 2022, Mel Gorman wrote:
> > 
> > > Changelog since v1
> > >  o Use trylock in free_unref_page_list due to IO completion from softirq
> > >    context
> > > 
> > > The pcp_spin_lock_irqsave protecting the PCP lists is IRQ-safe as a task
> > > allocating from the PCP must not re-enter the allocator from IRQ context.
> > > In each instance where IRQ-reentrancy is possible, the lock is acquired using
> > > pcp_spin_trylock_irqsave() even though IRQs are disabled and re-entrancy
> > > is impossible.
> > > 
> > > Demote the lock to pcp_spin_lock avoids an IRQ disable/enable in the common
> > > case at the cost of some IRQ allocations taking a slower path. If the PCP
> > > lists need to be refilled, the zone lock still needs to disable IRQs but
> > > that will only happen on PCP refill and drain. If an IRQ is raised when
> > > a PCP allocation is in progress, the trylock will fail and fallback to
> > > using the buddy lists directly. Note that this may not be a universal win
> > > if an interrupt-intensive workload also allocates heavily from interrupt
> > > context and contends heavily on the zone->lock as a result.
> > > 
> > > [yuzhao@google.com: Reported lockdep issue on IO completion from softirq]
> > > Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> > 
> > Hi Mel, I think you Cc'ed me for the purpose of giving this patch a
> > run, and reporting if it's not good.  That is the case, I'm afraid.
> > 
> 
> Thanks for testing and yes, you were cc'd in the hope you'd run it through a
> stress test of some sort. A lot of the test I run are performance orientated
> and relatively few target functional issues.
> 
> > I first tried it on a v6.1-rc3, and very soon crashed under load with
> > something about PageBuddy in the output.  When I reverted, no problem;
> > I thought maybe it's dependent on other commits in akpm's tree.
> > 
> 
> Can you tell me what sort of load it's under? I would like to add something
> similar to the general battery of tests I run all patches affecting the
> page allocator through. Even if this is just a crude shell script, it would
> be enough for me to work with and incorporate into mmtests. If it's there
> and I find mm-unstable has its own problems, bisection can brute force
> the problem.
> 
> > Later I tried on current mm-unstable: which is living up to the name
> > in other ways, but when other issues patched, it soon crashed under
> > load, GPF probably for non-canonical address 0xdead0000000000f8
> > in compact_zone < compact_zone_order < try_to_compact_pages <
> > .... < shmem_alloc_hugefolio < ...
> > 
> 
> 0xdead000* looks like ILLEGAL_POINTER_VALUE which is used as a poison
> value so a full list of debugging options you apply for the stress test
> would also be useful.
> 
> > I do try to exercise compaction as hard as I can, even to the point
> > of having a hack in what used to be called shmem_getpage_gfp(),
> > reverting to the stronger attempt to get huge pages, before Rik
> > weakened the effect of huge=always with vma_thp_gfp_mask() in 5.12:
> > so shmem is probably applying stronger flags for compaction than it
> > would in your tree - I'm using
> > GFP_TRANSHUGE_LIGHT | __GFP_RECLAIM | __GFP_NORETRY there.
> > 
> > Sorry for not giving you more info, I'm rather hoping that compaction
> > is relevant, and will give you a clue (maybe that capture code, which
> > surprised us once before??). 
> 
> While capture is a possibility, it's a bad fit for this patch because
> pages are captured under task context.
> 
> > What I'm really trying to do is fix
> > the bug in Linus's rmap/TLB series, and its interaction with my
> > rmap series, and report back on his series (asking for temporary
> > drop), before next-20221107 goes down in flames.
> > 
> > I'd advocate for dropping this patch of yours too; but if it's giving
> > nobody else any trouble, I can easily continue to patch it out.
> > 
> 
> Given that you tested the patch against v6.1-rc3, it's clear that the
> patch on its own causes problems. Having a reproduction case will help
> me figure out why.

Sorry for appearing to ignore your requests all day, Mel, but I just
had slightly more confidence in debugging it here, than in conveying
all the details of my load (some other time), and my config, and
actually enabling you to reproduce it.  Had to focus.

Got it at last: free_unref_page_list() has been surviving on the
"next" in its list_for_each_entry_safe() for years(?), without doing
a proper list_del() in that block: only with your list_del() before
free_one_page() did it start to go so very wrong.  (Or was there any
way in which it might already have been wrong, and needs backport?)

Here's a few things to fold into your patch: I've moved your
list_del() up to cover both cases, that's the important fix;
but prior to finding that, I did notice a "locked_zone = NULL"
needed, and was very disappointed when that didn't fix the issues;
zone instead of page_zone(page), batch_count = 0, lock hold times
were just improvements I noticed along the way.

Now running fine, and can be reinstated in akpm's tree.

Hugh

--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3515,6 +3515,8 @@ void free_unref_page_list(struct list_he
 	list_for_each_entry_safe(page, next, list, lru) {
 		struct zone *zone = page_zone(page);
 
+		list_del(&page->lru);
+
 		/* Different zone, different pcp lock. */
 		if (zone != locked_zone) {
 			if (pcp) {
@@ -3530,13 +3532,13 @@ void free_unref_page_list(struct list_he
 			pcp = pcp_spin_trylock(zone->per_cpu_pageset);
 			if (!pcp) {
 				pcp_trylock_finish(UP_flags);
-				list_del(&page->lru);
-				free_one_page(page_zone(page), page,
-					      page_to_pfn(page), 0, migratetype,
-					      FPI_NONE);
+				free_one_page(zone, page, page_to_pfn(page),
+					      0, migratetype, FPI_NONE);
+				locked_zone = NULL;
 				continue;
 			}
 			locked_zone = zone;
+			batch_count = 0;
 		}
 
 		/*
@@ -3551,7 +3553,7 @@ void free_unref_page_list(struct list_he
 		free_unref_page_commit(zone, pcp, page, migratetype, 0);
 
 		/*
-		 * Guard against excessive IRQ disabled times when freeing
+		 * Guard against excessive lock hold times when freeing
 		 * a large list of pages. Lock will be reacquired if
 		 * necessary on the next iteration.
 		 */
Mel Gorman Nov. 8, 2022, 1:57 p.m. UTC | #5
On Tue, Nov 08, 2022 at 01:40:48AM -0800, Hugh Dickins wrote:
> On Mon, 7 Nov 2022, Mel Gorman wrote:
> > On Sun, Nov 06, 2022 at 08:42:32AM -0800, Hugh Dickins wrote:
> > > On Fri, 4 Nov 2022, Mel Gorman wrote:
> > > What I'm really trying to do is fix
> > > the bug in Linus's rmap/TLB series, and its interaction with my
> > > rmap series, and report back on his series (asking for temporary
> > > drop), before next-20221107 goes down in flames.
> > > 
> > > I'd advocate for dropping this patch of yours too; but if it's giving
> > > nobody else any trouble, I can easily continue to patch it out.
> > > 
> > 
> > Given that you tested the patch against v6.1-rc3, it's clear that the
> > patch on its own causes problems. Having a reproduction case will help
> > me figure out why.
> 
> Sorry for appearing to ignore your requests all day, Mel, but I just
> had slightly more confidence in debugging it here, than in conveying
> all the details of my load (some other time), and my config, and
> actually enabling you to reproduce it.  Had to focus.
> 

Ok, understood. If you ever get the chance to give me even a rough
description, I'd appreciate it but I understand that it's a distraction
at the moment. Thanks for taking the time to debug this in your test
environment.

> Got it at last: free_unref_page_list() has been surviving on the
> "next" in its list_for_each_entry_safe() for years(?), without doing
> a proper list_del() in that block: only with your list_del() before
> free_one_page() did it start to go so very wrong.  (Or was there any
> way in which it might already have been wrong, and needs backport?)
> 

I think it happened to work by coincidence since forever because it was
always adding to the same list. Even though the temporary list was
thrashed, it is always either ignored or reinitialised.

I've made this a standalone patch which is at the end of the mail. I can
change the Reported-by to a Signed-off-by if you agree. While it doesn't
fix anything today, it may still be worth documenting in git history why
that list_del exists.

> Here's a few things to fold into your patch: I've moved your
> list_del() up to cover both cases, that's the important fix;
> but prior to finding that, I did notice a "locked_zone = NULL"
> needed, and was very disappointed when that didn't fix the issues;

This is a real fix but it also should happen to work properly which is
less than ideal because it's fragile.

> zone instead of page_zone(page), batch_count = 0, lock hold times
> were just improvements I noticed along the way.
> 

The first is a small optimisation, the second addresses a corner case where
the lock may be released/reacquired too soon after switching from one zone to
another and the comment fix is valid. I've simply folded these in directly.

The standalone patch is below, I'm rerunning tests before posting a
short v3 series.

Thanks!

--8<--
mm/page_alloc: Always remove pages from temporary list

free_unref_page_list() has neglected to remove pages properly from the list
of pages to free since forever. It works by coincidence because list_add
happened to do the right thing adding the pages to just the PCP lists.
However, a later patch added pages to either the PCP list or the zone list
but only properly deleted the page from the list in one path leading to
list corruption and a subsequent failure. As a preparation patch, always
delete the pages from one list properly before adding to another.  On its
own, this fixes nothing although it adds a fractional amount of overhead
but is critical to the next patch.

Reported-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 mm/page_alloc.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 218b28ee49ed..1ec54173b8d4 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3546,6 +3546,8 @@ void free_unref_page_list(struct list_head *list)
 	list_for_each_entry_safe(page, next, list, lru) {
 		struct zone *zone = page_zone(page);
 
+		list_del(&page->lru);
+
 		/* Different zone, different pcp lock. */
 		if (zone != locked_zone) {
 			if (pcp)
Hugh Dickins Nov. 8, 2022, 3:47 p.m. UTC | #6
On Tue, 8 Nov 2022, Mel Gorman wrote:
> On Tue, Nov 08, 2022 at 01:40:48AM -0800, Hugh Dickins wrote:
> > On Mon, 7 Nov 2022, Mel Gorman wrote:
> > > On Sun, Nov 06, 2022 at 08:42:32AM -0800, Hugh Dickins wrote:
> > > > On Fri, 4 Nov 2022, Mel Gorman wrote:
> > > > What I'm really trying to do is fix
> > > > the bug in Linus's rmap/TLB series, and its interaction with my
> > > > rmap series, and report back on his series (asking for temporary
> > > > drop), before next-20221107 goes down in flames.
> > > > 
> > > > I'd advocate for dropping this patch of yours too; but if it's giving
> > > > nobody else any trouble, I can easily continue to patch it out.
> > > > 
> > > 
> > > Given that you tested the patch against v6.1-rc3, it's clear that the
> > > patch on its own causes problems. Having a reproduction case will help
> > > me figure out why.
> > 
> > Sorry for appearing to ignore your requests all day, Mel, but I just
> > had slightly more confidence in debugging it here, than in conveying
> > all the details of my load (some other time), and my config, and
> > actually enabling you to reproduce it.  Had to focus.
> > 
> 
> Ok, understood. If you ever get the chance to give me even a rough
> description, I'd appreciate it but I understand that it's a distraction
> at the moment. Thanks for taking the time to debug this in your test
> environment.

I have sent it out two or three times before (I prefer privately, so my
limited shell scripting skills are not on public view!): later in the day
I'll look out the last time I sent it, and just forward that to you.
No doubt I'll have tweaked it here and there in my own usage since then,
but that will be good enough to show what I try for.

Wonderful if it could get into something like mmtests, but I should
warn that attempts to incorporate it into other frameworks have not
been successful in the past.  Maybe it just needs too much handholding:
getting the balance right, so that it's stressing without quite OOMing,
is difficult in any new environment.  Perhaps it should restart after
OOM, I just never tried to go that way.

> 
> > Got it at last: free_unref_page_list() has been surviving on the
> > "next" in its list_for_each_entry_safe() for years(?), without doing
> > a proper list_del() in that block: only with your list_del() before
> > free_one_page() did it start to go so very wrong.  (Or was there any
> > way in which it might already have been wrong, and needs backport?)
> > 
> 
> I think it happened to work by coincidence since forever because it was
> always adding to the same list. Even though the temporary list was
> thrashed, it is always either ignored or reinitialised.
> 
> I've made this a standalone patch which is at the end of the mail. I can
> change the Reported-by to a Signed-off-by if you agree. While it doesn't
> fix anything today, it may still be worth documenting in git history why
> that list_del exists.

Yes, worth separating out to its own git commit.  Just continue with what
you already have, Reported-by me, Signed-off-by you, thanks for asking.

> 
> > Here's a few things to fold into your patch: I've moved your
> > list_del() up to cover both cases, that's the important fix;
> > but prior to finding that, I did notice a "locked_zone = NULL"
> > needed, and was very disappointed when that didn't fix the issues;
> 
> This is a real fix but it also should happen to work properly which is
> less than ideal because it's fragile.

I thought that if the next page's zone matched the stale locked_zone,
then it would head into free_unref_page_commit() with NULL pcp, and
oops there?  But I've certainly never seen that happen (despite first
assuming it was responsible for my crashes), so maybe I read it wrong.

> 
> > zone instead of page_zone(page), batch_count = 0, lock hold times
> > were just improvements I noticed along the way.
> > 
> 
> The first is a small optimisation, the second addresses a corner case where
> the lock may be released/reacquired too soon after switching from one zone to
> another and the comment fix is valid. I've simply folded these in directly.
> 
> The standalone patch is below, I'm rerunning tests before posting a
> short v3 series.

Great, thanks Mel.

Hugh

> 
> Thanks!
> 
> --8<--
> mm/page_alloc: Always remove pages from temporary list
> 
> free_unref_page_list() has neglected to remove pages properly from the list
> of pages to free since forever. It works by coincidence because list_add
> happened to do the right thing adding the pages to just the PCP lists.
> However, a later patch added pages to either the PCP list or the zone list
> but only properly deleted the page from the list in one path leading to
> list corruption and a subsequent failure. As a preparation patch, always
> delete the pages from one list properly before adding to another.  On its
> own, this fixes nothing although it adds a fractional amount of overhead
> but is critical to the next patch.
> 
> Reported-by: Hugh Dickins <hughd@google.com>
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> ---
>  mm/page_alloc.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 218b28ee49ed..1ec54173b8d4 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3546,6 +3546,8 @@ void free_unref_page_list(struct list_head *list)
>  	list_for_each_entry_safe(page, next, list, lru) {
>  		struct zone *zone = page_zone(page);
>  
> +		list_del(&page->lru);
> +
>  		/* Different zone, different pcp lock. */
>  		if (zone != locked_zone) {
>  			if (pcp)
> 
> -- 
> Mel Gorman
> SUSE Labs
Mel Gorman Nov. 9, 2022, 2:24 p.m. UTC | #7
On Tue, Nov 08, 2022 at 07:47:12AM -0800, Hugh Dickins wrote:
> On Tue, 8 Nov 2022, Mel Gorman wrote:
> > On Tue, Nov 08, 2022 at 01:40:48AM -0800, Hugh Dickins wrote:
> > > On Mon, 7 Nov 2022, Mel Gorman wrote:
> > > > On Sun, Nov 06, 2022 at 08:42:32AM -0800, Hugh Dickins wrote:
> > > > > On Fri, 4 Nov 2022, Mel Gorman wrote:
> > > > > What I'm really trying to do is fix
> > > > > the bug in Linus's rmap/TLB series, and its interaction with my
> > > > > rmap series, and report back on his series (asking for temporary
> > > > > drop), before next-20221107 goes down in flames.
> > > > > 
> > > > > I'd advocate for dropping this patch of yours too; but if it's giving
> > > > > nobody else any trouble, I can easily continue to patch it out.
> > > > > 
> > > > 
> > > > Given that you tested the patch against v6.1-rc3, it's clear that the
> > > > patch on its own causes problems. Having a reproduction case will help
> > > > me figure out why.
> > > 
> > > Sorry for appearing to ignore your requests all day, Mel, but I just
> > > had slightly more confidence in debugging it here, than in conveying
> > > all the details of my load (some other time), and my config, and
> > > actually enabling you to reproduce it.  Had to focus.
> > > 
> > 
> > Ok, understood. If you ever get the chance to give me even a rough
> > description, I'd appreciate it but I understand that it's a distraction
> > at the moment. Thanks for taking the time to debug this in your test
> > environment.
> 
> I have sent it out two or three times before (I prefer privately, so my
> limited shell scripting skills are not on public view!): later in the day
> I'll look out the last time I sent it, and just forward that to you.

Thanks.

> No doubt I'll have tweaked it here and there in my own usage since then,
> but that will be good enough to show what I try for.
> 

It should be. I'll know it's at least partially successful if the v2
crashes in the same way you reported.

> Wonderful if it could get into something like mmtests, but I should
> warn that attempts to incorporate it into other frameworks have not
> been successful in the past. 

I'm stubborn enough to try. Even if the end result is ugly, it will still
not be the first ugly hack in mmtests that still had a useful purpose.

> Maybe it just needs too much handholding:
> getting the balance right, so that it's stressing without quite OOMing,
> is difficult in any new environment.  Perhaps it should restart after
> OOM, I just never tried to go that way.
> 

As dmesg is always captured after a test, I can set it to detect either
an outright failure or an OOM kill in the dmesg and flag the results as
invalid. Even if this is less than perfect, I can add notes in the config
file on the hazards. If nothing else, it should note that even if it passes,
it's not a guarantee that everything is ok.

> > I've made this a standalone patch which is at the end of the mail. I can
> > change the Reported-by to a Signed-off-by if you agree. While it doesn't
> > fix anything today, it may still be worth documenting in git history why
> > that list_del exists.
> 
> Yes, worth separating out to its own git commit.  Just continue with what
> you already have, Reported-by me, Signed-off-by you, thanks for asking.
> 

Will do.

> > 
> > > Here's a few things to fold into your patch: I've moved your
> > > list_del() up to cover both cases, that's the important fix;
> > > but prior to finding that, I did notice a "locked_zone = NULL"
> > > needed, and was very disappointed when that didn't fix the issues;
> > 
> > This is a real fix but it also should happen to work properly which is
> > less than ideal because it's fragile.
> 
> I thought that if the next page's zone matched the stale locked_zone,
> then it would head into free_unref_page_commit() with NULL pcp, and
> oops there?  But I've certainly never seen that happen (despite first
> assuming it was responsible for my crashes), so maybe I read it wrong.
> 

It's possible that would happen, just very unlikely. Reclaim would generally
pass in pages from the same zone but something like release_pages could
hit it. Triggering a NULL reference would require that the list being
freed consists of pages belonging to two zones interleaved which requires a
fairly specific allocation pattern and then the free to be interrupted by an
allocation request from IRQ context at just the right time. MPOL_INTERLEAVE
might make the necessary allocation pattern easier to hit I suppose. Either
way, the fix is definitely needed even if I think the bug would be hard
to hit in practice.
diff mbox series

Patch

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e20ade858e71..ae410adf36fb 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -170,21 +170,12 @@  static DEFINE_MUTEX(pcp_batch_high_lock);
 	_ret;								\
 })
 
-#define pcpu_spin_lock_irqsave(type, member, ptr, flags)		\
+#define pcpu_spin_trylock(type, member, ptr)				\
 ({									\
 	type *_ret;							\
 	pcpu_task_pin();						\
 	_ret = this_cpu_ptr(ptr);					\
-	spin_lock_irqsave(&_ret->member, flags);			\
-	_ret;								\
-})
-
-#define pcpu_spin_trylock_irqsave(type, member, ptr, flags)		\
-({									\
-	type *_ret;							\
-	pcpu_task_pin();						\
-	_ret = this_cpu_ptr(ptr);					\
-	if (!spin_trylock_irqsave(&_ret->member, flags)) {		\
+	if (!spin_trylock(&_ret->member)) {				\
 		pcpu_task_unpin();					\
 		_ret = NULL;						\
 	}								\
@@ -197,27 +188,16 @@  static DEFINE_MUTEX(pcp_batch_high_lock);
 	pcpu_task_unpin();						\
 })
 
-#define pcpu_spin_unlock_irqrestore(member, ptr, flags)			\
-({									\
-	spin_unlock_irqrestore(&ptr->member, flags);			\
-	pcpu_task_unpin();						\
-})
-
 /* struct per_cpu_pages specific helpers. */
 #define pcp_spin_lock(ptr)						\
 	pcpu_spin_lock(struct per_cpu_pages, lock, ptr)
 
-#define pcp_spin_lock_irqsave(ptr, flags)				\
-	pcpu_spin_lock_irqsave(struct per_cpu_pages, lock, ptr, flags)
-
-#define pcp_spin_trylock_irqsave(ptr, flags)				\
-	pcpu_spin_trylock_irqsave(struct per_cpu_pages, lock, ptr, flags)
+#define pcp_spin_trylock(ptr)						\
+	pcpu_spin_trylock(struct per_cpu_pages, lock, ptr)
 
 #define pcp_spin_unlock(ptr)						\
 	pcpu_spin_unlock(lock, ptr)
 
-#define pcp_spin_unlock_irqrestore(ptr, flags)				\
-	pcpu_spin_unlock_irqrestore(lock, ptr, flags)
 #ifdef CONFIG_USE_PERCPU_NUMA_NODE_ID
 DEFINE_PER_CPU(int, numa_node);
 EXPORT_PER_CPU_SYMBOL(numa_node);
@@ -1546,6 +1526,7 @@  static void free_pcppages_bulk(struct zone *zone, int count,
 					struct per_cpu_pages *pcp,
 					int pindex)
 {
+	unsigned long flags;
 	int min_pindex = 0;
 	int max_pindex = NR_PCP_LISTS - 1;
 	unsigned int order;
@@ -1561,8 +1542,7 @@  static void free_pcppages_bulk(struct zone *zone, int count,
 	/* Ensure requested pindex is drained first. */
 	pindex = pindex - 1;
 
-	/* Caller must hold IRQ-safe pcp->lock so IRQs are disabled. */
-	spin_lock(&zone->lock);
+	spin_lock_irqsave(&zone->lock, flags);
 	isolated_pageblocks = has_isolate_pageblock(zone);
 
 	while (count > 0) {
@@ -1610,7 +1590,7 @@  static void free_pcppages_bulk(struct zone *zone, int count,
 		} while (count > 0 && !list_empty(list));
 	}
 
-	spin_unlock(&zone->lock);
+	spin_unlock_irqrestore(&zone->lock, flags);
 }
 
 static void free_one_page(struct zone *zone,
@@ -3124,10 +3104,10 @@  static int rmqueue_bulk(struct zone *zone, unsigned int order,
 			unsigned long count, struct list_head *list,
 			int migratetype, unsigned int alloc_flags)
 {
+	unsigned long flags;
 	int i, allocated = 0;
 
-	/* Caller must hold IRQ-safe pcp->lock so IRQs are disabled. */
-	spin_lock(&zone->lock);
+	spin_lock_irqsave(&zone->lock, flags);
 	for (i = 0; i < count; ++i) {
 		struct page *page = __rmqueue(zone, order, migratetype,
 								alloc_flags);
@@ -3161,7 +3141,7 @@  static int rmqueue_bulk(struct zone *zone, unsigned int order,
 	 * pages added to the pcp list.
 	 */
 	__mod_zone_page_state(zone, NR_FREE_PAGES, -(i << order));
-	spin_unlock(&zone->lock);
+	spin_unlock_irqrestore(&zone->lock, flags);
 	return allocated;
 }
 
@@ -3178,16 +3158,9 @@  void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp)
 	batch = READ_ONCE(pcp->batch);
 	to_drain = min(pcp->count, batch);
 	if (to_drain > 0) {
-		unsigned long flags;
-
-		/*
-		 * free_pcppages_bulk expects IRQs disabled for zone->lock
-		 * so even though pcp->lock is not intended to be IRQ-safe,
-		 * it's needed in this context.
-		 */
-		spin_lock_irqsave(&pcp->lock, flags);
+		spin_lock(&pcp->lock);
 		free_pcppages_bulk(zone, to_drain, pcp, 0);
-		spin_unlock_irqrestore(&pcp->lock, flags);
+		spin_unlock(&pcp->lock);
 	}
 }
 #endif
@@ -3201,12 +3174,9 @@  static void drain_pages_zone(unsigned int cpu, struct zone *zone)
 
 	pcp = per_cpu_ptr(zone->per_cpu_pageset, cpu);
 	if (pcp->count) {
-		unsigned long flags;
-
-		/* See drain_zone_pages on why this is disabling IRQs */
-		spin_lock_irqsave(&pcp->lock, flags);
+		spin_lock(&pcp->lock);
 		free_pcppages_bulk(zone, pcp->count, pcp, 0);
-		spin_unlock_irqrestore(&pcp->lock, flags);
+		spin_unlock(&pcp->lock);
 	}
 }
 
@@ -3472,7 +3442,6 @@  static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp,
  */
 void free_unref_page(struct page *page, unsigned int order)
 {
-	unsigned long flags;
 	unsigned long __maybe_unused UP_flags;
 	struct per_cpu_pages *pcp;
 	struct zone *zone;
@@ -3500,10 +3469,10 @@  void free_unref_page(struct page *page, unsigned int order)
 
 	zone = page_zone(page);
 	pcp_trylock_prepare(UP_flags);
-	pcp = pcp_spin_trylock_irqsave(zone->per_cpu_pageset, flags);
+	pcp = pcp_spin_trylock(zone->per_cpu_pageset);
 	if (pcp) {
 		free_unref_page_commit(zone, pcp, page, migratetype, order);
-		pcp_spin_unlock_irqrestore(pcp, flags);
+		pcp_spin_unlock(pcp);
 	} else {
 		free_one_page(zone, page, pfn, order, migratetype, FPI_NONE);
 	}
@@ -3515,10 +3484,10 @@  void free_unref_page(struct page *page, unsigned int order)
  */
 void free_unref_page_list(struct list_head *list)
 {
+	unsigned long __maybe_unused UP_flags;
 	struct page *page, *next;
 	struct per_cpu_pages *pcp = NULL;
 	struct zone *locked_zone = NULL;
-	unsigned long flags;
 	int batch_count = 0;
 	int migratetype;
 
@@ -3547,11 +3516,26 @@  void free_unref_page_list(struct list_head *list)
 
 		/* Different zone, different pcp lock. */
 		if (zone != locked_zone) {
-			if (pcp)
-				pcp_spin_unlock_irqrestore(pcp, flags);
+			if (pcp) {
+				pcp_spin_unlock(pcp);
+				pcp_trylock_finish(UP_flags);
+			}
 
+			/*
+			 * trylock is necessary as pages may be getting freed
+			 * from IRQ or SoftIRQ context after an IO completion.
+			 */
+			pcp_trylock_prepare(UP_flags);
+			pcp = pcp_spin_trylock(zone->per_cpu_pageset);
+			if (!pcp) {
+				pcp_trylock_finish(UP_flags);
+				list_del(&page->lru);
+				free_one_page(page_zone(page), page,
+					      page_to_pfn(page), 0, migratetype,
+					      FPI_NONE);
+				continue;
+			}
 			locked_zone = zone;
-			pcp = pcp_spin_lock_irqsave(locked_zone->per_cpu_pageset, flags);
 		}
 
 		/*
@@ -3566,18 +3550,23 @@  void free_unref_page_list(struct list_head *list)
 		free_unref_page_commit(zone, pcp, page, migratetype, 0);
 
 		/*
-		 * Guard against excessive IRQ disabled times when we get
-		 * a large list of pages to free.
+		 * Guard against excessive IRQ disabled times when freeing
+		 * a large list of pages. Lock will be reacquired if
+		 * necessary on the next iteration.
 		 */
 		if (++batch_count == SWAP_CLUSTER_MAX) {
-			pcp_spin_unlock_irqrestore(pcp, flags);
+			pcp_spin_unlock(pcp);
+			pcp_trylock_finish(UP_flags);
 			batch_count = 0;
-			pcp = pcp_spin_lock_irqsave(locked_zone->per_cpu_pageset, flags);
+			pcp = NULL;
+			locked_zone = NULL;
 		}
 	}
 
-	if (pcp)
-		pcp_spin_unlock_irqrestore(pcp, flags);
+	if (pcp) {
+		pcp_spin_unlock(pcp);
+		pcp_trylock_finish(UP_flags);
+	}
 }
 
 /*
@@ -3778,15 +3767,11 @@  static struct page *rmqueue_pcplist(struct zone *preferred_zone,
 	struct per_cpu_pages *pcp;
 	struct list_head *list;
 	struct page *page;
-	unsigned long flags;
 	unsigned long __maybe_unused UP_flags;
 
-	/*
-	 * spin_trylock may fail due to a parallel drain. In the future, the
-	 * trylock will also protect against IRQ reentrancy.
-	 */
+	/* spin_trylock may fail due to a parallel drain or IRQ reentrancy. */
 	pcp_trylock_prepare(UP_flags);
-	pcp = pcp_spin_trylock_irqsave(zone->per_cpu_pageset, flags);
+	pcp = pcp_spin_trylock(zone->per_cpu_pageset);
 	if (!pcp) {
 		pcp_trylock_finish(UP_flags);
 		return NULL;
@@ -3800,7 +3785,7 @@  static struct page *rmqueue_pcplist(struct zone *preferred_zone,
 	pcp->free_factor >>= 1;
 	list = &pcp->lists[order_to_pindex(migratetype, order)];
 	page = __rmqueue_pcplist(zone, order, migratetype, alloc_flags, pcp, list);
-	pcp_spin_unlock_irqrestore(pcp, flags);
+	pcp_spin_unlock(pcp);
 	pcp_trylock_finish(UP_flags);
 	if (page) {
 		__count_zid_vm_events(PGALLOC, page_zonenum(page), 1 << order);
@@ -5368,7 +5353,6 @@  unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
 			struct page **page_array)
 {
 	struct page *page;
-	unsigned long flags;
 	unsigned long __maybe_unused UP_flags;
 	struct zone *zone;
 	struct zoneref *z;
@@ -5450,9 +5434,9 @@  unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
 	if (unlikely(!zone))
 		goto failed;
 
-	/* Is a parallel drain in progress? */
+	/* spin_trylock may fail due to a parallel drain or IRQ reentrancy. */
 	pcp_trylock_prepare(UP_flags);
-	pcp = pcp_spin_trylock_irqsave(zone->per_cpu_pageset, flags);
+	pcp = pcp_spin_trylock(zone->per_cpu_pageset);
 	if (!pcp)
 		goto failed_irq;
 
@@ -5471,7 +5455,7 @@  unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
 		if (unlikely(!page)) {
 			/* Try and allocate at least one page */
 			if (!nr_account) {
-				pcp_spin_unlock_irqrestore(pcp, flags);
+				pcp_spin_unlock(pcp);
 				goto failed_irq;
 			}
 			break;
@@ -5486,7 +5470,7 @@  unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
 		nr_populated++;
 	}
 
-	pcp_spin_unlock_irqrestore(pcp, flags);
+	pcp_spin_unlock(pcp);
 	pcp_trylock_finish(UP_flags);
 
 	__count_zid_vm_events(PGALLOC, zone_idx(zone), nr_account);