diff mbox series

[1/1] mm/page_alloc: Leave IRQs enabled for per-cpu page allocations

Message ID 20220824141802.23395-1-mgorman@techsingularity.net (mailing list archive)
State New
Headers show
Series [1/1] mm/page_alloc: Leave IRQs enabled for per-cpu page allocations | expand

Commit Message

Mel Gorman Aug. 24, 2022, 2:18 p.m. UTC
The pcp_spin_lock_irqsave protecting the PCP lists is IRQ-safe as a task
allocating from the PCP must not re-enter the allocator from IRQ context.
In each instance where IRQ-reentrancy is possible, the lock is acquired using
pcp_spin_trylock_irqsave() even though IRQs are disabled and re-entrancy
is impossible.

Demote the lock to pcp_spin_lock avoids an IRQ disable/enable in the common
case at the cost of some IRQ allocations taking a slower path. If the PCP
lists need to be refilled, the zone lock still needs to disable IRQs but
that will only happen on PCP refill and drain. If an IRQ is raised when
a PCP allocation is in progress, the trylock will fail and fallback to
using the buddy lists directly. Note that this may not be a universal win
if an interrupt-intensive workload also allocates heavily from interrupt
context and contends heavily on the zone->lock as a result.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 mm/page_alloc.c | 93 +++++++++++++++----------------------------------
 1 file changed, 28 insertions(+), 65 deletions(-)

Comments

Yu Zhao Aug. 25, 2022, 4:58 a.m. UTC | #1
On Wed, Aug 24, 2022 at 8:18 AM Mel Gorman <mgorman@techsingularity.net> wrote:
>
> The pcp_spin_lock_irqsave protecting the PCP lists is IRQ-safe as a task
> allocating from the PCP must not re-enter the allocator from IRQ context.
> In each instance where IRQ-reentrancy is possible, the lock is acquired using
> pcp_spin_trylock_irqsave() even though IRQs are disabled and re-entrancy
> is impossible.
>
> Demote the lock to pcp_spin_lock avoids an IRQ disable/enable in the common
> case at the cost of some IRQ allocations taking a slower path. If the PCP
> lists need to be refilled, the zone lock still needs to disable IRQs but
> that will only happen on PCP refill and drain. If an IRQ is raised when
> a PCP allocation is in progress, the trylock will fail and fallback to
> using the buddy lists directly. Note that this may not be a universal win
> if an interrupt-intensive workload also allocates heavily from interrupt
> context and contends heavily on the zone->lock as a result.

Hi,

This patch caused the following warning. Please take a look.

Thanks.

  WARNING: inconsistent lock state
  6.0.0-dbg-DEV #1 Tainted: G S      W  O
  --------------------------------
  inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
  ksoftirqd/2/27 [HC0[0]:SC1[1]:HE0:SE0] takes:
  ffff9ce5002b8c58 (&pcp->lock){+.?.}-{2:2}, at:
free_unref_page_list+0x1ac/0x260
  {SOFTIRQ-ON-W} state was registered at:
    lock_acquire+0xb3/0x190
    _raw_spin_trylock+0x46/0x60
    rmqueue_pcplist+0x42/0x1d0
    rmqueue+0x58/0x590
    get_page_from_freelist+0x2c3/0x510
    __alloc_pages+0x126/0x210
    alloc_page_interleave+0x13/0x90
    alloc_pages+0xfb/0x250
    __get_free_pages+0x11/0x30
    __pte_alloc_kernel+0x1c/0xc0
    vmap_p4d_range+0x448/0x690
    ioremap_page_range+0xdc/0x130
    __ioremap_caller+0x258/0x320
    ioremap_cache+0x17/0x20
    acpi_os_map_iomem+0x12f/0x1d0
    acpi_os_map_memory+0xe/0x10
    acpi_tb_acquire_table+0x42/0x6e
    acpi_tb_validate_temp_table+0x43/0x55
    acpi_tb_verify_temp_table+0x31/0x238
    acpi_reallocate_root_table+0xe6/0x158
    acpi_early_init+0x4f/0xd1
    start_kernel+0x32a/0x44f
    x86_64_start_reservations+0x24/0x26
    x86_64_start_kernel+0x124/0x12b
    secondary_startup_64_no_verify+0xe6/0xeb
  irq event stamp: 961581
  hardirqs last  enabled at (961580): [<ffffffff95b2cde5>]
_raw_spin_unlock_irqrestore+0x35/0x50
  hardirqs last disabled at (961581): [<ffffffff951c1998>]
folio_rotate_reclaimable+0xf8/0x310
  softirqs last  enabled at (961490): [<ffffffff94fa40d8>]
run_ksoftirqd+0x48/0x90
  softirqs last disabled at (961495): [<ffffffff94fa40d8>]
run_ksoftirqd+0x48/0x90

  other info that might help us debug this:
   Possible unsafe locking scenario:

         CPU0
         ----
    lock(&pcp->lock);
    <Interrupt>
      lock(&pcp->lock);

   *** DEADLOCK ***

  1 lock held by ksoftirqd/2/27:
   #0: ffff9ce5002adab8 (lock#7){..-.}-{2:2}, at: local_lock_acquire+0x0/0x70

  stack backtrace:
  CPU: 2 PID: 27 Comm: ksoftirqd/2 Tainted: G S      W  O       6.0.0-dbg-DEV #1
  Call Trace:
   <TASK>
   dump_stack_lvl+0x6c/0x9a
   dump_stack+0x10/0x12
   print_usage_bug+0x374/0x380
   mark_lock_irq+0x4a8/0x4c0
   ? save_trace+0x40/0x2c0
   mark_lock+0x137/0x1b0
   __lock_acquire+0x5bf/0x3540
   ? __SCT__tp_func_virtio_transport_recv_pkt+0x7/0x8
   ? lock_is_held_type+0x96/0x130
   ? rcu_read_lock_sched_held+0x49/0xa0
   lock_acquire+0xb3/0x190
   ? free_unref_page_list+0x1ac/0x260
   _raw_spin_lock+0x2f/0x40
   ? free_unref_page_list+0x1ac/0x260
   free_unref_page_list+0x1ac/0x260
   release_pages+0x90a/0xa70
   ? folio_batch_move_lru+0x138/0x190
   ? local_lock_acquire+0x70/0x70
   folio_batch_move_lru+0x147/0x190
   folio_rotate_reclaimable+0x168/0x310
   folio_end_writeback+0x5d/0x200
   end_page_writeback+0x18/0x40
   end_swap_bio_write+0x100/0x2b0
   ? bio_chain+0x30/0x30
   bio_endio+0xd8/0xf0
   blk_update_request+0x173/0x340
   scsi_end_request+0x2a/0x300
   scsi_io_completion+0x66/0x140
   scsi_finish_command+0xc0/0xf0
   scsi_complete+0xec/0x110
   blk_done_softirq+0x53/0x70
   __do_softirq+0x1e2/0x357
   ? run_ksoftirqd+0x48/0x90
   run_ksoftirqd+0x48/0x90
   smpboot_thread_fn+0x14b/0x1c0
   kthread+0xe6/0x100
   ? cpu_report_death+0x50/0x50
   ? kthread_blkcg+0x40/0x40
   ret_from_fork+0x1f/0x30
   </TASK>
Mel Gorman Aug. 25, 2022, 9:11 a.m. UTC | #2
On Wed, Aug 24, 2022 at 10:58:26PM -0600, Yu Zhao wrote:
> On Wed, Aug 24, 2022 at 8:18 AM Mel Gorman <mgorman@techsingularity.net> wrote:
> >
> > The pcp_spin_lock_irqsave protecting the PCP lists is IRQ-safe as a task
> > allocating from the PCP must not re-enter the allocator from IRQ context.
> > In each instance where IRQ-reentrancy is possible, the lock is acquired using
> > pcp_spin_trylock_irqsave() even though IRQs are disabled and re-entrancy
> > is impossible.
> >
> > Demote the lock to pcp_spin_lock avoids an IRQ disable/enable in the common
> > case at the cost of some IRQ allocations taking a slower path. If the PCP
> > lists need to be refilled, the zone lock still needs to disable IRQs but
> > that will only happen on PCP refill and drain. If an IRQ is raised when
> > a PCP allocation is in progress, the trylock will fail and fallback to
> > using the buddy lists directly. Note that this may not be a universal win
> > if an interrupt-intensive workload also allocates heavily from interrupt
> > context and contends heavily on the zone->lock as a result.
> 
> Hi,
> 
> This patch caused the following warning. Please take a look.
> 

Thanks. I see the patch is already taken out of mm-unstable so back to
the drawing board!
Mel Gorman Oct. 10, 2022, 2:22 p.m. UTC | #3
On Wed, Aug 24, 2022 at 10:58:26PM -0600, Yu Zhao wrote:
> On Wed, Aug 24, 2022 at 8:18 AM Mel Gorman <mgorman@techsingularity.net> wrote:
> >
> > The pcp_spin_lock_irqsave protecting the PCP lists is IRQ-safe as a task
> > allocating from the PCP must not re-enter the allocator from IRQ context.
> > In each instance where IRQ-reentrancy is possible, the lock is acquired using
> > pcp_spin_trylock_irqsave() even though IRQs are disabled and re-entrancy
> > is impossible.
> >
> > Demote the lock to pcp_spin_lock avoids an IRQ disable/enable in the common
> > case at the cost of some IRQ allocations taking a slower path. If the PCP
> > lists need to be refilled, the zone lock still needs to disable IRQs but
> > that will only happen on PCP refill and drain. If an IRQ is raised when
> > a PCP allocation is in progress, the trylock will fail and fallback to
> > using the buddy lists directly. Note that this may not be a universal win
> > if an interrupt-intensive workload also allocates heavily from interrupt
> > context and contends heavily on the zone->lock as a result.
> 
> Hi,
> 
> This patch caused the following warning. Please take a look.
> 
> Thanks.
> 
>   WARNING: inconsistent lock state
>   6.0.0-dbg-DEV #1 Tainted: G S      W  O
>   --------------------------------

I finally found time to take a closer look at this and I cannot reproduce
it against 6.0. What workload triggered the warning, on what platform and
can you post the kernel config used please? It would also help if you
can remember what git commit the patch was tested upon.

Thanks and sorry for the long delay.
Vlastimil Babka Oct. 10, 2022, 8:45 p.m. UTC | #4
On 10/10/22 16:22, Mel Gorman wrote:
> On Wed, Aug 24, 2022 at 10:58:26PM -0600, Yu Zhao wrote:
>> On Wed, Aug 24, 2022 at 8:18 AM Mel Gorman <mgorman@techsingularity.net> wrote:
>>>
>>> The pcp_spin_lock_irqsave protecting the PCP lists is IRQ-safe as a task
>>> allocating from the PCP must not re-enter the allocator from IRQ context.
>>> In each instance where IRQ-reentrancy is possible, the lock is acquired using
>>> pcp_spin_trylock_irqsave() even though IRQs are disabled and re-entrancy
>>> is impossible.
>>>
>>> Demote the lock to pcp_spin_lock avoids an IRQ disable/enable in the common
>>> case at the cost of some IRQ allocations taking a slower path. If the PCP
>>> lists need to be refilled, the zone lock still needs to disable IRQs but
>>> that will only happen on PCP refill and drain. If an IRQ is raised when
>>> a PCP allocation is in progress, the trylock will fail and fallback to
>>> using the buddy lists directly. Note that this may not be a universal win
>>> if an interrupt-intensive workload also allocates heavily from interrupt
>>> context and contends heavily on the zone->lock as a result.
>>
>> Hi,
>>
>> This patch caused the following warning. Please take a look.
>>
>> Thanks.
>>
>>    WARNING: inconsistent lock state
>>    6.0.0-dbg-DEV #1 Tainted: G S      W  O
>>    --------------------------------
> 
> I finally found time to take a closer look at this and I cannot reproduce
> it against 6.0. What workload triggered the warning, on what platform and
> can you post the kernel config used please? It would also help if you
> can remember what git commit the patch was tested upon.
> 
> Thanks and sorry for the long delay.

I didn't (try to) reproduce this, but FWIW the report looked legit to 
me, as after the patch, pcp_spin_trylock() has to be used for both 
allocation and freeing to be IRQ safe. free_unref_page() uses it, so 
it's fine. But as the stack trace in the report shows, 
free_unref_page_list() does pcp_spin_lock() and not _trylock, and that's 
IMHO the problem.
Yu Zhao Oct. 10, 2022, 10:09 p.m. UTC | #5
On Mon, Oct 10, 2022 at 2:45 PM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 10/10/22 16:22, Mel Gorman wrote:
> > On Wed, Aug 24, 2022 at 10:58:26PM -0600, Yu Zhao wrote:
> >> On Wed, Aug 24, 2022 at 8:18 AM Mel Gorman <mgorman@techsingularity.net> wrote:
> >>>
> >>> The pcp_spin_lock_irqsave protecting the PCP lists is IRQ-safe as a task
> >>> allocating from the PCP must not re-enter the allocator from IRQ context.
> >>> In each instance where IRQ-reentrancy is possible, the lock is acquired using
> >>> pcp_spin_trylock_irqsave() even though IRQs are disabled and re-entrancy
> >>> is impossible.
> >>>
> >>> Demote the lock to pcp_spin_lock avoids an IRQ disable/enable in the common
> >>> case at the cost of some IRQ allocations taking a slower path. If the PCP
> >>> lists need to be refilled, the zone lock still needs to disable IRQs but
> >>> that will only happen on PCP refill and drain. If an IRQ is raised when
> >>> a PCP allocation is in progress, the trylock will fail and fallback to
> >>> using the buddy lists directly. Note that this may not be a universal win
> >>> if an interrupt-intensive workload also allocates heavily from interrupt
> >>> context and contends heavily on the zone->lock as a result.
> >>
> >> Hi,
> >>
> >> This patch caused the following warning. Please take a look.
> >>
> >> Thanks.
> >>
> >>    WARNING: inconsistent lock state
> >>    6.0.0-dbg-DEV #1 Tainted: G S      W  O
> >>    --------------------------------
> >
> > I finally found time to take a closer look at this and I cannot reproduce
> > it against 6.0. What workload triggered the warning, on what platform and
> > can you post the kernel config used please? It would also help if you
> > can remember what git commit the patch was tested upon.
> >
> > Thanks and sorry for the long delay.
>
> I didn't (try to) reproduce this, but FWIW the report looked legit to
> me, as after the patch, pcp_spin_trylock() has to be used for both
> allocation and freeing to be IRQ safe. free_unref_page() uses it, so
> it's fine. But as the stack trace in the report shows,
> free_unref_page_list() does pcp_spin_lock() and not _trylock, and that's
> IMHO the problem.

If this is not the case, please let me know and I'll try repro again.
Mel Gorman Oct. 11, 2022, 8:25 a.m. UTC | #6
On Mon, Oct 10, 2022 at 10:45:43PM +0200, Vlastimil Babka wrote:
> On 10/10/22 16:22, Mel Gorman wrote:
> > On Wed, Aug 24, 2022 at 10:58:26PM -0600, Yu Zhao wrote:
> > > On Wed, Aug 24, 2022 at 8:18 AM Mel Gorman <mgorman@techsingularity.net> wrote:
> > > > 
> > > > The pcp_spin_lock_irqsave protecting the PCP lists is IRQ-safe as a task
> > > > allocating from the PCP must not re-enter the allocator from IRQ context.
> > > > In each instance where IRQ-reentrancy is possible, the lock is acquired using
> > > > pcp_spin_trylock_irqsave() even though IRQs are disabled and re-entrancy
> > > > is impossible.
> > > > 
> > > > Demote the lock to pcp_spin_lock avoids an IRQ disable/enable in the common
> > > > case at the cost of some IRQ allocations taking a slower path. If the PCP
> > > > lists need to be refilled, the zone lock still needs to disable IRQs but
> > > > that will only happen on PCP refill and drain. If an IRQ is raised when
> > > > a PCP allocation is in progress, the trylock will fail and fallback to
> > > > using the buddy lists directly. Note that this may not be a universal win
> > > > if an interrupt-intensive workload also allocates heavily from interrupt
> > > > context and contends heavily on the zone->lock as a result.
> > > 
> > > Hi,
> > > 
> > > This patch caused the following warning. Please take a look.
> > > 
> > > Thanks.
> > > 
> > >    WARNING: inconsistent lock state
> > >    6.0.0-dbg-DEV #1 Tainted: G S      W  O
> > >    --------------------------------
> > 
> > I finally found time to take a closer look at this and I cannot reproduce
> > it against 6.0. What workload triggered the warning, on what platform and
> > can you post the kernel config used please? It would also help if you
> > can remember what git commit the patch was tested upon.
> > 
> > Thanks and sorry for the long delay.
> 
> I didn't (try to) reproduce this, but FWIW the report looked legit to me, as
> after the patch, pcp_spin_trylock() has to be used for both allocation and
> freeing to be IRQ safe. free_unref_page() uses it, so it's fine. But as the
> stack trace in the report shows, free_unref_page_list() does pcp_spin_lock()
> and not _trylock, and that's IMHO the problem.
> 

I completely agree, it was a surprise to me that IO completion would
happen in soft IRQ context even though blk_done_softirq indicates that
it is normal and I didn't manage to trigger that case myself. I wondered
if there was an easy way to force that which would have made testing of
this easier. I can live without the reproduction case and cc Yu Zhao after
6.1-rc1 comes out and I've fixed this.
Mel Gorman Oct. 13, 2022, 10:10 a.m. UTC | #7
On Mon, Oct 10, 2022 at 04:09:14PM -0600, Yu Zhao wrote:
> > I didn't (try to) reproduce this, but FWIW the report looked legit to
> > me, as after the patch, pcp_spin_trylock() has to be used for both
> > allocation and freeing to be IRQ safe. free_unref_page() uses it, so
> > it's fine. But as the stack trace in the report shows,
> > free_unref_page_list() does pcp_spin_lock() and not _trylock, and that's
> > IMHO the problem.
> 
> If this is not the case, please let me know and I'll try repro again.

Can you try testing this patch please on top of v6.0? It passed light
testing for me but I never got the lockdep warning.

--8<--
mm/page_alloc: Leave IRQs enabled for per-cpu page allocations

The pcp_spin_lock_irqsave protecting the PCP lists is IRQ-safe as a task
allocating from the PCP must not re-enter the allocator from IRQ context.
In each instance where IRQ-reentrancy is possible, the lock is acquired using
pcp_spin_trylock_irqsave() even though IRQs are disabled and re-entrancy
is impossible.

Demote the lock to pcp_spin_lock avoids an IRQ disable/enable in the common
case at the cost of some IRQ allocations taking a slower path. If the PCP
lists need to be refilled, the zone lock still needs to disable IRQs but
that will only happen on PCP refill and drain. If an IRQ is raised when
a PCP allocation is in progress, the trylock will fail and fallback to
using the buddy lists directly. Note that this may not be a universal win
if an interrupt-intensive workload also allocates heavily from interrupt
context and contends heavily on the zone->lock as a result.

[yuzhao@google.com: Reported lockdep issue on IO completion from softirq]
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 mm/page_alloc.c | 122 ++++++++++++++++++++++++--------------------------------
 1 file changed, 53 insertions(+), 69 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d04211f0ef0b..55feaacfebb1 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -169,21 +169,12 @@ static DEFINE_MUTEX(pcp_batch_high_lock);
 	_ret;								\
 })
 
-#define pcpu_spin_lock_irqsave(type, member, ptr, flags)		\
+#define pcpu_spin_trylock(type, member, ptr)				\
 ({									\
 	type *_ret;							\
 	pcpu_task_pin();						\
 	_ret = this_cpu_ptr(ptr);					\
-	spin_lock_irqsave(&_ret->member, flags);			\
-	_ret;								\
-})
-
-#define pcpu_spin_trylock_irqsave(type, member, ptr, flags)		\
-({									\
-	type *_ret;							\
-	pcpu_task_pin();						\
-	_ret = this_cpu_ptr(ptr);					\
-	if (!spin_trylock_irqsave(&_ret->member, flags)) {		\
+	if (!spin_trylock(&_ret->member)) {				\
 		pcpu_task_unpin();					\
 		_ret = NULL;						\
 	}								\
@@ -196,27 +187,16 @@ static DEFINE_MUTEX(pcp_batch_high_lock);
 	pcpu_task_unpin();						\
 })
 
-#define pcpu_spin_unlock_irqrestore(member, ptr, flags)			\
-({									\
-	spin_unlock_irqrestore(&ptr->member, flags);			\
-	pcpu_task_unpin();						\
-})
-
 /* struct per_cpu_pages specific helpers. */
 #define pcp_spin_lock(ptr)						\
 	pcpu_spin_lock(struct per_cpu_pages, lock, ptr)
 
-#define pcp_spin_lock_irqsave(ptr, flags)				\
-	pcpu_spin_lock_irqsave(struct per_cpu_pages, lock, ptr, flags)
-
-#define pcp_spin_trylock_irqsave(ptr, flags)				\
-	pcpu_spin_trylock_irqsave(struct per_cpu_pages, lock, ptr, flags)
+#define pcp_spin_trylock(ptr)						\
+	pcpu_spin_trylock(struct per_cpu_pages, lock, ptr)
 
 #define pcp_spin_unlock(ptr)						\
 	pcpu_spin_unlock(lock, ptr)
 
-#define pcp_spin_unlock_irqrestore(ptr, flags)				\
-	pcpu_spin_unlock_irqrestore(lock, ptr, flags)
 #ifdef CONFIG_USE_PERCPU_NUMA_NODE_ID
 DEFINE_PER_CPU(int, numa_node);
 EXPORT_PER_CPU_SYMBOL(numa_node);
@@ -1536,6 +1516,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 					struct per_cpu_pages *pcp,
 					int pindex)
 {
+	unsigned long flags;
 	int min_pindex = 0;
 	int max_pindex = NR_PCP_LISTS - 1;
 	unsigned int order;
@@ -1551,8 +1532,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 	/* Ensure requested pindex is drained first. */
 	pindex = pindex - 1;
 
-	/* Caller must hold IRQ-safe pcp->lock so IRQs are disabled. */
-	spin_lock(&zone->lock);
+	spin_lock_irqsave(&zone->lock, flags);
 	isolated_pageblocks = has_isolate_pageblock(zone);
 
 	while (count > 0) {
@@ -1601,7 +1581,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 		} while (count > 0 && !list_empty(list));
 	}
 
-	spin_unlock(&zone->lock);
+	spin_unlock_irqrestore(&zone->lock, flags);
 }
 
 static void free_one_page(struct zone *zone,
@@ -3118,10 +3098,10 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
 			unsigned long count, struct list_head *list,
 			int migratetype, unsigned int alloc_flags)
 {
+	unsigned long flags;
 	int i, allocated = 0;
 
-	/* Caller must hold IRQ-safe pcp->lock so IRQs are disabled. */
-	spin_lock(&zone->lock);
+	spin_lock_irqsave(&zone->lock, flags);
 	for (i = 0; i < count; ++i) {
 		struct page *page = __rmqueue(zone, order, migratetype,
 								alloc_flags);
@@ -3155,7 +3135,7 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
 	 * pages added to the pcp list.
 	 */
 	__mod_zone_page_state(zone, NR_FREE_PAGES, -(i << order));
-	spin_unlock(&zone->lock);
+	spin_unlock_irqrestore(&zone->lock, flags);
 	return allocated;
 }
 
@@ -3172,16 +3152,9 @@ void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp)
 	batch = READ_ONCE(pcp->batch);
 	to_drain = min(pcp->count, batch);
 	if (to_drain > 0) {
-		unsigned long flags;
-
-		/*
-		 * free_pcppages_bulk expects IRQs disabled for zone->lock
-		 * so even though pcp->lock is not intended to be IRQ-safe,
-		 * it's needed in this context.
-		 */
-		spin_lock_irqsave(&pcp->lock, flags);
+		spin_lock(&pcp->lock);
 		free_pcppages_bulk(zone, to_drain, pcp, 0);
-		spin_unlock_irqrestore(&pcp->lock, flags);
+		spin_unlock(&pcp->lock);
 	}
 }
 #endif
@@ -3195,12 +3168,9 @@ static void drain_pages_zone(unsigned int cpu, struct zone *zone)
 
 	pcp = per_cpu_ptr(zone->per_cpu_pageset, cpu);
 	if (pcp->count) {
-		unsigned long flags;
-
-		/* See drain_zone_pages on why this is disabling IRQs */
-		spin_lock_irqsave(&pcp->lock, flags);
+		spin_lock(&pcp->lock);
 		free_pcppages_bulk(zone, pcp->count, pcp, 0);
-		spin_unlock_irqrestore(&pcp->lock, flags);
+		spin_unlock(&pcp->lock);
 	}
 }
 
@@ -3466,7 +3436,6 @@ static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp,
  */
 void free_unref_page(struct page *page, unsigned int order)
 {
-	unsigned long flags;
 	unsigned long __maybe_unused UP_flags;
 	struct per_cpu_pages *pcp;
 	struct zone *zone;
@@ -3494,10 +3463,10 @@ void free_unref_page(struct page *page, unsigned int order)
 
 	zone = page_zone(page);
 	pcp_trylock_prepare(UP_flags);
-	pcp = pcp_spin_trylock_irqsave(zone->per_cpu_pageset, flags);
+	pcp = pcp_spin_trylock(zone->per_cpu_pageset);
 	if (pcp) {
 		free_unref_page_commit(zone, pcp, page, migratetype, order);
-		pcp_spin_unlock_irqrestore(pcp, flags);
+		pcp_spin_unlock(pcp);
 	} else {
 		free_one_page(zone, page, pfn, order, migratetype, FPI_NONE);
 	}
@@ -3509,10 +3478,10 @@ void free_unref_page(struct page *page, unsigned int order)
  */
 void free_unref_page_list(struct list_head *list)
 {
+	unsigned long __maybe_unused UP_flags;
 	struct page *page, *next;
 	struct per_cpu_pages *pcp = NULL;
 	struct zone *locked_zone = NULL;
-	unsigned long flags;
 	int batch_count = 0;
 	int migratetype;
 
@@ -3541,11 +3510,26 @@ void free_unref_page_list(struct list_head *list)
 
 		/* Different zone, different pcp lock. */
 		if (zone != locked_zone) {
-			if (pcp)
-				pcp_spin_unlock_irqrestore(pcp, flags);
+			if (pcp) {
+				pcp_spin_unlock(pcp);
+				pcp_trylock_finish(UP_flags);
+			}
 
+			/*
+			 * trylock is necessary as pages may be getting freed
+			 * from IRQ or SoftIRQ context after an IO completion.
+			 */
+			pcp_trylock_prepare(UP_flags);
+			pcp = pcp_spin_trylock(zone->per_cpu_pageset);
+			if (!pcp) {
+				pcp_trylock_finish(UP_flags);
+				list_del(&page->lru);
+				free_one_page(page_zone(page), page,
+					      page_to_pfn(page), 0, migratetype,
+					      FPI_NONE);
+				continue;
+			}
 			locked_zone = zone;
-			pcp = pcp_spin_lock_irqsave(locked_zone->per_cpu_pageset, flags);
 		}
 
 		/*
@@ -3560,18 +3544,23 @@ void free_unref_page_list(struct list_head *list)
 		free_unref_page_commit(zone, pcp, page, migratetype, 0);
 
 		/*
-		 * Guard against excessive IRQ disabled times when we get
-		 * a large list of pages to free.
+		 * Guard against excessive IRQ disabled times when freeing
+		 * a large list of pages. Lock will be reacquired if
+		 * necessary on the next iteration.
 		 */
 		if (++batch_count == SWAP_CLUSTER_MAX) {
-			pcp_spin_unlock_irqrestore(pcp, flags);
+			pcp_spin_unlock(pcp);
+			pcp_trylock_finish(UP_flags);
 			batch_count = 0;
-			pcp = pcp_spin_lock_irqsave(locked_zone->per_cpu_pageset, flags);
+			pcp = NULL;
+			locked_zone = NULL;
 		}
 	}
 
-	if (pcp)
-		pcp_spin_unlock_irqrestore(pcp, flags);
+	if (pcp) {
+		pcp_spin_unlock(pcp);
+		pcp_trylock_finish(UP_flags);
+	}
 }
 
 /*
@@ -3783,15 +3772,11 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone,
 	struct per_cpu_pages *pcp;
 	struct list_head *list;
 	struct page *page;
-	unsigned long flags;
 	unsigned long __maybe_unused UP_flags;
 
-	/*
-	 * spin_trylock may fail due to a parallel drain. In the future, the
-	 * trylock will also protect against IRQ reentrancy.
-	 */
+	/* spin_trylock may fail due to a parallel drain or IRQ reentrancy. */
 	pcp_trylock_prepare(UP_flags);
-	pcp = pcp_spin_trylock_irqsave(zone->per_cpu_pageset, flags);
+	pcp = pcp_spin_trylock(zone->per_cpu_pageset);
 	if (!pcp) {
 		pcp_trylock_finish(UP_flags);
 		return NULL;
@@ -3805,7 +3790,7 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone,
 	pcp->free_factor >>= 1;
 	list = &pcp->lists[order_to_pindex(migratetype, order)];
 	page = __rmqueue_pcplist(zone, order, migratetype, alloc_flags, pcp, list);
-	pcp_spin_unlock_irqrestore(pcp, flags);
+	pcp_spin_unlock(pcp);
 	pcp_trylock_finish(UP_flags);
 	if (page) {
 		__count_zid_vm_events(PGALLOC, page_zonenum(page), 1);
@@ -5363,7 +5348,6 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
 			struct page **page_array)
 {
 	struct page *page;
-	unsigned long flags;
 	unsigned long __maybe_unused UP_flags;
 	struct zone *zone;
 	struct zoneref *z;
@@ -5445,9 +5429,9 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
 	if (unlikely(!zone))
 		goto failed;
 
-	/* Is a parallel drain in progress? */
+	/* spin_trylock may fail due to a parallel drain or IRQ reentrancy. */
 	pcp_trylock_prepare(UP_flags);
-	pcp = pcp_spin_trylock_irqsave(zone->per_cpu_pageset, flags);
+	pcp = pcp_spin_trylock(zone->per_cpu_pageset);
 	if (!pcp)
 		goto failed_irq;
 
@@ -5466,7 +5450,7 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
 		if (unlikely(!page)) {
 			/* Try and allocate at least one page */
 			if (!nr_account) {
-				pcp_spin_unlock_irqrestore(pcp, flags);
+				pcp_spin_unlock(pcp);
 				goto failed_irq;
 			}
 			break;
@@ -5481,7 +5465,7 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
 		nr_populated++;
 	}
 
-	pcp_spin_unlock_irqrestore(pcp, flags);
+	pcp_spin_unlock(pcp);
 	pcp_trylock_finish(UP_flags);
 
 	__count_zid_vm_events(PGALLOC, zone_idx(zone), nr_account);
diff mbox series

Patch

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e5486d47406e..6a8f07a0a548 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -169,21 +169,12 @@  static DEFINE_MUTEX(pcp_batch_high_lock);
 	_ret;								\
 })
 
-#define pcpu_spin_lock_irqsave(type, member, ptr, flags)		\
+#define pcpu_spin_trylock(type, member, ptr)				\
 ({									\
 	type *_ret;							\
 	pcpu_task_pin();						\
 	_ret = this_cpu_ptr(ptr);					\
-	spin_lock_irqsave(&_ret->member, flags);			\
-	_ret;								\
-})
-
-#define pcpu_spin_trylock_irqsave(type, member, ptr, flags)		\
-({									\
-	type *_ret;							\
-	pcpu_task_pin();						\
-	_ret = this_cpu_ptr(ptr);					\
-	if (!spin_trylock_irqsave(&_ret->member, flags)) {		\
+	if (!spin_trylock(&_ret->member)) {				\
 		pcpu_task_unpin();					\
 		_ret = NULL;						\
 	}								\
@@ -196,27 +187,16 @@  static DEFINE_MUTEX(pcp_batch_high_lock);
 	pcpu_task_unpin();						\
 })
 
-#define pcpu_spin_unlock_irqrestore(member, ptr, flags)			\
-({									\
-	spin_unlock_irqrestore(&ptr->member, flags);			\
-	pcpu_task_unpin();						\
-})
-
 /* struct per_cpu_pages specific helpers. */
 #define pcp_spin_lock(ptr)						\
 	pcpu_spin_lock(struct per_cpu_pages, lock, ptr)
 
-#define pcp_spin_lock_irqsave(ptr, flags)				\
-	pcpu_spin_lock_irqsave(struct per_cpu_pages, lock, ptr, flags)
-
-#define pcp_spin_trylock_irqsave(ptr, flags)				\
-	pcpu_spin_trylock_irqsave(struct per_cpu_pages, lock, ptr, flags)
+#define pcp_spin_trylock(ptr)						\
+	pcpu_spin_trylock(struct per_cpu_pages, lock, ptr)
 
 #define pcp_spin_unlock(ptr)						\
 	pcpu_spin_unlock(lock, ptr)
 
-#define pcp_spin_unlock_irqrestore(ptr, flags)				\
-	pcpu_spin_unlock_irqrestore(lock, ptr, flags)
 #ifdef CONFIG_USE_PERCPU_NUMA_NODE_ID
 DEFINE_PER_CPU(int, numa_node);
 EXPORT_PER_CPU_SYMBOL(numa_node);
@@ -1536,6 +1516,7 @@  static void free_pcppages_bulk(struct zone *zone, int count,
 					struct per_cpu_pages *pcp,
 					int pindex)
 {
+	unsigned long flags;
 	int min_pindex = 0;
 	int max_pindex = NR_PCP_LISTS - 1;
 	unsigned int order;
@@ -1551,8 +1532,7 @@  static void free_pcppages_bulk(struct zone *zone, int count,
 	/* Ensure requested pindex is drained first. */
 	pindex = pindex - 1;
 
-	/* Caller must hold IRQ-safe pcp->lock so IRQs are disabled. */
-	spin_lock(&zone->lock);
+	spin_lock_irqsave(&zone->lock, flags);
 	isolated_pageblocks = has_isolate_pageblock(zone);
 
 	while (count > 0) {
@@ -1601,7 +1581,7 @@  static void free_pcppages_bulk(struct zone *zone, int count,
 		} while (count > 0 && !list_empty(list));
 	}
 
-	spin_unlock(&zone->lock);
+	spin_unlock_irqrestore(&zone->lock, flags);
 }
 
 static void free_one_page(struct zone *zone,
@@ -3118,10 +3098,10 @@  static int rmqueue_bulk(struct zone *zone, unsigned int order,
 			unsigned long count, struct list_head *list,
 			int migratetype, unsigned int alloc_flags)
 {
+	unsigned long flags;
 	int i, allocated = 0;
 
-	/* Caller must hold IRQ-safe pcp->lock so IRQs are disabled. */
-	spin_lock(&zone->lock);
+	spin_lock_irqsave(&zone->lock, flags);
 	for (i = 0; i < count; ++i) {
 		struct page *page = __rmqueue(zone, order, migratetype,
 								alloc_flags);
@@ -3155,7 +3135,7 @@  static int rmqueue_bulk(struct zone *zone, unsigned int order,
 	 * pages added to the pcp list.
 	 */
 	__mod_zone_page_state(zone, NR_FREE_PAGES, -(i << order));
-	spin_unlock(&zone->lock);
+	spin_unlock_irqrestore(&zone->lock, flags);
 	return allocated;
 }
 
@@ -3172,16 +3152,9 @@  void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp)
 	batch = READ_ONCE(pcp->batch);
 	to_drain = min(pcp->count, batch);
 	if (to_drain > 0) {
-		unsigned long flags;
-
-		/*
-		 * free_pcppages_bulk expects IRQs disabled for zone->lock
-		 * so even though pcp->lock is not intended to be IRQ-safe,
-		 * it's needed in this context.
-		 */
-		spin_lock_irqsave(&pcp->lock, flags);
+		spin_lock(&pcp->lock);
 		free_pcppages_bulk(zone, to_drain, pcp, 0);
-		spin_unlock_irqrestore(&pcp->lock, flags);
+		spin_unlock(&pcp->lock);
 	}
 }
 #endif
@@ -3195,12 +3168,9 @@  static void drain_pages_zone(unsigned int cpu, struct zone *zone)
 
 	pcp = per_cpu_ptr(zone->per_cpu_pageset, cpu);
 	if (pcp->count) {
-		unsigned long flags;
-
-		/* See drain_zone_pages on why this is disabling IRQs */
-		spin_lock_irqsave(&pcp->lock, flags);
+		spin_lock(&pcp->lock);
 		free_pcppages_bulk(zone, pcp->count, pcp, 0);
-		spin_unlock_irqrestore(&pcp->lock, flags);
+		spin_unlock(&pcp->lock);
 	}
 }
 
@@ -3466,7 +3436,6 @@  static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp,
  */
 void free_unref_page(struct page *page, unsigned int order)
 {
-	unsigned long flags;
 	unsigned long __maybe_unused UP_flags;
 	struct per_cpu_pages *pcp;
 	struct zone *zone;
@@ -3494,10 +3463,10 @@  void free_unref_page(struct page *page, unsigned int order)
 
 	zone = page_zone(page);
 	pcp_trylock_prepare(UP_flags);
-	pcp = pcp_spin_trylock_irqsave(zone->per_cpu_pageset, flags);
+	pcp = pcp_spin_trylock(zone->per_cpu_pageset);
 	if (pcp) {
 		free_unref_page_commit(zone, pcp, page, migratetype, order);
-		pcp_spin_unlock_irqrestore(pcp, flags);
+		pcp_spin_unlock(pcp);
 	} else {
 		free_one_page(zone, page, pfn, order, migratetype, FPI_NONE);
 	}
@@ -3512,7 +3481,6 @@  void free_unref_page_list(struct list_head *list)
 	struct page *page, *next;
 	struct per_cpu_pages *pcp = NULL;
 	struct zone *locked_zone = NULL;
-	unsigned long flags;
 	int batch_count = 0;
 	int migratetype;
 
@@ -3542,10 +3510,10 @@  void free_unref_page_list(struct list_head *list)
 		/* Different zone, different pcp lock. */
 		if (zone != locked_zone) {
 			if (pcp)
-				pcp_spin_unlock_irqrestore(pcp, flags);
+				pcp_spin_unlock(pcp);
 
 			locked_zone = zone;
-			pcp = pcp_spin_lock_irqsave(locked_zone->per_cpu_pageset, flags);
+			pcp = pcp_spin_lock(locked_zone->per_cpu_pageset);
 		}
 
 		/*
@@ -3564,14 +3532,14 @@  void free_unref_page_list(struct list_head *list)
 		 * a large list of pages to free.
 		 */
 		if (++batch_count == SWAP_CLUSTER_MAX) {
-			pcp_spin_unlock_irqrestore(pcp, flags);
+			pcp_spin_unlock(pcp);
 			batch_count = 0;
-			pcp = pcp_spin_lock_irqsave(locked_zone->per_cpu_pageset, flags);
+			pcp = pcp_spin_lock(locked_zone->per_cpu_pageset);
 		}
 	}
 
 	if (pcp)
-		pcp_spin_unlock_irqrestore(pcp, flags);
+		pcp_spin_unlock(pcp);
 }
 
 /*
@@ -3783,15 +3751,11 @@  static struct page *rmqueue_pcplist(struct zone *preferred_zone,
 	struct per_cpu_pages *pcp;
 	struct list_head *list;
 	struct page *page;
-	unsigned long flags;
 	unsigned long __maybe_unused UP_flags;
 
-	/*
-	 * spin_trylock may fail due to a parallel drain. In the future, the
-	 * trylock will also protect against IRQ reentrancy.
-	 */
+	/* spin_trylock may fail due to a parallel drain or IRQ reentrancy. */
 	pcp_trylock_prepare(UP_flags);
-	pcp = pcp_spin_trylock_irqsave(zone->per_cpu_pageset, flags);
+	pcp = pcp_spin_trylock(zone->per_cpu_pageset);
 	if (!pcp) {
 		pcp_trylock_finish(UP_flags);
 		return NULL;
@@ -3805,7 +3769,7 @@  static struct page *rmqueue_pcplist(struct zone *preferred_zone,
 	pcp->free_factor >>= 1;
 	list = &pcp->lists[order_to_pindex(migratetype, order)];
 	page = __rmqueue_pcplist(zone, order, migratetype, alloc_flags, pcp, list);
-	pcp_spin_unlock_irqrestore(pcp, flags);
+	pcp_spin_unlock(pcp);
 	pcp_trylock_finish(UP_flags);
 	if (page) {
 		__count_zid_vm_events(PGALLOC, page_zonenum(page), 1);
@@ -5329,7 +5293,6 @@  unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
 			struct page **page_array)
 {
 	struct page *page;
-	unsigned long flags;
 	unsigned long __maybe_unused UP_flags;
 	struct zone *zone;
 	struct zoneref *z;
@@ -5411,9 +5374,9 @@  unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
 	if (unlikely(!zone))
 		goto failed;
 
-	/* Is a parallel drain in progress? */
+	/* spin_trylock may fail due to a parallel drain or IRQ reentrancy. */
 	pcp_trylock_prepare(UP_flags);
-	pcp = pcp_spin_trylock_irqsave(zone->per_cpu_pageset, flags);
+	pcp = pcp_spin_trylock(zone->per_cpu_pageset);
 	if (!pcp)
 		goto failed_irq;
 
@@ -5432,7 +5395,7 @@  unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
 		if (unlikely(!page)) {
 			/* Try and allocate at least one page */
 			if (!nr_account) {
-				pcp_spin_unlock_irqrestore(pcp, flags);
+				pcp_spin_unlock(pcp);
 				goto failed_irq;
 			}
 			break;
@@ -5447,7 +5410,7 @@  unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
 		nr_populated++;
 	}
 
-	pcp_spin_unlock_irqrestore(pcp, flags);
+	pcp_spin_unlock(pcp);
 	pcp_trylock_finish(UP_flags);
 
 	__count_zid_vm_events(PGALLOC, zone_idx(zone), nr_account);