diff mbox series

[v2,2/2] mm/page_alloc: Use write_seqlock_irqsave() instead write_seqlock() + local_irq_save().

Message ID 20230623171232.892937-3-bigeasy@linutronix.de (mailing list archive)
State New
Headers show
Series seqlock,mm: lockdep annotation + write_seqlock_irqsave() | expand

Commit Message

Sebastian Andrzej Siewior June 23, 2023, 5:12 p.m. UTC
__build_all_zonelists() acquires zonelist_update_seq by first disabling
interrupts via local_irq_save() and then acquiring the seqlock with
write_seqlock(). This is troublesome and leads to problems on
PREEMPT_RT. The problem is that the inner spinlock_t becomes a sleeping
lock on PREEMPT_RT and must not be acquired with disabled interrupts.

The API provides write_seqlock_irqsave() which does the right thing in
one step.
printk_deferred_enter() has to be invoked in non-migrate-able context to
ensure that deferred printing is enabled and disabled on the same CPU.
This is the case after zonelist_update_seq has been acquired.

There was discussion on the first submission that the order should be:
	local_irq_disable();
	printk_deferred_enter();
	write_seqlock();

to avoid pitfalls like having an unaccounted printk() coming from
write_seqlock_irqsave() before printk_deferred_enter() is invoked. The
only origin of such a printk() can be a lockdep splat because the
lockdep annotation happens after the sequence count is incremented.
This is exceptional and subject to change.

It was also pointed that PREEMPT_RT can be affected by the printk
problem since its write_seqlock_irqsave() does not really disable
interrupts. This isn't the case because PREEMPT_RT's printk
implementation differs from the mainline implementation in two important
aspects:
- Printing happens in a dedicated threads and not at during the
  invocation of printk().
- In emergency cases where synchronous printing is used, a different
  driver is used which does not use tty_port::lock.

Acquire zonelist_update_seq with write_seqlock_irqsave() and then defer
printk output.

Fixes: 1007843a91909 ("mm/page_alloc: fix potential deadlock on zonelist_update_seq seqlock")
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 mm/page_alloc.c | 11 ++++-------
 1 file changed, 4 insertions(+), 7 deletions(-)

Comments

Michal Hocko June 23, 2023, 6:17 p.m. UTC | #1
On Fri 23-06-23 19:12:32, Sebastian Andrzej Siewior wrote:
> __build_all_zonelists() acquires zonelist_update_seq by first disabling
> interrupts via local_irq_save() and then acquiring the seqlock with
> write_seqlock(). This is troublesome and leads to problems on
> PREEMPT_RT. The problem is that the inner spinlock_t becomes a sleeping
> lock on PREEMPT_RT and must not be acquired with disabled interrupts.
> 
> The API provides write_seqlock_irqsave() which does the right thing in
> one step.
> printk_deferred_enter() has to be invoked in non-migrate-able context to
> ensure that deferred printing is enabled and disabled on the same CPU.
> This is the case after zonelist_update_seq has been acquired.
> 
> There was discussion on the first submission that the order should be:
> 	local_irq_disable();
> 	printk_deferred_enter();
> 	write_seqlock();
> 
> to avoid pitfalls like having an unaccounted printk() coming from
> write_seqlock_irqsave() before printk_deferred_enter() is invoked. The
> only origin of such a printk() can be a lockdep splat because the
> lockdep annotation happens after the sequence count is incremented.
> This is exceptional and subject to change.
> 
> It was also pointed that PREEMPT_RT can be affected by the printk
> problem since its write_seqlock_irqsave() does not really disable
> interrupts. This isn't the case because PREEMPT_RT's printk
> implementation differs from the mainline implementation in two important
> aspects:
> - Printing happens in a dedicated threads and not at during the
>   invocation of printk().
> - In emergency cases where synchronous printing is used, a different
>   driver is used which does not use tty_port::lock.
> 
> Acquire zonelist_update_seq with write_seqlock_irqsave() and then defer
> printk output.
> 
> Fixes: 1007843a91909 ("mm/page_alloc: fix potential deadlock on zonelist_update_seq seqlock")
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>

Thanks for extending the changelog. This is much more clearer IMO.

One nit below which I haven't noticed before. Anyway
Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  mm/page_alloc.c | 11 ++++-------
>  1 file changed, 4 insertions(+), 7 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 47421bedc12b7..99b7e7d09c5c0 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5808,11 +5808,10 @@ static void __build_all_zonelists(void *data)
>  	unsigned long flags;
>  
>  	/*
> -	 * Explicitly disable this CPU's interrupts before taking seqlock
> -	 * to prevent any IRQ handler from calling into the page allocator
> -	 * (e.g. GFP_ATOMIC) that could hit zonelist_iter_begin and livelock.
> +	 * The zonelist_update_seq must be acquired with irqsave because the
> +	 * reader can be invoked from IRQ with GFP_ATOMIC.
>  	 */
> -	local_irq_save(flags);
> +	write_seqlock_irqsave(&zonelist_update_seq, flags);
>  	/*
>  	 * Explicitly disable this CPU's synchronous printk() before taking
>  	 * seqlock to prevent any printk() from trying to hold port->lock, for

This is not the case anymore because the locking ordering has flipped. I
would just extend the comment above by something like:

	 * Also disable synchronous printk() to prevent any printk() from trying
	 * to hold port->lock, for tty_insert_flip_string_and_push_buffer() on
	 * other CPU might be calling kmalloc(GFP_ATOMIC | __GFP_NOWARN) with
	 * port->lock held.

> @@ -5820,7 +5819,6 @@ static void __build_all_zonelists(void *data)
>  	 * calling kmalloc(GFP_ATOMIC | __GFP_NOWARN) with port->lock held.
>  	 */
>  	printk_deferred_enter();
> -	write_seqlock(&zonelist_update_seq);
>  
>  #ifdef CONFIG_NUMA
>  	memset(node_load, 0, sizeof(node_load));
> @@ -5857,9 +5855,8 @@ static void __build_all_zonelists(void *data)
>  #endif
>  	}
>  
> -	write_sequnlock(&zonelist_update_seq);
>  	printk_deferred_exit();
> -	local_irq_restore(flags);
> +	write_sequnlock_irqrestore(&zonelist_update_seq, flags);
>  }
>  
>  static noinline void __init
> -- 
> 2.40.1
diff mbox series

Patch

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 47421bedc12b7..99b7e7d09c5c0 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5808,11 +5808,10 @@  static void __build_all_zonelists(void *data)
 	unsigned long flags;
 
 	/*
-	 * Explicitly disable this CPU's interrupts before taking seqlock
-	 * to prevent any IRQ handler from calling into the page allocator
-	 * (e.g. GFP_ATOMIC) that could hit zonelist_iter_begin and livelock.
+	 * The zonelist_update_seq must be acquired with irqsave because the
+	 * reader can be invoked from IRQ with GFP_ATOMIC.
 	 */
-	local_irq_save(flags);
+	write_seqlock_irqsave(&zonelist_update_seq, flags);
 	/*
 	 * Explicitly disable this CPU's synchronous printk() before taking
 	 * seqlock to prevent any printk() from trying to hold port->lock, for
@@ -5820,7 +5819,6 @@  static void __build_all_zonelists(void *data)
 	 * calling kmalloc(GFP_ATOMIC | __GFP_NOWARN) with port->lock held.
 	 */
 	printk_deferred_enter();
-	write_seqlock(&zonelist_update_seq);
 
 #ifdef CONFIG_NUMA
 	memset(node_load, 0, sizeof(node_load));
@@ -5857,9 +5855,8 @@  static void __build_all_zonelists(void *data)
 #endif
 	}
 
-	write_sequnlock(&zonelist_update_seq);
 	printk_deferred_exit();
-	local_irq_restore(flags);
+	write_sequnlock_irqrestore(&zonelist_update_seq, flags);
 }
 
 static noinline void __init