diff mbox series

mm/page_alloc: fix potential deadlock on zonelist_update_seq seqlock

Message ID 6266b161-e4c3-7d65-6590-da6cc04d93ec@I-love.SAKURA.ne.jp (mailing list archive)
State New
Headers show
Series mm/page_alloc: fix potential deadlock on zonelist_update_seq seqlock | expand

Commit Message

Tetsuo Handa April 4, 2023, 12:37 a.m. UTC
syzbot is reporting circular locking dependency which involves
zonelist_update_seq seqlock [1], for this lock is checked by memory
allocation requests which do not need to be retried.

We somehow need to prevent __alloc_pages_slowpath() from checking
this lock. Since Petr Mladek thinks that __build_all_zonelists() can
become a candidate for deferring printk() [2], let's make sure that
current CPU/thread won't reach __alloc_pages_slowpath() while this lock
is in use.

Reported-by: syzbot <syzbot+223c7461c58c58a4cb10@syzkaller.appspotmail.com>
Link: https://syzkaller.appspot.com/bug?extid=223c7461c58c58a4cb10 [1]
Fixes: 3d36424b3b58 ("mm/page_alloc: fix race condition between build_all_zonelists and page allocation")
Link: https://lkml.kernel.org/r/ZCrs+1cDqPWTDFNM@alley [2]
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Petr Mladek <pmladek@suse.com>
---
 mm/page_alloc.c | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

Comments

Sergey Senozhatsky April 4, 2023, 2:11 a.m. UTC | #1
On (23/04/04 09:37), Tetsuo Handa wrote:
> syzbot is reporting circular locking dependency which involves
> zonelist_update_seq seqlock [1], for this lock is checked by memory
> allocation requests which do not need to be retried.
> 
> We somehow need to prevent __alloc_pages_slowpath() from checking
> this lock. Since Petr Mladek thinks that __build_all_zonelists() can
> become a candidate for deferring printk() [2], let's make sure that
> current CPU/thread won't reach __alloc_pages_slowpath() while this lock
> is in use.
> 
> Reported-by: syzbot <syzbot+223c7461c58c58a4cb10@syzkaller.appspotmail.com>
> Link: https://syzkaller.appspot.com/bug?extid=223c7461c58c58a4cb10 [1]
> Fixes: 3d36424b3b58 ("mm/page_alloc: fix race condition between build_all_zonelists and page allocation")
> Link: https://lkml.kernel.org/r/ZCrs+1cDqPWTDFNM@alley [2]
> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Petr Mladek <pmladek@suse.com>

Yeah that looks like one of those printk_deferred() cases.
FWIW
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Petr Mladek April 4, 2023, 7:43 a.m. UTC | #2
On Tue 2023-04-04 09:37:25, Tetsuo Handa wrote:
> syzbot is reporting circular locking dependency which involves
> zonelist_update_seq seqlock [1], for this lock is checked by memory
> allocation requests which do not need to be retried.
> 
> We somehow need to prevent __alloc_pages_slowpath() from checking
> this lock. Since Petr Mladek thinks that __build_all_zonelists() can
> become a candidate for deferring printk() [2], let's make sure that
> current CPU/thread won't reach __alloc_pages_slowpath() while this lock
> is in use.
> 
> Reported-by: syzbot <syzbot+223c7461c58c58a4cb10@syzkaller.appspotmail.com>
> Link: https://syzkaller.appspot.com/bug?extid=223c7461c58c58a4cb10 [1]
> Fixes: 3d36424b3b58 ("mm/page_alloc: fix race condition between build_all_zonelists and page allocation")
> Link: https://lkml.kernel.org/r/ZCrs+1cDqPWTDFNM@alley [2]

From the description is far from obvious how printk() is involved.
It might make sense to paste the entire lockdep splat. The links
are not guaranteed to stay around.

> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -6632,7 +6632,21 @@ static void __build_all_zonelists(void *data)
>  	int nid;
>  	int __maybe_unused cpu;
>  	pg_data_t *self = data;
> +	unsigned long flags;
>  
> +	/*
> +	 * Since __alloc_pages_slowpath() spins if zonelist_update_seq.seqcount
> +	 * is odd, any memory allocation while zonelist_update_seq.seqcount is
> +	 * odd have to be avoided.
> +	 *
> +	 * Explicitly disable local irqs in order to avoid calling
> +	 * kmalloc(GFP_ATOMIC) from e.g. timer interrupt handler.
> +	 * Also, explicitly prevent printk() from synchronously waiting for
> +	 * port->lock because tty_insert_flip_string_and_push_buffer() might
> +	 * call kmalloc(GFP_ATOMIC | __GFP_NOWARN) while holding port->lock.
> +	 */
> +	local_irq_save(flags);

The comment above printk_deferred_enter definition in
include/linux/printk.h says that interrupts need to be disabled.

But strictly speaking, it should be enough to disable preemption
there days. The reason is that is uses per-CPU reference counter.

Note that it used to be really important to disable interrupts
in the past. The messages were temporary stored in a per-CPU buffer
and the lockless algorithm was not safe for reentrancy.

> +	printk_deferred_enter();
>  	write_seqlock(&zonelist_update_seq);
>  
>  #ifdef CONFIG_NUMA
> @@ -6671,6 +6685,8 @@ static void __build_all_zonelists(void *data)
>  	}
>  
>  	write_sequnlock(&zonelist_update_seq);
> +	printk_deferred_exit();
> +	local_irq_restore(flags);
>  }
>  
>  static noinline void __init

Otherwise, it looks fine from the printk() POV.

Best Regards,
Petr
Michal Hocko April 4, 2023, 7:54 a.m. UTC | #3
On Tue 04-04-23 09:37:25, Tetsuo Handa wrote:
> syzbot is reporting circular locking dependency which involves
> zonelist_update_seq seqlock [1], for this lock is checked by memory
> allocation requests which do not need to be retried.
> 
> We somehow need to prevent __alloc_pages_slowpath() from checking
> this lock. Since Petr Mladek thinks that __build_all_zonelists() can
> become a candidate for deferring printk() [2], let's make sure that
> current CPU/thread won't reach __alloc_pages_slowpath() while this lock
> is in use.

I agree with Petr that the lockdep splat and the lockup explanation
should be part of the changelog. Just reuse what you had in previous
email.

> Reported-by: syzbot <syzbot+223c7461c58c58a4cb10@syzkaller.appspotmail.com>
> Link: https://syzkaller.appspot.com/bug?extid=223c7461c58c58a4cb10 [1]
> Fixes: 3d36424b3b58 ("mm/page_alloc: fix race condition between build_all_zonelists and page allocation")
> Link: https://lkml.kernel.org/r/ZCrs+1cDqPWTDFNM@alley [2]
> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Petr Mladek <pmladek@suse.com>
> ---
>  mm/page_alloc.c | 16 ++++++++++++++++
>  1 file changed, 16 insertions(+)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 7136c36c5d01..64fa77b8d24a 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -6632,7 +6632,21 @@ static void __build_all_zonelists(void *data)
>  	int nid;
>  	int __maybe_unused cpu;
>  	pg_data_t *self = data;
> +	unsigned long flags;
>  
> +	/*
> +	 * Since __alloc_pages_slowpath() spins if zonelist_update_seq.seqcount
> +	 * is odd, any memory allocation while zonelist_update_seq.seqcount is
> +	 * odd have to be avoided.
> +	 *
> +	 * Explicitly disable local irqs in order to avoid calling
> +	 * kmalloc(GFP_ATOMIC) from e.g. timer interrupt handler.
> +	 * Also, explicitly prevent printk() from synchronously waiting for
> +	 * port->lock because tty_insert_flip_string_and_push_buffer() might
> +	 * call kmalloc(GFP_ATOMIC | __GFP_NOWARN) while holding port->lock.

This explanation of local_irq_save just doesn't make any sense. You do
not prevent any other cpu from entering the IRQ and doing the same
thing. If the sole purpose of local_irq_save is to conform
printk_deferred_enter then state that instead. Although it seems that
Petr believes that preempt_disable should be sufficient and then it
would be preferred as well. This would require update to the comment for
printk_deferred_enter though.

Thanks!

> +	 */
> +	local_irq_save(flags);
> +	printk_deferred_enter();
>  	write_seqlock(&zonelist_update_seq);
>  
>  #ifdef CONFIG_NUMA
> @@ -6671,6 +6685,8 @@ static void __build_all_zonelists(void *data)
>  	}
>  
>  	write_sequnlock(&zonelist_update_seq);
> +	printk_deferred_exit();
> +	local_irq_restore(flags);
>  }
>  
>  static noinline void __init
> -- 
> 2.34.1
Tetsuo Handa April 4, 2023, 8:20 a.m. UTC | #4
On 2023/04/04 16:54, Michal Hocko wrote:
>> +	/*
>> +	 * Since __alloc_pages_slowpath() spins if zonelist_update_seq.seqcount
>> +	 * is odd, any memory allocation while zonelist_update_seq.seqcount is
>> +	 * odd have to be avoided.
>> +	 *
>> +	 * Explicitly disable local irqs in order to avoid calling
>> +	 * kmalloc(GFP_ATOMIC) from e.g. timer interrupt handler.
>> +	 * Also, explicitly prevent printk() from synchronously waiting for
>> +	 * port->lock because tty_insert_flip_string_and_push_buffer() might
>> +	 * call kmalloc(GFP_ATOMIC | __GFP_NOWARN) while holding port->lock.
> 
> This explanation of local_irq_save just doesn't make any sense. You do
> not prevent any other cpu from entering the IRQ and doing the same
> thing.

There is no need to prevent other CPUs from doing the same thing.
The intent of local_irq_save() here is to avoid below sequence.

  CPU0
  ----
  __build_all_zonelists() {
    write_seqlock(&zonelist_update_seq); // makes zonelist_update_seq.seqcount odd
    // e.g. timer interrupt handler runs at this moment
      some_timer_func() {
        kmalloc(GFP_ATOMIC) {
          __alloc_pages_slowpath() {
            read_seqbegin(&zonelist_update_seq) {
              // forever spins because zonelist_update_seq.seqcount is odd
            }
          }
        }
      }
    // e.g. timer interrupt handler finishes
    write_sequnlock(&zonelist_update_seq); // makes zonelist_update_seq.seqcount even
  }

>        If the sole purpose of local_irq_save is to conform
> printk_deferred_enter then state that instead.

As described above, this local_irq_save() is not intended for conforming
printk_deferred_enter(). This is the same with set_mems_allowed() calling
local_irq_save() before write_seqcount_begin(&current->mems_allowed_seq).

>                                                Although it seems that
> Petr believes that preempt_disable should be sufficient and then it
> would be preferred as well. This would require update to the comment for
> printk_deferred_enter though.

This patch is supposed to be backported to stable kernels where commit
3d36424b3b58 ("mm/page_alloc: fix race condition between build_all_zonelists
and page allocation") was backported.
Michal Hocko April 4, 2023, 11:05 a.m. UTC | #5
On Tue 04-04-23 17:20:02, Tetsuo Handa wrote:
> On 2023/04/04 16:54, Michal Hocko wrote:
> >> +	/*
> >> +	 * Since __alloc_pages_slowpath() spins if zonelist_update_seq.seqcount
> >> +	 * is odd, any memory allocation while zonelist_update_seq.seqcount is
> >> +	 * odd have to be avoided.
> >> +	 *
> >> +	 * Explicitly disable local irqs in order to avoid calling
> >> +	 * kmalloc(GFP_ATOMIC) from e.g. timer interrupt handler.
> >> +	 * Also, explicitly prevent printk() from synchronously waiting for
> >> +	 * port->lock because tty_insert_flip_string_and_push_buffer() might
> >> +	 * call kmalloc(GFP_ATOMIC | __GFP_NOWARN) while holding port->lock.
> > 
> > This explanation of local_irq_save just doesn't make any sense. You do
> > not prevent any other cpu from entering the IRQ and doing the same
> > thing.
> 
> There is no need to prevent other CPUs from doing the same thing.
> The intent of local_irq_save() here is to avoid below sequence.
> 
>   CPU0
>   ----
>   __build_all_zonelists() {
>     write_seqlock(&zonelist_update_seq); // makes zonelist_update_seq.seqcount odd
>     // e.g. timer interrupt handler runs at this moment
>       some_timer_func() {
>         kmalloc(GFP_ATOMIC) {
>           __alloc_pages_slowpath() {
>             read_seqbegin(&zonelist_update_seq) {
>               // forever spins because zonelist_update_seq.seqcount is odd
>             }
>           }
>         }
>       }
>     // e.g. timer interrupt handler finishes
>     write_sequnlock(&zonelist_update_seq); // makes zonelist_update_seq.seqcount even
>   }

OK, now we are on the same page. Your previous bc630622-8b24-e4ca-2685-64880b5a7647@I-love.SAKURA.ne.jp
explanation had an intermediary lock in the dependency (port->lock). I
haven't realized that the actual scenario could be simpler than that.
But you are right that GFP_ATOMIC allocations from IRQ context are quite
common so this is a more general situation that doesn't really need to
involve TTY and some locking oddities there.

This all is quite subtle and easy to miss so please make sure to
describe both scenarios in the changelog. For the comment above I would
rephrase as follows:
	/*
	 * Explicitly disable interrupts before taking seqlock to prevent
	 * any IRQ handler from calling into the page allocator
	 * (e.g. GFP_ATOMIC) that could hit zonelist_iter_begin and live
	 * lock.
	 */

Thanks!
Tetsuo Handa April 4, 2023, 11:19 a.m. UTC | #6
On 2023/04/04 20:05, Michal Hocko wrote:
> But you are right that GFP_ATOMIC allocations from IRQ context are quite
> common so this is a more general situation that doesn't really need to
> involve TTY and some locking oddities there.

Yes, that's why "[PATCH] mm/page_alloc: don't check zonelist_update_seq from
atomic allocations" was proposed; I consider that zonelist_update_seq should
not be checked from atomic context.

 From lockdep perspective, keeping locking dependency simpler is the better,
even if we go with printk_deferred_enter()/printk_deferred_exit() approach.
diff mbox series

Patch

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 7136c36c5d01..64fa77b8d24a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6632,7 +6632,21 @@  static void __build_all_zonelists(void *data)
 	int nid;
 	int __maybe_unused cpu;
 	pg_data_t *self = data;
+	unsigned long flags;
 
+	/*
+	 * Since __alloc_pages_slowpath() spins if zonelist_update_seq.seqcount
+	 * is odd, any memory allocation while zonelist_update_seq.seqcount is
+	 * odd have to be avoided.
+	 *
+	 * Explicitly disable local irqs in order to avoid calling
+	 * kmalloc(GFP_ATOMIC) from e.g. timer interrupt handler.
+	 * Also, explicitly prevent printk() from synchronously waiting for
+	 * port->lock because tty_insert_flip_string_and_push_buffer() might
+	 * call kmalloc(GFP_ATOMIC | __GFP_NOWARN) while holding port->lock.
+	 */
+	local_irq_save(flags);
+	printk_deferred_enter();
 	write_seqlock(&zonelist_update_seq);
 
 #ifdef CONFIG_NUMA
@@ -6671,6 +6685,8 @@  static void __build_all_zonelists(void *data)
 	}
 
 	write_sequnlock(&zonelist_update_seq);
+	printk_deferred_exit();
+	local_irq_restore(flags);
 }
 
 static noinline void __init