diff mbox series

[v2] mm/page_alloc: fix potential deadlock on zonelist_update_seq seqlock

Message ID 8796b95c-3da3-5885-fddd-6ef55f30e4d3@I-love.SAKURA.ne.jp (mailing list archive)
State New
Headers show
Series [v2] mm/page_alloc: fix potential deadlock on zonelist_update_seq seqlock | expand

Commit Message

Tetsuo Handa April 4, 2023, 2:31 p.m. UTC
syzbot is reporting circular locking dependency which involves
zonelist_update_seq seqlock [1], for this lock is checked by memory
allocation requests which do not need to be retried.

One deadlock scenario is kmalloc(GFP_ATOMIC) from an interrupt handler.

  CPU0
  ----
  __build_all_zonelists() {
    write_seqlock(&zonelist_update_seq); // makes zonelist_update_seq.seqcount odd
    // e.g. timer interrupt handler runs at this moment
      some_timer_func() {
        kmalloc(GFP_ATOMIC) {
          __alloc_pages_slowpath() {
            read_seqbegin(&zonelist_update_seq) {
              // spins forever because zonelist_update_seq.seqcount is odd
            }
          }
        }
      }
    // e.g. timer interrupt handler finishes
    write_sequnlock(&zonelist_update_seq); // makes zonelist_update_seq.seqcount even
  }

This deadlock scenario can be easily eliminated by not calling
read_seqbegin(&zonelist_update_seq) from !__GFP_DIRECT_RECLAIM allocation
requests, for retry is applicable to only __GFP_DIRECT_RECLAIM allocation
requests. But Michal Hocko does not know whether we should go with this
approach.

Another deadlock scenario which syzbot is reporting is a race between
kmalloc(GFP_ATOMIC) from tty_insert_flip_string_and_push_buffer()
with port->lock held and printk() from __build_all_zonelists() with
zonelist_update_seq held.

  CPU0                                   CPU1
  ----                                   ----
  pty_write() {
    tty_insert_flip_string_and_push_buffer() {
                                         __build_all_zonelists() {
                                           write_seqlock(&zonelist_update_seq);
                                           build_zonelists() {
                                             printk() {
                                               vprintk() {
                                                 vprintk_default() {
                                                   vprintk_emit() {
                                                     console_unlock() {
                                                       console_flush_all() {
                                                         console_emit_next_record() {
                                                           con->write() = serial8250_console_write() {
      spin_lock_irqsave(&port->lock, flags);
      tty_insert_flip_string() {
        tty_insert_flip_string_fixed_flag() {
          __tty_buffer_request_room() {
            tty_buffer_alloc() {
              kmalloc(GFP_ATOMIC | __GFP_NOWARN) {
                __alloc_pages_slowpath() {
                  zonelist_iter_begin() {
                    read_seqbegin(&zonelist_update_seq); // spins forever because zonelist_update_seq.seqcount is odd
                                                             spin_lock_irqsave(&port->lock, flags); // spins forever because port->lock is held
                    }
                  }
                }
              }
            }
          }
        }
      }
      spin_unlock_irqrestore(&port->lock, flags);
                                                             // message is printed to console
                                                             spin_unlock_irqrestore(&port->lock, flags);
                                                           }
                                                         }
                                                       }
                                                     }
                                                   }
                                                 }
                                               }
                                             }
                                           }
                                           write_sequnlock(&zonelist_update_seq);
                                         }
    }
  }

This deadlock scenario can be eliminated by

  preventing interrupt context from calling kmalloc(GFP_ATOMIC)

and

  preventing printk() from calling console_flush_all()

while zonelist_update_seq.seqcount is odd.

Since Petr Mladek thinks that __build_all_zonelists() can become a
candidate for deferring printk() [2], let's address this problem by

  disabling local interrupts in order to avoid kmalloc(GFP_ATOMIC)

and

  disabling synchronous printk() in order to avoid console_flush_all()

.

As a side effect of minimizing duration of zonelist_update_seq.seqcount
being odd by disabling synchronous printk(), latency at
read_seqbegin(&zonelist_update_seq) for both !__GFP_DIRECT_RECLAIM and
__GFP_DIRECT_RECLAIM allocation requests will be reduced. Although, from
lockdep perspective, not calling read_seqbegin(&zonelist_update_seq) (i.e.
do not record unnecessary locking dependency) from interrupt context is
still preferable, even if we don't allow calling kmalloc(GFP_ATOMIC) inside
write_seqlock(&zonelist_update_seq)/write_sequnlock(&zonelist_update_seq)
section...

Reported-by: syzbot <syzbot+223c7461c58c58a4cb10@syzkaller.appspotmail.com>
Link: https://syzkaller.appspot.com/bug?extid=223c7461c58c58a4cb10 [1]
Fixes: 3d36424b3b58 ("mm/page_alloc: fix race condition between build_all_zonelists and page allocation")
Link: https://lkml.kernel.org/r/ZCrs+1cDqPWTDFNM@alley [2]
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Petr Mladek <pmladek@suse.com>
---
Changes in v2:
  Update patch description and comment.

 mm/page_alloc.c | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

Comments

Michal Hocko April 4, 2023, 3:20 p.m. UTC | #1
On Tue 04-04-23 23:31:58, Tetsuo Handa wrote:
> syzbot is reporting circular locking dependency which involves
> zonelist_update_seq seqlock [1], for this lock is checked by memory
> allocation requests which do not need to be retried.
> 
> One deadlock scenario is kmalloc(GFP_ATOMIC) from an interrupt handler.
> 
>   CPU0
>   ----
>   __build_all_zonelists() {
>     write_seqlock(&zonelist_update_seq); // makes zonelist_update_seq.seqcount odd
>     // e.g. timer interrupt handler runs at this moment
>       some_timer_func() {
>         kmalloc(GFP_ATOMIC) {
>           __alloc_pages_slowpath() {
>             read_seqbegin(&zonelist_update_seq) {
>               // spins forever because zonelist_update_seq.seqcount is odd
>             }
>           }
>         }
>       }
>     // e.g. timer interrupt handler finishes
>     write_sequnlock(&zonelist_update_seq); // makes zonelist_update_seq.seqcount even
>   }
> 
> This deadlock scenario can be easily eliminated by not calling
> read_seqbegin(&zonelist_update_seq) from !__GFP_DIRECT_RECLAIM allocation
> requests, for retry is applicable to only __GFP_DIRECT_RECLAIM allocation
> requests. But Michal Hocko does not know whether we should go with this
> approach.

It would have been more useful to explain why that is not preferred or
desirable.

> Another deadlock scenario which syzbot is reporting is a race between
> kmalloc(GFP_ATOMIC) from tty_insert_flip_string_and_push_buffer()
> with port->lock held and printk() from __build_all_zonelists() with
> zonelist_update_seq held.
> 
>   CPU0                                   CPU1
>   ----                                   ----
>   pty_write() {
>     tty_insert_flip_string_and_push_buffer() {
>                                          __build_all_zonelists() {
>                                            write_seqlock(&zonelist_update_seq);
>                                            build_zonelists() {
>                                              printk() {
>                                                vprintk() {
>                                                  vprintk_default() {
>                                                    vprintk_emit() {
>                                                      console_unlock() {
>                                                        console_flush_all() {
>                                                          console_emit_next_record() {
>                                                            con->write() = serial8250_console_write() {
>       spin_lock_irqsave(&port->lock, flags);
>       tty_insert_flip_string() {
>         tty_insert_flip_string_fixed_flag() {
>           __tty_buffer_request_room() {
>             tty_buffer_alloc() {
>               kmalloc(GFP_ATOMIC | __GFP_NOWARN) {
>                 __alloc_pages_slowpath() {
>                   zonelist_iter_begin() {
>                     read_seqbegin(&zonelist_update_seq); // spins forever because zonelist_update_seq.seqcount is odd
>                                                              spin_lock_irqsave(&port->lock, flags); // spins forever because port->lock is held
>                     }
>                   }
>                 }
>               }
>             }
>           }
>         }
>       }
>       spin_unlock_irqrestore(&port->lock, flags);
>                                                              // message is printed to console
>                                                              spin_unlock_irqrestore(&port->lock, flags);
>                                                            }
>                                                          }
>                                                        }
>                                                      }
>                                                    }
>                                                  }
>                                                }
>                                              }
>                                            }
>                                            write_sequnlock(&zonelist_update_seq);
>                                          }
>     }
>   }
> 
> This deadlock scenario can be eliminated by
> 
>   preventing interrupt context from calling kmalloc(GFP_ATOMIC)
> 
> and
> 
>   preventing printk() from calling console_flush_all()
> 
> while zonelist_update_seq.seqcount is odd.
> 
> Since Petr Mladek thinks that __build_all_zonelists() can become a
> candidate for deferring printk() [2], let's address this problem by
> 
>   disabling local interrupts in order to avoid kmalloc(GFP_ATOMIC)
> 
> and
> 
>   disabling synchronous printk() in order to avoid console_flush_all()
> 
> .
> 
> As a side effect of minimizing duration of zonelist_update_seq.seqcount
> being odd by disabling synchronous printk(), latency at
> read_seqbegin(&zonelist_update_seq) for both !__GFP_DIRECT_RECLAIM and
> __GFP_DIRECT_RECLAIM allocation requests will be reduced. Although, from
> lockdep perspective, not calling read_seqbegin(&zonelist_update_seq) (i.e.
> do not record unnecessary locking dependency) from interrupt context is
> still preferable, even if we don't allow calling kmalloc(GFP_ATOMIC) inside
> write_seqlock(&zonelist_update_seq)/write_sequnlock(&zonelist_update_seq)
> section...

I have really hard time to wrap my head around this changelog. I would
rephrase as follows.

The syzbot has noticed the following deadlock scenario[1]
	CPU0					CPU1
   pty_write() {
     tty_insert_flip_string_and_push_buffer() {
                                          __build_all_zonelists() {
                                            write_seqlock(&zonelist_update_seq); (A)
                                            build_zonelists() {
                                              printk() {
                                                vprintk() {
                                                  vprintk_default() {
                                                    vprintk_emit() {
                                                      console_unlock() {
                                                        console_flush_all() {
                                                          console_emit_next_record() {
                                                            con->write() = serial8250_console_write() {
       spin_lock_irqsave(&port->lock, flags); (B)
                                                              spin_lock_irqsave(&port->lock, flags); <<< spinning on (B)
       tty_insert_flip_string() {
         tty_insert_flip_string_fixed_flag() {
           __tty_buffer_request_room() {
             tty_buffer_alloc() {
               kmalloc(GFP_ATOMIC | __GFP_NOWARN) {
                 __alloc_pages_slowpath() {
                   zonelist_iter_begin() {
                     read_seqbegin(&zonelist_update_seq); <<< spinning on (A)

This can happen during memory hotplug operation. This means that
write_seqlock on zonelist_update_seq is not allowed to call into
synchronous printk code path. This can be avoided by using a deferred
printk context.

This is not the only problematic scenario though. Another one would be
   __build_all_zonelists() {
     write_seqlock(&zonelist_update_seq); <<< (A)
       <IRQ>
         kmalloc(GFP_ATOMIC) {
           __alloc_pages_slowpath() {
             read_seqbegin(&zonelist_update_seq) <<< spinning on (A)

Allocations from (soft)IRQ contexts are quite common. This can be
avoided by disabling interrupts for this path so we won't self livelock.

> Reported-by: syzbot <syzbot+223c7461c58c58a4cb10@syzkaller.appspotmail.com>
> Link: https://syzkaller.appspot.com/bug?extid=223c7461c58c58a4cb10 [1]
> Fixes: 3d36424b3b58 ("mm/page_alloc: fix race condition between build_all_zonelists and page allocation")
> Link: https://lkml.kernel.org/r/ZCrs+1cDqPWTDFNM@alley [2]
> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Petr Mladek <pmladek@suse.com>

Anyway the patch is correct
Acked-by: Michal Hocko <mhocko@suse.com>

> ---
> Changes in v2:
>   Update patch description and comment.
> 
>  mm/page_alloc.c | 16 ++++++++++++++++
>  1 file changed, 16 insertions(+)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 7136c36c5d01..e8b4f294d763 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -6632,7 +6632,21 @@ static void __build_all_zonelists(void *data)
>  	int nid;
>  	int __maybe_unused cpu;
>  	pg_data_t *self = data;
> +	unsigned long flags;
>  
> +	/*
> +	 * Explicitly disable this CPU's interrupts before taking seqlock
> +	 * to prevent any IRQ handler from calling into the page allocator
> +	 * (e.g. GFP_ATOMIC) that could hit zonelist_iter_begin and livelock.
> +	 */
> +	local_irq_save(flags);
> +	/*
> +	 * Explicitly disable this CPU's synchronous printk() before taking
> +	 * seqlock to prevent any printk() from trying to hold port->lock, for
> +	 * tty_insert_flip_string_and_push_buffer() on other CPU might be
> +	 * calling kmalloc(GFP_ATOMIC | __GFP_NOWARN) with port->lock held.
> +	 */
> +	printk_deferred_enter();
>  	write_seqlock(&zonelist_update_seq);
>  
>  #ifdef CONFIG_NUMA
> @@ -6671,6 +6685,8 @@ static void __build_all_zonelists(void *data)
>  	}
>  
>  	write_sequnlock(&zonelist_update_seq);
> +	printk_deferred_exit();
> +	local_irq_restore(flags);
>  }
>  
>  static noinline void __init
> -- 
> 2.34.1
Andrew Morton April 4, 2023, 9:25 p.m. UTC | #2
On Tue, 4 Apr 2023 23:31:58 +0900 Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> wrote:

> syzbot is reporting circular locking dependency which involves
> zonelist_update_seq seqlock [1], for this lock is checked by memory
> allocation requests which do not need to be retried.

I queued this, along with a note that an updated changelog is likely.

Do we feel that a -stable backport is warranted?  I think so, from your
earlier comments.  Please add the cc:stable to the changelog in this
situation.
Michal Hocko April 5, 2023, 8:28 a.m. UTC | #3
On Tue 04-04-23 14:25:28, Andrew Morton wrote:
> On Tue, 4 Apr 2023 23:31:58 +0900 Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> wrote:
> 
> > syzbot is reporting circular locking dependency which involves
> > zonelist_update_seq seqlock [1], for this lock is checked by memory
> > allocation requests which do not need to be retried.
> 
> I queued this, along with a note that an updated changelog is likely.
> 
> Do we feel that a -stable backport is warranted?  I think so, from your
> earlier comments.  Please add the cc:stable to the changelog in this
> situation.

Memory hotplug is pretty rare event so the deadlock is quite unlikely.
On the other hand the fix is pretty easy so it shouldn't hurt to have it
in stable kernels.
Petr Mladek April 5, 2023, 8:53 a.m. UTC | #4
On Wed 2023-04-05 10:28:07, Michal Hocko wrote:
> On Tue 04-04-23 14:25:28, Andrew Morton wrote:
> > On Tue, 4 Apr 2023 23:31:58 +0900 Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> wrote:
> > 
> > > syzbot is reporting circular locking dependency which involves
> > > zonelist_update_seq seqlock [1], for this lock is checked by memory
> > > allocation requests which do not need to be retried.
> > 
> > I queued this, along with a note that an updated changelog is likely.
> > 
> > Do we feel that a -stable backport is warranted?  I think so, from your
> > earlier comments.  Please add the cc:stable to the changelog in this
> > situation.
> 
> Memory hotplug is pretty rare event so the deadlock is quite unlikely.
> On the other hand the fix is pretty easy so it shouldn't hurt to have it
> in stable kernels.

Note that printk_deferred_enter()/exit() has been added in v5.15-rc1
by the commit 85e3e7fbbb720b9897 ("printk: remove NMI tracking").

The commit has non-trivial dependencies. Any backport for older stable
kernel would need a custom patch just adding these two wrappers.

Best Regards,
Petr
Mel Gorman April 5, 2023, 9:02 a.m. UTC | #5
On Tue, Apr 04, 2023 at 05:20:35PM +0200, Michal Hocko wrote:
> On Tue 04-04-23 23:31:58, Tetsuo Handa wrote:
> > syzbot is reporting circular locking dependency which involves
> > zonelist_update_seq seqlock [1], for this lock is checked by memory
> > allocation requests which do not need to be retried.
> > 
> > One deadlock scenario is kmalloc(GFP_ATOMIC) from an interrupt handler.
> > 
> >   CPU0
> >   ----
> >   __build_all_zonelists() {
> >     write_seqlock(&zonelist_update_seq); // makes zonelist_update_seq.seqcount odd
> >     // e.g. timer interrupt handler runs at this moment
> >       some_timer_func() {
> >         kmalloc(GFP_ATOMIC) {
> >           __alloc_pages_slowpath() {
> >             read_seqbegin(&zonelist_update_seq) {
> >               // spins forever because zonelist_update_seq.seqcount is odd
> >             }
> >           }
> >         }
> >       }
> >     // e.g. timer interrupt handler finishes
> >     write_sequnlock(&zonelist_update_seq); // makes zonelist_update_seq.seqcount even
> >   }
> > 
> > This deadlock scenario can be easily eliminated by not calling
> > read_seqbegin(&zonelist_update_seq) from !__GFP_DIRECT_RECLAIM allocation
> > requests, for retry is applicable to only __GFP_DIRECT_RECLAIM allocation
> > requests. But Michal Hocko does not know whether we should go with this
> > approach.
> 
> It would have been more useful to explain why that is not preferred or
> desirable.
> 

It would have undesirable side-effects. !__GFP_DIRECT_RECLAIM does not mean
that the caller is happening from interrupt context or is a re-entrant
allocation request from IRQ context. Special casing __GFP_DIRECT_RECLAIM
could allow a speculative allocation to simply prematurely fail due to
memory hotplug.

	This deadlock could be mitigated by not calling
	read_seqbegin(&zonelist_update_seq) from !__GFP_DIRECT_RECLAIM
	allocation request but it has undesirable consequences.
	!GFP_DIRECT_RECLAIM applies to allocations other than
	atomic allocations from interrupt context such as GFP_NOWAIT
	allocations. These type of allocations could prematurely fail due to
	a memory hotplug race or cpuset update and while this is probably
	harmless and recoverable, it's undesirable and unnecessary to fix
	the deadlock.

I imagine there could also be checks for callers from IRQ context but on
some older chips, checking if IRQs are disabled is expensive so also is
an undesirable fix. Maybe the same is true for newer CPUs, but I didn't
check. The printk_deferred option seems the most harmless option for now
and maybe printk_deferred will ultimately go away.

Other than potential changelog updates

Acked-by: Mel Gorman <mgorman@techsingularity.net>
diff mbox series

Patch

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 7136c36c5d01..e8b4f294d763 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6632,7 +6632,21 @@  static void __build_all_zonelists(void *data)
 	int nid;
 	int __maybe_unused cpu;
 	pg_data_t *self = data;
+	unsigned long flags;
 
+	/*
+	 * Explicitly disable this CPU's interrupts before taking seqlock
+	 * to prevent any IRQ handler from calling into the page allocator
+	 * (e.g. GFP_ATOMIC) that could hit zonelist_iter_begin and livelock.
+	 */
+	local_irq_save(flags);
+	/*
+	 * Explicitly disable this CPU's synchronous printk() before taking
+	 * seqlock to prevent any printk() from trying to hold port->lock, for
+	 * tty_insert_flip_string_and_push_buffer() on other CPU might be
+	 * calling kmalloc(GFP_ATOMIC | __GFP_NOWARN) with port->lock held.
+	 */
+	printk_deferred_enter();
 	write_seqlock(&zonelist_update_seq);
 
 #ifdef CONFIG_NUMA
@@ -6671,6 +6685,8 @@  static void __build_all_zonelists(void *data)
 	}
 
 	write_sequnlock(&zonelist_update_seq);
+	printk_deferred_exit();
+	local_irq_restore(flags);
 }
 
 static noinline void __init