diff mbox series

mm/page_alloc: don't check zonelist_update_seq from atomic allocations

Message ID dfdb9da6-ca8f-7a81-bfdd-d74b4c401f11@I-love.SAKURA.ne.jp (mailing list archive)
State New
Headers show
Series mm/page_alloc: don't check zonelist_update_seq from atomic allocations | expand

Commit Message

Tetsuo Handa April 2, 2023, 10:48 a.m. UTC
syzbot is reporting circular locking dependency which involves
zonelist_update_seq seqlock, for zonelist_update_seq is checked
while doing GFP_ATOMIC allocation.

Since zonelist_update_seq is checked for avoiding unnecessary OOM kill,
there is no need to check zonelist_update_seq for memory allocations
which will fail without OOM kill.

Therefore, let's keep locking dependency simple, by not checking
zonelist_update_seq from !__GFP_DIRECT_RECLAIM allocations.

Reported-by: syzbot <syzbot+223c7461c58c58a4cb10@syzkaller.appspotmail.com>
Link: https://syzkaller.appspot.com/bug?extid=223c7461c58c58a4cb10
Fixes: 3d36424b3b58 ("mm/page_alloc: fix race condition between build_all_zonelists and page allocation")
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
 mm/page_alloc.c | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

Comments

Michal Hocko April 3, 2023, 8:15 a.m. UTC | #1
On Sun 02-04-23 19:48:29, Tetsuo Handa wrote:
> syzbot is reporting circular locking dependency which involves
> zonelist_update_seq seqlock, for zonelist_update_seq is checked
> while doing GFP_ATOMIC allocation.

Could you explain the the deadlock scenario?

> Since zonelist_update_seq is checked for avoiding unnecessary OOM kill,
> there is no need to check zonelist_update_seq for memory allocations
> which will fail without OOM kill.
> 
> Therefore, let's keep locking dependency simple, by not checking
> zonelist_update_seq from !__GFP_DIRECT_RECLAIM allocations.
> 
> Reported-by: syzbot <syzbot+223c7461c58c58a4cb10@syzkaller.appspotmail.com>

Is this
https://lore.kernel.org/all/0000000000001d74d205f7c1821f@google.com/ the
underlying report ?

> Link: https://syzkaller.appspot.com/bug?extid=223c7461c58c58a4cb10
> Fixes: 3d36424b3b58 ("mm/page_alloc: fix race condition between build_all_zonelists and page allocation")
> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Tetsuo Handa April 3, 2023, 11:14 a.m. UTC | #2
On 2023/04/03 17:15, Michal Hocko wrote:
> Is this
> https://lore.kernel.org/all/0000000000001d74d205f7c1821f@google.com/ the
> underlying report ?

Yes.

> Could you explain the the deadlock scenario?

build_zonelists() from __build_all_zonelists() calls printk() with
zonelist_update_seq held.

printk() holds console_owner lock for synchronous printing, and then upon
unlock of console_owner lock, printk() holds port_lock_key and port->lock.

tty_insert_flip_string_and_push_buffer() from pty_write() conditionally calls
kmalloc(GFP_ATOMIC | __GFP_NOWARN) with port->lock held. But since commit 3d36424b3b58,
zonelist_update_seq is checked by GFP_ATOMIC allocation (i.e. a new locking dependency
was added by that commit).

  CPU0                                                       CPU1
  pty_write() {
    tty_insert_flip_string_and_push_buffer() {
                                                             __build_all_zonelists() {
      spin_lock_irqsave(&port->lock, flags);
      tty_insert_flip_string() {
        tty_insert_flip_string_fixed_flag() {
          __tty_buffer_request_room() {
            tty_buffer_alloc() {
              kmalloc(GFP_ATOMIC | __GFP_NOWARN) {
                __alloc_pages_slowpath() {
                                                               write_seqlock(&zonelist_update_seq); // makes zonelist_update_seq.seqcount odd
                                                               // interrupt handler starts
                                                                 handle_irq() {
                                                                   serial8250_interrupt() {
                                                                     serial8250_tx_chars() {
                                                                       tty_port_tty_get() {
                                                                         spin_lock_irqsave(&port->lock, flags); // spins here waiting for kmalloc() from tty_insert_flip_string() to complete
                  zonelist_iter_begin() {
                    read_seqbegin(&zonelist_update_seq) {
                      // spins here waiting for interrupt handler to complete if zonelist_update_seq.seqcount is odd
                                                                         tty = tty_kref_get(port->tty);
                                                                         spin_unlock_irqrestore(&port->lock, flags);
                                                                       }
                                                                     }
                                                                   }
                                                                 }
                                                               // interrupt handler ends
                                                               write_sequnlock(&zonelist_update_seq); // makes zonelist_update_seq.seqcount even
                                                             }
                    }
                  }
                }
              }
            }
          }
        }
      }
      spin_unlock_irqrestore(&port->lock, flags);
    }
  }

Well, it seems that read_mems_allowed_begin() is protected by calling
local_irq_save(flags) before write_seqcount_begin(&current->mems_allowed_seq).

Can zonelist_iter_begin() be protected as well (i.e. call local_irq_save(flags)
before write_seqlock(&zonelist_update_seq)) ?

But even if write_seqlock(&zonelist_update_seq) is called with local irq disabled,
port_lock_key after all makes this warning again?



This bug report might be a suggestion that we want to use two versions of
__alloc_pages_slowpath(), one for atomic context which is geared towards smaller
kernel stack usage and simplified locking dependency (because atomic allocation can
happen from subtle context including interrupt handler) and the other for noinline
version for schedulable context which is geared towards larger kernel stack usage
and complicated locking dependency for implementing rich retry paths including
direct reclaim and OOM kill...
Michal Hocko April 3, 2023, 12:09 p.m. UTC | #3
On Mon 03-04-23 20:14:28, Tetsuo Handa wrote:
> On 2023/04/03 17:15, Michal Hocko wrote:
> > Is this
> > https://lore.kernel.org/all/0000000000001d74d205f7c1821f@google.com/ the
> > underlying report ?
> 
> Yes.
> 
> > Could you explain the the deadlock scenario?
> 
> build_zonelists() from __build_all_zonelists() calls printk() with
> zonelist_update_seq held.
> 
> printk() holds console_owner lock for synchronous printing, and then upon
> unlock of console_owner lock, printk() holds port_lock_key and port->lock.
> 
> tty_insert_flip_string_and_push_buffer() from pty_write() conditionally calls
> kmalloc(GFP_ATOMIC | __GFP_NOWARN) with port->lock held. But since commit 3d36424b3b58,
> zonelist_update_seq is checked by GFP_ATOMIC allocation (i.e. a new locking dependency
> was added by that commit).
> 
>   CPU0                                                       CPU1
>   pty_write() {
>     tty_insert_flip_string_and_push_buffer() {
>                                                              __build_all_zonelists() {
>       spin_lock_irqsave(&port->lock, flags);
>       tty_insert_flip_string() {
>         tty_insert_flip_string_fixed_flag() {
>           __tty_buffer_request_room() {
>             tty_buffer_alloc() {
>               kmalloc(GFP_ATOMIC | __GFP_NOWARN) {
>                 __alloc_pages_slowpath() {
>                                                                write_seqlock(&zonelist_update_seq); // makes zonelist_update_seq.seqcount odd
>                                                                // interrupt handler starts
>                                                                  handle_irq() {
>                                                                    serial8250_interrupt() {
>                                                                      serial8250_tx_chars() {
>                                                                        tty_port_tty_get() {
>                                                                          spin_lock_irqsave(&port->lock, flags); // spins here waiting for kmalloc() from tty_insert_flip_string() to complete
>                   zonelist_iter_begin() {
>                     read_seqbegin(&zonelist_update_seq) {
>                       // spins here waiting for interrupt handler to complete if zonelist_update_seq.seqcount is odd
>                                                                          tty = tty_kref_get(port->tty);
>                                                                          spin_unlock_irqrestore(&port->lock, flags);
>                                                                        }
>                                                                      }
>                                                                    }
>                                                                  }
>                                                                // interrupt handler ends
>                                                                write_sequnlock(&zonelist_update_seq); // makes zonelist_update_seq.seqcount even
>                                                              }
>                     }
>                   }
>                 }
>               }
>             }
>           }
>         }
>       }
>       spin_unlock_irqrestore(&port->lock, flags);
>     }
>   }

Thank you! IIUC this can only happen when there is a race with the
memory hotplug. So pretty much a very rare event. Also I am not really
sure this really requires any changes at the allocator level. I would
much rather sacrifice the printk in build_zonelists or pull it out of
the locked section. Or would printk_deferred help in this case?
Tetsuo Handa April 3, 2023, 12:51 p.m. UTC | #4
On 2023/04/03 21:09, Michal Hocko wrote:
> On Mon 03-04-23 20:14:28, Tetsuo Handa wrote:
>> Well, it seems that read_mems_allowed_begin() is protected by calling
>> local_irq_save(flags) before write_seqcount_begin(&current->mems_allowed_seq).
>> 
>> Can zonelist_iter_begin() be protected as well (i.e. call local_irq_save(flags)
>> before write_seqlock(&zonelist_update_seq)) ?
>> 
>> But even if write_seqlock(&zonelist_update_seq) is called with local irq disabled,
>> port_lock_key after all makes this warning again?

Hmm, local_irq_save(flags) before write_seqlock(&zonelist_update_seq) won't help.
Synchronous printk() will try to hold port->lock from process context even if local
irq is disabled, won't it? Not limited to interrupt handler but any synchronous printk()
inside write_seqlock(&zonelist_update_seq) / write_sequnlock(&zonelist_update_seq)
section is not safe.

> Thank you! IIUC this can only happen when there is a race with the
> memory hotplug. So pretty much a very rare event.

Right.

>                                                   Also I am not really
> sure this really requires any changes at the allocator level. I would
> much rather sacrifice the printk in build_zonelists or pull it out of
> the locked section. Or would printk_deferred help in this case?

Just moving printk() out of write_seqlock(&zonelist_update_seq) / write_sequnlock(&zonelist_update_seq)
section is not sufficient. This problem will happen as long as interrupt handler tries to hold port->lock.
Also disabling local irq will be needed.

By the way, is this case qualified as a user of printk_deferred(), for printk_deferred() says

  /*
   * Special printk facility for scheduler/timekeeping use only, _DO_NOT_USE_ !
   */
  __printf(1, 2) __cold int _printk_deferred(const char *fmt, ...);

?

Since this is a problem introduced by mm change, I think fixing this problem on the
mm side is the cleaner. Can't there be a different approach? For example, can't we
replace

	cpuset_mems_cookie = read_mems_allowed_begin();
	zonelist_iter_cookie = zonelist_iter_begin();

and

	if (check_retry_cpuset(cpuset_mems_cookie, ac) ||
	    check_retry_zonelist(zonelist_iter_cookie))

with different conditions, like recalculate cpuset/zonelist in the last second and
check immediately before giving up allocation or OOM kill whether they have changed?
Michal Hocko April 3, 2023, 1:44 p.m. UTC | #5
On Mon 03-04-23 21:51:29, Tetsuo Handa wrote:
> On 2023/04/03 21:09, Michal Hocko wrote:
> > On Mon 03-04-23 20:14:28, Tetsuo Handa wrote:
> >> Well, it seems that read_mems_allowed_begin() is protected by calling
> >> local_irq_save(flags) before write_seqcount_begin(&current->mems_allowed_seq).
> >> 
> >> Can zonelist_iter_begin() be protected as well (i.e. call local_irq_save(flags)
> >> before write_seqlock(&zonelist_update_seq)) ?
> >> 
> >> But even if write_seqlock(&zonelist_update_seq) is called with local irq disabled,
> >> port_lock_key after all makes this warning again?
> 
> Hmm, local_irq_save(flags) before write_seqlock(&zonelist_update_seq) won't help.
> Synchronous printk() will try to hold port->lock from process context even if local
> irq is disabled, won't it? Not limited to interrupt handler but any synchronous printk()
> inside write_seqlock(&zonelist_update_seq) / write_sequnlock(&zonelist_update_seq)
> section is not safe.
> 
> > Thank you! IIUC this can only happen when there is a race with the
> > memory hotplug. So pretty much a very rare event.
> 
> Right.
> 
> >                                                   Also I am not really
> > sure this really requires any changes at the allocator level. I would
> > much rather sacrifice the printk in build_zonelists or pull it out of
> > the locked section. Or would printk_deferred help in this case?
> 
> Just moving printk() out of write_seqlock(&zonelist_update_seq) / write_sequnlock(&zonelist_update_seq)
> section is not sufficient. This problem will happen as long as interrupt handler tries to hold port->lock.

I do not follow. How is a printk outside of zonelist_update_seq going to
cause a dead/live lock? There shouldn't be any other locks (apart from
hotplug) taken in that path IIRC.

> Also disabling local irq will be needed.

Why?

> By the way, is this case qualified as a user of printk_deferred(), for printk_deferred() says
> 
>   /*
>    * Special printk facility for scheduler/timekeeping use only, _DO_NOT_USE_ !
>    */
>   __printf(1, 2) __cold int _printk_deferred(const char *fmt, ...);
> 
> ?

Dunno, question for printk maintainers. I know they want to limit the
usage. Maybe this qualifies as a exception worth case as well.


> Since this is a problem introduced by mm change, I think fixing this problem on the
> mm side is the cleaner.

Agreed. That would be one of the options I have mentioned. I do not
think the printk information serves such a big role we couldn't live
without it.

> Can't there be a different approach? For example, can't we
> replace
> 
> 	cpuset_mems_cookie = read_mems_allowed_begin();
> 	zonelist_iter_cookie = zonelist_iter_begin();
> 
> and
> 
> 	if (check_retry_cpuset(cpuset_mems_cookie, ac) ||
> 	    check_retry_zonelist(zonelist_iter_cookie))
> 
> with different conditions, like recalculate cpuset/zonelist in the last second and
> check immediately before giving up allocation or OOM kill whether they have changed?

Dunno and honestly that is a subtle piece of code and I would rather not
touch it just because we have limitations in printk usage. Especially
considerenig the above.
Petr Mladek April 3, 2023, 3:12 p.m. UTC | #6
On Mon 2023-04-03 15:44:34, Michal Hocko wrote:
> On Mon 03-04-23 21:51:29, Tetsuo Handa wrote:
> > On 2023/04/03 21:09, Michal Hocko wrote:
> > > On Mon 03-04-23 20:14:28, Tetsuo Handa wrote:
> > >> Well, it seems that read_mems_allowed_begin() is protected by calling
> > >> local_irq_save(flags) before write_seqcount_begin(&current->mems_allowed_seq).
> > >> 
> > >> Can zonelist_iter_begin() be protected as well (i.e. call local_irq_save(flags)
> > >> before write_seqlock(&zonelist_update_seq)) ?
> > >> 
> > >> But even if write_seqlock(&zonelist_update_seq) is called with local irq disabled,
> > >> port_lock_key after all makes this warning again?
> > 
> > Hmm, local_irq_save(flags) before write_seqlock(&zonelist_update_seq) won't help.
> > Synchronous printk() will try to hold port->lock from process context even if local
> > irq is disabled, won't it? Not limited to interrupt handler but any synchronous printk()
> > inside write_seqlock(&zonelist_update_seq) / write_sequnlock(&zonelist_update_seq)
> > section is not safe.
> > 
> > > Thank you! IIUC this can only happen when there is a race with the
> > > memory hotplug. So pretty much a very rare event.
> > 
> > Right.
> > 
> > >                                                   Also I am not really
> > > sure this really requires any changes at the allocator level. I would
> > > much rather sacrifice the printk in build_zonelists or pull it out of
> > > the locked section. Or would printk_deferred help in this case?
> > 
> > Just moving printk() out of write_seqlock(&zonelist_update_seq) / write_sequnlock(&zonelist_update_seq)
> > section is not sufficient. This problem will happen as long as interrupt handler tries to hold port->lock.
> 
> I do not follow. How is a printk outside of zonelist_update_seq going to
> cause a dead/live lock? There shouldn't be any other locks (apart from
> hotplug) taken in that path IIRC.
> 
> > Also disabling local irq will be needed.
> 
> Why?
> 
> > By the way, is this case qualified as a user of printk_deferred(), for printk_deferred() says
> > 
> >   /*
> >    * Special printk facility for scheduler/timekeeping use only, _DO_NOT_USE_ !
> >    */
> >   __printf(1, 2) __cold int _printk_deferred(const char *fmt, ...);
> > 
> > ?
> 
> Dunno, question for printk maintainers. I know they want to limit the
> usage. Maybe this qualifies as a exception worth case as well.

Sigh, I am afraid that printk_deferred() is the best option at this
moment.

I see that there are more pr_info()/pr_cont() calls in build_zonelists.
You might want to use printk_deferred_enter()/exit() around them.

These problems should disappear once printk() messages get processesed
in a kthread. And pr_info() is not critical information by definition
and it is a perfect candidate for deferring.

Unfortunately, introducing the kthreads takes ages. So, we will have
to live with printk_deferred() for some time.


Important:

printk_deferred() still should be used with care. The messages will
be pushed to the console in another random context that might be
atomic. The more messages we deffer the bigger risk of softlockups
we create.

A rules of thumb:

   + printk_deferred() would make sense for printing few lines
     in a code where the dependency against the console driver
     code is really complicated. It seems to be this case.

  + printk_deferred() is not a good solution for long reports. It
    would just increase a risk of softlockups and could break
    the system.

Best Regards,
Petr
diff mbox series

Patch

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 7136c36c5d01..80ef79b23865 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4759,17 +4759,17 @@  EXPORT_SYMBOL_GPL(fs_reclaim_release);
  */
 static DEFINE_SEQLOCK(zonelist_update_seq);
 
-static unsigned int zonelist_iter_begin(void)
+static unsigned int zonelist_iter_begin(gfp_t gfp_mask)
 {
-	if (IS_ENABLED(CONFIG_MEMORY_HOTREMOVE))
+	if (IS_ENABLED(CONFIG_MEMORY_HOTREMOVE) && (gfp_mask & __GFP_DIRECT_RECLAIM))
 		return read_seqbegin(&zonelist_update_seq);
 
 	return 0;
 }
 
-static unsigned int check_retry_zonelist(unsigned int seq)
+static unsigned int check_retry_zonelist(unsigned int seq, gfp_t gfp_mask)
 {
-	if (IS_ENABLED(CONFIG_MEMORY_HOTREMOVE))
+	if (IS_ENABLED(CONFIG_MEMORY_HOTREMOVE) && (gfp_mask & __GFP_DIRECT_RECLAIM))
 		return read_seqretry(&zonelist_update_seq, seq);
 
 	return seq;
@@ -5083,7 +5083,7 @@  __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	no_progress_loops = 0;
 	compact_priority = DEF_COMPACT_PRIORITY;
 	cpuset_mems_cookie = read_mems_allowed_begin();
-	zonelist_iter_cookie = zonelist_iter_begin();
+	zonelist_iter_cookie = zonelist_iter_begin(gfp_mask);
 
 	/*
 	 * The fast path uses conservative alloc_flags to succeed only until
@@ -5261,7 +5261,7 @@  __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	 * a unnecessary OOM kill.
 	 */
 	if (check_retry_cpuset(cpuset_mems_cookie, ac) ||
-	    check_retry_zonelist(zonelist_iter_cookie))
+	    check_retry_zonelist(zonelist_iter_cookie, gfp_mask))
 		goto restart;
 
 	/* Reclaim has failed us, start killing things */
@@ -5287,7 +5287,7 @@  __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	 * a unnecessary OOM kill.
 	 */
 	if (check_retry_cpuset(cpuset_mems_cookie, ac) ||
-	    check_retry_zonelist(zonelist_iter_cookie))
+	    check_retry_zonelist(zonelist_iter_cookie, gfp_mask))
 		goto restart;
 
 	/*