diff mbox series

[v2] mm/page_alloc: don't check zonelist_update_seq from atomic allocations

Message ID 6cc13636-eda6-6a95-6564-db1c9ae76bb6@I-love.SAKURA.ne.jp (mailing list archive)
State New
Headers show
Series [v2] mm/page_alloc: don't check zonelist_update_seq from atomic allocations | expand

Commit Message

Tetsuo Handa Aug. 9, 2023, 11:03 a.m. UTC
Sebastian Andrzej Siewior reported that commit 1007843a9190
("mm/page_alloc: fix potential deadlock on zonelist_update_seq seqlock")
is problematic for CONFIG_PREEMPT_RT=y case, for PREEMPT_RT kernels do not
allow holding spinlocks with interrupts disabled because PREEMPT_RT kernels
manage priority by making the spinlock sleepable.

Commit 3d36424b3b58 ("mm/page_alloc: fix race condition between
build_all_zonelists and page allocation") was obviously wrong that
that commit introduced a spinlock into !__GFP_DIRECT_RECLAIM allocations
without understanding the reality that we cannot figure out all possible
locking dependency. Like commit 726ccdba1521 ("kasan,kmsan: remove
__GFP_KSWAPD_RECLAIM usage from kasan/kmsan") says, !__GFP_DIRECT_RECLAIM
allocations might happen with arbitrary locks held. But the page allocator
does not offer a gfp flag that opts out from holding zonelist_update_seq
seqlock. Under such situations, the safer approach is not to opt in to
holding zonelist_update_seq seqlock if sleeping is not permitted.

This is an updated and optimized version of [1]. This patch eliminates

     while ((__seq = seqprop_sequence(s)) & 1)
     	   cpu_relax();

path from zonelist_iter_begin() which is always called as long as
__alloc_pages_slowpath() is called. There is no need to wait for
completion of rebuilding zonelists, for rebuilding zonelists being in
flight does not mean that allocation never succeeds. When allocation did
not fail, this "while" loop becomes nothing but a waste of CPU time.
And it is very likely that rebuilding zonelists being not in flight from
the beginning.

This patch not only avoids possibility of deadlock, but also makes
zonelist_iter_begin() faster and simpler. This change is an improvement
even without considering printk() and lockdep/KCSAN related problems.

Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Closes: https://lkml.kernel.org/r/20230621104034.HT6QnNkQ@linutronix.de
Link: https://lkml.kernel.org/r/dfdb9da6-ca8f-7a81-bfdd-d74b4c401f11@I-love.SAKURA.ne.jp [1]
Fixes: 1007843a9190 ("mm/page_alloc: fix potential deadlock on zonelist_update_seq seqlock")
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
This patch replaces "mm/page_alloc: use write_seqlock_irqsave() instead
write_seqlock() + local_irq_save()." in linux-next.git tree.

 mm/page_alloc.c | 73 ++++++++++++++++++++++++++++++-------------------
 1 file changed, 45 insertions(+), 28 deletions(-)

Comments

Michal Hocko Aug. 9, 2023, 12:49 p.m. UTC | #1
On Wed 09-08-23 20:03:00, Tetsuo Handa wrote:
> Sebastian Andrzej Siewior reported that commit 1007843a9190
> ("mm/page_alloc: fix potential deadlock on zonelist_update_seq seqlock")
> is problematic for CONFIG_PREEMPT_RT=y case, for PREEMPT_RT kernels do not
> allow holding spinlocks with interrupts disabled because PREEMPT_RT kernels
> manage priority by making the spinlock sleepable.
> 
> Commit 3d36424b3b58 ("mm/page_alloc: fix race condition between
> build_all_zonelists and page allocation") was obviously wrong that
> that commit introduced a spinlock into !__GFP_DIRECT_RECLAIM allocations
> without understanding the reality that we cannot figure out all possible
> locking dependency. Like commit 726ccdba1521 ("kasan,kmsan: remove
> __GFP_KSWAPD_RECLAIM usage from kasan/kmsan") says, !__GFP_DIRECT_RECLAIM
> allocations might happen with arbitrary locks held. But the page allocator
> does not offer a gfp flag that opts out from holding zonelist_update_seq
> seqlock. Under such situations, the safer approach is not to opt in to
> holding zonelist_update_seq seqlock if sleeping is not permitted.

Yes it doesn't allow any constain like that so it is kasan code to be
fixed to not use page allocator if it is executed from a context like
that. There is no bug to take a non-sleeping lock from GFP_NOWAIT like
allocations.

> This is an updated and optimized version of [1]. This patch eliminates
> 
>      while ((__seq = seqprop_sequence(s)) & 1)
>      	   cpu_relax();
> 
> path from zonelist_iter_begin() which is always called as long as
> __alloc_pages_slowpath() is called. There is no need to wait for
> completion of rebuilding zonelists, for rebuilding zonelists being in
> flight does not mean that allocation never succeeds. When allocation did
> not fail, this "while" loop becomes nothing but a waste of CPU time.
> And it is very likely that rebuilding zonelists being not in flight from
> the beginning.
> 
> This patch not only avoids possibility of deadlock, but also makes
> zonelist_iter_begin() faster and simpler. This change is an improvement
> even without considering printk() and lockdep/KCSAN related problems.
> 
> Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> Closes: https://lkml.kernel.org/r/20230621104034.HT6QnNkQ@linutronix.de
> Link: https://lkml.kernel.org/r/dfdb9da6-ca8f-7a81-bfdd-d74b4c401f11@I-love.SAKURA.ne.jp [1]
> Fixes: 1007843a9190 ("mm/page_alloc: fix potential deadlock on zonelist_update_seq seqlock")
> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>

As per previous discussion http://lkml.kernel.org/r/ZMfETPzGfpPP7F79@dhcp22.suse.cz
Nacked-by: Michal Hocko <mhocko@suse.com>

And to be really honest, this whole thing is highly annoying. We already
have a patch to address the RT problem. You have nacked because of a
highly theoretical concern you cannot really prove is existing and now
you are making the code unnecessarily more complex and harder to
maintain as a result. I will not speak for others but this is not how
_I_ would like to see this code maintained.

> ---
> This patch replaces "mm/page_alloc: use write_seqlock_irqsave() instead
> write_seqlock() + local_irq_save()." in linux-next.git tree.
> 
>  mm/page_alloc.c | 73 ++++++++++++++++++++++++++++++-------------------
>  1 file changed, 45 insertions(+), 28 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 7d3460c7a480..5557d9a2ff2c 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3638,26 +3638,44 @@ EXPORT_SYMBOL_GPL(fs_reclaim_release);
>  
>  /*
>   * Zonelists may change due to hotplug during allocation. Detect when zonelists
> - * have been rebuilt so allocation retries. Reader side does not lock and
> - * retries the allocation if zonelist changes. Writer side is protected by the
> - * embedded spin_lock.
> + * have been rebuilt so __GFP_DIRECT_RECLAIM allocation retries. Writer side
> + * makes this sequence odd before rebuilding zonelists and makes this sequence
> + * even after rebuilding zonelists. Sine writer side disables context switching
> + * when rebuilding zonelists, and !__GFP_DIRECT_RECLAIM allocation will not
> + * retry when zonelists changed, reader side does not need to hold a lock (but
> + * needs to use data_race() annotation), making locking dependency simpler.
>   */
> -static DEFINE_SEQLOCK(zonelist_update_seq);
> +static unsigned int zonelist_update_seq;
>  
>  static unsigned int zonelist_iter_begin(void)
>  {
>  	if (IS_ENABLED(CONFIG_MEMORY_HOTREMOVE))
> -		return read_seqbegin(&zonelist_update_seq);
> +		/* See comment above. */
> +		return data_race(READ_ONCE(zonelist_update_seq));
>  
>  	return 0;
>  }
>  
> -static unsigned int check_retry_zonelist(unsigned int seq)
> +static unsigned int check_retry_zonelist(gfp_t gfp, unsigned int seq)
>  {
> -	if (IS_ENABLED(CONFIG_MEMORY_HOTREMOVE))
> -		return read_seqretry(&zonelist_update_seq, seq);
> +	if (IS_ENABLED(CONFIG_MEMORY_HOTREMOVE) && (gfp & __GFP_DIRECT_RECLAIM)) {
> +		/* See comment above. */
> +		unsigned int seq2 = data_race(READ_ONCE(zonelist_update_seq));
>  
> -	return seq;
> +		/*
> +		 * "seq != seq2" indicates that __build_all_zonelists() has
> +		 * started or has finished rebuilding zonelists, hence retry.
> +		 * "seq == seq2 && (seq2 & 1)" indicates that
> +		 * __build_all_zonelists() is still rebuilding zonelists
> +		 * with context switching disabled, hence retry.
> +		 * "seq == seq2 && !(seq2 & 1)" indicates that
> +		 * __build_all_zonelists() did not rebuild zonelists, hence
> +		 * no retry.
> +		 */
> +		return unlikely(seq != seq2 || (seq2 & 1));
> +	}
> +
> +	return 0;
>  }
>  
>  /* Perform direct synchronous page reclaim */
> @@ -4146,7 +4164,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  	 * a unnecessary OOM kill.
>  	 */
>  	if (check_retry_cpuset(cpuset_mems_cookie, ac) ||
> -	    check_retry_zonelist(zonelist_iter_cookie))
> +	    check_retry_zonelist(gfp_mask, zonelist_iter_cookie))
>  		goto restart;
>  
>  	/* Reclaim has failed us, start killing things */
> @@ -4172,7 +4190,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  	 * a unnecessary OOM kill.
>  	 */
>  	if (check_retry_cpuset(cpuset_mems_cookie, ac) ||
> -	    check_retry_zonelist(zonelist_iter_cookie))
> +	    check_retry_zonelist(gfp_mask, zonelist_iter_cookie))
>  		goto restart;
>  
>  	/*
> @@ -5136,22 +5154,17 @@ static void __build_all_zonelists(void *data)
>  	int nid;
>  	int __maybe_unused cpu;
>  	pg_data_t *self = data;
> +	static DEFINE_SPINLOCK(lock);
>  	unsigned long flags;
>  
> -	/*
> -	 * Explicitly disable this CPU's interrupts before taking seqlock
> -	 * to prevent any IRQ handler from calling into the page allocator
> -	 * (e.g. GFP_ATOMIC) that could hit zonelist_iter_begin and livelock.
> -	 */
> -	local_irq_save(flags);
> -	/*
> -	 * Explicitly disable this CPU's synchronous printk() before taking
> -	 * seqlock to prevent any printk() from trying to hold port->lock, for
> -	 * tty_insert_flip_string_and_push_buffer() on other CPU might be
> -	 * calling kmalloc(GFP_ATOMIC | __GFP_NOWARN) with port->lock held.
> -	 */
> -	printk_deferred_enter();
> -	write_seqlock(&zonelist_update_seq);
> +#ifdef CONFIG_PREEMPT_RT
> +	migrate_disable()
> +#endif
> +	/* Serialize increments of zonelist_update_seq. */
> +	spin_lock_irqsave(&lock, flags);
> +	zonelist_update_seq++;
> +	/* Tell check_retry_zonelist() that we started rebuilding zonelists. */
> +	smp_wmb();
>  
>  #ifdef CONFIG_NUMA
>  	memset(node_load, 0, sizeof(node_load));
> @@ -5188,9 +5201,13 @@ static void __build_all_zonelists(void *data)
>  #endif
>  	}
>  
> -	write_sequnlock(&zonelist_update_seq);
> -	printk_deferred_exit();
> -	local_irq_restore(flags);
> +	/* Tell check_retry_zonelist() that we finished rebuilding zonelists. */
> +	smp_wmb();
> +	zonelist_update_seq++;
> +	spin_unlock_irqrestore(&lock, flags);
> +#ifdef CONFIG_PREEMPT_RT
> +	migrate_enable()
> +#endif
>  }
>  
>  static noinline void __init
> -- 
> 2.18.4
Sebastian Andrzej Siewior Aug. 10, 2023, 7:26 a.m. UTC | #2
On 2023-08-09 20:03:00 [+0900], Tetsuo Handa wrote:
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 7d3460c7a480..5557d9a2ff2c 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3638,26 +3638,44 @@ EXPORT_SYMBOL_GPL(fs_reclaim_release);> -static DEFINE_SEQLOCK(zonelist_update_seq);
> +static unsigned int zonelist_update_seq;
>  
>  static unsigned int zonelist_iter_begin(void)
>  {
>  	if (IS_ENABLED(CONFIG_MEMORY_HOTREMOVE))
> -		return read_seqbegin(&zonelist_update_seq);
> +		/* See comment above. */
> +		return data_race(READ_ONCE(zonelist_update_seq));

This is open coded raw_read_seqcount() while it should have been
raw_seqcount_begin().

>  	return 0;
>  }
>  
> -static unsigned int check_retry_zonelist(unsigned int seq)
> +static unsigned int check_retry_zonelist(gfp_t gfp, unsigned int seq)
>  {
> -	if (IS_ENABLED(CONFIG_MEMORY_HOTREMOVE))
> -		return read_seqretry(&zonelist_update_seq, seq);
> +	if (IS_ENABLED(CONFIG_MEMORY_HOTREMOVE) && (gfp & __GFP_DIRECT_RECLAIM)) {
> +		/* See comment above. */
> +		unsigned int seq2 = data_race(READ_ONCE(zonelist_update_seq));
>  
> -	return seq;
> +		/*
> +		 * "seq != seq2" indicates that __build_all_zonelists() has
> +		 * started or has finished rebuilding zonelists, hence retry.
> +		 * "seq == seq2 && (seq2 & 1)" indicates that
> +		 * __build_all_zonelists() is still rebuilding zonelists
> +		 * with context switching disabled, hence retry.
> +		 * "seq == seq2 && !(seq2 & 1)" indicates that
> +		 * __build_all_zonelists() did not rebuild zonelists, hence
> +		 * no retry.
> +		 */
> +		return unlikely(seq != seq2 || (seq2 & 1));

open coded read_seqcount_retry().

> +	}
> +
> +	return 0;
>  }
>  
>  /* Perform direct synchronous page reclaim */
> @@ -5136,22 +5154,17 @@ static void __build_all_zonelists(void *data)
>  	int nid;
>  	int __maybe_unused cpu;
>  	pg_data_t *self = data;
> +	static DEFINE_SPINLOCK(lock);
>  	unsigned long flags;
>  
> -	/*
> -	 * Explicitly disable this CPU's interrupts before taking seqlock
> -	 * to prevent any IRQ handler from calling into the page allocator
> -	 * (e.g. GFP_ATOMIC) that could hit zonelist_iter_begin and livelock.
> -	 */
> -	local_irq_save(flags);
> -	/*
> -	 * Explicitly disable this CPU's synchronous printk() before taking
> -	 * seqlock to prevent any printk() from trying to hold port->lock, for
> -	 * tty_insert_flip_string_and_push_buffer() on other CPU might be
> -	 * calling kmalloc(GFP_ATOMIC | __GFP_NOWARN) with port->lock held.
> -	 */
> -	printk_deferred_enter();
> -	write_seqlock(&zonelist_update_seq);
> +#ifdef CONFIG_PREEMPT_RT
> +	migrate_disable()
> +#endif

There is no justification/ explanation why migrate_disable() here is
needed on PREEMPT_RT and I don't see one.

There are two changes here:
- The replacement of seqlock_t with something open coded 
- Logic change when a retry is needed (the gfp mask is considered).

I am not a big fan of open coding things especially when not needed and
then there is this ifdef which is not needed as well. I don't comment on
the logic change.

Can we please put an end to this?

> +	/* Serialize increments of zonelist_update_seq. */
> +	spin_lock_irqsave(&lock, flags);
> +	zonelist_update_seq++;
> +	/* Tell check_retry_zonelist() that we started rebuilding zonelists. */
> +	smp_wmb();
>  
>  #ifdef CONFIG_NUMA
>  	memset(node_load, 0, sizeof(node_load));

Sebastian
Tetsuo Handa Aug. 10, 2023, 9:58 a.m. UTC | #3
On 2023/08/10 16:26, Sebastian Andrzej Siewior wrote:
> On 2023-08-09 20:03:00 [+0900], Tetsuo Handa wrote:
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 7d3460c7a480..5557d9a2ff2c 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -3638,26 +3638,44 @@ EXPORT_SYMBOL_GPL(fs_reclaim_release);
> …
>> -static DEFINE_SEQLOCK(zonelist_update_seq);
>> +static unsigned int zonelist_update_seq;
>>  
>>  static unsigned int zonelist_iter_begin(void)
>>  {
>>  	if (IS_ENABLED(CONFIG_MEMORY_HOTREMOVE))
>> -		return read_seqbegin(&zonelist_update_seq);
>> +		/* See comment above. */
>> +		return data_race(READ_ONCE(zonelist_update_seq));
> 
> This is open coded raw_read_seqcount() while it should have been
> raw_seqcount_begin().

Not an open coded raw_read_seqcount(). You explained us at
https://lkml.kernel.org/r/20230623101111.7tuAg5p5@linutronix.de that
seqprop_sequence() behaves differently if CONFIG_PREEMPT_RT=y.

The point of my proposal is to get rid of

  spin_lock(s->lock);
  spin_unlock(s->lock);

 from zonelist_iter_begin().

Also, my version avoids KCSAN warning by using data_race() and avoids papering
over KCSAN warnings between zonelist_iter_begin() and check_retry_zonelist()
by not using kcsan_atomic_next(KCSAN_SEQLOCK_REGION_MAX).

  /*
   * The seqlock seqcount_t interface does not prescribe a precise sequence of
   * read begin/retry/end. For readers, typically there is a call to
   * read_seqcount_begin() and read_seqcount_retry(), however, there are more
   * esoteric cases which do not follow this pattern.
   *
   * As a consequence, we take the following best-effort approach for raw usage
   * via seqcount_t under KCSAN: upon beginning a seq-reader critical section,
   * pessimistically mark the next KCSAN_SEQLOCK_REGION_MAX memory accesses as
   * atomics; if there is a matching read_seqcount_retry() call, no following
   * memory operations are considered atomic. Usage of the seqlock_t interface
   * is not affected.
   */

The section between zonelist_iter_begin() and check_retry_zonelist() is very
complicated where concurrency bug that is unrelated to zonelist counters could
be found and fixed.

> 
>>  	return 0;
>>  }
>>  
>> -static unsigned int check_retry_zonelist(unsigned int seq)
>> +static unsigned int check_retry_zonelist(gfp_t gfp, unsigned int seq)
>>  {
>> -	if (IS_ENABLED(CONFIG_MEMORY_HOTREMOVE))
>> -		return read_seqretry(&zonelist_update_seq, seq);
>> +	if (IS_ENABLED(CONFIG_MEMORY_HOTREMOVE) && (gfp & __GFP_DIRECT_RECLAIM)) {
>> +		/* See comment above. */
>> +		unsigned int seq2 = data_race(READ_ONCE(zonelist_update_seq));
>>  
>> -	return seq;
>> +		/*
>> +		 * "seq != seq2" indicates that __build_all_zonelists() has
>> +		 * started or has finished rebuilding zonelists, hence retry.
>> +		 * "seq == seq2 && (seq2 & 1)" indicates that
>> +		 * __build_all_zonelists() is still rebuilding zonelists
>> +		 * with context switching disabled, hence retry.
>> +		 * "seq == seq2 && !(seq2 & 1)" indicates that
>> +		 * __build_all_zonelists() did not rebuild zonelists, hence
>> +		 * no retry.
>> +		 */
>> +		return unlikely(seq != seq2 || (seq2 & 1));
> 
> open coded read_seqcount_retry().

Not an open coded read_seqcount_retry(), for read_seqcount_retry() checks
only "seq != seq2" condition. We need to check "seq2 & 1" condition too.

> 
>> +	}
>> +
>> +	return 0;
>>  }
>>  
>>  /* Perform direct synchronous page reclaim */
>> @@ -5136,22 +5154,17 @@ static void __build_all_zonelists(void *data)
>>  	int nid;
>>  	int __maybe_unused cpu;
>>  	pg_data_t *self = data;
>> +	static DEFINE_SPINLOCK(lock);
>>  	unsigned long flags;
>>  
>> -	/*
>> -	 * Explicitly disable this CPU's interrupts before taking seqlock
>> -	 * to prevent any IRQ handler from calling into the page allocator
>> -	 * (e.g. GFP_ATOMIC) that could hit zonelist_iter_begin and livelock.
>> -	 */
>> -	local_irq_save(flags);
>> -	/*
>> -	 * Explicitly disable this CPU's synchronous printk() before taking
>> -	 * seqlock to prevent any printk() from trying to hold port->lock, for
>> -	 * tty_insert_flip_string_and_push_buffer() on other CPU might be
>> -	 * calling kmalloc(GFP_ATOMIC | __GFP_NOWARN) with port->lock held.
>> -	 */
>> -	printk_deferred_enter();
>> -	write_seqlock(&zonelist_update_seq);
>> +#ifdef CONFIG_PREEMPT_RT
>> +	migrate_disable()
>> +#endif
> 
> There is no justification/ explanation why migrate_disable() here is
> needed on PREEMPT_RT and I don't see one.

This migrate_disable() is a compensation for removing

  spin_lock(s->lock);
  spin_unlock(s->lock);

 from zonelist_iter_begin(). Since neither the proposed zonelist_iter_begin()
nor the proposed check_retry_zonelist() holds the spinlock, we need to
guarantee that the thread which has performed the opening zonelist_update_seq++
can continue execution till the closing zonelist_update_seq++ without sleeping.
Calling interrupt handlers are fine, for interrupt handlers can't do
__GFP_DIRECT_RECLAIM allocation, which in turn guarantees that interrupt
handlers switched from the thread which has performed the opening
zonelist_update_seq++ won't deadlock.

> 
> There are two changes here:
> - The replacement of seqlock_t with something open coded 

Yes.

> - Logic change when a retry is needed (the gfp mask is considered).

Yes.

> 
> I am not a big fan of open coding things especially when not needed and
> then there is this ifdef which is not needed as well. I don't comment on
> the logic change.

If __build_all_zonelists() can run without being switched to other threads
(except interrupt handlers), I consider that this approach works.

> 
> Can we please put an end to this?
> 
>> +	/* Serialize increments of zonelist_update_seq. */
>> +	spin_lock_irqsave(&lock, flags);
>> +	zonelist_update_seq++;
>> +	/* Tell check_retry_zonelist() that we started rebuilding zonelists. */
>> +	smp_wmb();
>>  
>>  #ifdef CONFIG_NUMA
>>  	memset(node_load, 0, sizeof(node_load));
> 
> Sebastian
>
Tetsuo Handa Aug. 10, 2023, 3:35 p.m. UTC | #4
On 2023/08/10 18:58, Tetsuo Handa wrote:
> If __build_all_zonelists() can run without being switched to other threads
> (except interrupt handlers), I consider that this approach works.

If there is no way to make sure that the section between
write_seqlock(&zonelist_update_seq) and write_sequnlock(&zonelist_update_seq)
runs without context switching (interrupts handlers are fine), something like
below could be used in order to keep

  spin_lock(s->lock);
  spin_unlock(s->lock);

away from seqprop_sequence() from atomic allocations. But I think that looses
the reason to replace read_seqbegin() with raw_seqcount_begin(); that will be
essentially the same with
https://lkml.kernel.org/r/dfdb9da6-ca8f-7a81-bfdd-d74b4c401f11@I-love.SAKURA.ne.jp .



diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 7d3460c7a480..f2f79caab2cf 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3644,20 +3644,20 @@ EXPORT_SYMBOL_GPL(fs_reclaim_release);
  */
 static DEFINE_SEQLOCK(zonelist_update_seq);
 
-static unsigned int zonelist_iter_begin(void)
+static unsigned int zonelist_iter_begin(gfp_t gfp)
 {
-	if (IS_ENABLED(CONFIG_MEMORY_HOTREMOVE))
-		return read_seqbegin(&zonelist_update_seq);
+	if (IS_ENABLED(CONFIG_MEMORY_HOTREMOVE) && (gfp & __GFP_DIRECT_RECLAIM))
+		return data_race(raw_seqcount_begin(&zonelist_update_seq.seqcount));
 
 	return 0;
 }
 
-static unsigned int check_retry_zonelist(unsigned int seq)
+static unsigned int check_retry_zonelist(gfp_t gfp, unsigned int seq)
 {
-	if (IS_ENABLED(CONFIG_MEMORY_HOTREMOVE))
-		return read_seqretry(&zonelist_update_seq, seq);
+	if (IS_ENABLED(CONFIG_MEMORY_HOTREMOVE) && (gfp & __GFP_DIRECT_RECLAIM))
+		return data_race(read_seqcount_retry(&zonelist_update_seq.seqcount, seq));
 
-	return seq;
+	return 0;
 }
 
 /* Perform direct synchronous page reclaim */
@@ -3968,7 +3968,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	no_progress_loops = 0;
 	compact_priority = DEF_COMPACT_PRIORITY;
 	cpuset_mems_cookie = read_mems_allowed_begin();
-	zonelist_iter_cookie = zonelist_iter_begin();
+	zonelist_iter_cookie = zonelist_iter_begin(gfp_mask);
 
 	/*
 	 * The fast path uses conservative alloc_flags to succeed only until
@@ -4146,7 +4146,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	 * a unnecessary OOM kill.
 	 */
 	if (check_retry_cpuset(cpuset_mems_cookie, ac) ||
-	    check_retry_zonelist(zonelist_iter_cookie))
+	    check_retry_zonelist(gfp_mask, zonelist_iter_cookie))
 		goto restart;
 
 	/* Reclaim has failed us, start killing things */
@@ -4172,7 +4172,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	 * a unnecessary OOM kill.
 	 */
 	if (check_retry_cpuset(cpuset_mems_cookie, ac) ||
-	    check_retry_zonelist(zonelist_iter_cookie))
+	    check_retry_zonelist(gfp_mask, zonelist_iter_cookie))
 		goto restart;
 
 	/*
@@ -5138,20 +5138,7 @@ static void __build_all_zonelists(void *data)
 	pg_data_t *self = data;
 	unsigned long flags;
 
-	/*
-	 * Explicitly disable this CPU's interrupts before taking seqlock
-	 * to prevent any IRQ handler from calling into the page allocator
-	 * (e.g. GFP_ATOMIC) that could hit zonelist_iter_begin and livelock.
-	 */
-	local_irq_save(flags);
-	/*
-	 * Explicitly disable this CPU's synchronous printk() before taking
-	 * seqlock to prevent any printk() from trying to hold port->lock, for
-	 * tty_insert_flip_string_and_push_buffer() on other CPU might be
-	 * calling kmalloc(GFP_ATOMIC | __GFP_NOWARN) with port->lock held.
-	 */
-	printk_deferred_enter();
-	write_seqlock(&zonelist_update_seq);
+	write_seqlock_irqsave(&zonelist_update_seq, flags);
 
 #ifdef CONFIG_NUMA
 	memset(node_load, 0, sizeof(node_load));
@@ -5188,9 +5175,7 @@ static void __build_all_zonelists(void *data)
 #endif
 	}
 
-	write_sequnlock(&zonelist_update_seq);
-	printk_deferred_exit();
-	local_irq_restore(flags);
+	write_sequnlock_irqrestore(&zonelist_update_seq, flags);
 }
 
 static noinline void __init
diff mbox series

Patch

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 7d3460c7a480..5557d9a2ff2c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3638,26 +3638,44 @@  EXPORT_SYMBOL_GPL(fs_reclaim_release);
 
 /*
  * Zonelists may change due to hotplug during allocation. Detect when zonelists
- * have been rebuilt so allocation retries. Reader side does not lock and
- * retries the allocation if zonelist changes. Writer side is protected by the
- * embedded spin_lock.
+ * have been rebuilt so __GFP_DIRECT_RECLAIM allocation retries. Writer side
+ * makes this sequence odd before rebuilding zonelists and makes this sequence
+ * even after rebuilding zonelists. Sine writer side disables context switching
+ * when rebuilding zonelists, and !__GFP_DIRECT_RECLAIM allocation will not
+ * retry when zonelists changed, reader side does not need to hold a lock (but
+ * needs to use data_race() annotation), making locking dependency simpler.
  */
-static DEFINE_SEQLOCK(zonelist_update_seq);
+static unsigned int zonelist_update_seq;
 
 static unsigned int zonelist_iter_begin(void)
 {
 	if (IS_ENABLED(CONFIG_MEMORY_HOTREMOVE))
-		return read_seqbegin(&zonelist_update_seq);
+		/* See comment above. */
+		return data_race(READ_ONCE(zonelist_update_seq));
 
 	return 0;
 }
 
-static unsigned int check_retry_zonelist(unsigned int seq)
+static unsigned int check_retry_zonelist(gfp_t gfp, unsigned int seq)
 {
-	if (IS_ENABLED(CONFIG_MEMORY_HOTREMOVE))
-		return read_seqretry(&zonelist_update_seq, seq);
+	if (IS_ENABLED(CONFIG_MEMORY_HOTREMOVE) && (gfp & __GFP_DIRECT_RECLAIM)) {
+		/* See comment above. */
+		unsigned int seq2 = data_race(READ_ONCE(zonelist_update_seq));
 
-	return seq;
+		/*
+		 * "seq != seq2" indicates that __build_all_zonelists() has
+		 * started or has finished rebuilding zonelists, hence retry.
+		 * "seq == seq2 && (seq2 & 1)" indicates that
+		 * __build_all_zonelists() is still rebuilding zonelists
+		 * with context switching disabled, hence retry.
+		 * "seq == seq2 && !(seq2 & 1)" indicates that
+		 * __build_all_zonelists() did not rebuild zonelists, hence
+		 * no retry.
+		 */
+		return unlikely(seq != seq2 || (seq2 & 1));
+	}
+
+	return 0;
 }
 
 /* Perform direct synchronous page reclaim */
@@ -4146,7 +4164,7 @@  __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	 * a unnecessary OOM kill.
 	 */
 	if (check_retry_cpuset(cpuset_mems_cookie, ac) ||
-	    check_retry_zonelist(zonelist_iter_cookie))
+	    check_retry_zonelist(gfp_mask, zonelist_iter_cookie))
 		goto restart;
 
 	/* Reclaim has failed us, start killing things */
@@ -4172,7 +4190,7 @@  __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	 * a unnecessary OOM kill.
 	 */
 	if (check_retry_cpuset(cpuset_mems_cookie, ac) ||
-	    check_retry_zonelist(zonelist_iter_cookie))
+	    check_retry_zonelist(gfp_mask, zonelist_iter_cookie))
 		goto restart;
 
 	/*
@@ -5136,22 +5154,17 @@  static void __build_all_zonelists(void *data)
 	int nid;
 	int __maybe_unused cpu;
 	pg_data_t *self = data;
+	static DEFINE_SPINLOCK(lock);
 	unsigned long flags;
 
-	/*
-	 * Explicitly disable this CPU's interrupts before taking seqlock
-	 * to prevent any IRQ handler from calling into the page allocator
-	 * (e.g. GFP_ATOMIC) that could hit zonelist_iter_begin and livelock.
-	 */
-	local_irq_save(flags);
-	/*
-	 * Explicitly disable this CPU's synchronous printk() before taking
-	 * seqlock to prevent any printk() from trying to hold port->lock, for
-	 * tty_insert_flip_string_and_push_buffer() on other CPU might be
-	 * calling kmalloc(GFP_ATOMIC | __GFP_NOWARN) with port->lock held.
-	 */
-	printk_deferred_enter();
-	write_seqlock(&zonelist_update_seq);
+#ifdef CONFIG_PREEMPT_RT
+	migrate_disable()
+#endif
+	/* Serialize increments of zonelist_update_seq. */
+	spin_lock_irqsave(&lock, flags);
+	zonelist_update_seq++;
+	/* Tell check_retry_zonelist() that we started rebuilding zonelists. */
+	smp_wmb();
 
 #ifdef CONFIG_NUMA
 	memset(node_load, 0, sizeof(node_load));
@@ -5188,9 +5201,13 @@  static void __build_all_zonelists(void *data)
 #endif
 	}
 
-	write_sequnlock(&zonelist_update_seq);
-	printk_deferred_exit();
-	local_irq_restore(flags);
+	/* Tell check_retry_zonelist() that we finished rebuilding zonelists. */
+	smp_wmb();
+	zonelist_update_seq++;
+	spin_unlock_irqrestore(&lock, flags);
+#ifdef CONFIG_PREEMPT_RT
+	migrate_enable()
+#endif
 }
 
 static noinline void __init