diff mbox series

[v2] mm/page_alloc: don't wake kswapd from rmqueue() unless __GFP_KSWAPD_RECLAIM is specified

Message ID c3c3dacf-dd3b-77c9-f96a-d0982b4b2a4f@I-love.SAKURA.ne.jp (mailing list archive)
State New
Headers show
Series [v2] mm/page_alloc: don't wake kswapd from rmqueue() unless __GFP_KSWAPD_RECLAIM is specified | expand

Commit Message

Tetsuo Handa May 14, 2023, 12:28 a.m. UTC
Commit 73444bc4d8f9 ("mm, page_alloc: do not wake kswapd with zone lock
held") moved wakeup_kswapd() from steal_suitable_fallback() to rmqueue()
using ZONE_BOOSTED_WATERMARK flag.

Only allocation contexts that include ALLOC_KSWAPD (which corresponds to
__GFP_KSWAPD_RECLAIM) should wake kswapd, for callers are supposed to
remove __GFP_KSWAPD_RECLAIM if trying to hold pgdat->kswapd_wait has a
risk of deadlock. But since zone->flags is a shared variable, a thread
doing !__GFP_KSWAPD_RECLAIM allocation request might observe this flag
being set immediately after another thread doing __GFP_KSWAPD_RECLAIM
allocation request set this flag, causing possibility of deadlock.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Fixes: 73444bc4d8f9 ("mm, page_alloc: do not wake kswapd with zone lock held")
---
Changes in v2:
  Check ALLOC_KSWAPD before checking ZONE_BOOSTED_WATERMARK and update
  description, suggested by Mel Gorman <mgorman@techsingularity.net>.

 mm/page_alloc.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

Comments

Huang, Ying May 15, 2023, 6:03 a.m. UTC | #1
Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> writes:

> Commit 73444bc4d8f9 ("mm, page_alloc: do not wake kswapd with zone lock
> held") moved wakeup_kswapd() from steal_suitable_fallback() to rmqueue()
> using ZONE_BOOSTED_WATERMARK flag.
>
> Only allocation contexts that include ALLOC_KSWAPD (which corresponds to
> __GFP_KSWAPD_RECLAIM) should wake kswapd, for callers are supposed to
> remove __GFP_KSWAPD_RECLAIM if trying to hold pgdat->kswapd_wait has a
> risk of deadlock. But since zone->flags is a shared variable, a thread
> doing !__GFP_KSWAPD_RECLAIM allocation request might observe this flag
> being set immediately after another thread doing __GFP_KSWAPD_RECLAIM
> allocation request set this flag, causing possibility of deadlock.

Sorry, I don't understand what is the deadlock here.

I checked commit 73444bc4d8f9 ("mm, page_alloc: do not wake kswapd with
zone lock held") and the corresponding mail thread.  From the below
mail,

https://lore.kernel.org/all/20190107204627.GA25526@cmpxchg.org/

commit 73444bc4d8f9 fixed a circular locking ordering as follows,

pi lock -> rq lock -> timer base lock -> zone lock -> wakeup lock
(kswapd_wait, fixed) -> pi lock

But I don't know what is the deadlock that your patch fixed.  Can you
teach me on that?

Best Regards,
Huang, Ying

> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Fixes: 73444bc4d8f9 ("mm, page_alloc: do not wake kswapd with zone lock held")
> ---
> Changes in v2:
>   Check ALLOC_KSWAPD before checking ZONE_BOOSTED_WATERMARK and update
>   description, suggested by Mel Gorman <mgorman@techsingularity.net>.
>
>  mm/page_alloc.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 47421bedc12b..ecad680cec53 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3052,7 +3052,8 @@ struct page *rmqueue(struct zone *preferred_zone,
>  
>  out:
>  	/* Separate test+clear to avoid unnecessary atomics */
> -	if (unlikely(test_bit(ZONE_BOOSTED_WATERMARK, &zone->flags))) {
> +	if ((alloc_flags & ALLOC_KSWAPD) &&
> +	    unlikely(test_bit(ZONE_BOOSTED_WATERMARK, &zone->flags))) {
>  		clear_bit(ZONE_BOOSTED_WATERMARK, &zone->flags);
>  		wakeup_kswapd(zone, 0, 0, zone_idx(zone));
>  	}
Huang, Ying May 15, 2023, 6:35 a.m. UTC | #2
Hi, Tetsuo,

"Huang, Ying" <ying.huang@intel.com> writes:

> Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> writes:
>
>> Commit 73444bc4d8f9 ("mm, page_alloc: do not wake kswapd with zone lock
>> held") moved wakeup_kswapd() from steal_suitable_fallback() to rmqueue()
>> using ZONE_BOOSTED_WATERMARK flag.
>>
>> Only allocation contexts that include ALLOC_KSWAPD (which corresponds to
>> __GFP_KSWAPD_RECLAIM) should wake kswapd, for callers are supposed to
>> remove __GFP_KSWAPD_RECLAIM if trying to hold pgdat->kswapd_wait has a
>> risk of deadlock. But since zone->flags is a shared variable, a thread
>> doing !__GFP_KSWAPD_RECLAIM allocation request might observe this flag
>> being set immediately after another thread doing __GFP_KSWAPD_RECLAIM
>> allocation request set this flag, causing possibility of deadlock.
>
> Sorry, I don't understand what is the deadlock here.
>
> I checked commit 73444bc4d8f9 ("mm, page_alloc: do not wake kswapd with
> zone lock held") and the corresponding mail thread.  From the below
> mail,
>
> https://lore.kernel.org/all/20190107204627.GA25526@cmpxchg.org/
>
> commit 73444bc4d8f9 fixed a circular locking ordering as follows,
>
> pi lock -> rq lock -> timer base lock -> zone lock -> wakeup lock
> (kswapd_wait, fixed) -> pi lock
>
> But I don't know what is the deadlock that your patch fixed.  Can you
> teach me on that?

Just read your email in another thread related to this patch as follow,

https://lore.kernel.org/linux-mm/d642e597-cf7d-b410-16ce-22dff483fd8e@I-love.SAKURA.ne.jp/

Is that the deadlock that you tried to fix in this patch?

It appears that commit 73444bc4d8f9 didn't fix the deadlock above.  It
just convert the circular locking ordering to,

pi lock -> rq lock -> timer base lock -> wakeup lock (kswapd_wait,
fixed) -> pi lock

If so, I think that it's better to add corresponding information in
patch description to avoid the possible confusion.

Best Regards,
Huang, Ying

>> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
>> Fixes: 73444bc4d8f9 ("mm, page_alloc: do not wake kswapd with zone lock held")
>> ---
>> Changes in v2:
>>   Check ALLOC_KSWAPD before checking ZONE_BOOSTED_WATERMARK and update
>>   description, suggested by Mel Gorman <mgorman@techsingularity.net>.
>>
>>  mm/page_alloc.c | 3 ++-
>>  1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 47421bedc12b..ecad680cec53 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -3052,7 +3052,8 @@ struct page *rmqueue(struct zone *preferred_zone,
>>  
>>  out:
>>  	/* Separate test+clear to avoid unnecessary atomics */
>> -	if (unlikely(test_bit(ZONE_BOOSTED_WATERMARK, &zone->flags))) {
>> +	if ((alloc_flags & ALLOC_KSWAPD) &&
>> +	    unlikely(test_bit(ZONE_BOOSTED_WATERMARK, &zone->flags))) {
>>  		clear_bit(ZONE_BOOSTED_WATERMARK, &zone->flags);
>>  		wakeup_kswapd(zone, 0, 0, zone_idx(zone));
>>  	}
Mel Gorman May 15, 2023, 7:38 a.m. UTC | #3
On Sun, May 14, 2023 at 09:28:56AM +0900, Tetsuo Handa wrote:
> Commit 73444bc4d8f9 ("mm, page_alloc: do not wake kswapd with zone lock
> held") moved wakeup_kswapd() from steal_suitable_fallback() to rmqueue()
> using ZONE_BOOSTED_WATERMARK flag.
> 
> Only allocation contexts that include ALLOC_KSWAPD (which corresponds to
> __GFP_KSWAPD_RECLAIM) should wake kswapd, for callers are supposed to
> remove __GFP_KSWAPD_RECLAIM if trying to hold pgdat->kswapd_wait has a
> risk of deadlock.

kswapd_wait is a waitqueue so what is being held? It's safe for kswapd
to try wake itself as the waitqueue will be active when wakeup_kswapd()
is called so no wakeup occurs. If there is a deadlock, it needs a better
explanation. I believe I already stated why this patch is fixing a bug
but it wasn't deadlock related.
Tetsuo Handa May 15, 2023, 10:17 a.m. UTC | #4
On 2023/05/15 16:38, Mel Gorman wrote:
> On Sun, May 14, 2023 at 09:28:56AM +0900, Tetsuo Handa wrote:
>> Commit 73444bc4d8f9 ("mm, page_alloc: do not wake kswapd with zone lock
>> held") moved wakeup_kswapd() from steal_suitable_fallback() to rmqueue()
>> using ZONE_BOOSTED_WATERMARK flag.
>>
>> Only allocation contexts that include ALLOC_KSWAPD (which corresponds to
>> __GFP_KSWAPD_RECLAIM) should wake kswapd, for callers are supposed to
>> remove __GFP_KSWAPD_RECLAIM if trying to hold pgdat->kswapd_wait has a
>> risk of deadlock.
> 
> kswapd_wait is a waitqueue so what is being held? It's safe for kswapd
> to try wake itself as the waitqueue will be active when wakeup_kswapd()
> is called so no wakeup occurs. If there is a deadlock, it needs a better
> explanation. I believe I already stated why this patch is fixing a bug
> but it wasn't deadlock related.
> 

I noticed this problem ( pgdat->kswapd_wait might be held without
__GFP_KSWAPD_RECLAIM ) when analyzing a different problem ( debugobject code
is holding pgdat->kswapd_wait due to __GFP_KSWAPD_RECLAIM ) reported at
https://syzkaller.appspot.com/bug?extid=fe0c72f0ccbb93786380 .

I'm calling pgdat->kswapd_wait a lock, as lockdep uses it as a lock name.

The latter was explained at
https://lore.kernel.org/linux-mm/d642e597-cf7d-b410-16ce-22dff483fd8e@I-love.SAKURA.ne.jp/
and we agreed that debugobject code needs to drop __GFP_KSWAPD_RECLAIM at
https://lore.kernel.org/linux-mm/87fs809fwk.ffs@tglx/ .

This patch is for making sure that debugobject code will not try to hold
pgdat->kswapd_wait after debugobject code dropped __GFP_KSWAPD_RECLAIM.

Thus, the problem this patch will fix is a deadlock related, isn't it?
Huang, Ying May 16, 2023, 1:44 a.m. UTC | #5
Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> writes:

> On 2023/05/15 16:38, Mel Gorman wrote:
>> On Sun, May 14, 2023 at 09:28:56AM +0900, Tetsuo Handa wrote:
>>> Commit 73444bc4d8f9 ("mm, page_alloc: do not wake kswapd with zone lock
>>> held") moved wakeup_kswapd() from steal_suitable_fallback() to rmqueue()
>>> using ZONE_BOOSTED_WATERMARK flag.
>>>
>>> Only allocation contexts that include ALLOC_KSWAPD (which corresponds to
>>> __GFP_KSWAPD_RECLAIM) should wake kswapd, for callers are supposed to
>>> remove __GFP_KSWAPD_RECLAIM if trying to hold pgdat->kswapd_wait has a
>>> risk of deadlock.
>> 
>> kswapd_wait is a waitqueue so what is being held? It's safe for kswapd
>> to try wake itself as the waitqueue will be active when wakeup_kswapd()
>> is called so no wakeup occurs. If there is a deadlock, it needs a better
>> explanation. I believe I already stated why this patch is fixing a bug
>> but it wasn't deadlock related.
>> 
>
> I noticed this problem ( pgdat->kswapd_wait might be held without
> __GFP_KSWAPD_RECLAIM ) when analyzing a different problem ( debugobject code
> is holding pgdat->kswapd_wait due to __GFP_KSWAPD_RECLAIM ) reported at
> https://syzkaller.appspot.com/bug?extid=fe0c72f0ccbb93786380 .
>
> I'm calling pgdat->kswapd_wait a lock, as lockdep uses it as a lock name.

This has confused me much before.  IIUC, the deadlock is unrelated with
kswapd wakeup itself.  pgdat->kswapd_wait is the lock name and the lock
in fact is the spinlock: pgdat->kswapd_wait.lock.  So the deadlock is,

pgdat->kswapd_wait.lock holders take the pi lock
pi lock holders take the rq lock
rq lock holders take the timer base lock
timer base lock holders take the pgdat->kswapd_wait.lock (missing __GFP_KSWAPD_RECLAIM check)

The above is based on analysis in

https://lore.kernel.org/all/20190107204627.GA25526@cmpxchg.org/
https://lore.kernel.org/linux-mm/d642e597-cf7d-b410-16ce-22dff483fd8e@I-love.SAKURA.ne.jp/

Tetsuo's patch avoids to take pgdat->kswapd_wait.lock when timer base
lock is held via adding check for __GFP_KSWAPD_RECLAIM, so breaks the
circular dependency chain.

> The latter was explained at
> https://lore.kernel.org/linux-mm/d642e597-cf7d-b410-16ce-22dff483fd8e@I-love.SAKURA.ne.jp/
> and we agreed that debugobject code needs to drop __GFP_KSWAPD_RECLAIM at
> https://lore.kernel.org/linux-mm/87fs809fwk.ffs@tglx/ .
>
> This patch is for making sure that debugobject code will not try to hold
> pgdat->kswapd_wait after debugobject code dropped __GFP_KSWAPD_RECLAIM.
>
> Thus, the problem this patch will fix is a deadlock related, isn't it?

Best Regards,
Huang, Ying
Tetsuo Handa May 22, 2023, 1:57 p.m. UTC | #6
On 2023/05/16 10:44, Huang, Ying wrote:
> Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> writes:
> 
>> On 2023/05/15 16:38, Mel Gorman wrote:
>>> On Sun, May 14, 2023 at 09:28:56AM +0900, Tetsuo Handa wrote:
>>>> Commit 73444bc4d8f9 ("mm, page_alloc: do not wake kswapd with zone lock
>>>> held") moved wakeup_kswapd() from steal_suitable_fallback() to rmqueue()
>>>> using ZONE_BOOSTED_WATERMARK flag.
>>>>
>>>> Only allocation contexts that include ALLOC_KSWAPD (which corresponds to
>>>> __GFP_KSWAPD_RECLAIM) should wake kswapd, for callers are supposed to
>>>> remove __GFP_KSWAPD_RECLAIM if trying to hold pgdat->kswapd_wait has a
>>>> risk of deadlock.
>>>
>>> kswapd_wait is a waitqueue so what is being held? It's safe for kswapd
>>> to try wake itself as the waitqueue will be active when wakeup_kswapd()
>>> is called so no wakeup occurs. If there is a deadlock, it needs a better
>>> explanation. I believe I already stated why this patch is fixing a bug
>>> but it wasn't deadlock related.
>>>
>>
>> I noticed this problem ( pgdat->kswapd_wait might be held without
>> __GFP_KSWAPD_RECLAIM ) when analyzing a different problem ( debugobject code
>> is holding pgdat->kswapd_wait due to __GFP_KSWAPD_RECLAIM ) reported at
>> https://syzkaller.appspot.com/bug?extid=fe0c72f0ccbb93786380 .
>>
>> I'm calling pgdat->kswapd_wait a lock, as lockdep uses it as a lock name.
> 
> This has confused me much before.  IIUC, the deadlock is unrelated with
> kswapd wakeup itself.  pgdat->kswapd_wait is the lock name and the lock
> in fact is the spinlock: pgdat->kswapd_wait.lock.  So the deadlock is,
> 
> pgdat->kswapd_wait.lock holders take the pi lock
> pi lock holders take the rq lock
> rq lock holders take the timer base lock
> timer base lock holders take the pgdat->kswapd_wait.lock (missing __GFP_KSWAPD_RECLAIM check)

Yes, thank you for explanation.

> 
> The above is based on analysis in
> 
> https://lore.kernel.org/all/20190107204627.GA25526@cmpxchg.org/
> https://lore.kernel.org/linux-mm/d642e597-cf7d-b410-16ce-22dff483fd8e@I-love.SAKURA.ne.jp/
> 
> Tetsuo's patch avoids to take pgdat->kswapd_wait.lock when timer base
> lock is held via adding check for __GFP_KSWAPD_RECLAIM, so breaks the
> circular dependency chain.

Yes. Mel, any questions on this patch?

Thomas Gleixner described this lock as kswapd_wait::lock at
https://lkml.kernel.org/r/168476016890.404.6911447269153588182.tip-bot2@tip-bot2 .
Should I resubmit this patch with s/pgdat->kswapd_wait/pgdat->kswapd_wait.lock/ or
s/pgdat->kswapd_wait/kswapd_wait::lock/ ?

> 
>> The latter was explained at
>> https://lore.kernel.org/linux-mm/d642e597-cf7d-b410-16ce-22dff483fd8e@I-love.SAKURA.ne.jp/
>> and we agreed that debugobject code needs to drop __GFP_KSWAPD_RECLAIM at
>> https://lore.kernel.org/linux-mm/87fs809fwk.ffs@tglx/ .
>>
>> This patch is for making sure that debugobject code will not try to hold
>> pgdat->kswapd_wait after debugobject code dropped __GFP_KSWAPD_RECLAIM.
>>
>> Thus, the problem this patch will fix is a deadlock related, isn't it?
> 
> Best Regards,
> Huang, Ying
Mel Gorman May 22, 2023, 2:58 p.m. UTC | #7
On Mon, May 22, 2023 at 10:57:03PM +0900, Tetsuo Handa wrote:
> On 2023/05/16 10:44, Huang, Ying wrote:
> > Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> writes:
> > 
> 
> > 
> > The above is based on analysis in
> > 
> > https://lore.kernel.org/all/20190107204627.GA25526@cmpxchg.org/
> > https://lore.kernel.org/linux-mm/d642e597-cf7d-b410-16ce-22dff483fd8e@I-love.SAKURA.ne.jp/
> > 
> > Tetsuo's patch avoids to take pgdat->kswapd_wait.lock when timer base
> > lock is held via adding check for __GFP_KSWAPD_RECLAIM, so breaks the
> > circular dependency chain.
> 
> Yes. Mel, any questions on this patch?
> 

No, I do not, the deadlock issue is more clear now.

> Thomas Gleixner described this lock as kswapd_wait::lock at
> https://lkml.kernel.org/r/168476016890.404.6911447269153588182.tip-bot2@tip-bot2 .
> Should I resubmit this patch with s/pgdat->kswapd_wait/pgdat->kswapd_wait.lock/ or
> s/pgdat->kswapd_wait/kswapd_wait::lock/ ?
> 

I don't think a revision of what's in Andrew's tree is necessary. It could
be improved by outlining the exact deadlock similar based on this thread
but it's somewhat overkill as the fix has other obvious justifications on
its own.

Acked-by: Mel Gorman <mgorman@techsingularity.net>

Thanks.
diff mbox series

Patch

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 47421bedc12b..ecad680cec53 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3052,7 +3052,8 @@  struct page *rmqueue(struct zone *preferred_zone,
 
 out:
 	/* Separate test+clear to avoid unnecessary atomics */
-	if (unlikely(test_bit(ZONE_BOOSTED_WATERMARK, &zone->flags))) {
+	if ((alloc_flags & ALLOC_KSWAPD) &&
+	    unlikely(test_bit(ZONE_BOOSTED_WATERMARK, &zone->flags))) {
 		clear_bit(ZONE_BOOSTED_WATERMARK, &zone->flags);
 		wakeup_kswapd(zone, 0, 0, zone_idx(zone));
 	}