diff mbox series

[1/2] mm/hugetlb: fix DEBUG_LOCKS_WARN_ON(1) when dissolve_free_hugetlb_folio()

Message ID 20240418022000.3524229-2-linmiaohe@huawei.com (mailing list archive)
State New
Headers show
Series mm/hugetlb: a few fixup patches for hugetlb | expand

Commit Message

Miaohe Lin April 18, 2024, 2:19 a.m. UTC
When I did memory failure tests recently, below warning occurs:

DEBUG_LOCKS_WARN_ON(1)
WARNING: CPU: 8 PID: 1011 at kernel/locking/lockdep.c:232 __lock_acquire+0xccb/0x1ca0
Modules linked in: mce_inject hwpoison_inject
CPU: 8 PID: 1011 Comm: bash Kdump: loaded Not tainted 6.9.0-rc3-next-20240410-00012-gdb69f219f4be #3
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
RIP: 0010:__lock_acquire+0xccb/0x1ca0
RSP: 0018:ffffa7a1c7fe3bd0 EFLAGS: 00000082
RAX: 0000000000000000 RBX: eb851eb853975fcf RCX: ffffa1ce5fc1c9c8
RDX: 00000000ffffffd8 RSI: 0000000000000027 RDI: ffffa1ce5fc1c9c0
RBP: ffffa1c6865d3280 R08: ffffffffb0f570a8 R09: 0000000000009ffb
R10: 0000000000000286 R11: ffffffffb0f2ad50 R12: ffffa1c6865d3d10
R13: ffffa1c6865d3c70 R14: 0000000000000000 R15: 0000000000000004
FS:  00007ff9f32aa740(0000) GS:ffffa1ce5fc00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007ff9f3134ba0 CR3: 00000008484e4000 CR4: 00000000000006f0
Call Trace:
 <TASK>
 lock_acquire+0xbe/0x2d0
 _raw_spin_lock_irqsave+0x3a/0x60
 hugepage_subpool_put_pages.part.0+0xe/0xc0
 free_huge_folio+0x253/0x3f0
 dissolve_free_huge_page+0x147/0x210
 __page_handle_poison+0x9/0x70
 memory_failure+0x4e6/0x8c0
 hard_offline_page_store+0x55/0xa0
 kernfs_fop_write_iter+0x12c/0x1d0
 vfs_write+0x380/0x540
 ksys_write+0x64/0xe0
 do_syscall_64+0xbc/0x1d0
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7ff9f3114887
RSP: 002b:00007ffecbacb458 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 000000000000000c RCX: 00007ff9f3114887
RDX: 000000000000000c RSI: 0000564494164e10 RDI: 0000000000000001
RBP: 0000564494164e10 R08: 00007ff9f31d1460 R09: 000000007fffffff
R10: 0000000000000000 R11: 0000000000000246 R12: 000000000000000c
R13: 00007ff9f321b780 R14: 00007ff9f3217600 R15: 00007ff9f3216a00
 </TASK>
Kernel panic - not syncing: kernel: panic_on_warn set ...
CPU: 8 PID: 1011 Comm: bash Kdump: loaded Not tainted 6.9.0-rc3-next-20240410-00012-gdb69f219f4be #3
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
Call Trace:
 <TASK>
 panic+0x326/0x350
 check_panic_on_warn+0x4f/0x50
 __warn+0x98/0x190
 report_bug+0x18e/0x1a0
 handle_bug+0x3d/0x70
 exc_invalid_op+0x18/0x70
 asm_exc_invalid_op+0x1a/0x20
RIP: 0010:__lock_acquire+0xccb/0x1ca0
RSP: 0018:ffffa7a1c7fe3bd0 EFLAGS: 00000082
RAX: 0000000000000000 RBX: eb851eb853975fcf RCX: ffffa1ce5fc1c9c8
RDX: 00000000ffffffd8 RSI: 0000000000000027 RDI: ffffa1ce5fc1c9c0
RBP: ffffa1c6865d3280 R08: ffffffffb0f570a8 R09: 0000000000009ffb
R10: 0000000000000286 R11: ffffffffb0f2ad50 R12: ffffa1c6865d3d10
R13: ffffa1c6865d3c70 R14: 0000000000000000 R15: 0000000000000004
 lock_acquire+0xbe/0x2d0
 _raw_spin_lock_irqsave+0x3a/0x60
 hugepage_subpool_put_pages.part.0+0xe/0xc0
 free_huge_folio+0x253/0x3f0
 dissolve_free_huge_page+0x147/0x210
 __page_handle_poison+0x9/0x70
 memory_failure+0x4e6/0x8c0
 hard_offline_page_store+0x55/0xa0
 kernfs_fop_write_iter+0x12c/0x1d0
 vfs_write+0x380/0x540
 ksys_write+0x64/0xe0
 do_syscall_64+0xbc/0x1d0
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7ff9f3114887
RSP: 002b:00007ffecbacb458 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 000000000000000c RCX: 00007ff9f3114887
RDX: 000000000000000c RSI: 0000564494164e10 RDI: 0000000000000001
RBP: 0000564494164e10 R08: 00007ff9f31d1460 R09: 000000007fffffff
R10: 0000000000000000 R11: 0000000000000246 R12: 000000000000000c
R13: 00007ff9f321b780 R14: 00007ff9f3217600 R15: 00007ff9f3216a00
 </TASK>

After git bisecting and digging into the code, I believe the root cause
is that _deferred_list field of folio is unioned with _hugetlb_subpool
field. In __update_and_free_hugetlb_folio(), folio->_deferred_list is
always initialized leading to corrupted folio->_hugetlb_subpool when
folio is hugetlb. Later free_huge_folio() will use _hugetlb_subpool
and above warning happens. Fix this by initialise folio->_deferred_list
iff folio is not hugetlb.

Fixes: b6952b6272dd ("mm: always initialise folio->_deferred_list")
CC: stable@vger.kernel.org
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
---
 mm/hugetlb.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

Comments

Oscar Salvador April 18, 2024, 4:05 a.m. UTC | #1
On Thu, Apr 18, 2024 at 10:19:59AM +0800, Miaohe Lin wrote:
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 26ab9dfc7d63..1da9a14a5513 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1788,7 +1788,8 @@ static void __update_and_free_hugetlb_folio(struct hstate *h,
>  		destroy_compound_gigantic_folio(folio, huge_page_order(h));
>  		free_gigantic_folio(folio, huge_page_order(h));
>  	} else {
> -		INIT_LIST_HEAD(&folio->_deferred_list);
> +		if (!folio_test_hugetlb(folio))
> +			INIT_LIST_HEAD(&folio->_deferred_list);

Ok, it took me a bit to figure this out.

So we basically init __deferred_list when we know that
folio_put will not end up calling free_huge_folio
because a previous call to remove_hugetlb_folio has already cleared the
bit.

Maybe Matthew thought that any folio ending here would not end up in
free_huge_folio (which is the one fiddling subpool).

I mean, fix looks good because if hugetlb flag is cleared,
destroy_large_folio will go straight to free_the_page, but the
whole thing is a bit subtle.

And if we decide to go with this, I think we are going to need a comment
in there explaining what is going on like "only init _deferred_list if
free_huge_folio cannot be call".
Miaohe Lin April 18, 2024, 8 a.m. UTC | #2
On 2024/4/18 12:05, Oscar Salvador wrote:
> On Thu, Apr 18, 2024 at 10:19:59AM +0800, Miaohe Lin wrote:
>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>> index 26ab9dfc7d63..1da9a14a5513 100644
>> --- a/mm/hugetlb.c
>> +++ b/mm/hugetlb.c
>> @@ -1788,7 +1788,8 @@ static void __update_and_free_hugetlb_folio(struct hstate *h,
>>  		destroy_compound_gigantic_folio(folio, huge_page_order(h));
>>  		free_gigantic_folio(folio, huge_page_order(h));
>>  	} else {
>> -		INIT_LIST_HEAD(&folio->_deferred_list);
>> +		if (!folio_test_hugetlb(folio))
>> +			INIT_LIST_HEAD(&folio->_deferred_list);
> 
> Ok, it took me a bit to figure this out.
> 
> So we basically init __deferred_list when we know that
> folio_put will not end up calling free_huge_folio
> because a previous call to remove_hugetlb_folio has already cleared the
> bit.
> 
> Maybe Matthew thought that any folio ending here would not end up in
> free_huge_folio (which is the one fiddling subpool).
> 
> I mean, fix looks good because if hugetlb flag is cleared,
> destroy_large_folio will go straight to free_the_page, but the
> whole thing is a bit subtle.

AFAICS, this is the most straightforward way to fix the issue. Do you have any suggestions
on how to fix this in a more graceful way?

> 
> And if we decide to go with this, I think we are going to need a comment
> in there explaining what is going on like "only init _deferred_list if
> free_huge_folio cannot be call".

Yes, this comment will help.
Thanks.
.

> 
>
Oscar Salvador April 18, 2024, 12:41 p.m. UTC | #3
On Thu, Apr 18, 2024 at 04:00:42PM +0800, Miaohe Lin wrote:
> On 2024/4/18 12:05, Oscar Salvador wrote:
> > On Thu, Apr 18, 2024 at 10:19:59AM +0800, Miaohe Lin wrote:
> >> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> >> index 26ab9dfc7d63..1da9a14a5513 100644
> >> --- a/mm/hugetlb.c
> >> +++ b/mm/hugetlb.c
> >> @@ -1788,7 +1788,8 @@ static void __update_and_free_hugetlb_folio(struct hstate *h,
> >>  		destroy_compound_gigantic_folio(folio, huge_page_order(h));
> >>  		free_gigantic_folio(folio, huge_page_order(h));
> >>  	} else {
> >> -		INIT_LIST_HEAD(&folio->_deferred_list);
> >> +		if (!folio_test_hugetlb(folio))
> >> +			INIT_LIST_HEAD(&folio->_deferred_list);
> > 
> > Ok, it took me a bit to figure this out.
> > 
> > So we basically init __deferred_list when we know that
> > folio_put will not end up calling free_huge_folio
> > because a previous call to remove_hugetlb_folio has already cleared the
> > bit.
> > 
> > Maybe Matthew thought that any folio ending here would not end up in
> > free_huge_folio (which is the one fiddling subpool).
> > 
> > I mean, fix looks good because if hugetlb flag is cleared,
> > destroy_large_folio will go straight to free_the_page, but the
> > whole thing is a bit subtle.
> 
> AFAICS, this is the most straightforward way to fix the issue. Do you have any suggestions
> on how to fix this in a more graceful way?

Not from the top of my head.
Anyway, I have been thinking for a while that this code needs some love,
so I will check how this can be untangled.
Miaohe Lin April 19, 2024, 2 a.m. UTC | #4
On 2024/4/18 20:41, Oscar Salvador wrote:
> On Thu, Apr 18, 2024 at 04:00:42PM +0800, Miaohe Lin wrote:
>> On 2024/4/18 12:05, Oscar Salvador wrote:
>>> On Thu, Apr 18, 2024 at 10:19:59AM +0800, Miaohe Lin wrote:
>>>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>>>> index 26ab9dfc7d63..1da9a14a5513 100644
>>>> --- a/mm/hugetlb.c
>>>> +++ b/mm/hugetlb.c
>>>> @@ -1788,7 +1788,8 @@ static void __update_and_free_hugetlb_folio(struct hstate *h,
>>>>  		destroy_compound_gigantic_folio(folio, huge_page_order(h));
>>>>  		free_gigantic_folio(folio, huge_page_order(h));
>>>>  	} else {
>>>> -		INIT_LIST_HEAD(&folio->_deferred_list);
>>>> +		if (!folio_test_hugetlb(folio))
>>>> +			INIT_LIST_HEAD(&folio->_deferred_list);
>>>
>>> Ok, it took me a bit to figure this out.
>>>
>>> So we basically init __deferred_list when we know that
>>> folio_put will not end up calling free_huge_folio
>>> because a previous call to remove_hugetlb_folio has already cleared the
>>> bit.
>>>
>>> Maybe Matthew thought that any folio ending here would not end up in
>>> free_huge_folio (which is the one fiddling subpool).
>>>
>>> I mean, fix looks good because if hugetlb flag is cleared,
>>> destroy_large_folio will go straight to free_the_page, but the
>>> whole thing is a bit subtle.
>>
>> AFAICS, this is the most straightforward way to fix the issue. Do you have any suggestions
>> on how to fix this in a more graceful way?
> 
> Not from the top of my head.
> Anyway, I have been thinking for a while that this code needs some love,
> so I will check how this can be untangled.

That would be really nice. Thanks Oscar.
.

> 
>
diff mbox series

Patch

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 26ab9dfc7d63..1da9a14a5513 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1788,7 +1788,8 @@  static void __update_and_free_hugetlb_folio(struct hstate *h,
 		destroy_compound_gigantic_folio(folio, huge_page_order(h));
 		free_gigantic_folio(folio, huge_page_order(h));
 	} else {
-		INIT_LIST_HEAD(&folio->_deferred_list);
+		if (!folio_test_hugetlb(folio))
+			INIT_LIST_HEAD(&folio->_deferred_list);
 		folio_put(folio);
 	}
 }