diff mbox series

[RFC] mm/thp: check and bail out if page in deferred queue already

Message ID 20221223135207.2275317-1-fengwei.yin@intel.com (mailing list archive)
State New
Headers show
Series [RFC] mm/thp: check and bail out if page in deferred queue already | expand

Commit Message

Yin, Fengwei Dec. 23, 2022, 1:52 p.m. UTC
Kernel build regression with LLVM was reported here:
https://lore.kernel.org/all/Y1GCYXGtEVZbcv%2F5@dev-arch.thelio-3990X/
with commit f35b5d7d676e ("mm: align larger anonymous mappings on THP
boundaries"). And the commit f35b5d7d676e was reverted.

It turned out the regression is related with madvise(MADV_DONTNEED)
was used by ld.lld. But with none PMD_SIZE aligned parameter len.
trace-bpfcc captured:
531607  531732  ld.lld          do_madvise.part.0 start: 0x7feca9000000, len: 0x7fb000, behavior: 0x4
531607  531793  ld.lld          do_madvise.part.0 start: 0x7fec86a00000, len: 0x7fb000, behavior: 0x4

If the underneath physical page is THP, the madvise(MADV_DONTNNED) can
trigger split_queue_lock contention raised significantly. perf showed
following data:
    14.85%     0.00%  ld.lld           [kernel.kallsyms]           [k]
       entry_SYSCALL_64_after_hwframe
           11.52%
                entry_SYSCALL_64_after_hwframe
                do_syscall_64
                __x64_sys_madvise
                do_madvise.part.0
                zap_page_range
                unmap_single_vma
                unmap_page_range
                page_remove_rmap
                deferred_split_huge_page
                __lock_text_start
                native_queued_spin_lock_slowpath

If THP can't be removed from rmap as whole THP, partial THP will be
removed from rmap by removing sub-pages from rmap. Even the THP
head page is added to deferred queue already, the split_queue_lock
will be acquired and check whether the THP head page is in the queue
already. Thus, the contention of split_queue_lock is raised.

Before acquire split_queue_lock, check and bail out early if the THP
head page is in the queue already. The checking without holding
split_queue_lock could race with deferred_split_scan, but it doesn't
impact the correctness here.

Test result of building kernel with ld.lld:
commit 7b5a0b664ebe (parent commit of f35b5d7d676e):
time -f "\t%E real,\t%U user,\t%S sys" make LD=ld.lld -skj96 allmodconfig all
        6:07.99 real,   26367.77 user,  5063.35 sys

commit f35b5d7d676e:
time -f "\t%E real,\t%U user,\t%S sys" make LD=ld.lld -skj96 allmodconfig all
        7:22.15 real,   26235.03 user,  12504.55 sys

commit f35b5d7d676e with the fixing patch:
time -f "\t%E real,\t%U user,\t%S sys" make LD=ld.lld -skj96 allmodconfig all
        6:08.49 real,   26520.15 user,  5047.91 sys

Signed-off-by: Yin Fengwei <fengwei.yin@intel.com>
---
My first thought was to change the per node deferred queue to per cpu.
It's complicated and may be overkill.

For the race without lock acquired, I didn't see obvious issue here. But I
could miss something here. Let me know if I did. Thanks.


 mm/huge_memory.c | 3 +++
 1 file changed, 3 insertions(+)


base-commit: 8395ae05cb5a2e31d36106e8c85efa11cda849be

Comments

Nathan Chancellor Dec. 27, 2022, 8:15 p.m. UTC | #1
On Fri, Dec 23, 2022 at 09:52:07PM +0800, Yin Fengwei wrote:
> Kernel build regression with LLVM was reported here:
> https://lore.kernel.org/all/Y1GCYXGtEVZbcv%2F5@dev-arch.thelio-3990X/
> with commit f35b5d7d676e ("mm: align larger anonymous mappings on THP
> boundaries"). And the commit f35b5d7d676e was reverted.
> 
> It turned out the regression is related with madvise(MADV_DONTNEED)
> was used by ld.lld. But with none PMD_SIZE aligned parameter len.
> trace-bpfcc captured:
> 531607  531732  ld.lld          do_madvise.part.0 start: 0x7feca9000000, len: 0x7fb000, behavior: 0x4
> 531607  531793  ld.lld          do_madvise.part.0 start: 0x7fec86a00000, len: 0x7fb000, behavior: 0x4
> 
> If the underneath physical page is THP, the madvise(MADV_DONTNNED) can
> trigger split_queue_lock contention raised significantly. perf showed
> following data:
>     14.85%     0.00%  ld.lld           [kernel.kallsyms]           [k]
>        entry_SYSCALL_64_after_hwframe
>            11.52%
>                 entry_SYSCALL_64_after_hwframe
>                 do_syscall_64
>                 __x64_sys_madvise
>                 do_madvise.part.0
>                 zap_page_range
>                 unmap_single_vma
>                 unmap_page_range
>                 page_remove_rmap
>                 deferred_split_huge_page
>                 __lock_text_start
>                 native_queued_spin_lock_slowpath
> 
> If THP can't be removed from rmap as whole THP, partial THP will be
> removed from rmap by removing sub-pages from rmap. Even the THP
> head page is added to deferred queue already, the split_queue_lock
> will be acquired and check whether the THP head page is in the queue
> already. Thus, the contention of split_queue_lock is raised.
> 
> Before acquire split_queue_lock, check and bail out early if the THP
> head page is in the queue already. The checking without holding
> split_queue_lock could race with deferred_split_scan, but it doesn't
> impact the correctness here.
> 
> Test result of building kernel with ld.lld:
> commit 7b5a0b664ebe (parent commit of f35b5d7d676e):
> time -f "\t%E real,\t%U user,\t%S sys" make LD=ld.lld -skj96 allmodconfig all
>         6:07.99 real,   26367.77 user,  5063.35 sys
> 
> commit f35b5d7d676e:
> time -f "\t%E real,\t%U user,\t%S sys" make LD=ld.lld -skj96 allmodconfig all
>         7:22.15 real,   26235.03 user,  12504.55 sys
> 
> commit f35b5d7d676e with the fixing patch:
> time -f "\t%E real,\t%U user,\t%S sys" make LD=ld.lld -skj96 allmodconfig all
>         6:08.49 real,   26520.15 user,  5047.91 sys
> 
> Signed-off-by: Yin Fengwei <fengwei.yin@intel.com>

I cannot say whether or not this is a good idea or not but it does
resolve the regression I reported:

  Benchmark 1: x86_64 allmodconfig (GCC + ld.lld) @ 1b929c02afd3 ("Linux 6.2-rc1") on 6.0.0-rc3-debug-00016-g7b5a0b664ebe
    Time (mean ± σ):     383.003 s ±  0.680 s    [User: 34737.850 s, System: 7287.079 s]
    Range (min … max):   382.218 s … 383.413 s    3 runs

  Benchmark 1: x86_64 allmodconfig (GCC + ld.lld) @ 1b929c02afd3 ("Linux 6.2-rc1") on 6.0.0-rc3-debug-00017-gf35b5d7d676e
    Time (mean ± σ):     437.886 s ±  1.030 s    [User: 35888.658 s, System: 14048.871 s]
    Range (min … max):   436.865 s … 438.924 s    3 runs

  Benchmark 1: x86_64 allmodconfig (GCC + ld.lld) @ 1b929c02afd3 ("Linux 6.2-rc1") on 6.0.0-rc3-debug-00017-gf35b5d7d676e-dirty
    Time (mean ± σ):     384.371 s ±  1.004 s    [User: 35402.880 s, System: 6401.691 s]
    Range (min … max):   383.547 s … 385.489 s    3 runs

Tested-by: Nathan Chancellor <nathan@kernel.org>

> ---
> My first thought was to change the per node deferred queue to per cpu.
> It's complicated and may be overkill.
> 
> For the race without lock acquired, I didn't see obvious issue here. But I
> could miss something here. Let me know if I did. Thanks.
> 
> 
>  mm/huge_memory.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index abe6cfd92ffa..7cde9f702e63 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2837,6 +2837,9 @@ void deferred_split_huge_page(struct page *page)
>  	if (PageSwapCache(page))
>  		return;
>  
> +	if (!list_empty(page_deferred_list(page)))
> +		return;
> +
>  	spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
>  	if (list_empty(page_deferred_list(page))) {
>  		count_vm_event(THP_DEFERRED_SPLIT_PAGE);
> 
> base-commit: 8395ae05cb5a2e31d36106e8c85efa11cda849be
> -- 
> 2.34.1
> 
> 
>
David Rientjes Dec. 28, 2022, 1:24 a.m. UTC | #2
On Fri, 23 Dec 2022, Yin Fengwei wrote:

> Kernel build regression with LLVM was reported here:
> https://lore.kernel.org/all/Y1GCYXGtEVZbcv%2F5@dev-arch.thelio-3990X/
> with commit f35b5d7d676e ("mm: align larger anonymous mappings on THP
> boundaries"). And the commit f35b5d7d676e was reverted.
> 
> It turned out the regression is related with madvise(MADV_DONTNEED)
> was used by ld.lld. But with none PMD_SIZE aligned parameter len.
> trace-bpfcc captured:
> 531607  531732  ld.lld          do_madvise.part.0 start: 0x7feca9000000, len: 0x7fb000, behavior: 0x4
> 531607  531793  ld.lld          do_madvise.part.0 start: 0x7fec86a00000, len: 0x7fb000, behavior: 0x4
> 
> If the underneath physical page is THP, the madvise(MADV_DONTNNED) can
> trigger split_queue_lock contention raised significantly. perf showed
> following data:
>     14.85%     0.00%  ld.lld           [kernel.kallsyms]           [k]
>        entry_SYSCALL_64_after_hwframe
>            11.52%
>                 entry_SYSCALL_64_after_hwframe
>                 do_syscall_64
>                 __x64_sys_madvise
>                 do_madvise.part.0
>                 zap_page_range
>                 unmap_single_vma
>                 unmap_page_range
>                 page_remove_rmap
>                 deferred_split_huge_page
>                 __lock_text_start
>                 native_queued_spin_lock_slowpath
> 
> If THP can't be removed from rmap as whole THP, partial THP will be
> removed from rmap by removing sub-pages from rmap. Even the THP
> head page is added to deferred queue already, the split_queue_lock
> will be acquired and check whether the THP head page is in the queue
> already. Thus, the contention of split_queue_lock is raised.
> 
> Before acquire split_queue_lock, check and bail out early if the THP
> head page is in the queue already. The checking without holding
> split_queue_lock could race with deferred_split_scan, but it doesn't
> impact the correctness here.
> 
> Test result of building kernel with ld.lld:
> commit 7b5a0b664ebe (parent commit of f35b5d7d676e):
> time -f "\t%E real,\t%U user,\t%S sys" make LD=ld.lld -skj96 allmodconfig all
>         6:07.99 real,   26367.77 user,  5063.35 sys
> 
> commit f35b5d7d676e:
> time -f "\t%E real,\t%U user,\t%S sys" make LD=ld.lld -skj96 allmodconfig all
>         7:22.15 real,   26235.03 user,  12504.55 sys
> 
> commit f35b5d7d676e with the fixing patch:
> time -f "\t%E real,\t%U user,\t%S sys" make LD=ld.lld -skj96 allmodconfig all
>         6:08.49 real,   26520.15 user,  5047.91 sys
> 
> Signed-off-by: Yin Fengwei <fengwei.yin@intel.com>

Acked-by: David Rientjes <rientjes@google.com>
Huang, Ying Dec. 28, 2022, 2:06 a.m. UTC | #3
Yin Fengwei <fengwei.yin@intel.com> writes:

> Kernel build regression with LLVM was reported here:
> https://lore.kernel.org/all/Y1GCYXGtEVZbcv%2F5@dev-arch.thelio-3990X/
> with commit f35b5d7d676e ("mm: align larger anonymous mappings on THP
> boundaries"). And the commit f35b5d7d676e was reverted.
>
> It turned out the regression is related with madvise(MADV_DONTNEED)
> was used by ld.lld. But with none PMD_SIZE aligned parameter len.
> trace-bpfcc captured:
> 531607  531732  ld.lld          do_madvise.part.0 start: 0x7feca9000000, len: 0x7fb000, behavior: 0x4
> 531607  531793  ld.lld          do_madvise.part.0 start: 0x7fec86a00000, len: 0x7fb000, behavior: 0x4
>
> If the underneath physical page is THP, the madvise(MADV_DONTNNED) can
> trigger split_queue_lock contention raised significantly. perf showed
> following data:
>     14.85%     0.00%  ld.lld           [kernel.kallsyms]           [k]
>        entry_SYSCALL_64_after_hwframe
>            11.52%
>                 entry_SYSCALL_64_after_hwframe
>                 do_syscall_64
>                 __x64_sys_madvise
>                 do_madvise.part.0
>                 zap_page_range
>                 unmap_single_vma
>                 unmap_page_range
>                 page_remove_rmap
>                 deferred_split_huge_page
>                 __lock_text_start
>                 native_queued_spin_lock_slowpath
>
> If THP can't be removed from rmap as whole THP, partial THP will be
> removed from rmap by removing sub-pages from rmap. Even the THP
> head page is added to deferred queue already, the split_queue_lock
> will be acquired and check whether the THP head page is in the queue
> already. Thus, the contention of split_queue_lock is raised.
>
> Before acquire split_queue_lock, check and bail out early if the THP
> head page is in the queue already. The checking without holding
> split_queue_lock could race with deferred_split_scan, but it doesn't
> impact the correctness here.
>
> Test result of building kernel with ld.lld:
> commit 7b5a0b664ebe (parent commit of f35b5d7d676e):
> time -f "\t%E real,\t%U user,\t%S sys" make LD=ld.lld -skj96 allmodconfig all
>         6:07.99 real,   26367.77 user,  5063.35 sys
>
> commit f35b5d7d676e:
> time -f "\t%E real,\t%U user,\t%S sys" make LD=ld.lld -skj96 allmodconfig all
>         7:22.15 real,   26235.03 user,  12504.55 sys
>
> commit f35b5d7d676e with the fixing patch:
> time -f "\t%E real,\t%U user,\t%S sys" make LD=ld.lld -skj96 allmodconfig all
>         6:08.49 real,   26520.15 user,  5047.91 sys
>
> Signed-off-by: Yin Fengwei <fengwei.yin@intel.com>

Thank you for fixing.

Reviewed-by: "Huang, Ying" <ying.huang@intel.com>

> ---
> My first thought was to change the per node deferred queue to per cpu.
> It's complicated and may be overkill.
>
> For the race without lock acquired, I didn't see obvious issue here. But I
> could miss something here. Let me know if I did. Thanks.
>
>
>  mm/huge_memory.c | 3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index abe6cfd92ffa..7cde9f702e63 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2837,6 +2837,9 @@ void deferred_split_huge_page(struct page *page)
>  	if (PageSwapCache(page))
>  		return;
>  
> +	if (!list_empty(page_deferred_list(page)))
> +		return;
> +
>  	spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
>  	if (list_empty(page_deferred_list(page))) {
>  		count_vm_event(THP_DEFERRED_SPLIT_PAGE);
>
> base-commit: 8395ae05cb5a2e31d36106e8c85efa11cda849be
Andrew Morton Dec. 28, 2022, 11:33 p.m. UTC | #4
On Fri, 23 Dec 2022 21:52:07 +0800 Yin Fengwei <fengwei.yin@intel.com> wrote:

> Kernel build regression with LLVM was reported here:
> https://lore.kernel.org/all/Y1GCYXGtEVZbcv%2F5@dev-arch.thelio-3990X/
> with commit f35b5d7d676e ("mm: align larger anonymous mappings on THP
> boundaries"). And the commit f35b5d7d676e was reverted.
> 
> It turned out the regression is related with madvise(MADV_DONTNEED)
> was used by ld.lld.

Are we able to identify a Fixes: target for this?  Thanks.
Yin, Fengwei Dec. 29, 2022, 1:14 a.m. UTC | #5
On 12/28/2022 4:15 AM, Nathan Chancellor wrote:
> On Fri, Dec 23, 2022 at 09:52:07PM +0800, Yin Fengwei wrote:
>> Kernel build regression with LLVM was reported here:
>> https://lore.kernel.org/all/Y1GCYXGtEVZbcv%2F5@dev-arch.thelio-3990X/
>> with commit f35b5d7d676e ("mm: align larger anonymous mappings on THP
>> boundaries"). And the commit f35b5d7d676e was reverted.
>>
>> It turned out the regression is related with madvise(MADV_DONTNEED)
>> was used by ld.lld. But with none PMD_SIZE aligned parameter len.
>> trace-bpfcc captured:
>> 531607  531732  ld.lld          do_madvise.part.0 start: 0x7feca9000000, len: 0x7fb000, behavior: 0x4
>> 531607  531793  ld.lld          do_madvise.part.0 start: 0x7fec86a00000, len: 0x7fb000, behavior: 0x4
>>
>> If the underneath physical page is THP, the madvise(MADV_DONTNNED) can
>> trigger split_queue_lock contention raised significantly. perf showed
>> following data:
>>     14.85%     0.00%  ld.lld           [kernel.kallsyms]           [k]
>>        entry_SYSCALL_64_after_hwframe
>>            11.52%
>>                 entry_SYSCALL_64_after_hwframe
>>                 do_syscall_64
>>                 __x64_sys_madvise
>>                 do_madvise.part.0
>>                 zap_page_range
>>                 unmap_single_vma
>>                 unmap_page_range
>>                 page_remove_rmap
>>                 deferred_split_huge_page
>>                 __lock_text_start
>>                 native_queued_spin_lock_slowpath
>>
>> If THP can't be removed from rmap as whole THP, partial THP will be
>> removed from rmap by removing sub-pages from rmap. Even the THP
>> head page is added to deferred queue already, the split_queue_lock
>> will be acquired and check whether the THP head page is in the queue
>> already. Thus, the contention of split_queue_lock is raised.
>>
>> Before acquire split_queue_lock, check and bail out early if the THP
>> head page is in the queue already. The checking without holding
>> split_queue_lock could race with deferred_split_scan, but it doesn't
>> impact the correctness here.
>>
>> Test result of building kernel with ld.lld:
>> commit 7b5a0b664ebe (parent commit of f35b5d7d676e):
>> time -f "\t%E real,\t%U user,\t%S sys" make LD=ld.lld -skj96 allmodconfig all
>>         6:07.99 real,   26367.77 user,  5063.35 sys
>>
>> commit f35b5d7d676e:
>> time -f "\t%E real,\t%U user,\t%S sys" make LD=ld.lld -skj96 allmodconfig all
>>         7:22.15 real,   26235.03 user,  12504.55 sys
>>
>> commit f35b5d7d676e with the fixing patch:
>> time -f "\t%E real,\t%U user,\t%S sys" make LD=ld.lld -skj96 allmodconfig all
>>         6:08.49 real,   26520.15 user,  5047.91 sys
>>
>> Signed-off-by: Yin Fengwei <fengwei.yin@intel.com>
> 
> I cannot say whether or not this is a good idea or not but it does
> resolve the regression I reported:
> 
>   Benchmark 1: x86_64 allmodconfig (GCC + ld.lld) @ 1b929c02afd3 ("Linux 6.2-rc1") on 6.0.0-rc3-debug-00016-g7b5a0b664ebe
>     Time (mean ± σ):     383.003 s ±  0.680 s    [User: 34737.850 s, System: 7287.079 s]
>     Range (min … max):   382.218 s … 383.413 s    3 runs
> 
>   Benchmark 1: x86_64 allmodconfig (GCC + ld.lld) @ 1b929c02afd3 ("Linux 6.2-rc1") on 6.0.0-rc3-debug-00017-gf35b5d7d676e
>     Time (mean ± σ):     437.886 s ±  1.030 s    [User: 35888.658 s, System: 14048.871 s]
>     Range (min … max):   436.865 s … 438.924 s    3 runs
> 
>   Benchmark 1: x86_64 allmodconfig (GCC + ld.lld) @ 1b929c02afd3 ("Linux 6.2-rc1") on 6.0.0-rc3-debug-00017-gf35b5d7d676e-dirty
>     Time (mean ± σ):     384.371 s ±  1.004 s    [User: 35402.880 s, System: 6401.691 s]
>     Range (min … max):   383.547 s … 385.489 s    3 runs
> 
> Tested-by: Nathan Chancellor <nathan@kernel.org>
Thanks for testing the patch.

Regards
Yin, Fengwei
Yin, Fengwei Dec. 29, 2022, 1:15 a.m. UTC | #6
On 12/28/2022 9:24 AM, David Rientjes wrote:
> On Fri, 23 Dec 2022, Yin Fengwei wrote:
> 
>> Kernel build regression with LLVM was reported here:
>> https://lore.kernel.org/all/Y1GCYXGtEVZbcv%2F5@dev-arch.thelio-3990X/
>> with commit f35b5d7d676e ("mm: align larger anonymous mappings on THP
>> boundaries"). And the commit f35b5d7d676e was reverted.
>>
>> It turned out the regression is related with madvise(MADV_DONTNEED)
>> was used by ld.lld. But with none PMD_SIZE aligned parameter len.
>> trace-bpfcc captured:
>> 531607  531732  ld.lld          do_madvise.part.0 start: 0x7feca9000000, len: 0x7fb000, behavior: 0x4
>> 531607  531793  ld.lld          do_madvise.part.0 start: 0x7fec86a00000, len: 0x7fb000, behavior: 0x4
>>
>> If the underneath physical page is THP, the madvise(MADV_DONTNNED) can
>> trigger split_queue_lock contention raised significantly. perf showed
>> following data:
>>     14.85%     0.00%  ld.lld           [kernel.kallsyms]           [k]
>>        entry_SYSCALL_64_after_hwframe
>>            11.52%
>>                 entry_SYSCALL_64_after_hwframe
>>                 do_syscall_64
>>                 __x64_sys_madvise
>>                 do_madvise.part.0
>>                 zap_page_range
>>                 unmap_single_vma
>>                 unmap_page_range
>>                 page_remove_rmap
>>                 deferred_split_huge_page
>>                 __lock_text_start
>>                 native_queued_spin_lock_slowpath
>>
>> If THP can't be removed from rmap as whole THP, partial THP will be
>> removed from rmap by removing sub-pages from rmap. Even the THP
>> head page is added to deferred queue already, the split_queue_lock
>> will be acquired and check whether the THP head page is in the queue
>> already. Thus, the contention of split_queue_lock is raised.
>>
>> Before acquire split_queue_lock, check and bail out early if the THP
>> head page is in the queue already. The checking without holding
>> split_queue_lock could race with deferred_split_scan, but it doesn't
>> impact the correctness here.
>>
>> Test result of building kernel with ld.lld:
>> commit 7b5a0b664ebe (parent commit of f35b5d7d676e):
>> time -f "\t%E real,\t%U user,\t%S sys" make LD=ld.lld -skj96 allmodconfig all
>>         6:07.99 real,   26367.77 user,  5063.35 sys
>>
>> commit f35b5d7d676e:
>> time -f "\t%E real,\t%U user,\t%S sys" make LD=ld.lld -skj96 allmodconfig all
>>         7:22.15 real,   26235.03 user,  12504.55 sys
>>
>> commit f35b5d7d676e with the fixing patch:
>> time -f "\t%E real,\t%U user,\t%S sys" make LD=ld.lld -skj96 allmodconfig all
>>         6:08.49 real,   26520.15 user,  5047.91 sys
>>
>> Signed-off-by: Yin Fengwei <fengwei.yin@intel.com>
> 
> Acked-by: David Rientjes <rientjes@google.com>

Thanks for reviewing the patch.

Regards
Yin, Fengwei
Yin, Fengwei Dec. 29, 2022, 1:15 a.m. UTC | #7
On 12/28/2022 10:06 AM, Huang, Ying wrote:
> Yin Fengwei <fengwei.yin@intel.com> writes:
> 
>> Kernel build regression with LLVM was reported here:
>> https://lore.kernel.org/all/Y1GCYXGtEVZbcv%2F5@dev-arch.thelio-3990X/
>> with commit f35b5d7d676e ("mm: align larger anonymous mappings on THP
>> boundaries"). And the commit f35b5d7d676e was reverted.
>>
>> It turned out the regression is related with madvise(MADV_DONTNEED)
>> was used by ld.lld. But with none PMD_SIZE aligned parameter len.
>> trace-bpfcc captured:
>> 531607  531732  ld.lld          do_madvise.part.0 start: 0x7feca9000000, len: 0x7fb000, behavior: 0x4
>> 531607  531793  ld.lld          do_madvise.part.0 start: 0x7fec86a00000, len: 0x7fb000, behavior: 0x4
>>
>> If the underneath physical page is THP, the madvise(MADV_DONTNNED) can
>> trigger split_queue_lock contention raised significantly. perf showed
>> following data:
>>     14.85%     0.00%  ld.lld           [kernel.kallsyms]           [k]
>>        entry_SYSCALL_64_after_hwframe
>>            11.52%
>>                 entry_SYSCALL_64_after_hwframe
>>                 do_syscall_64
>>                 __x64_sys_madvise
>>                 do_madvise.part.0
>>                 zap_page_range
>>                 unmap_single_vma
>>                 unmap_page_range
>>                 page_remove_rmap
>>                 deferred_split_huge_page
>>                 __lock_text_start
>>                 native_queued_spin_lock_slowpath
>>
>> If THP can't be removed from rmap as whole THP, partial THP will be
>> removed from rmap by removing sub-pages from rmap. Even the THP
>> head page is added to deferred queue already, the split_queue_lock
>> will be acquired and check whether the THP head page is in the queue
>> already. Thus, the contention of split_queue_lock is raised.
>>
>> Before acquire split_queue_lock, check and bail out early if the THP
>> head page is in the queue already. The checking without holding
>> split_queue_lock could race with deferred_split_scan, but it doesn't
>> impact the correctness here.
>>
>> Test result of building kernel with ld.lld:
>> commit 7b5a0b664ebe (parent commit of f35b5d7d676e):
>> time -f "\t%E real,\t%U user,\t%S sys" make LD=ld.lld -skj96 allmodconfig all
>>         6:07.99 real,   26367.77 user,  5063.35 sys
>>
>> commit f35b5d7d676e:
>> time -f "\t%E real,\t%U user,\t%S sys" make LD=ld.lld -skj96 allmodconfig all
>>         7:22.15 real,   26235.03 user,  12504.55 sys
>>
>> commit f35b5d7d676e with the fixing patch:
>> time -f "\t%E real,\t%U user,\t%S sys" make LD=ld.lld -skj96 allmodconfig all
>>         6:08.49 real,   26520.15 user,  5047.91 sys
>>
>> Signed-off-by: Yin Fengwei <fengwei.yin@intel.com>
> 
> Thank you for fixing.
> 
> Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Thanks for reviewing the patch.

Regards
Yin, Fengwei
Yin, Fengwei Dec. 29, 2022, 1:29 a.m. UTC | #8
On 12/29/2022 7:33 AM, Andrew Morton wrote:
> On Fri, 23 Dec 2022 21:52:07 +0800 Yin Fengwei <fengwei.yin@intel.com> wrote:
> 
>> Kernel build regression with LLVM was reported here:
>> https://lore.kernel.org/all/Y1GCYXGtEVZbcv%2F5@dev-arch.thelio-3990X/
>> with commit f35b5d7d676e ("mm: align larger anonymous mappings on THP
>> boundaries"). And the commit f35b5d7d676e was reverted.
>>
>> It turned out the regression is related with madvise(MADV_DONTNEED)
>> was used by ld.lld.
> 
> Are we able to identify a Fixes: target for this?  Thanks.
I am not sure. The commit f35b5d7d676e didn't introduce this issue. It
just exposed this issue and it's reverted already. The partial THP
behavior was like this from the first day and not a regression.


Regards
Yin, Fengwei
diff mbox series

Patch

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index abe6cfd92ffa..7cde9f702e63 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2837,6 +2837,9 @@  void deferred_split_huge_page(struct page *page)
 	if (PageSwapCache(page))
 		return;
 
+	if (!list_empty(page_deferred_list(page)))
+		return;
+
 	spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
 	if (list_empty(page_deferred_list(page))) {
 		count_vm_event(THP_DEFERRED_SPLIT_PAGE);