mbox series

[v9,00/12] AMD broadcast TLB invalidation

Message ID 20250206044346.3810242-1-riel@surriel.com (mailing list archive)
Headers show
Series AMD broadcast TLB invalidation | expand

Message

Rik van Riel Feb. 6, 2025, 4:43 a.m. UTC
Add support for broadcast TLB invalidation using AMD's INVLPGB instruction.

This allows the kernel to invalidate TLB entries on remote CPUs without
needing to send IPIs, without having to wait for remote CPUs to handle
those interrupts, and with less interruption to what was running on
those CPUs.

Because x86 PCID space is limited, and there are some very large
systems out there, broadcast TLB invalidation is only used for
processes that are active on 3 or more CPUs, with the threshold
being gradually increased the more the PCID space gets exhausted.

Combined with the removal of unnecessary lru_add_drain calls
(see https://lkml.org/lkml/2024/12/19/1388) this results in a
nice performance boost for the will-it-scale tlb_flush2_threads
test on an AMD Milan system with 36 cores:

- vanilla kernel:           527k loops/second
- lru_add_drain removal:    731k loops/second
- only INVLPGB:             527k loops/second
- lru_add_drain + INVLPGB: 1157k loops/second

Profiling with only the INVLPGB changes showed while
TLB invalidation went down from 40% of the total CPU
time to only around 4% of CPU time, the contention
simply moved to the LRU lock.

Fixing both at the same time about doubles the
number of iterations per second from this case.

Some numbers closer to real world performance
can be found at Phoronix, thanks to Michael:

https://www.phoronix.com/news/AMD-INVLPGB-Linux-Benefits

My current plan is to implement support for Intel's RAR
(Remote Action Request) TLB flushing in a follow-up series,
after this thing has been merged into -tip. Making things
any larger would just be unwieldy for reviewers.

v9:
 - print warning when start or end address was rounded (Peter)
 - in the reclaim code, tlbsync at context switch time (Peter)
 - fix !CONFIG_CPU_SUP_AMD compile error in arch_tlbbatch_add_pending (Jan)
v8:
 - round start & end to handle non-page-aligned callers (Steven & Jan)
 - fix up changelog & add tested-by tags (Manali)
v7:
 - a few small code cleanups (Nadav)
 - fix spurious VM_WARN_ON_ONCE in mm_global_asid
 - code simplifications & better barriers (Peter & Dave)
v6:
 - fix info->end check in flush_tlb_kernel_range (Michael)
 - disable broadcast TLB flushing on 32 bit x86
v5:
 - use byte assembly for compatibility with older toolchains (Borislav, Michael)
 - ensure a panic on an invalid number of extra pages (Dave, Tom)
 - add cant_migrate() assertion to tlbsync (Jann)
 - a bunch more cleanups (Nadav)
 - key TCE enabling off X86_FEATURE_TCE (Andrew)
 - fix a race between reclaim and ASID transition (Jann)
v4:
 - Use only bitmaps to track free global ASIDs (Nadav)
 - Improved AMD initialization (Borislav & Tom)
 - Various naming and documentation improvements (Peter, Nadav, Tom, Dave)
 - Fixes for subtle race conditions (Jann)
v3:
 - Remove paravirt tlb_remove_table call (thank you Qi Zheng)
 - More suggested cleanups and changelog fixes by Peter and Nadav
v2:
 - Apply suggestions by Peter and Borislav (thank you!)
 - Fix bug in arch_tlbbatch_flush, where we need to do both
   the TLBSYNC, and flush the CPUs that are in the cpumask.
 - Some updates to comments and changelogs based on questions.

Comments

Oleksandr Natalenko Feb. 6, 2025, 10:16 a.m. UTC | #1
Hello.

On čtvrtek 6. února 2025 5:43:19, středoevropský standardní čas Rik van Riel wrote:
> Add support for broadcast TLB invalidation using AMD's INVLPGB instruction.
> 
> This allows the kernel to invalidate TLB entries on remote CPUs without
> needing to send IPIs, without having to wait for remote CPUs to handle
> those interrupts, and with less interruption to what was running on
> those CPUs.
> 
> Because x86 PCID space is limited, and there are some very large
> systems out there, broadcast TLB invalidation is only used for
> processes that are active on 3 or more CPUs, with the threshold
> being gradually increased the more the PCID space gets exhausted.
> 
> Combined with the removal of unnecessary lru_add_drain calls
> (see https://lkml.org/lkml/2024/12/19/1388) this results in a
> nice performance boost for the will-it-scale tlb_flush2_threads
> test on an AMD Milan system with 36 cores:
> 
> - vanilla kernel:           527k loops/second
> - lru_add_drain removal:    731k loops/second
> - only INVLPGB:             527k loops/second
> - lru_add_drain + INVLPGB: 1157k loops/second
> 
> Profiling with only the INVLPGB changes showed while
> TLB invalidation went down from 40% of the total CPU
> time to only around 4% of CPU time, the contention
> simply moved to the LRU lock.
> 
> Fixing both at the same time about doubles the
> number of iterations per second from this case.
> 
> Some numbers closer to real world performance
> can be found at Phoronix, thanks to Michael:
> 
> https://www.phoronix.com/news/AMD-INVLPGB-Linux-Benefits
> 
> My current plan is to implement support for Intel's RAR
> (Remote Action Request) TLB flushing in a follow-up series,
> after this thing has been merged into -tip. Making things
> any larger would just be unwieldy for reviewers.
> 
> v9:
>  - print warning when start or end address was rounded (Peter)

OK, I've just hit one:

TLB flush not stride 200000 aligned. Start 7fffc0000000, end 7fffffe01000
WARNING: CPU: 31 PID: 411 at arch/x86/mm/tlb.c:1342 flush_tlb_mm_range+0x57b/0x600
Modules linked in:
CPU: 31 UID: 0 PID: 411 Comm: modprobe Not tainted 6.13.0-pf3 #1 1366679ca06f46d05d1e9d9c537b0c6b4c922b82
Hardware name: ASUS System Product Name/Pro WS X570-ACE, BIOS 4902 08/29/2024
RIP: 0010:flush_tlb_mm_range+0x57b/0x600
Code: 5f e9 39 b3 3f 00 e8 24 57 f5 ff e9 e9 fc ff ff 48 8b 0c 24 4c 89 e2 48 c7 c7 78 59 27 b0 c6 05 3d 1a 31 02 01 e8 85 e4 01 00 <0f> 0b e9 35 fb ff ff fa 0f 1f 44 00 00 48 89 df e8 a0 f4 ff ff fb
RSP: 0018:ffffc137c11e7a38 EFLAGS: 00010286
RAX: 0000000000000000 RBX: ffff9e6eaf1b5d80 RCX: 00000000ffffdfff
RDX: 0000000000000000 RSI: 00000000ffffffea RDI: 0000000000000001
RBP: ffff9e500244d800 R08: 00000000ffffdfff R09: ffff9e6eae1fffa8
R10: 00000000ffffdfff R11: 0000000000000003 R12: 00007fffc0000000
R13: 000000000000001f R14: 0000000000000015 R15: ffff9e6eaf180000
FS:  0000000000000000(0000) GS:ffff9e6eaf180000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 0000000109966000 CR4: 0000000000f50ef0
PKRU: 55555554
Call Trace:
 <TASK>
 tlb_flush_mmu+0x125/0x1a0
 tlb_finish_mmu+0x41/0x80
 relocate_vma_down+0x183/0x200
 setup_arg_pages+0x201/0x390
 load_elf_binary+0x3a7/0x17d0
 bprm_execve+0x244/0x630
 kernel_execve+0x180/0x1f0
 call_usermodehelper_exec_async+0xd0/0x190
 ret_from_fork+0x34/0x50
 ret_from_fork_asm+0x1a/0x30
 </TASK>

What do I do with it?

Thank you.

>  - in the reclaim code, tlbsync at context switch time (Peter)
>  - fix !CONFIG_CPU_SUP_AMD compile error in arch_tlbbatch_add_pending (Jan)
> v8:
>  - round start & end to handle non-page-aligned callers (Steven & Jan)
>  - fix up changelog & add tested-by tags (Manali)
> v7:
>  - a few small code cleanups (Nadav)
>  - fix spurious VM_WARN_ON_ONCE in mm_global_asid
>  - code simplifications & better barriers (Peter & Dave)
> v6:
>  - fix info->end check in flush_tlb_kernel_range (Michael)
>  - disable broadcast TLB flushing on 32 bit x86
> v5:
>  - use byte assembly for compatibility with older toolchains (Borislav, Michael)
>  - ensure a panic on an invalid number of extra pages (Dave, Tom)
>  - add cant_migrate() assertion to tlbsync (Jann)
>  - a bunch more cleanups (Nadav)
>  - key TCE enabling off X86_FEATURE_TCE (Andrew)
>  - fix a race between reclaim and ASID transition (Jann)
> v4:
>  - Use only bitmaps to track free global ASIDs (Nadav)
>  - Improved AMD initialization (Borislav & Tom)
>  - Various naming and documentation improvements (Peter, Nadav, Tom, Dave)
>  - Fixes for subtle race conditions (Jann)
> v3:
>  - Remove paravirt tlb_remove_table call (thank you Qi Zheng)
>  - More suggested cleanups and changelog fixes by Peter and Nadav
> v2:
>  - Apply suggestions by Peter and Borislav (thank you!)
>  - Fix bug in arch_tlbbatch_flush, where we need to do both
>    the TLBSYNC, and flush the CPUs that are in the cpumask.
>  - Some updates to comments and changelogs based on questions.
Rik van Riel Feb. 6, 2025, 2:16 p.m. UTC | #2
On Thu, 2025-02-06 at 11:16 +0100, Oleksandr Natalenko wrote:
> Hello.
> 
> On čtvrtek 6. února 2025 5:43:19, středoevropský standardní čas Rik
> van Riel wrote:
> > 
> > v9:
> >  - print warning when start or end address was rounded (Peter)
> 
> OK, I've just hit one:
> 
> TLB flush not stride 200000 aligned. Start 7fffc0000000, end
> 7fffffe01000

Beautiful, the caller wants to zap 2MB pages, but
the end address is 4kB aligned.

> WARNING: CPU: 31 PID: 411 at arch/x86/mm/tlb.c:1342
> flush_tlb_mm_range+0x57b/0x600
> Modules linked in:
> CPU: 31 UID: 0 PID: 411 Comm: modprobe Not tainted 6.13.0-pf3 #1
> 1366679ca06f46d05d1e9d9c537b0c6b4c922b82
> Hardware name: ASUS System Product Name/Pro WS X570-ACE, BIOS 4902
> 08/29/2024
> RIP: 0010:flush_tlb_mm_range+0x57b/0x600
> Code: 5f e9 39 b3 3f 00 e8 24 57 f5 ff e9 e9 fc ff ff 48 8b 0c 24 4c
> 89 e2 48 c7 c7 78 59 27 b0 c6 05 3d 1a 31 02 01 e8 85 e4 01 00 <0f>
> 0b e9 35 fb ff ff fa 0f 1f 44 00 00 48 89 df e8 a0 f4 ff ff fb
> RSP: 0018:ffffc137c11e7a38 EFLAGS: 00010286
> RAX: 0000000000000000 RBX: ffff9e6eaf1b5d80 RCX: 00000000ffffdfff
> RDX: 0000000000000000 RSI: 00000000ffffffea RDI: 0000000000000001
> RBP: ffff9e500244d800 R08: 00000000ffffdfff R09: ffff9e6eae1fffa8
> R10: 00000000ffffdfff R11: 0000000000000003 R12: 00007fffc0000000
> R13: 000000000000001f R14: 0000000000000015 R15: ffff9e6eaf180000
> FS:  0000000000000000(0000) GS:ffff9e6eaf180000(0000)
> knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 0000000000000000 CR3: 0000000109966000 CR4: 0000000000f50ef0
> PKRU: 55555554
> Call Trace:
>  <TASK>
>  tlb_flush_mmu+0x125/0x1a0
>  tlb_finish_mmu+0x41/0x80
>  relocate_vma_down+0x183/0x200
>  setup_arg_pages+0x201/0x390
>  load_elf_binary+0x3a7/0x17d0
>  bprm_execve+0x244/0x630
>  kernel_execve+0x180/0x1f0
>  call_usermodehelper_exec_async+0xd0/0x190
>  ret_from_fork+0x34/0x50
>  ret_from_fork_asm+0x1a/0x30
>  </TASK>
> 
> What do I do with it?

Reporting it is the right thing. Let me dig into what
setup_arg_pages and relocate_vma_down are doing to
come up with a 2MB page size area where the end is
not 2MB aligned.

Reading through the relocate_vma_down code, and the
free_pgd/p4d/pud/pmd_range code, it looks like that
code always adds PAGE_SIZE to the address being zapped,
even when zapping things at a larger granularity.

On the flip side, the code in relocate_vma_down and
free_pgd_range correctly set the TLB page size to
the 4kB PAGE_SIZE.

It looks like setting the stride_shift to something
larger is done transparently by the x86 tlb_flush()
implementation, specifically by tlb_get_unmap_shift(),
which looks at which page table level got freed to
determine what stride shift to use.

This can result in flush_tlb_mm_range being called
with a stride_shift for 2MB pages, but a range ending
on a 4kB aligned (not 2MB aligned) boundary.

Peter, how should we solve this one?

Should tlb_flush() round the start & end addresses
to match the found stride_shift?
Peter Zijlstra Feb. 6, 2025, 2:23 p.m. UTC | #3
On Thu, Feb 06, 2025 at 09:16:35AM -0500, Rik van Riel wrote:

> This can result in flush_tlb_mm_range being called
> with a stride_shift for 2MB pages, but a range ending
> on a 4kB aligned (not 2MB aligned) boundary.
> 
> Peter, how should we solve this one?

I don't think that's wrong per-se, since all we really need is for end
to be past the end, one byte, one page, or one stride don't matter much.

Anyway, I'm in desperate need of a break, so I'm not quite sure what the
best way forward is.
Rik van Riel Feb. 6, 2025, 2:48 p.m. UTC | #4
On Thu, 2025-02-06 at 15:23 +0100, Peter Zijlstra wrote:
> On Thu, Feb 06, 2025 at 09:16:35AM -0500, Rik van Riel wrote:
> 
> > This can result in flush_tlb_mm_range being called
> > with a stride_shift for 2MB pages, but a range ending
> > on a 4kB aligned (not 2MB aligned) boundary.
> > 
> > Peter, how should we solve this one?
> 
> I don't think that's wrong per-se, since all we really need is for
> end
> to be past the end, one byte, one page, or one stride don't matter
> much.
> 
> Anyway, I'm in desperate need of a break, so I'm not quite sure what
> the
> best way forward is.
> 
Given that the tlb_flush() code is used only for
page table freeing, we can just round up the
end address to the nearest stride boundary
there, with a comment explaining why?

Alternatively, we could just change the page
sizes used in pmd_free_tlb, pud_free_tlb,
and p4d_free_tlb, given that the functions
called now have parameters they didn't seem
to have back in 2014, when the linked email
in the comment was written?

Either way, no big hurry. The rounding up
is already being done in get_flush_tlb_info,
so all we have is a noisy WARN_ONCE :)
Peter Zijlstra Feb. 7, 2025, 8:16 a.m. UTC | #5
On Thu, Feb 06, 2025 at 09:48:25AM -0500, Rik van Riel wrote:
> On Thu, 2025-02-06 at 15:23 +0100, Peter Zijlstra wrote:
> > On Thu, Feb 06, 2025 at 09:16:35AM -0500, Rik van Riel wrote:
> > 
> > > This can result in flush_tlb_mm_range being called
> > > with a stride_shift for 2MB pages, but a range ending
> > > on a 4kB aligned (not 2MB aligned) boundary.
> > > 
> > > Peter, how should we solve this one?
> > 
> > I don't think that's wrong per-se, since all we really need is for
> > end
> > to be past the end, one byte, one page, or one stride don't matter
> > much.
> > 
> > Anyway, I'm in desperate need of a break, so I'm not quite sure what
> > the
> > best way forward is.
> > 
> Given that the tlb_flush() code is used only for
> page table freeing, 

mmu_gather is used for both page and page-table freeing.

> we can just round up the
> end address to the nearest stride boundary
> there, with a comment explaining why?

Well, why are we rounding at all? I don't think I've seen an explanation
for that anywhere yet.

What made you do this?

> Alternatively, we could just change the page
> sizes used in pmd_free_tlb, pud_free_tlb,
> and p4d_free_tlb, given that the functions
> called now have parameters they didn't seem
> to have back in 2014, when the linked email
> in the comment was written?

Well, it really shouldn't matter.

Notably, stride is simply the smallest size encountered during the
gather; not something that can be determined a priory.

Like I said, end is exclusive, so it doesn't really matter how much
further it is, and PAGE_SIZE is a natural enough step, seeing it is the
smallest granularity.

So start and end not being page aligned is a definite fail.
And start should be stride aligned.

But other than that, I don't see it matters much.
Rik van Riel Feb. 7, 2025, 5:46 p.m. UTC | #6
On Fri, 2025-02-07 at 09:16 +0100, Peter Zijlstra wrote:
> On Thu, Feb 06, 2025 at 09:48:25AM -0500, Rik van Riel wrote:
> 
> 
> > we can just round up the
> > end address to the nearest stride boundary
> > there, with a comment explaining why?
> 
> Well, why are we rounding at all? I don't think I've seen an
> explanation
> for that anywhere yet.
> 
> What made you do this?

The only real reason is to not end up with a 0
value for nr in these loops:

        for (addr = info->start; addr < info->end; addr += nr <<
PAGE_SHIFT) {
                nr = min((info->end - addr) >> PAGE_SHIFT,
invlpgb_count_max);
                invlpgb_flush_addr_nosync(addr, nr);
        }       

Which makes me think we could just insert a
"nr = max(nr, 1)" in there, and round up
partial pages that way?

Looking further, I already did that on the
userspace flushing side. Let me do the same
thing on the kernel side, and remove the
other code related to partial pages being
passed in.
Brendan Jackman Feb. 7, 2025, 6:23 p.m. UTC | #7
> Add support for broadcast TLB invalidation using AMD's INVLPGB instruction.

I applied this to 60675d4ca1ef0 ("Merge branch 'linus' into x86/mm, to pick up
fixes"), booted it and ran some selftests on a Milan and a Genoa machine, no
issues, although I didn't have CONFIG_DEBUG_VM or anything like that. I did see
the warning reported by Oleksandr [0] on QEMU.

I tried to test Turin but there's some unrelated brokenness. Due to other
unrelated issues I also couldn't test with lockdep.

Anyway I guess that counts as:

Tested-by: Brendan Jackman <jackmanb@google.com>

I couldn't get any performance data either, more unrelated brokenness. Maybe I
can try again next week.

I posted some bikeshed review comments but I haven't grokked the interesting
bits yet, will try to do that next week too.

[0] https://lore.kernel.org/linux-kernel/12602226.O9o76ZdvQC@natalenko.name/

Cheers,
Brendan