Message ID | 20250206044346.3810242-1-riel@surriel.com (mailing list archive) |
---|---|
Headers | show |
Series | AMD broadcast TLB invalidation | expand |
Hello. On čtvrtek 6. února 2025 5:43:19, středoevropský standardní čas Rik van Riel wrote: > Add support for broadcast TLB invalidation using AMD's INVLPGB instruction. > > This allows the kernel to invalidate TLB entries on remote CPUs without > needing to send IPIs, without having to wait for remote CPUs to handle > those interrupts, and with less interruption to what was running on > those CPUs. > > Because x86 PCID space is limited, and there are some very large > systems out there, broadcast TLB invalidation is only used for > processes that are active on 3 or more CPUs, with the threshold > being gradually increased the more the PCID space gets exhausted. > > Combined with the removal of unnecessary lru_add_drain calls > (see https://lkml.org/lkml/2024/12/19/1388) this results in a > nice performance boost for the will-it-scale tlb_flush2_threads > test on an AMD Milan system with 36 cores: > > - vanilla kernel: 527k loops/second > - lru_add_drain removal: 731k loops/second > - only INVLPGB: 527k loops/second > - lru_add_drain + INVLPGB: 1157k loops/second > > Profiling with only the INVLPGB changes showed while > TLB invalidation went down from 40% of the total CPU > time to only around 4% of CPU time, the contention > simply moved to the LRU lock. > > Fixing both at the same time about doubles the > number of iterations per second from this case. > > Some numbers closer to real world performance > can be found at Phoronix, thanks to Michael: > > https://www.phoronix.com/news/AMD-INVLPGB-Linux-Benefits > > My current plan is to implement support for Intel's RAR > (Remote Action Request) TLB flushing in a follow-up series, > after this thing has been merged into -tip. Making things > any larger would just be unwieldy for reviewers. > > v9: > - print warning when start or end address was rounded (Peter) OK, I've just hit one: TLB flush not stride 200000 aligned. Start 7fffc0000000, end 7fffffe01000 WARNING: CPU: 31 PID: 411 at arch/x86/mm/tlb.c:1342 flush_tlb_mm_range+0x57b/0x600 Modules linked in: CPU: 31 UID: 0 PID: 411 Comm: modprobe Not tainted 6.13.0-pf3 #1 1366679ca06f46d05d1e9d9c537b0c6b4c922b82 Hardware name: ASUS System Product Name/Pro WS X570-ACE, BIOS 4902 08/29/2024 RIP: 0010:flush_tlb_mm_range+0x57b/0x600 Code: 5f e9 39 b3 3f 00 e8 24 57 f5 ff e9 e9 fc ff ff 48 8b 0c 24 4c 89 e2 48 c7 c7 78 59 27 b0 c6 05 3d 1a 31 02 01 e8 85 e4 01 00 <0f> 0b e9 35 fb ff ff fa 0f 1f 44 00 00 48 89 df e8 a0 f4 ff ff fb RSP: 0018:ffffc137c11e7a38 EFLAGS: 00010286 RAX: 0000000000000000 RBX: ffff9e6eaf1b5d80 RCX: 00000000ffffdfff RDX: 0000000000000000 RSI: 00000000ffffffea RDI: 0000000000000001 RBP: ffff9e500244d800 R08: 00000000ffffdfff R09: ffff9e6eae1fffa8 R10: 00000000ffffdfff R11: 0000000000000003 R12: 00007fffc0000000 R13: 000000000000001f R14: 0000000000000015 R15: ffff9e6eaf180000 FS: 0000000000000000(0000) GS:ffff9e6eaf180000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000000 CR3: 0000000109966000 CR4: 0000000000f50ef0 PKRU: 55555554 Call Trace: <TASK> tlb_flush_mmu+0x125/0x1a0 tlb_finish_mmu+0x41/0x80 relocate_vma_down+0x183/0x200 setup_arg_pages+0x201/0x390 load_elf_binary+0x3a7/0x17d0 bprm_execve+0x244/0x630 kernel_execve+0x180/0x1f0 call_usermodehelper_exec_async+0xd0/0x190 ret_from_fork+0x34/0x50 ret_from_fork_asm+0x1a/0x30 </TASK> What do I do with it? Thank you. > - in the reclaim code, tlbsync at context switch time (Peter) > - fix !CONFIG_CPU_SUP_AMD compile error in arch_tlbbatch_add_pending (Jan) > v8: > - round start & end to handle non-page-aligned callers (Steven & Jan) > - fix up changelog & add tested-by tags (Manali) > v7: > - a few small code cleanups (Nadav) > - fix spurious VM_WARN_ON_ONCE in mm_global_asid > - code simplifications & better barriers (Peter & Dave) > v6: > - fix info->end check in flush_tlb_kernel_range (Michael) > - disable broadcast TLB flushing on 32 bit x86 > v5: > - use byte assembly for compatibility with older toolchains (Borislav, Michael) > - ensure a panic on an invalid number of extra pages (Dave, Tom) > - add cant_migrate() assertion to tlbsync (Jann) > - a bunch more cleanups (Nadav) > - key TCE enabling off X86_FEATURE_TCE (Andrew) > - fix a race between reclaim and ASID transition (Jann) > v4: > - Use only bitmaps to track free global ASIDs (Nadav) > - Improved AMD initialization (Borislav & Tom) > - Various naming and documentation improvements (Peter, Nadav, Tom, Dave) > - Fixes for subtle race conditions (Jann) > v3: > - Remove paravirt tlb_remove_table call (thank you Qi Zheng) > - More suggested cleanups and changelog fixes by Peter and Nadav > v2: > - Apply suggestions by Peter and Borislav (thank you!) > - Fix bug in arch_tlbbatch_flush, where we need to do both > the TLBSYNC, and flush the CPUs that are in the cpumask. > - Some updates to comments and changelogs based on questions.
On Thu, 2025-02-06 at 11:16 +0100, Oleksandr Natalenko wrote: > Hello. > > On čtvrtek 6. února 2025 5:43:19, středoevropský standardní čas Rik > van Riel wrote: > > > > v9: > > - print warning when start or end address was rounded (Peter) > > OK, I've just hit one: > > TLB flush not stride 200000 aligned. Start 7fffc0000000, end > 7fffffe01000 Beautiful, the caller wants to zap 2MB pages, but the end address is 4kB aligned. > WARNING: CPU: 31 PID: 411 at arch/x86/mm/tlb.c:1342 > flush_tlb_mm_range+0x57b/0x600 > Modules linked in: > CPU: 31 UID: 0 PID: 411 Comm: modprobe Not tainted 6.13.0-pf3 #1 > 1366679ca06f46d05d1e9d9c537b0c6b4c922b82 > Hardware name: ASUS System Product Name/Pro WS X570-ACE, BIOS 4902 > 08/29/2024 > RIP: 0010:flush_tlb_mm_range+0x57b/0x600 > Code: 5f e9 39 b3 3f 00 e8 24 57 f5 ff e9 e9 fc ff ff 48 8b 0c 24 4c > 89 e2 48 c7 c7 78 59 27 b0 c6 05 3d 1a 31 02 01 e8 85 e4 01 00 <0f> > 0b e9 35 fb ff ff fa 0f 1f 44 00 00 48 89 df e8 a0 f4 ff ff fb > RSP: 0018:ffffc137c11e7a38 EFLAGS: 00010286 > RAX: 0000000000000000 RBX: ffff9e6eaf1b5d80 RCX: 00000000ffffdfff > RDX: 0000000000000000 RSI: 00000000ffffffea RDI: 0000000000000001 > RBP: ffff9e500244d800 R08: 00000000ffffdfff R09: ffff9e6eae1fffa8 > R10: 00000000ffffdfff R11: 0000000000000003 R12: 00007fffc0000000 > R13: 000000000000001f R14: 0000000000000015 R15: ffff9e6eaf180000 > FS: 0000000000000000(0000) GS:ffff9e6eaf180000(0000) > knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: 0000000000000000 CR3: 0000000109966000 CR4: 0000000000f50ef0 > PKRU: 55555554 > Call Trace: > <TASK> > tlb_flush_mmu+0x125/0x1a0 > tlb_finish_mmu+0x41/0x80 > relocate_vma_down+0x183/0x200 > setup_arg_pages+0x201/0x390 > load_elf_binary+0x3a7/0x17d0 > bprm_execve+0x244/0x630 > kernel_execve+0x180/0x1f0 > call_usermodehelper_exec_async+0xd0/0x190 > ret_from_fork+0x34/0x50 > ret_from_fork_asm+0x1a/0x30 > </TASK> > > What do I do with it? Reporting it is the right thing. Let me dig into what setup_arg_pages and relocate_vma_down are doing to come up with a 2MB page size area where the end is not 2MB aligned. Reading through the relocate_vma_down code, and the free_pgd/p4d/pud/pmd_range code, it looks like that code always adds PAGE_SIZE to the address being zapped, even when zapping things at a larger granularity. On the flip side, the code in relocate_vma_down and free_pgd_range correctly set the TLB page size to the 4kB PAGE_SIZE. It looks like setting the stride_shift to something larger is done transparently by the x86 tlb_flush() implementation, specifically by tlb_get_unmap_shift(), which looks at which page table level got freed to determine what stride shift to use. This can result in flush_tlb_mm_range being called with a stride_shift for 2MB pages, but a range ending on a 4kB aligned (not 2MB aligned) boundary. Peter, how should we solve this one? Should tlb_flush() round the start & end addresses to match the found stride_shift?
On Thu, Feb 06, 2025 at 09:16:35AM -0500, Rik van Riel wrote: > This can result in flush_tlb_mm_range being called > with a stride_shift for 2MB pages, but a range ending > on a 4kB aligned (not 2MB aligned) boundary. > > Peter, how should we solve this one? I don't think that's wrong per-se, since all we really need is for end to be past the end, one byte, one page, or one stride don't matter much. Anyway, I'm in desperate need of a break, so I'm not quite sure what the best way forward is.
On Thu, 2025-02-06 at 15:23 +0100, Peter Zijlstra wrote: > On Thu, Feb 06, 2025 at 09:16:35AM -0500, Rik van Riel wrote: > > > This can result in flush_tlb_mm_range being called > > with a stride_shift for 2MB pages, but a range ending > > on a 4kB aligned (not 2MB aligned) boundary. > > > > Peter, how should we solve this one? > > I don't think that's wrong per-se, since all we really need is for > end > to be past the end, one byte, one page, or one stride don't matter > much. > > Anyway, I'm in desperate need of a break, so I'm not quite sure what > the > best way forward is. > Given that the tlb_flush() code is used only for page table freeing, we can just round up the end address to the nearest stride boundary there, with a comment explaining why? Alternatively, we could just change the page sizes used in pmd_free_tlb, pud_free_tlb, and p4d_free_tlb, given that the functions called now have parameters they didn't seem to have back in 2014, when the linked email in the comment was written? Either way, no big hurry. The rounding up is already being done in get_flush_tlb_info, so all we have is a noisy WARN_ONCE :)
On Thu, Feb 06, 2025 at 09:48:25AM -0500, Rik van Riel wrote: > On Thu, 2025-02-06 at 15:23 +0100, Peter Zijlstra wrote: > > On Thu, Feb 06, 2025 at 09:16:35AM -0500, Rik van Riel wrote: > > > > > This can result in flush_tlb_mm_range being called > > > with a stride_shift for 2MB pages, but a range ending > > > on a 4kB aligned (not 2MB aligned) boundary. > > > > > > Peter, how should we solve this one? > > > > I don't think that's wrong per-se, since all we really need is for > > end > > to be past the end, one byte, one page, or one stride don't matter > > much. > > > > Anyway, I'm in desperate need of a break, so I'm not quite sure what > > the > > best way forward is. > > > Given that the tlb_flush() code is used only for > page table freeing, mmu_gather is used for both page and page-table freeing. > we can just round up the > end address to the nearest stride boundary > there, with a comment explaining why? Well, why are we rounding at all? I don't think I've seen an explanation for that anywhere yet. What made you do this? > Alternatively, we could just change the page > sizes used in pmd_free_tlb, pud_free_tlb, > and p4d_free_tlb, given that the functions > called now have parameters they didn't seem > to have back in 2014, when the linked email > in the comment was written? Well, it really shouldn't matter. Notably, stride is simply the smallest size encountered during the gather; not something that can be determined a priory. Like I said, end is exclusive, so it doesn't really matter how much further it is, and PAGE_SIZE is a natural enough step, seeing it is the smallest granularity. So start and end not being page aligned is a definite fail. And start should be stride aligned. But other than that, I don't see it matters much.
On Fri, 2025-02-07 at 09:16 +0100, Peter Zijlstra wrote: > On Thu, Feb 06, 2025 at 09:48:25AM -0500, Rik van Riel wrote: > > > > we can just round up the > > end address to the nearest stride boundary > > there, with a comment explaining why? > > Well, why are we rounding at all? I don't think I've seen an > explanation > for that anywhere yet. > > What made you do this? The only real reason is to not end up with a 0 value for nr in these loops: for (addr = info->start; addr < info->end; addr += nr << PAGE_SHIFT) { nr = min((info->end - addr) >> PAGE_SHIFT, invlpgb_count_max); invlpgb_flush_addr_nosync(addr, nr); } Which makes me think we could just insert a "nr = max(nr, 1)" in there, and round up partial pages that way? Looking further, I already did that on the userspace flushing side. Let me do the same thing on the kernel side, and remove the other code related to partial pages being passed in.
> Add support for broadcast TLB invalidation using AMD's INVLPGB instruction. I applied this to 60675d4ca1ef0 ("Merge branch 'linus' into x86/mm, to pick up fixes"), booted it and ran some selftests on a Milan and a Genoa machine, no issues, although I didn't have CONFIG_DEBUG_VM or anything like that. I did see the warning reported by Oleksandr [0] on QEMU. I tried to test Turin but there's some unrelated brokenness. Due to other unrelated issues I also couldn't test with lockdep. Anyway I guess that counts as: Tested-by: Brendan Jackman <jackmanb@google.com> I couldn't get any performance data either, more unrelated brokenness. Maybe I can try again next week. I posted some bikeshed review comments but I haven't grokked the interesting bits yet, will try to do that next week too. [0] https://lore.kernel.org/linux-kernel/12602226.O9o76ZdvQC@natalenko.name/ Cheers, Brendan