Message ID | 20250221005345.2156760-1-riel@surriel.com (mailing list archive) |
---|---|
Headers | show |
Series | AMD broadcast TLB invalidation | expand |
Hello. On pátek 21. února 2025 1:52:59, středoevropský standardní čas Rik van Riel wrote: > Add support for broadcast TLB invalidation using AMD's INVLPGB instruction. > > This allows the kernel to invalidate TLB entries on remote CPUs without > needing to send IPIs, without having to wait for remote CPUs to handle > those interrupts, and with less interruption to what was running on > those CPUs. > > Because x86 PCID space is limited, and there are some very large > systems out there, broadcast TLB invalidation is only used for > processes that are active on 3 or more CPUs, with the threshold > being gradually increased the more the PCID space gets exhausted. > > Combined with the removal of unnecessary lru_add_drain calls > (see https://lkml.org/lkml/2024/12/19/1388) this results in a > nice performance boost for the will-it-scale tlb_flush2_threads > test on an AMD Milan system with 36 cores: > > - vanilla kernel: 527k loops/second > - lru_add_drain removal: 731k loops/second > - only INVLPGB: 527k loops/second > - lru_add_drain + INVLPGB: 1157k loops/second > > Profiling with only the INVLPGB changes showed while > TLB invalidation went down from 40% of the total CPU > time to only around 4% of CPU time, the contention > simply moved to the LRU lock. > > Fixing both at the same time about doubles the > number of iterations per second from this case. > > Some numbers closer to real world performance > can be found at Phoronix, thanks to Michael: > > https://www.phoronix.com/news/AMD-INVLPGB-Linux-Benefits > > My current plan is to implement support for Intel's RAR > (Remote Action Request) TLB flushing in a follow-up series, > after this thing has been merged into -tip. Making things > any larger would just be unwieldy for reviewers. > > v12: > - make sure "nopcid" command line option turns off invlpgb (Brendan) > - add "noinvlpgb" kernel command line option > - split out kernel TLB flushing differently (Dave & Yosry) > - split up the patch that does invlpgb flushing for user processes (Dave) > - clean up get_flush_tlb_info (Boris) > - move invlpgb_count_max initialization to get_cpu_cap (Boris) > - bunch more comments as requested Somehow, this iteration breaks resume from S3. I can see it even in a QEMU VM: ``` [ 24.373391] ACPI: PM: Low-level resume complete [ 24.373929] ACPI: PM: Restoring platform NVS memory [ 24.375024] Enabling non-boot CPUs ... [ 24.375777] smpboot: Booting Node 0 Processor 1 APIC 0x1 [ 24.376463] BUG: unable to handle page fault for address: ffffffffa3ba4d60 [ 24.377383] #PF: supervisor write access in kernel mode [ 24.377912] #PF: error_code(0x0003) - permissions violation [ 24.378413] PGD 25427067 P4D 25427067 PUD 25428063 PMD 8000000024c001a1 [ 24.379020] Oops: Oops: 0003 [#1] PREEMPT SMP NOPTI [ 24.379503] CPU: 1 UID: 0 PID: 0 Comm: swapper/1 Kdump: loaded Not tainted 6.14.0-pf0 #1 161e4891fb5044b2d7438cd1852eeaac0cdffab5 [ 24.380650] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 02/02/2022 [ 24.381400] RIP: 0010:get_cpu_cap+0x39b/0x4f0 [ 24.381810] Code: 08 c7 44 24 08 00 00 00 00 48 8d 4c 24 0c e8 3c 00 04 00 90 8b 44 24 04 89 43 64 0f b7 44 24 0c 83 c0 01 81 7b 24 09 00 00 80 <66> 89 05 0e ab 8b 01 0f 86 18 fd ff ff c7 44 24 14 00 00 00 00 4c [ 24.383629] RSP: 0000:ffffafbec00efe70 EFLAGS: 00010012 [ 24.384155] RAX: 0000000000000001 RBX: ffff8b3fbcb19020 RCX: 0000000000001001 [ 24.384862] RDX: 0000000000000000 RSI: ffffafbec00efe74 RDI: ffffafbec00efe78 [ 24.385603] RBP: ffffafbec00efe88 R08: ffffafbec00efe70 R09: ffffafbec00efe7c [ 24.386318] R10: 0000000000002430 R11: ffff8b3fa5428000 R12: ffffafbec00efe8c [ 24.387014] R13: ffffafbec00efe84 R14: ffffafbec00efe80 R15: ffffafbec00efe70 [ 24.387713] FS: 0000000000000000(0000) GS:ffff8b3fbcb00000(0000) knlGS:0000000000000000 [ 24.388502] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 24.389074] CR2: ffffffffa3ba4d60 CR3: 0000000025422000 CR4: 0000000000350ef0 [ 24.389769] Call Trace: [ 24.390020] <TASK> [ 24.392234] identify_cpu+0xd4/0x890 [ 24.392593] identify_secondary_cpu+0x12/0x40 [ 24.393032] smp_store_cpu_info+0x49/0x60 [ 24.393430] start_secondary+0x7f/0x140 [ 24.393810] common_startup_64+0x13e/0x141 [ 24.394218] </TASK> $ scripts/faddr2line arch/x86/kernel/cpu/common.o get_cpu_cap+0x39b get_cpu_cap+0x39b/0x500: get_cpu_cap at …/arch/x86/kernel/cpu/common.c:1063 1060 if (c->extended_cpuid_level >= 0x80000008) { 1061 cpuid(0x80000008, &eax, &ebx, &ecx, &edx); 1062 c->x86_capability[CPUID_8000_0008_EBX] = ebx; 1063 invlpgb_count_max = (edx & 0xffff) + 1; 1064 } ``` Any idea what I'm looking at? Thank you. > v11: > - resolve conflict with CONFIG_PT_RECLAIM code > - a few more cleanups (Peter, Brendan, Nadav) > v10: > - simplify partial pages with min(nr, 1) in the invlpgb loop (Peter) > - document x86 paravirt, AMD invlpgb, and ARM64 flush without IPI (Brendan) > - remove IS_ENABLED(CONFIG_X86_BROADCAST_TLB_FLUSH) (Brendan) > - various cleanups (Brendan) > v9: > - print warning when start or end address was rounded (Peter) > - in the reclaim code, tlbsync at context switch time (Peter) > - fix !CONFIG_CPU_SUP_AMD compile error in arch_tlbbatch_add_pending (Jan) > v8: > - round start & end to handle non-page-aligned callers (Steven & Jan) > - fix up changelog & add tested-by tags (Manali) > v7: > - a few small code cleanups (Nadav) > - fix spurious VM_WARN_ON_ONCE in mm_global_asid > - code simplifications & better barriers (Peter & Dave) > v6: > - fix info->end check in flush_tlb_kernel_range (Michael) > - disable broadcast TLB flushing on 32 bit x86 > v5: > - use byte assembly for compatibility with older toolchains (Borislav, Michael) > - ensure a panic on an invalid number of extra pages (Dave, Tom) > - add cant_migrate() assertion to tlbsync (Jann) > - a bunch more cleanups (Nadav) > - key TCE enabling off X86_FEATURE_TCE (Andrew) > - fix a race between reclaim and ASID transition (Jann) > v4: > - Use only bitmaps to track free global ASIDs (Nadav) > - Improved AMD initialization (Borislav & Tom) > - Various naming and documentation improvements (Peter, Nadav, Tom, Dave) > - Fixes for subtle race conditions (Jann) > v3: > - Remove paravirt tlb_remove_table call (thank you Qi Zheng) > - More suggested cleanups and changelog fixes by Peter and Nadav > v2: > - Apply suggestions by Peter and Borislav (thank you!) > - Fix bug in arch_tlbbatch_flush, where we need to do both > the TLBSYNC, and flush the CPUs that are in the cpumask. > - Some updates to comments and changelogs based on questions. > > >
On sobota 22. února 2025 12:29:54, středoevropský standardní čas Oleksandr Natalenko wrote: > Hello. > > On pátek 21. února 2025 1:52:59, středoevropský standardní čas Rik van Riel wrote: > > Add support for broadcast TLB invalidation using AMD's INVLPGB instruction. > > > > This allows the kernel to invalidate TLB entries on remote CPUs without > > needing to send IPIs, without having to wait for remote CPUs to handle > > those interrupts, and with less interruption to what was running on > > those CPUs. > > > > Because x86 PCID space is limited, and there are some very large > > systems out there, broadcast TLB invalidation is only used for > > processes that are active on 3 or more CPUs, with the threshold > > being gradually increased the more the PCID space gets exhausted. > > > > Combined with the removal of unnecessary lru_add_drain calls > > (see https://lkml.org/lkml/2024/12/19/1388) this results in a > > nice performance boost for the will-it-scale tlb_flush2_threads > > test on an AMD Milan system with 36 cores: > > > > - vanilla kernel: 527k loops/second > > - lru_add_drain removal: 731k loops/second > > - only INVLPGB: 527k loops/second > > - lru_add_drain + INVLPGB: 1157k loops/second > > > > Profiling with only the INVLPGB changes showed while > > TLB invalidation went down from 40% of the total CPU > > time to only around 4% of CPU time, the contention > > simply moved to the LRU lock. > > > > Fixing both at the same time about doubles the > > number of iterations per second from this case. > > > > Some numbers closer to real world performance > > can be found at Phoronix, thanks to Michael: > > > > https://www.phoronix.com/news/AMD-INVLPGB-Linux-Benefits > > > > My current plan is to implement support for Intel's RAR > > (Remote Action Request) TLB flushing in a follow-up series, > > after this thing has been merged into -tip. Making things > > any larger would just be unwieldy for reviewers. > > > > v12: > > - make sure "nopcid" command line option turns off invlpgb (Brendan) > > - add "noinvlpgb" kernel command line option > > - split out kernel TLB flushing differently (Dave & Yosry) > > - split up the patch that does invlpgb flushing for user processes (Dave) > > - clean up get_flush_tlb_info (Boris) > > - move invlpgb_count_max initialization to get_cpu_cap (Boris) > > - bunch more comments as requested > > Somehow, this iteration breaks resume from S3. I can see it even in a QEMU VM: Can also reproduce this by simply offlining/onlining a CPU via `/sys/devices/system/cpu/cpuX/online`. > > ``` > [ 24.373391] ACPI: PM: Low-level resume complete > [ 24.373929] ACPI: PM: Restoring platform NVS memory > [ 24.375024] Enabling non-boot CPUs ... > [ 24.375777] smpboot: Booting Node 0 Processor 1 APIC 0x1 > [ 24.376463] BUG: unable to handle page fault for address: ffffffffa3ba4d60 > [ 24.377383] #PF: supervisor write access in kernel mode > [ 24.377912] #PF: error_code(0x0003) - permissions violation > [ 24.378413] PGD 25427067 P4D 25427067 PUD 25428063 PMD 8000000024c001a1 > [ 24.379020] Oops: Oops: 0003 [#1] PREEMPT SMP NOPTI > [ 24.379503] CPU: 1 UID: 0 PID: 0 Comm: swapper/1 Kdump: loaded Not tainted 6.14.0-pf0 #1 161e4891fb5044b2d7438cd1852eeaac0cdffab5 > [ 24.380650] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 02/02/2022 > [ 24.381400] RIP: 0010:get_cpu_cap+0x39b/0x4f0 > [ 24.381810] Code: 08 c7 44 24 08 00 00 00 00 48 8d 4c 24 0c e8 3c 00 04 00 90 8b 44 24 04 89 43 64 0f b7 44 24 0c 83 c0 01 81 7b 24 09 00 00 80 <66> 89 05 0e ab 8b 01 0f 86 18 fd ff ff c7 44 24 14 00 00 00 00 4c > [ 24.383629] RSP: 0000:ffffafbec00efe70 EFLAGS: 00010012 > [ 24.384155] RAX: 0000000000000001 RBX: ffff8b3fbcb19020 RCX: 0000000000001001 > [ 24.384862] RDX: 0000000000000000 RSI: ffffafbec00efe74 RDI: ffffafbec00efe78 > [ 24.385603] RBP: ffffafbec00efe88 R08: ffffafbec00efe70 R09: ffffafbec00efe7c > [ 24.386318] R10: 0000000000002430 R11: ffff8b3fa5428000 R12: ffffafbec00efe8c > [ 24.387014] R13: ffffafbec00efe84 R14: ffffafbec00efe80 R15: ffffafbec00efe70 > [ 24.387713] FS: 0000000000000000(0000) GS:ffff8b3fbcb00000(0000) knlGS:0000000000000000 > [ 24.388502] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 24.389074] CR2: ffffffffa3ba4d60 CR3: 0000000025422000 CR4: 0000000000350ef0 > [ 24.389769] Call Trace: > [ 24.390020] <TASK> > [ 24.392234] identify_cpu+0xd4/0x890 > [ 24.392593] identify_secondary_cpu+0x12/0x40 > [ 24.393032] smp_store_cpu_info+0x49/0x60 > [ 24.393430] start_secondary+0x7f/0x140 > [ 24.393810] common_startup_64+0x13e/0x141 > [ 24.394218] </TASK> > > $ scripts/faddr2line arch/x86/kernel/cpu/common.o get_cpu_cap+0x39b > get_cpu_cap+0x39b/0x500: > get_cpu_cap at …/arch/x86/kernel/cpu/common.c:1063 > > 1060 if (c->extended_cpuid_level >= 0x80000008) { > 1061 cpuid(0x80000008, &eax, &ebx, &ecx, &edx); > 1062 c->x86_capability[CPUID_8000_0008_EBX] = ebx; > 1063 invlpgb_count_max = (edx & 0xffff) + 1; > 1064 } > ``` > > Any idea what I'm looking at? > > Thank you. > > > v11: > > - resolve conflict with CONFIG_PT_RECLAIM code > > - a few more cleanups (Peter, Brendan, Nadav) > > v10: > > - simplify partial pages with min(nr, 1) in the invlpgb loop (Peter) > > - document x86 paravirt, AMD invlpgb, and ARM64 flush without IPI (Brendan) > > - remove IS_ENABLED(CONFIG_X86_BROADCAST_TLB_FLUSH) (Brendan) > > - various cleanups (Brendan) > > v9: > > - print warning when start or end address was rounded (Peter) > > - in the reclaim code, tlbsync at context switch time (Peter) > > - fix !CONFIG_CPU_SUP_AMD compile error in arch_tlbbatch_add_pending (Jan) > > v8: > > - round start & end to handle non-page-aligned callers (Steven & Jan) > > - fix up changelog & add tested-by tags (Manali) > > v7: > > - a few small code cleanups (Nadav) > > - fix spurious VM_WARN_ON_ONCE in mm_global_asid > > - code simplifications & better barriers (Peter & Dave) > > v6: > > - fix info->end check in flush_tlb_kernel_range (Michael) > > - disable broadcast TLB flushing on 32 bit x86 > > v5: > > - use byte assembly for compatibility with older toolchains (Borislav, Michael) > > - ensure a panic on an invalid number of extra pages (Dave, Tom) > > - add cant_migrate() assertion to tlbsync (Jann) > > - a bunch more cleanups (Nadav) > > - key TCE enabling off X86_FEATURE_TCE (Andrew) > > - fix a race between reclaim and ASID transition (Jann) > > v4: > > - Use only bitmaps to track free global ASIDs (Nadav) > > - Improved AMD initialization (Borislav & Tom) > > - Various naming and documentation improvements (Peter, Nadav, Tom, Dave) > > - Fixes for subtle race conditions (Jann) > > v3: > > - Remove paravirt tlb_remove_table call (thank you Qi Zheng) > > - More suggested cleanups and changelog fixes by Peter and Nadav > > v2: > > - Apply suggestions by Peter and Borislav (thank you!) > > - Fix bug in arch_tlbbatch_flush, where we need to do both > > the TLBSYNC, and flush the CPUs that are in the cpumask. > > - Some updates to comments and changelogs based on questions. > > > > > > > > >
On Sat, 2025-02-22 at 12:29 +0100, Oleksandr Natalenko wrote: > > [ 24.381400] RIP: 0010:get_cpu_cap+0x39b/0x4f0 > > $ scripts/faddr2line arch/x86/kernel/cpu/common.o get_cpu_cap+0x39b > get_cpu_cap+0x39b/0x500: > get_cpu_cap at …/arch/x86/kernel/cpu/common.c:1063 > > 1060 if (c->extended_cpuid_level >= 0x80000008) { > 1061 cpuid(0x80000008, &eax, &ebx, &ecx, &edx); > 1062 c->x86_capability[CPUID_8000_0008_EBX] = ebx; > 1063 invlpgb_count_max = (edx & 0xffff) + 1; > 1064 } > ``` > > Any idea what I'm looking at? It's crashing when writing the value to the invlpgb_count_max variable. This would be because: 1) invlpgb_count_max is marked __ro_after_init, making it read-only after the system has finished booting, but 2) get_cpu_cap gets run at resume and CPU hotplug time! Borislav, do you prefer I move the initialization of invlpgb_count_max back to where it was before, or get rid of the __ro_after_init thing?
On Sat, Feb 22, 2025 at 11:05:41AM -0500, Rik van Riel wrote: > It's crashing when writing the value to the > invlpgb_count_max variable. > > This would be because: > 1) invlpgb_count_max is marked __ro_after_init, making > it read-only after the system has finished booting, but > 2) get_cpu_cap gets run at resume and CPU hotplug time! Yet another side effect of us reading CPUID gazillion times. /facepalm. > Borislav, do you prefer I move the initialization of > invlpgb_count_max back to where it was before, or get > rid of the __ro_after_init thing? You probably could move it back to where it was - cpu_detect_tlb_amd - and leave it __ro_after_init because cpu_detect_tlb() is run on the BSP only so I'm guessing resume doesn't bootstrap that thing... Thx.