[v12,00/16] AMD broadcast TLB invalidation

Message ID	20250221005345.2156760-1-riel@surriel.com (mailing list archive)
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: Rik van Riel <riel@surriel.com> To: x86@kernel.org Cc: linux-kernel@vger.kernel.org, bp@alien8.de, peterz@infradead.org, dave.hansen@linux.intel.com, zhengqi.arch@bytedance.com, nadav.amit@gmail.com, thomas.lendacky@amd.com, kernel-team@meta.com, linux-mm@kvack.org, akpm@linux-foundation.org, jackmanb@google.com, jannh@google.com, mhklinux@outlook.com, andrew.cooper3@citrix.com, Manali.Shukla@amd.com Subject: [PATCH v12 00/16] AMD broadcast TLB invalidation Date: Thu, 20 Feb 2025 19:52:59 -0500 Message-ID: <20250221005345.2156760-1-riel@surriel.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	AMD broadcast TLB invalidation \| expand [v12,00/16] AMD broadcast TLB invalidation [v12,01/16] x86/mm: make MMU_GATHER_RCU_TABLE_FREE unconditional [v12,02/16] x86/mm: remove pv_ops.mmu.tlb_remove_table call [v12,03/16] x86/mm: consolidate full flush threshold decision [v12,04/16] x86/mm: get INVLPGB count max from CPUID [v12,05/16] x86/mm: add INVLPGB support code [v12,06/16] x86/mm: use INVLPGB for kernel TLB flushes [v12,07/16] x86/mm: use INVLPGB in flush_tlb_all [v12,08/16] x86/mm: use broadcast TLB flushing for page reclaim TLB flushing [v12,09/16] x86/mm: global ASID allocation helper functions [v12,10/16] x86/mm: global ASID context switch & TLB flush handling [v12,11/16] x86/mm: global ASID process exit helpers [v12,12/16] x86/mm: enable broadcast TLB invalidation for multi-threaded processes [v12,13/16] x86/mm: do targeted broadcast flushing from tlbbatch code [v12,14/16] x86/mm: enable AMD translation cache extensions [v12,15/16] x86/mm: only invalidate final translations with INVLPGB [v12,16/16] x86/mm: add noinvlpgb commandline option

Message ID

20250221005345.2156760-1-riel@surriel.com (mailing list archive)

Headers

From: Rik van Riel <riel@surriel.com>
To: x86@kernel.org
Cc: linux-kernel@vger.kernel.org,
	bp@alien8.de,
	peterz@infradead.org,
	dave.hansen@linux.intel.com,
	zhengqi.arch@bytedance.com,
	nadav.amit@gmail.com,
	thomas.lendacky@amd.com,
	kernel-team@meta.com,
	linux-mm@kvack.org,
	akpm@linux-foundation.org,
	jackmanb@google.com,
	jannh@google.com,
	mhklinux@outlook.com,
	andrew.cooper3@citrix.com,
	Manali.Shukla@amd.com
Subject: [PATCH v12 00/16] AMD broadcast TLB invalidation
Date: Thu, 20 Feb 2025 19:52:59 -0500
Message-ID: <20250221005345.2156760-1-riel@surriel.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Sender: owner-linux-mm@kvack.org
Precedence: bulk

Series

AMD broadcast TLB invalidation | expand

Message

Rik van Riel Feb. 21, 2025, 12:52 a.m. UTC

Add support for broadcast TLB invalidation using AMD's INVLPGB instruction.

This allows the kernel to invalidate TLB entries on remote CPUs without
needing to send IPIs, without having to wait for remote CPUs to handle
those interrupts, and with less interruption to what was running on
those CPUs.

Because x86 PCID space is limited, and there are some very large
systems out there, broadcast TLB invalidation is only used for
processes that are active on 3 or more CPUs, with the threshold
being gradually increased the more the PCID space gets exhausted.

Combined with the removal of unnecessary lru_add_drain calls
(see https://lkml.org/lkml/2024/12/19/1388) this results in a
nice performance boost for the will-it-scale tlb_flush2_threads
test on an AMD Milan system with 36 cores:

- vanilla kernel:           527k loops/second
- lru_add_drain removal:    731k loops/second
- only INVLPGB:             527k loops/second
- lru_add_drain + INVLPGB: 1157k loops/second

Profiling with only the INVLPGB changes showed while
TLB invalidation went down from 40% of the total CPU
time to only around 4% of CPU time, the contention
simply moved to the LRU lock.

Fixing both at the same time about doubles the
number of iterations per second from this case.

Some numbers closer to real world performance
can be found at Phoronix, thanks to Michael:

https://www.phoronix.com/news/AMD-INVLPGB-Linux-Benefits

My current plan is to implement support for Intel's RAR
(Remote Action Request) TLB flushing in a follow-up series,
after this thing has been merged into -tip. Making things
any larger would just be unwieldy for reviewers.

v12:
 - make sure "nopcid" command line option turns off invlpgb (Brendan)
 - add "noinvlpgb" kernel command line option
 - split out kernel TLB flushing differently (Dave & Yosry)
 - split up the patch that does invlpgb flushing for user processes (Dave)
 - clean up get_flush_tlb_info (Boris)
 - move invlpgb_count_max initialization to get_cpu_cap (Boris)
 - bunch more comments as requested
v11:
 - resolve conflict with CONFIG_PT_RECLAIM code
 - a few more cleanups (Peter, Brendan, Nadav)
v10:
 - simplify partial pages with min(nr, 1) in the invlpgb loop (Peter)
 - document x86 paravirt, AMD invlpgb, and ARM64 flush without IPI (Brendan)
 - remove IS_ENABLED(CONFIG_X86_BROADCAST_TLB_FLUSH) (Brendan)
 - various cleanups (Brendan)
v9:
 - print warning when start or end address was rounded (Peter)
 - in the reclaim code, tlbsync at context switch time (Peter)
 - fix !CONFIG_CPU_SUP_AMD compile error in arch_tlbbatch_add_pending (Jan)
v8:
 - round start & end to handle non-page-aligned callers (Steven & Jan)
 - fix up changelog & add tested-by tags (Manali)
v7:
 - a few small code cleanups (Nadav)
 - fix spurious VM_WARN_ON_ONCE in mm_global_asid
 - code simplifications & better barriers (Peter & Dave)
v6:
 - fix info->end check in flush_tlb_kernel_range (Michael)
 - disable broadcast TLB flushing on 32 bit x86
v5:
 - use byte assembly for compatibility with older toolchains (Borislav, Michael)
 - ensure a panic on an invalid number of extra pages (Dave, Tom)
 - add cant_migrate() assertion to tlbsync (Jann)
 - a bunch more cleanups (Nadav)
 - key TCE enabling off X86_FEATURE_TCE (Andrew)
 - fix a race between reclaim and ASID transition (Jann)
v4:
 - Use only bitmaps to track free global ASIDs (Nadav)
 - Improved AMD initialization (Borislav & Tom)
 - Various naming and documentation improvements (Peter, Nadav, Tom, Dave)
 - Fixes for subtle race conditions (Jann)
v3:
 - Remove paravirt tlb_remove_table call (thank you Qi Zheng)
 - More suggested cleanups and changelog fixes by Peter and Nadav
v2:
 - Apply suggestions by Peter and Borislav (thank you!)
 - Fix bug in arch_tlbbatch_flush, where we need to do both
   the TLBSYNC, and flush the CPUs that are in the cpumask.
 - Some updates to comments and changelogs based on questions.

Comments

Oleksandr Natalenko Feb. 22, 2025, 11:29 a.m. UTC | #1

Hello.

On pátek 21. února 2025 1:52:59, středoevropský standardní čas Rik van Riel wrote:
> Add support for broadcast TLB invalidation using AMD's INVLPGB instruction.
> 
> This allows the kernel to invalidate TLB entries on remote CPUs without
> needing to send IPIs, without having to wait for remote CPUs to handle
> those interrupts, and with less interruption to what was running on
> those CPUs.
> 
> Because x86 PCID space is limited, and there are some very large
> systems out there, broadcast TLB invalidation is only used for
> processes that are active on 3 or more CPUs, with the threshold
> being gradually increased the more the PCID space gets exhausted.
> 
> Combined with the removal of unnecessary lru_add_drain calls
> (see https://lkml.org/lkml/2024/12/19/1388) this results in a
> nice performance boost for the will-it-scale tlb_flush2_threads
> test on an AMD Milan system with 36 cores:
> 
> - vanilla kernel:           527k loops/second
> - lru_add_drain removal:    731k loops/second
> - only INVLPGB:             527k loops/second
> - lru_add_drain + INVLPGB: 1157k loops/second
> 
> Profiling with only the INVLPGB changes showed while
> TLB invalidation went down from 40% of the total CPU
> time to only around 4% of CPU time, the contention
> simply moved to the LRU lock.
> 
> Fixing both at the same time about doubles the
> number of iterations per second from this case.
> 
> Some numbers closer to real world performance
> can be found at Phoronix, thanks to Michael:
> 
> https://www.phoronix.com/news/AMD-INVLPGB-Linux-Benefits
> 
> My current plan is to implement support for Intel's RAR
> (Remote Action Request) TLB flushing in a follow-up series,
> after this thing has been merged into -tip. Making things
> any larger would just be unwieldy for reviewers.
> 
> v12:
>  - make sure "nopcid" command line option turns off invlpgb (Brendan)
>  - add "noinvlpgb" kernel command line option
>  - split out kernel TLB flushing differently (Dave & Yosry)
>  - split up the patch that does invlpgb flushing for user processes (Dave)
>  - clean up get_flush_tlb_info (Boris)
>  - move invlpgb_count_max initialization to get_cpu_cap (Boris)
>  - bunch more comments as requested

Somehow, this iteration breaks resume from S3. I can see it even in a QEMU VM:

```
[   24.373391] ACPI: PM: Low-level resume complete
[   24.373929] ACPI: PM: Restoring platform NVS memory
[   24.375024] Enabling non-boot CPUs ...
[   24.375777] smpboot: Booting Node 0 Processor 1 APIC 0x1
[   24.376463] BUG: unable to handle page fault for address: ffffffffa3ba4d60
[   24.377383] #PF: supervisor write access in kernel mode
[   24.377912] #PF: error_code(0x0003) - permissions violation
[   24.378413] PGD 25427067 P4D 25427067 PUD 25428063 PMD 8000000024c001a1
[   24.379020] Oops: Oops: 0003 [#1] PREEMPT SMP NOPTI
[   24.379503] CPU: 1 UID: 0 PID: 0 Comm: swapper/1 Kdump: loaded Not tainted 6.14.0-pf0 #1 161e4891fb5044b2d7438cd1852eeaac0cdffab5
[   24.380650] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 02/02/2022
[   24.381400] RIP: 0010:get_cpu_cap+0x39b/0x4f0
[   24.381810] Code: 08 c7 44 24 08 00 00 00 00 48 8d 4c 24 0c e8 3c 00 04 00 90 8b 44 24 04 89 43 64 0f b7 44 24 0c 83 c0 01 81 7b 24 09 00 00 80 <66> 89 05 0e ab 8b 01 0f 86 18 fd ff ff c7 44 24 14 00 00 00 00 4c
[   24.383629] RSP: 0000:ffffafbec00efe70 EFLAGS: 00010012
[   24.384155] RAX: 0000000000000001 RBX: ffff8b3fbcb19020 RCX: 0000000000001001
[   24.384862] RDX: 0000000000000000 RSI: ffffafbec00efe74 RDI: ffffafbec00efe78
[   24.385603] RBP: ffffafbec00efe88 R08: ffffafbec00efe70 R09: ffffafbec00efe7c
[   24.386318] R10: 0000000000002430 R11: ffff8b3fa5428000 R12: ffffafbec00efe8c
[   24.387014] R13: ffffafbec00efe84 R14: ffffafbec00efe80 R15: ffffafbec00efe70
[   24.387713] FS:  0000000000000000(0000) GS:ffff8b3fbcb00000(0000) knlGS:0000000000000000
[   24.388502] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   24.389074] CR2: ffffffffa3ba4d60 CR3: 0000000025422000 CR4: 0000000000350ef0
[   24.389769] Call Trace:
[   24.390020]  <TASK>
[   24.392234]  identify_cpu+0xd4/0x890
[   24.392593]  identify_secondary_cpu+0x12/0x40
[   24.393032]  smp_store_cpu_info+0x49/0x60
[   24.393430]  start_secondary+0x7f/0x140
[   24.393810]  common_startup_64+0x13e/0x141
[   24.394218]  </TASK>

$ scripts/faddr2line arch/x86/kernel/cpu/common.o get_cpu_cap+0x39b
get_cpu_cap+0x39b/0x500:
get_cpu_cap at …/arch/x86/kernel/cpu/common.c:1063

1060         if (c->extended_cpuid_level >= 0x80000008) {
1061                 cpuid(0x80000008, &eax, &ebx, &ecx, &edx);
1062                 c->x86_capability[CPUID_8000_0008_EBX] = ebx;
1063                 invlpgb_count_max = (edx & 0xffff) + 1;
1064         }
```

Any idea what I'm looking at?

Thank you.

> v11:
>  - resolve conflict with CONFIG_PT_RECLAIM code
>  - a few more cleanups (Peter, Brendan, Nadav)
> v10:
>  - simplify partial pages with min(nr, 1) in the invlpgb loop (Peter)
>  - document x86 paravirt, AMD invlpgb, and ARM64 flush without IPI (Brendan)
>  - remove IS_ENABLED(CONFIG_X86_BROADCAST_TLB_FLUSH) (Brendan)
>  - various cleanups (Brendan)
> v9:
>  - print warning when start or end address was rounded (Peter)
>  - in the reclaim code, tlbsync at context switch time (Peter)
>  - fix !CONFIG_CPU_SUP_AMD compile error in arch_tlbbatch_add_pending (Jan)
> v8:
>  - round start & end to handle non-page-aligned callers (Steven & Jan)
>  - fix up changelog & add tested-by tags (Manali)
> v7:
>  - a few small code cleanups (Nadav)
>  - fix spurious VM_WARN_ON_ONCE in mm_global_asid
>  - code simplifications & better barriers (Peter & Dave)
> v6:
>  - fix info->end check in flush_tlb_kernel_range (Michael)
>  - disable broadcast TLB flushing on 32 bit x86
> v5:
>  - use byte assembly for compatibility with older toolchains (Borislav, Michael)
>  - ensure a panic on an invalid number of extra pages (Dave, Tom)
>  - add cant_migrate() assertion to tlbsync (Jann)
>  - a bunch more cleanups (Nadav)
>  - key TCE enabling off X86_FEATURE_TCE (Andrew)
>  - fix a race between reclaim and ASID transition (Jann)
> v4:
>  - Use only bitmaps to track free global ASIDs (Nadav)
>  - Improved AMD initialization (Borislav & Tom)
>  - Various naming and documentation improvements (Peter, Nadav, Tom, Dave)
>  - Fixes for subtle race conditions (Jann)
> v3:
>  - Remove paravirt tlb_remove_table call (thank you Qi Zheng)
>  - More suggested cleanups and changelog fixes by Peter and Nadav
> v2:
>  - Apply suggestions by Peter and Borislav (thank you!)
>  - Fix bug in arch_tlbbatch_flush, where we need to do both
>    the TLBSYNC, and flush the CPUs that are in the cpumask.
>  - Some updates to comments and changelogs based on questions.
> 
> 
>

Oleksandr Natalenko Feb. 22, 2025, 11:36 a.m. UTC | #2

On sobota 22. února 2025 12:29:54, středoevropský standardní čas Oleksandr Natalenko wrote:
> Hello.
> 
> On pátek 21. února 2025 1:52:59, středoevropský standardní čas Rik van Riel wrote:
> > Add support for broadcast TLB invalidation using AMD's INVLPGB instruction.
> > 
> > This allows the kernel to invalidate TLB entries on remote CPUs without
> > needing to send IPIs, without having to wait for remote CPUs to handle
> > those interrupts, and with less interruption to what was running on
> > those CPUs.
> > 
> > Because x86 PCID space is limited, and there are some very large
> > systems out there, broadcast TLB invalidation is only used for
> > processes that are active on 3 or more CPUs, with the threshold
> > being gradually increased the more the PCID space gets exhausted.
> > 
> > Combined with the removal of unnecessary lru_add_drain calls
> > (see https://lkml.org/lkml/2024/12/19/1388) this results in a
> > nice performance boost for the will-it-scale tlb_flush2_threads
> > test on an AMD Milan system with 36 cores:
> > 
> > - vanilla kernel:           527k loops/second
> > - lru_add_drain removal:    731k loops/second
> > - only INVLPGB:             527k loops/second
> > - lru_add_drain + INVLPGB: 1157k loops/second
> > 
> > Profiling with only the INVLPGB changes showed while
> > TLB invalidation went down from 40% of the total CPU
> > time to only around 4% of CPU time, the contention
> > simply moved to the LRU lock.
> > 
> > Fixing both at the same time about doubles the
> > number of iterations per second from this case.
> > 
> > Some numbers closer to real world performance
> > can be found at Phoronix, thanks to Michael:
> > 
> > https://www.phoronix.com/news/AMD-INVLPGB-Linux-Benefits
> > 
> > My current plan is to implement support for Intel's RAR
> > (Remote Action Request) TLB flushing in a follow-up series,
> > after this thing has been merged into -tip. Making things
> > any larger would just be unwieldy for reviewers.
> > 
> > v12:
> >  - make sure "nopcid" command line option turns off invlpgb (Brendan)
> >  - add "noinvlpgb" kernel command line option
> >  - split out kernel TLB flushing differently (Dave & Yosry)
> >  - split up the patch that does invlpgb flushing for user processes (Dave)
> >  - clean up get_flush_tlb_info (Boris)
> >  - move invlpgb_count_max initialization to get_cpu_cap (Boris)
> >  - bunch more comments as requested
> 
> Somehow, this iteration breaks resume from S3. I can see it even in a QEMU VM:

Can also reproduce this by simply offlining/onlining a CPU via `/sys/devices/system/cpu/cpuX/online`.

> 
> ```
> [   24.373391] ACPI: PM: Low-level resume complete
> [   24.373929] ACPI: PM: Restoring platform NVS memory
> [   24.375024] Enabling non-boot CPUs ...
> [   24.375777] smpboot: Booting Node 0 Processor 1 APIC 0x1
> [   24.376463] BUG: unable to handle page fault for address: ffffffffa3ba4d60
> [   24.377383] #PF: supervisor write access in kernel mode
> [   24.377912] #PF: error_code(0x0003) - permissions violation
> [   24.378413] PGD 25427067 P4D 25427067 PUD 25428063 PMD 8000000024c001a1
> [   24.379020] Oops: Oops: 0003 [#1] PREEMPT SMP NOPTI
> [   24.379503] CPU: 1 UID: 0 PID: 0 Comm: swapper/1 Kdump: loaded Not tainted 6.14.0-pf0 #1 161e4891fb5044b2d7438cd1852eeaac0cdffab5
> [   24.380650] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 02/02/2022
> [   24.381400] RIP: 0010:get_cpu_cap+0x39b/0x4f0
> [   24.381810] Code: 08 c7 44 24 08 00 00 00 00 48 8d 4c 24 0c e8 3c 00 04 00 90 8b 44 24 04 89 43 64 0f b7 44 24 0c 83 c0 01 81 7b 24 09 00 00 80 <66> 89 05 0e ab 8b 01 0f 86 18 fd ff ff c7 44 24 14 00 00 00 00 4c
> [   24.383629] RSP: 0000:ffffafbec00efe70 EFLAGS: 00010012
> [   24.384155] RAX: 0000000000000001 RBX: ffff8b3fbcb19020 RCX: 0000000000001001
> [   24.384862] RDX: 0000000000000000 RSI: ffffafbec00efe74 RDI: ffffafbec00efe78
> [   24.385603] RBP: ffffafbec00efe88 R08: ffffafbec00efe70 R09: ffffafbec00efe7c
> [   24.386318] R10: 0000000000002430 R11: ffff8b3fa5428000 R12: ffffafbec00efe8c
> [   24.387014] R13: ffffafbec00efe84 R14: ffffafbec00efe80 R15: ffffafbec00efe70
> [   24.387713] FS:  0000000000000000(0000) GS:ffff8b3fbcb00000(0000) knlGS:0000000000000000
> [   24.388502] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   24.389074] CR2: ffffffffa3ba4d60 CR3: 0000000025422000 CR4: 0000000000350ef0
> [   24.389769] Call Trace:
> [   24.390020]  <TASK>
> [   24.392234]  identify_cpu+0xd4/0x890
> [   24.392593]  identify_secondary_cpu+0x12/0x40
> [   24.393032]  smp_store_cpu_info+0x49/0x60
> [   24.393430]  start_secondary+0x7f/0x140
> [   24.393810]  common_startup_64+0x13e/0x141
> [   24.394218]  </TASK>
> 
> $ scripts/faddr2line arch/x86/kernel/cpu/common.o get_cpu_cap+0x39b
> get_cpu_cap+0x39b/0x500:
> get_cpu_cap at …/arch/x86/kernel/cpu/common.c:1063
> 
> 1060         if (c->extended_cpuid_level >= 0x80000008) {
> 1061                 cpuid(0x80000008, &eax, &ebx, &ecx, &edx);
> 1062                 c->x86_capability[CPUID_8000_0008_EBX] = ebx;
> 1063                 invlpgb_count_max = (edx & 0xffff) + 1;
> 1064         }
> ```
> 
> Any idea what I'm looking at?
> 
> Thank you.
> 
> > v11:
> >  - resolve conflict with CONFIG_PT_RECLAIM code
> >  - a few more cleanups (Peter, Brendan, Nadav)
> > v10:
> >  - simplify partial pages with min(nr, 1) in the invlpgb loop (Peter)
> >  - document x86 paravirt, AMD invlpgb, and ARM64 flush without IPI (Brendan)
> >  - remove IS_ENABLED(CONFIG_X86_BROADCAST_TLB_FLUSH) (Brendan)
> >  - various cleanups (Brendan)
> > v9:
> >  - print warning when start or end address was rounded (Peter)
> >  - in the reclaim code, tlbsync at context switch time (Peter)
> >  - fix !CONFIG_CPU_SUP_AMD compile error in arch_tlbbatch_add_pending (Jan)
> > v8:
> >  - round start & end to handle non-page-aligned callers (Steven & Jan)
> >  - fix up changelog & add tested-by tags (Manali)
> > v7:
> >  - a few small code cleanups (Nadav)
> >  - fix spurious VM_WARN_ON_ONCE in mm_global_asid
> >  - code simplifications & better barriers (Peter & Dave)
> > v6:
> >  - fix info->end check in flush_tlb_kernel_range (Michael)
> >  - disable broadcast TLB flushing on 32 bit x86
> > v5:
> >  - use byte assembly for compatibility with older toolchains (Borislav, Michael)
> >  - ensure a panic on an invalid number of extra pages (Dave, Tom)
> >  - add cant_migrate() assertion to tlbsync (Jann)
> >  - a bunch more cleanups (Nadav)
> >  - key TCE enabling off X86_FEATURE_TCE (Andrew)
> >  - fix a race between reclaim and ASID transition (Jann)
> > v4:
> >  - Use only bitmaps to track free global ASIDs (Nadav)
> >  - Improved AMD initialization (Borislav & Tom)
> >  - Various naming and documentation improvements (Peter, Nadav, Tom, Dave)
> >  - Fixes for subtle race conditions (Jann)
> > v3:
> >  - Remove paravirt tlb_remove_table call (thank you Qi Zheng)
> >  - More suggested cleanups and changelog fixes by Peter and Nadav
> > v2:
> >  - Apply suggestions by Peter and Borislav (thank you!)
> >  - Fix bug in arch_tlbbatch_flush, where we need to do both
> >    the TLBSYNC, and flush the CPUs that are in the cpumask.
> >  - Some updates to comments and changelogs based on questions.
> > 
> > 
> > 
> 
> 
>

Rik van Riel Feb. 22, 2025, 4:05 p.m. UTC | #3

On Sat, 2025-02-22 at 12:29 +0100, Oleksandr Natalenko wrote:
> 
> [   24.381400] RIP: 0010:get_cpu_cap+0x39b/0x4f0
> 

> $ scripts/faddr2line arch/x86/kernel/cpu/common.o get_cpu_cap+0x39b
> get_cpu_cap+0x39b/0x500:
> get_cpu_cap at …/arch/x86/kernel/cpu/common.c:1063
> 
> 1060         if (c->extended_cpuid_level >= 0x80000008) {
> 1061                 cpuid(0x80000008, &eax, &ebx, &ecx, &edx);
> 1062                 c->x86_capability[CPUID_8000_0008_EBX] = ebx;
> 1063                 invlpgb_count_max = (edx & 0xffff) + 1;
> 1064         }
> ```
> 
> Any idea what I'm looking at?

It's crashing when writing the value to the
invlpgb_count_max variable.

This would be because:
1) invlpgb_count_max is marked __ro_after_init, making
   it read-only after the system has finished booting, but
2) get_cpu_cap gets run at resume and CPU hotplug time!

Borislav, do you prefer I move the initialization of 
invlpgb_count_max back to where it was before, or get
rid of the __ro_after_init thing?

Borislav Petkov Feb. 22, 2025, 4:19 p.m. UTC | #4

On Sat, Feb 22, 2025 at 11:05:41AM -0500, Rik van Riel wrote:
> It's crashing when writing the value to the
> invlpgb_count_max variable.
> 
> This would be because:
> 1) invlpgb_count_max is marked __ro_after_init, making
>    it read-only after the system has finished booting, but
> 2) get_cpu_cap gets run at resume and CPU hotplug time!

Yet another side effect of us reading CPUID gazillion times. /facepalm.

> Borislav, do you prefer I move the initialization of 
> invlpgb_count_max back to where it was before, or get
> rid of the __ro_after_init thing?

You probably could move it back to where it was - cpu_detect_tlb_amd - and
leave it __ro_after_init because cpu_detect_tlb() is run on the BSP only so
I'm guessing resume doesn't bootstrap that thing...

Thx.