Message ID | 20240304005104.622511517@linutronix.de (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | x86: Cure tons of sparse warnings (mostly __percpu) | expand |
Hi, On Mon, Mar 04, 2024 at 11:12:23AM +0100, Thomas Gleixner wrote: > On UP builds sparse complains rightfully about accesses to cpu_info with > per CPU accessors: > > cacheinfo.c:282:30: sparse: warning: incorrect type in initializer (different address spaces) > cacheinfo.c:282:30: sparse: expected void const [noderef] __percpu *__vpp_verify > cacheinfo.c:282:30: sparse: got unsigned int * > > The reason is that on UP builds cpu_info which is a per CPU variable on SMP > is mapped to boot_cpu_info which is a regular variable. There is a hideous > accessor cpu_data() which tries to hide this, but it's not sufficient as > some places require raw accessors and generates worse code than the regular > per CPU accessors. > > Waste sizeof(struct x86_cpuinfo) memory on UP and provide the per CPU > cpu_info unconditionally. This requires to update the CPU info on the boot > CPU as SMP does. (Ab)use the weakly defined smp_prepare_boot_cpu() function > and implement exactly that. > > This allows to use regular per CPU accessors uncoditionally and paves the > way to remove the cpu_data() hackery. > > Signed-off-by: Thomas Gleixner <tglx@linutronix.de> This patch results in crashes when running the mainline kernel in qemu with nosmp builds and Intel CPUs. The problem is _not_ seen on tag x86-cleanups-2024-03-11; it is only seen in the mainline kernel. I didn't check all of them, but it looks like AMD CPUs are not affected. The initial bisect points to the merge of x86-cleanups-2024-03-11 into the mainline kernel. I rebased x86-cleanups-2024-03-11 on top of the mainline kernel; the second bisect uses that rebase as base. Reverting this patch from the mainline kernel fixes the problem. I don't know the code well enough to determine what is wrong. Please let me know what I can do to help debugging the problem. Thanks, Guenter ---- crash log: [ 3.291087] BUG: unable to handle page fault for address: ffff9cd801f3f2a0 [ 3.291087] #PF: supervisor write access in kernel mode [ 3.291087] #PF: error_code(0x0002) - not-present page [ 3.291087] PGD 60201067 P4D 60201067 PUD 0 [ 3.291087] Oops: 0002 [#1] PREEMPT PTI [ 3.291087] CPU: 0 PID: 1 Comm: swapper Not tainted 6.8.0-06619-ge5e038b7ae9d #1 [ 3.291087] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 [ 3.291087] RIP: 0010:rapl_cpu_online+0xf2/0x110 [ 3.291087] Code: 05 ff 8e 07 03 40 42 0f 00 48 89 43 60 e8 56 5f 12 00 8b 15 b4 84 61 02 48 8b 05 01 8f 07 03 48 c7 83 90 00 00 00 e0 84 80 b6 <48> 89 9c d0 38 01 00 00 e9 2b ff ff ff b8 f4 ff ff ff e9 47 ff ff [ 3.291087] RSP: 0018:ffffa3d54001fdd0 EFLAGS: 00000246 [ 3.291087] RAX: ffff9cd001f3f200 RBX: ffff9cd001fb34a8 RCX: 0000000000000000 [ 3.291087] RDX: 00000000ffffffed RSI: 0000000000000001 RDI: ffff9cd001fb3550 [ 3.291087] RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000001 [ 3.291087] R10: 0000000000000001 R11: 0000000000018001 R12: 0000000000000000 [ 3.291087] R13: 000000000000009e R14: ffffffffb6808180 R15: ffffffffb86710e5 [ 3.291087] FS: 0000000000000000(0000) GS:ffffffffb8ab0000(0000) knlGS:0000000000000000 [ 3.291087] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 3.291087] CR2: ffff9cd801f3f2a0 CR3: 000000005e6a2000 CR4: 00000000001506f0 [ 3.291087] Call Trace: [ 3.291087] <TASK> [ 3.291087] ? __die+0x1f/0x60 [ 3.291087] ? page_fault_oops+0x148/0x460 [ 3.291087] ? search_exception_tables+0x37/0x50 [ 3.291087] ? fixup_exception+0x21/0x320 [ 3.291087] ? exc_page_fault+0xca/0x150 [ 3.291087] ? asm_exc_page_fault+0x26/0x30 [ 3.291087] ? __pfx_rapl_cpu_online+0x10/0x10 [ 3.291087] ? rapl_cpu_online+0xf2/0x110 [ 3.291087] cpuhp_invoke_callback.constprop.0+0x117/0x3e0 [ 3.291087] __cpuhp_setup_state_cpuslocked+0x1b7/0x280 [ 3.291087] ? __pfx_rapl_cpu_online+0x10/0x10 [ 3.291087] rapl_pmu_init+0x189/0x2e0 [ 3.291087] ? __pfx_rapl_pmu_init+0x10/0x10 [ 3.291087] do_one_initcall+0x4f/0x210 [ 3.291087] kernel_init_freeable+0x166/0x290 [ 3.291087] ? __pfx_kernel_init+0x10/0x10 [ 3.291087] kernel_init+0x15/0x1b0 [ 3.291087] ret_from_fork+0x2f/0x50 [ 3.291087] ? __pfx_kernel_init+0x10/0x10 [ 3.291087] ret_from_fork_asm+0x19/0x30 [ 3.291087] </TASK> [ 3.291087] Modules linked in: [ 3.291087] CR2: ffff9cd801f3f2a0 [ 3.291087] ---[ end trace 0000000000000000 ]--- [ 3.291087] RIP: 0010:rapl_cpu_online+0xf2/0x110 [ 3.291087] Code: 05 ff 8e 07 03 40 42 0f 00 48 89 43 60 e8 56 5f 12 00 8b 15 b4 84 61 02 48 8b 05 01 8f 07 03 48 c7 83 90 00 00 00 e0 84 80 b6 <48> 89 9c d0 38 01 00 00 e9 2b ff ff ff b8 f4 ff ff ff e9 47 ff ff [ 3.291087] RSP: 0018:ffffa3d54001fdd0 EFLAGS: 00000246 [ 3.291087] RAX: ffff9cd001f3f200 RBX: ffff9cd001fb34a8 RCX: 0000000000000000 [ 3.291087] RDX: 00000000ffffffed RSI: 0000000000000001 RDI: ffff9cd001fb3550 [ 3.291087] RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000001 [ 3.291087] R10: 0000000000000001 R11: 0000000000018001 R12: 0000000000000000 [ 3.291087] R13: 000000000000009e R14: ffffffffb6808180 R15: ffffffffb86710e5 [ 3.291087] FS: 0000000000000000(0000) GS:ffffffffb8ab0000(0000) knlGS:0000000000000000 [ 3.291087] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 3.291087] CR2: ffff9cd801f3f2a0 CR3: 000000005e6a2000 CR4: 00000000001506f0 [ 3.291087] note: swapper[1] exited with irqs disabled [ 3.306047] Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000009 --- 1st bisect: # bad: [1bbeaf83dd7b5e3628b98bec66ff8fe2646e14aa] Merge tag 'perf-tools-for-v6.9-2024-03-13' of git://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools # good: [e8f897f4afef0031fe618a8e94127a0934896aba] Linux 6.8 git bisect start 'HEAD' 'v6.8' # bad: [9187210eee7d87eea37b45ea93454a88681894a4] Merge tag 'net-next-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next git bisect bad 9187210eee7d87eea37b45ea93454a88681894a4 # bad: [a01c9fe32378636ae65bec8047b5de3fdb2ba5c8] Merge tag 'nfsd-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux git bisect bad a01c9fe32378636ae65bec8047b5de3fdb2ba5c8 # bad: [691632f0e86973604678d193ccfa47b9614581aa] Merge tag 's390-6.9-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux git bisect bad 691632f0e86973604678d193ccfa47b9614581aa # good: [8ede842f669b6f78812349bbef4d1efd0fbdafce] Merge tag 'rust-6.9' of https://github.com/Rust-for-Linux/linux git bisect good 8ede842f669b6f78812349bbef4d1efd0fbdafce # good: [bfdb395a7cde12d83a623949ed029b0ab38d765b] Merge tag 'x86_mtrr_for_v6.9_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip git bisect good bfdb395a7cde12d83a623949ed029b0ab38d765b # bad: [685d98211273f60e38a6d361b62d7016c545297e] Merge tag 'x86-core-2024-03-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip git bisect bad 685d98211273f60e38a6d361b62d7016c545297e # good: [b0402403e54ae9eb94ce1cbb53c7def776e97426] Merge tag 'edac_updates_for_v6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras git bisect good b0402403e54ae9eb94ce1cbb53c7def776e97426 # good: [cb81deefb59de01325ab822f900c13941bfaf67f] x86/idle: Sanitize X86_BUG_AMD_E400 handling git bisect good cb81deefb59de01325ab822f900c13941bfaf67f # good: [73f0d1d7b4abb4a46bae1a0d8caf66e23d1138d0] Merge tag 'x86-asm-2024-03-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip git bisect good 73f0d1d7b4abb4a46bae1a0d8caf66e23d1138d0 # good: [65efc4dc12c5cc296374278673b89390eba79fe6] x86/cpu: Provide a declaration for itlb_multihit_kvm_mitigation git bisect good 65efc4dc12c5cc296374278673b89390eba79fe6 # good: [d69ad12c786f0a4593c48c0658043aa4a5116b09] Merge tag 'x86-build-2024-03-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip git bisect good d69ad12c786f0a4593c48c0658043aa4a5116b09 # good: [35ce64922c8263448e58a2b9e8d15a64e11e9b2d] x86/idle: Select idle routine only once git bisect good 35ce64922c8263448e58a2b9e8d15a64e11e9b2d # good: [774a86f1c885460ade4334b901919fa1d8ae6ec6] x86/nmi: Drop unused declaration of proc_nmi_enabled() git bisect good 774a86f1c885460ade4334b901919fa1d8ae6ec6 # bad: [fcc196579aa1fc167d6778948bff69fae6116737] Merge tag 'x86-cleanups-2024-03-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip git bisect bad fcc196579aa1fc167d6778948bff69fae6116737 # first bad commit: [fcc196579aa1fc167d6778948bff69fae6116737] Merge tag 'x86-cleanups-2024-03-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip --- 2nd bisect: # bad: [6d70929c7019e265425f7a89cf72163a639d462e] x86/nmi: Drop unused declaration of proc_nmi_enabled() # good: [d69ad12c786f0a4593c48c0658043aa4a5116b09] Merge tag 'x86-build-2024-03-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip git bisect start 'HEAD' 'fcc196579aa1fc167d6778948bff69fae6116737~1' # good: [5c157d25918ef15454c2a9a9262f7b763d9c9add] x86/msr: Add missing __percpu annotations git bisect good 5c157d25918ef15454c2a9a9262f7b763d9c9add # bad: [f5a6b1e2d92d825a0f73ca11e960795da336d99e] x86/uaccess: Add missing __force to casts in __access_ok() and valid_user_address() git bisect bad f5a6b1e2d92d825a0f73ca11e960795da336d99e # bad: [68907233f6d26a214bb79f892db8d999b15dda97] x86/percpu: Cure per CPU madness on UP git bisect bad 68907233f6d26a214bb79f892db8d999b15dda97 # good: [053df18ceb928c8631042317a884b2842a457f1b] smp: Consolidate smp_prepare_boot_cpu() git bisect good 053df18ceb928c8631042317a884b2842a457f1b # first bad commit: [68907233f6d26a214bb79f892db8d999b15dda97] x86/percpu: Cure per CPU madness on UP
On Fri, 15 Mar 2024 at 09:17, Guenter Roeck <linux@roeck-us.net> wrote: > > [ 3.291087] RIP: 0010:rapl_cpu_online+0xf2/0x110 > [ 3.291087] Code: 05 ff 8e 07 03 40 42 0f 00 48 89 43 60 e8 56 5f 12 00 8b 15 b4 84 61 02 48 8b 05 01 8f 07 03 48 c7 83 90 00 00 00 e0 84 80 b6 <48> 89 9c d0 38 01 00 00 e9 2b ff ff ff b8 f4 ff ff ff e9 47 ff ff The code is mov %rax,0x60(%rbx) call 0x125f5f mov 0x26184b4(%rip),%edx mov 0x3078f01(%rip),%rax movq $0xffffffffb68084e0,0x90(%rbx) mov %rbx,0x138(%rax,%rdx,8) <-- trapping instruction jmp <backwards> with %rdx being some index having the value 0xffffffed (-19). That's ENODEV. Without line numbers (if you have debug info for that kernel, it's good to run "scripts/decode_stacktrace.sh" on stack traces) it's hard to really know what's up, but I strongly suspect that it's this: rapl_pmus->pmus[topology_logical_die_id(cpu)] = pmu; because we have topology_logical_die_id(cpu) -> (cpu_data(cpu).topo.logical_die_id) and we have c->topo.logical_die_id = topology_get_logical_id(apicid, TOPO_DIE_DOMAIN); and topology_get_logical_id() does this: if (lvlid >= MAX_LOCAL_APIC) return -ERANGE; if (!test_bit(lvlid, apic_maps[at_level].map)) return -ENODEV; so that -ENODEV is not entirely unlikely for a UP run. This also explains why it *used* to work - that whole thing is new to the current merge window and came in through commit ca7e91776912 ("Merge tag 'x86-apic-2024-03-10' of ..."). Thomas, over to you. I wonder if maybe all those topology macros should just return 0 on an UP build, but that topology_get_logical_id() thing looks a bit wrong regardless. It really shouldn't depend on local apic data for configs that may not *have* a local apic. Linus
On 3/15/24 09:42, Linus Torvalds wrote: > On Fri, 15 Mar 2024 at 09:17, Guenter Roeck <linux@roeck-us.net> wrote: >> >> [ 3.291087] RIP: 0010:rapl_cpu_online+0xf2/0x110 >> [ 3.291087] Code: 05 ff 8e 07 03 40 42 0f 00 48 89 43 60 e8 56 5f 12 00 8b 15 b4 84 61 02 48 8b 05 01 8f 07 03 48 c7 83 90 00 00 00 e0 84 80 b6 <48> 89 9c d0 38 01 00 00 e9 2b ff ff ff b8 f4 ff ff ff e9 47 ff ff > > The code is > > mov %rax,0x60(%rbx) > call 0x125f5f > mov 0x26184b4(%rip),%edx > mov 0x3078f01(%rip),%rax > movq $0xffffffffb68084e0,0x90(%rbx) > mov %rbx,0x138(%rax,%rdx,8) <-- trapping instruction > jmp <backwards> > > with %rdx being some index having the value 0xffffffed (-19). > > That's ENODEV. > > Without line numbers (if you have debug info for that kernel, it's > good to run "scripts/decode_stacktrace.sh" on stack traces) it's hard Sorry, I should have done that. > to really know what's up, but I strongly suspect that it's this: > > rapl_pmus->pmus[topology_logical_die_id(cpu)] = pmu; Correct: [ 2.632164] RIP: 0010:rapl_cpu_online (arch/x86/events/rapl.c:581) which does point to that line. Here is a complete decoded backtrace: [ 2.632164] Call Trace: [ 2.632164] <TASK> [ 2.632164] ? __die (arch/x86/kernel/dumpstack.c:421 arch/x86/kernel/dumpstack.c:434) [ 2.632164] ? page_fault_oops (arch/x86/mm/fault.c:713) [ 2.632164] ? search_exception_tables (kernel/extable.c:59) [ 2.632164] ? fixup_exception (arch/x86/mm/extable.c:328) [ 2.632164] ? exc_page_fault (arch/x86/mm/fault.c:1503 arch/x86/mm/fault.c:1563) [ 2.632164] ? asm_exc_page_fault (./arch/x86/include/asm/idtentry.h:623) [ 2.632164] ? __pfx_rapl_cpu_online (arch/x86/events/rapl.c:566) [ 2.632164] ? rapl_cpu_online (arch/x86/events/rapl.c:581) [ 2.632164] cpuhp_invoke_callback.constprop.0 (kernel/cpu.c:195) [ 2.632164] __cpuhp_setup_state_cpuslocked (kernel/cpu.c:2541) [ 2.632164] ? __pfx_rapl_cpu_online (arch/x86/events/rapl.c:566) [ 2.632164] rapl_pmu_init (./include/linux/cpuhotplug.h:274 arch/x86/events/rapl.c:843) [ 2.632164] ? __pfx_rapl_pmu_init (arch/x86/events/rapl.c:816) [ 2.632164] do_one_initcall (init/main.c:1241) [ 2.632164] kernel_init_freeable (init/main.c:1302 init/main.c:1319 init/main.c:1338 init/main.c:1551) [ 2.632164] ? __pfx_kernel_init (init/main.c:1432) [ 2.632164] kernel_init (init/main.c:1442) [ 2.632164] ret_from_fork (arch/x86/kernel/process.c:153) [ 2.632164] ? __pfx_kernel_init (init/main.c:1432) [ 2.632164] ret_from_fork_asm (arch/x86/entry/entry_64.S:256) Guenter
On Fri, Mar 15 2024 at 09:42, Linus Torvalds wrote: > On Fri, 15 Mar 2024 at 09:17, Guenter Roeck <linux@roeck-us.net> wrote: > Without line numbers (if you have debug info for that kernel, it's > good to run "scripts/decode_stacktrace.sh" on stack traces) it's hard > to really know what's up, but I strongly suspect that it's this: > > rapl_pmus->pmus[topology_logical_die_id(cpu)] = pmu; > > because we have > > topology_logical_die_id(cpu) -> > (cpu_data(cpu).topo.logical_die_id) > > and we have > > c->topo.logical_die_id = topology_get_logical_id(apicid, TOPO_DIE_DOMAIN); > > and topology_get_logical_id() does this: > > if (lvlid >= MAX_LOCAL_APIC) > return -ERANGE; > if (!test_bit(lvlid, apic_maps[at_level].map)) > return -ENODEV; > > so that -ENODEV is not entirely unlikely for a UP run. > > This also explains why it *used* to work - that whole thing is new to > the current merge window and came in through commit ca7e91776912 > ("Merge tag 'x86-apic-2024-03-10' of ..."). > > Thomas, over to you. I wonder if maybe all those topology macros > should just return 0 on an UP build, but that > topology_get_logical_id() thing looks a bit wrong regardless. > > It really shouldn't depend on local apic data for configs that may not > *have* a local apic. Right. Let me look.
On Fri, Mar 15 2024 at 18:40, Thomas Gleixner wrote: > On Fri, Mar 15 2024 at 09:42, Linus Torvalds wrote: >> On Fri, 15 Mar 2024 at 09:17, Guenter Roeck <linux@roeck-us.net> wrote: >> Thomas, over to you. I wonder if maybe all those topology macros >> should just return 0 on an UP build, but that >> topology_get_logical_id() thing looks a bit wrong regardless. >> >> It really shouldn't depend on local apic data for configs that may not >> *have* a local apic. > > Right. Let me look. Not really. The problem is that a SMP build can run on a UP machine w/o APIC or command line disables the APIC and will run into the exactly same problem. The only case where we know that it is impossible is when APIC support is disabled, which is silly but topic for a different discussion. So the proper thing to do is to check for num_possible_cpus() == 1 in that function. Sure you can argue that we could avoid it for SMP=n builds completely, but I think the right thing to do is to aim for removing CONFIG_SMP and make the UP build a subset of a generic SMP capable build which has CONFIG_NR_CPUS=1, i.e. num_possible_cpus() = 1. Why? Because it consolidates the code and makes UP use exactly the same mechanisms as SMP which pretty much avoids the problem we see today that UP lacks test coverage and becomes more esoteric and untested over time. The downside is a slightly larger kernel image for such a build. Though if we pretend that we seriously care about that 10% larger memory footprint or about the marginal performance benefit of SMP=n on dead hardware, then we are just taking the wrong pills. The point is that this very commit in question was heading deliberately into the direction of removing the by now silly differences of UP/SMP for correctness sake. It just happened to unearth the missing check in topology_get_logical_id(), but that check is required independent of SMP=y/n as I pointed out above. Thanks, tglx --- diff --git a/arch/x86/kernel/cpu/topology.c b/arch/x86/kernel/cpu/topology.c index 3259b1d4fefe..118d9f7792ee 100644 --- a/arch/x86/kernel/cpu/topology.c +++ b/arch/x86/kernel/cpu/topology.c @@ -279,6 +279,15 @@ int topology_get_logical_id(u32 apicid, enum x86_topology_domains at_level) if (lvlid >= MAX_LOCAL_APIC) return -ERANGE; + /* + * Spare the exercise on UP as there is only one instance at any + * level and the map check below might fail because the CPU does + * not have a local APIC or local APIC has been disabled on the + * kernel command line. + */ + if (num_possible_cpus() == 1) + return 0; + if (!test_bit(lvlid, apic_maps[at_level].map)) return -ENODEV; /* Get the number of set bits before @lvlid. */
On Fri, 15 Mar 2024 at 15:55, Thomas Gleixner <tglx@linutronix.de> wrote: > > Not really. The problem is that a SMP build can run on a UP machine w/o > APIC or command line disables the APIC and will run into the exactly > same problem. The only case where we know that it is impossible is when > APIC support is disabled, which is silly but topic for a different > discussion. Oh, I agree - that was why I said that it shouldn't depend on a local APIC on machines that may not even have one. That "may not even have one" can still be a static option - we technically allow 32-bit UP kernel to not enable X86_UP_APIC, although it might be time to drop that option. > So the proper thing to do is to check for num_possible_cpus() == 1 in > that function. I think that's _one_ proper thing. I still think that the deeper problem is that it still looks at local apic rules even when those rules are completely nonsensical. For example, that MAX_LOCAL_APIC range test may not matter simply because it's testing a constant value, but it still smells entirely wrong to even check for that, when the system doesn't necessarily have one. So I think your patch may fix the immediate bug, but I think it's still just a band-aid. Either we should just make all machines look like they have the proper local apic mappings, or we shouldn't look at any local apic rules AT ALL. So I'd rather see those apic_maps[] just be properly filled in. > Sure you can argue that we could avoid it for SMP=n builds completely, > but I think the right thing to do is to aim for removing CONFIG_SMP and > make the UP build a subset of a generic SMP capable build which has > CONFIG_NR_CPUS=1, i.e. num_possible_cpus() = 1. Why? I wouldn't be entirely opposed to just doing that. UP has become fairly irrelevant. That said, UP is *not* entirely irrelevant on other architectures, and if we drop UP support on x86, we'll be effectively dropping a lot of coverage testing. The number of people who do cross-compilers is pretty small. End result: I'd *much* rather get rid of X86_UP_APIC and the "nolapic" kernel command line, and say "even UP has to have a local APIC". We already require a Pentium-class CPU, so in practice we already require that local APIC setup. And yes, machines existed where it could be turned off, but I don't think that is relevant any more. Put another way: I think "UP config for wider build testing" is a _lot_ more relevant than "no LAPIC support". Linus
On 3/15/24 15:55, Thomas Gleixner wrote: > On Fri, Mar 15 2024 at 18:40, Thomas Gleixner wrote: >> On Fri, Mar 15 2024 at 09:42, Linus Torvalds wrote: >>> On Fri, 15 Mar 2024 at 09:17, Guenter Roeck <linux@roeck-us.net> wrote: >>> Thomas, over to you. I wonder if maybe all those topology macros >>> should just return 0 on an UP build, but that >>> topology_get_logical_id() thing looks a bit wrong regardless. >>> >>> It really shouldn't depend on local apic data for configs that may not >>> *have* a local apic. >> >> Right. Let me look. > > Not really. The problem is that a SMP build can run on a UP machine w/o > APIC or command line disables the APIC and will run into the exactly > same problem. The only case where we know that it is impossible is when > APIC support is disabled, which is silly but topic for a different > discussion. > > So the proper thing to do is to check for num_possible_cpus() == 1 in > that function. > > Sure you can argue that we could avoid it for SMP=n builds completely, > but I think the right thing to do is to aim for removing CONFIG_SMP and > make the UP build a subset of a generic SMP capable build which has > CONFIG_NR_CPUS=1, i.e. num_possible_cpus() = 1. Why? > > Because it consolidates the code and makes UP use exactly the same > mechanisms as SMP which pretty much avoids the problem we see today that > UP lacks test coverage and becomes more esoteric and untested over time. > > The downside is a slightly larger kernel image for such a build. > > Though if we pretend that we seriously care about that 10% larger memory > footprint or about the marginal performance benefit of SMP=n on dead > hardware, then we are just taking the wrong pills. > FWIW, I would very much prefer for SMP=n builds to go away for x86. I don't think anyone uses that in the real world nowadays, and I never know if I should report problems like this one or just stop testing it. > The point is that this very commit in question was heading deliberately > into the direction of removing the by now silly differences of UP/SMP > for correctness sake. It just happened to unearth the missing check in > topology_get_logical_id(), but that check is required independent of > SMP=y/n as I pointed out above. > > Thanks, > > tglx > > --- > diff --git a/arch/x86/kernel/cpu/topology.c b/arch/x86/kernel/cpu/topology.c > index 3259b1d4fefe..118d9f7792ee 100644 > --- a/arch/x86/kernel/cpu/topology.c > +++ b/arch/x86/kernel/cpu/topology.c > @@ -279,6 +279,15 @@ int topology_get_logical_id(u32 apicid, enum x86_topology_domains at_level) > > if (lvlid >= MAX_LOCAL_APIC) > return -ERANGE; > + /* > + * Spare the exercise on UP as there is only one instance at any > + * level and the map check below might fail because the CPU does > + * not have a local APIC or local APIC has been disabled on the > + * kernel command line. > + */ > + if (num_possible_cpus() == 1) > + return 0; > + That works. Tested-by: Guenter Roeck <linux@roeck-us.net> > if (!test_bit(lvlid, apic_maps[at_level].map)) > return -ENODEV; > /* Get the number of set bits before @lvlid. */
On Fri, Mar 15 2024 at 16:23, Linus Torvalds wrote: > On Fri, 15 Mar 2024 at 15:55, Thomas Gleixner <tglx@linutronix.de> wrote: >> So the proper thing to do is to check for num_possible_cpus() == 1 in >> that function. > > I think that's _one_ proper thing. I still think that the deeper > problem is that it still looks at local apic rules even when those > rules are completely nonsensical. > > For example, that MAX_LOCAL_APIC range test may not matter simply > because it's testing a constant value, but it still smells entirely > wrong to even check for that, when the system doesn't necessarily have > one. cpu_info.apic_id defaults to 0, so unless the calling code is completely broken it will be correct. And I rather catch the case of calling code being broken in the !APIC case if we still want to support systems without a local APIC. > So I think your patch may fix the immediate bug, but I think it's > still just a band-aid. > > Either we should just make all machines look like they have the proper > local apic mappings, or we shouldn't look at any local apic rules AT > ALL. Sure. I can simply check if there was an APIC registered instead. >> Sure you can argue that we could avoid it for SMP=n builds completely, >> but I think the right thing to do is to aim for removing CONFIG_SMP and >> make the UP build a subset of a generic SMP capable build which has >> CONFIG_NR_CPUS=1, i.e. num_possible_cpus() = 1. Why? > > I wouldn't be entirely opposed to just doing that. UP has become > fairly irrelevant. > > That said, UP is *not* entirely irrelevant on other architectures, and > if we drop UP support on x86, we'll be effectively dropping a lot of > coverage testing. The number of people who do cross-compilers is > pretty small. > > End result: I'd *much* rather get rid of X86_UP_APIC and the "nolapic" > kernel command line, and say "even UP has to have a local APIC". > > We already require a Pentium-class CPU, so in practice we already > require that local APIC setup. And yes, machines existed where it > could be turned off, but I don't think that is relevant any more. You wish. We still support 486 and some of the still produced 486 clones do not have a local APIC. Not that I care and yes I'm all for getting rid of CONFIG_.*_APIC and of the related config/command line options. If we refuse to boot on hardware which does not enumerate an APIC then even better. But that is only a part of the overall problem. > Put another way: I think "UP config for wider build testing" is a > _lot_ more relevant than "no LAPIC support". I really have to disagree here. The concept of making UP a proper subset of SMP has absolutely nothing to do with x86 and UP test coverage. We want SMP as a general concept and overhaul the whole kernel to get rid of this ever increasing non-sensical UP burden. The real world UP small system use cases have moved over to other OSes like Zephyr & Co long ago. Just because some esoteric architectures (m68k comes to my mind) will have serious issues with that for the very wrong reasons does not mean that we should not go there. It's going to be quite some effort, but the overall benefit is worth it. OTOH, it's absolutely not rocket science to pretend to be SMP capable and if some architectures fail to accomodate on the way then we just should remove them as that's a clear sign of being unmaintained and irrelevant. The amount of untested SMP=n code in the kernel becomes frigthening and your argument that build coverage is making a difference is wishful thinking at best. Anything else than making the kernel SMP capable and making UP builds a well defined subset via CONFIG_NR_CPUS=1 is a complete waste of time and effort. If your intention is to indulge in the historical glory of Linux running on any (by now) irrelevant hardware on the planet, then I stop arguing right here. If not, can we please have a serious discussion about going SMP only and making UP the simple and obvious NR_CPUS=1 subset? The amount of subtle SMP=n fallout has been kinda exponentially increasing over the years and it's just putting burden on the wrong people. TBH, I'm tired of this nonsense. Thanks, tglx
On Fri, 15 Mar 2024 at 18:11, Thomas Gleixner <tglx@linutronix.de> wrote: > > You wish. We still support 486 and some of the still produced 486 clones > do not have a local APIC. Ouch. I was _sure_ we had dropped i486 support too due to cmpxchg8b. But apparently that was just a discussion, and my wishful thinking, and we never actually followed through. Linus
On Sat, Mar 16, 2024, at 02:23, Linus Torvalds wrote: > On Fri, 15 Mar 2024 at 18:11, Thomas Gleixner <tglx@linutronix.de> wrote: >> >> You wish. We still support 486 and some of the still produced 486 clones >> do not have a local APIC. > > Ouch. I was _sure_ we had dropped i486 support too due to cmpxchg8b. > > But apparently that was just a discussion, and my wishful thinking, > and we never actually followed through. Maciej Rozycki still cares about i486 type hardware, and he was asking for it to be kept around in the thread following [1] I think the best suggestion at the time was to make cmpxchg8b a compile-time feature and I had expected Maciej to follow up with a patch for that, but this never happend, and nobody sent a patch to remove support 486 and the early 586 clones either. I saw recently that there are still distros that advertise 486 support on modern kernels: Tiny Core Linux and Damn Small Linux. Both ship with a 486 SMP kernel but fail to boot on qemu unless an APIC is enabled (DSL also requires i686 or higher to run userspace). Arnd [1] https://lore.kernel.org/all/20220815071332.627393-9-yuzhao@google.com/
From: Thomas Gleixner > Sent: 16 March 2024 01:11 ... > We want SMP as a general concept and overhaul the whole kernel to get > rid of this ever increasing non-sensical UP burden. The real world UP > small system use cases have moved over to other OSes like Zephyr & Co > long ago. > > Just because some esoteric architectures (m68k comes to my mind) will > have serious issues with that for the very wrong reasons does not mean > that we should not go there. > > It's going to be quite some effort, but the overall benefit is worth it. > > OTOH, it's absolutely not rocket science to pretend to be SMP capable > and if some architectures fail to accomodate on the way then we just > should remove them as that's a clear sign of being unmaintained and > irrelevant. There are fpga soft-cpu (eg Nios & Risc-V) that can run linux. They are definitely memory constrained and really wouldn't want most of the SMP overhead. I'm not what it involves apart from simplified startup, compiling out IPI and spinlocks and optimising per-cpu data. But you wouldn't want to be running an SMP capable kernel on such systems. x86 is a different beast - except perhaps 486. It has to be said that I've never understood why anyone would run Linux on a Nios-II cpu. Far too slow for anything useful (you might get 100MHz if you are lucky), caches will be small and external memory accesses slow. I doubt soft RISC-V are any better (and I suspect they are worse). We do have 4 Nios-II in the fpga image for a PCIe card. They run very small programs (one has 2kB of code memory) to do things that would be impossible to write (sensibly) in VHDL. There are fpga with embedded ARM (and probably RISC-V) cores for running real OS. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
On Sat, Mar 16 2024 at 02:11, Thomas Gleixner wrote: > On Fri, Mar 15 2024 at 16:23, Linus Torvalds wrote: >> Either we should just make all machines look like they have the proper >> local apic mappings, or we shouldn't look at any local apic rules AT >> ALL. > > Sure. I can simply check if there was an APIC registered instead. Like the below. I'm not entirely sure though whether the sanity checks should return an error code, which is what caused the crash Guenter observed, but I couldn't come up with something sensible either. Returning 0 might keep the machine alive, but does it make sense? Thanks, tglx --- arch/x86/kernel/cpu/topology.c | 15 +++++++++++++-- 1 file changed, 13 insertions(+), 2 deletions(-) --- a/arch/x86/kernel/cpu/topology.c +++ b/arch/x86/kernel/cpu/topology.c @@ -277,10 +277,21 @@ int topology_get_logical_id(u32 apicid, /* Remove the bits below @at_level to get the proper level ID of @apicid */ unsigned int lvlid = topo_apicid(apicid, at_level); - if (lvlid >= MAX_LOCAL_APIC) + if (WARN_ON_ONCE(lvlid >= MAX_LOCAL_APIC)) return -ERANGE; - if (!test_bit(lvlid, apic_maps[at_level].map)) + + /* + * If there was no APIC registered, then the map check below would + * fail. With no APIC this is guaranteed to be an UP system and + * therefore all topology levels have only one entry and their + * logical ID is obviously 0. + */ + if (topo_info.boot_cpu_apic_id == BAD_APICID) + return 0; + + if (WARN_ON_ONCE(!test_bit(lvlid, apic_maps[at_level].map))) return -ENODEV; + /* Get the number of set bits before @lvlid. */ return bitmap_weight(apic_maps[at_level].map, lvlid); }
On Sat, Mar 16 2024 at 02:11, Thomas Gleixner wrote: > On Fri, Mar 15 2024 at 16:23, Linus Torvalds wrote: > The amount of subtle SMP=n fallout has been kinda exponentially > increasing over the years and it's just putting burden on the wrong > people. TBH, I'm tired of this nonsense. And for the fun of it I hacked Kconfig to allow a SMP=y NR_CPUS=1 build and checked the size of vmlinux: 64-bit 32-bit SMP, NCPUS=1 38438400 22110177 UP 38393703 21682041 Delta 44697 428076 0.1% 2% The UP savings are not really impressive... Let me look what it actually takes to do that. Thanks, tglx
On Mon, Mar 18, 2024, at 18:27, Thomas Gleixner wrote: > On Sat, Mar 16 2024 at 02:11, Thomas Gleixner wrote: >> On Fri, Mar 15 2024 at 16:23, Linus Torvalds wrote: >> The amount of subtle SMP=n fallout has been kinda exponentially >> increasing over the years and it's just putting burden on the wrong >> people. TBH, I'm tired of this nonsense. > > And for the fun of it I hacked Kconfig to allow a SMP=y NR_CPUS=1 build > and checked the size of vmlinux: > > 64-bit 32-bit > SMP, NCPUS=1 38438400 22110177 > UP 38393703 21682041 > Delta 44697 428076 > 0.1% 2% > > The UP savings are not really impressive... > > Let me look what it actually takes to do that. FWIW, I did some experiments a few weeks ago on 32-bit ARM, using a fairly minimal kernel in a virtual machine, and checking the runtime memory consumption rather than compile-time. In a kvm guest with 32MiB RAM, I saw a difference of multiple megabytes in memory usage: Linux testvm 6.8.0-rc4-00410-gc02197fc9076-dirty #1 SMP PREEMPT armv7l root@testvm:~# free total used free shared buff/cache available Mem: 26932 14956 1732 52 12800 11976 Swap: 16360 3632 12728 Linux testvm 6.8.0-rc4-00410-gc02197fc9076-dirty #2 PREEMPT armv7l root@testvm:~# free total used free shared buff/cache available Mem: 26932 13744 5648 32 10092 13188 Swap: 16360 3880 12480 There is a little difference between runs, but this does seem significant enough to keep it. The SMP build was with CONFIG_NR_CPUS=2 (the smallest supported compile-time number), but running on a single-CPU qemu instance. Arnd
On Mon, Mar 18 2024 at 20:13, Arnd Bergmann wrote: > FWIW, I did some experiments a few weeks ago on 32-bit ARM, > using a fairly minimal kernel in a virtual machine, and > checking the runtime memory consumption rather than compile-time. > In a kvm guest with 32MiB RAM, I saw a difference of multiple > megabytes in memory usage: > > Linux testvm 6.8.0-rc4-00410-gc02197fc9076-dirty #1 SMP PREEMPT armv7l > root@testvm:~# free > total used free shared buff/cache available > Mem: 26932 14956 1732 52 12800 11976 > Swap: 16360 3632 12728 > > Linux testvm 6.8.0-rc4-00410-gc02197fc9076-dirty #2 PREEMPT armv7l > root@testvm:~# free > total used free shared buff/cache available > Mem: 26932 13744 5648 32 10092 13188 > Swap: 16360 3880 12480 > > There is a little difference between runs, but this does seem > significant enough to keep it. The SMP build was with > CONFIG_NR_CPUS=2 (the smallest supported compile-time number), > but running on a single-CPU qemu instance. With a SMP=y, NR_CPUS=1 build on x86 64bit I get: total used free shared buff/cache available Mem: 32882056 498068 32590580 4884 128884 32383988 Swap: 998396 0 998396 Same config just SMP=n: total used free shared buff/cache available Mem: 32885804 461704 32635284 4876 119480 32424100 Swap: 998396 0 998396 So the delta for available is ~40 MiB. But if I look at it with init=/bin/sh on the command line then the delta is significantly different: With a SMP=y, NR_CPUS=1 build on x86 64bit I get: total used free shared buff/cache available Mem: 32883680 324120 32822728 216 10864 32559560 Swap: 0 0 0 Same config just SMP=n: total used free shared buff/cache available Mem: 32885804 326876 32821972 216 11100 32558928 Swap: 0 0 0 Delta available = 632 KiB I haven't had the time to stare at that in detail, but comparing /proc/meminfo for the full boot case above does not immediately give me a hint. It's confusing at best... Thanks, tglx
On 3/19/24 09:21, Thomas Gleixner wrote: > On Mon, Mar 18 2024 at 20:13, Arnd Bergmann wrote: >> FWIW, I did some experiments a few weeks ago on 32-bit ARM, >> using a fairly minimal kernel in a virtual machine, and >> checking the runtime memory consumption rather than compile-time. >> In a kvm guest with 32MiB RAM, I saw a difference of multiple >> megabytes in memory usage: >> >> Linux testvm 6.8.0-rc4-00410-gc02197fc9076-dirty #1 SMP PREEMPT armv7l >> root@testvm:~# free >> total used free shared buff/cache available >> Mem: 26932 14956 1732 52 12800 11976 >> Swap: 16360 3632 12728 >> >> Linux testvm 6.8.0-rc4-00410-gc02197fc9076-dirty #2 PREEMPT armv7l >> root@testvm:~# free >> total used free shared buff/cache available >> Mem: 26932 13744 5648 32 10092 13188 >> Swap: 16360 3880 12480 >> >> There is a little difference between runs, but this does seem >> significant enough to keep it. The SMP build was with >> CONFIG_NR_CPUS=2 (the smallest supported compile-time number), >> but running on a single-CPU qemu instance. > > With a SMP=y, NR_CPUS=1 build on x86 64bit I get: > > total used free shared buff/cache available > Mem: 32882056 498068 32590580 4884 128884 32383988 > Swap: 998396 0 998396 > > Same config just SMP=n: > > total used free shared buff/cache available > Mem: 32885804 461704 32635284 4876 119480 32424100 > Swap: 998396 0 998396 > > So the delta for available is ~40 MiB. > > But if I look at it with init=/bin/sh on the command line then the delta > is significantly different: > > With a SMP=y, NR_CPUS=1 build on x86 64bit I get: > > total used free shared buff/cache available > Mem: 32883680 324120 32822728 216 10864 32559560 > Swap: 0 0 0 > > Same config just SMP=n: > > total used free shared buff/cache available > Mem: 32885804 326876 32821972 216 11100 32558928 > Swap: 0 0 0 > > Delta available = 632 KiB > > I haven't had the time to stare at that in detail, but comparing > /proc/meminfo for the full boot case above does not immediately give me > a hint. It's confusing at best... > That makes me wonder if the number is affected by the total memory size. How about a system with 1GB of memory or less ? Guenter
On Fri, Mar 15 2024 at 09:17, Guenter Roeck wrote: > I don't know the code well enough to determine what is wrong. > Please let me know what I can do to help debugging the problem. Could you provide me the config and the qemu command line? Thanks, tglx
On 3/20/24 01:58, Thomas Gleixner wrote: > On Fri, Mar 15 2024 at 09:17, Guenter Roeck wrote: >> I don't know the code well enough to determine what is wrong. >> Please let me know what I can do to help debugging the problem. > > Could you provide me the config and the qemu command line? > defconfig-CONFIG_SMP and qemu-system-x86_64 -kernel arch/x86/boot/bzImage -cpu Haswell \ --append "console=ttyS0" -nographic -monitor none The cpu doesn't really matter as long as it is an Intel CPU. A root file system isn't needed since the boot doesn't get that far. Guenter
On Wed, Mar 20 2024 at 08:46, Guenter Roeck wrote: > On 3/20/24 01:58, Thomas Gleixner wrote: >> On Fri, Mar 15 2024 at 09:17, Guenter Roeck wrote: >>> I don't know the code well enough to determine what is wrong. >>> Please let me know what I can do to help debugging the problem. >> >> Could you provide me the config and the qemu command line? >> > > defconfig-CONFIG_SMP and > > qemu-system-x86_64 -kernel arch/x86/boot/bzImage -cpu Haswell \ > --append "console=ttyS0" -nographic -monitor none > > The cpu doesn't really matter as long as it is an Intel CPU. > A root file system isn't needed since the boot doesn't get that far. Now it get's interesting because I can't reproduce it with that setup at all. What's weird is that I saw it exactly once on 64-bit in a VM with a UP config two days ago, but when I started to add instrumentation it never came back even after backing the instrumentation changes out. I have seriously no idea what's going on there. Is it fully reproducible on your side? If so can you please provide a full dmesg and then apply the patch below and provide the resulting full dmesg too? I found two other issues while trying to find a way to reproduce, but those are completely unrelated to the problem you are observing. Thanks, tglx --- arch/x86/kernel/cpu/topology.c | 19 +++++++++++++++++-- 1 file changed, 17 insertions(+), 2 deletions(-) --- a/arch/x86/kernel/cpu/topology.c +++ b/arch/x86/kernel/cpu/topology.c @@ -176,6 +176,8 @@ static __init void topo_register_apic(u3 { int cpu, dom; + pr_info("APIC: %x %d\n", apic_id, present); + if (present) { set_bit(apic_id, phys_cpu_present_map); @@ -277,10 +279,23 @@ int topology_get_logical_id(u32 apicid, /* Remove the bits below @at_level to get the proper level ID of @apicid */ unsigned int lvlid = topo_apicid(apicid, at_level); - if (lvlid >= MAX_LOCAL_APIC) + pr_info("APIC logical ID: %x %x %d\n", apicid, lvlid, at_level); + + if (WARN_ON_ONCE(lvlid >= MAX_LOCAL_APIC)) return -ERANGE; - if (!test_bit(lvlid, apic_maps[at_level].map)) + + /* + * If there was no APIC registered, then the map check below would + * fail. With no APIC this is guaranteed to be an UP system and + * therefore all topology levels have only one entry and their + * logical ID is obviously 0. + */ + if (topo_info.boot_cpu_apic_id == BAD_APICID) + return 0; + + if (WARN_ON_ONCE(!test_bit(lvlid, apic_maps[at_level].map))) return -ENODEV; + /* Get the number of set bits before @lvlid. */ return bitmap_weight(apic_maps[at_level].map, lvlid); }
On 3/21/24 04:14, Thomas Gleixner wrote: > On Wed, Mar 20 2024 at 08:46, Guenter Roeck wrote: >> On 3/20/24 01:58, Thomas Gleixner wrote: >>> On Fri, Mar 15 2024 at 09:17, Guenter Roeck wrote: >>>> I don't know the code well enough to determine what is wrong. >>>> Please let me know what I can do to help debugging the problem. >>> >>> Could you provide me the config and the qemu command line? >>> >> >> defconfig-CONFIG_SMP and >> >> qemu-system-x86_64 -kernel arch/x86/boot/bzImage -cpu Haswell \ >> --append "console=ttyS0" -nographic -monitor none >> >> The cpu doesn't really matter as long as it is an Intel CPU. >> A root file system isn't needed since the boot doesn't get that far. > > Now it get's interesting because I can't reproduce it with that setup at > all. > > What's weird is that I saw it exactly once on 64-bit in a VM with a UP > config two days ago, but when I started to add instrumentation it never > came back even after backing the instrumentation changes out. I have > seriously no idea what's going on there. > > Is it fully reproducible on your side? > Yes, always. > If so can you please provide a full dmesg and then apply the patch below > and provide the resulting full dmesg too? > You'll find everything at http://server.roeck-us.net/qemu/x86-nosmp/ The crash is gone after applying your patch. The difference is: + /* + * If there was no APIC registered, then the map check below would + * fail. With no APIC this is guaranteed to be an UP system and + * therefore all topology levels have only one entry and their + * logical ID is obviously 0. + */ + if (topo_info.boot_cpu_apic_id == BAD_APICID) { + pr_info("#### topo_info.boot_cpu_apic_id == BAD_APICID\n"); ^^^^ I added this + return 0; + } + I see the "#### topo_info.boot_cpu_apic_id == BAD_APICID" message twice in the log. See patched.log at the page pointed to above. Hope the helps, Guenter
On Thu, Mar 21 2024 at 07:06, Guenter Roeck wrote: > On 3/21/24 04:14, Thomas Gleixner wrote: >> If so can you please provide a full dmesg and then apply the patch below >> and provide the resulting full dmesg too? > > You'll find everything at http://server.roeck-us.net/qemu/x86-nosmp/ Thanks for providing this. > The crash is gone after applying your patch. The difference is: > > + /* > + * If there was no APIC registered, then the map check below would > + * fail. With no APIC this is guaranteed to be an UP system and > + * therefore all topology levels have only one entry and their > + * logical ID is obviously 0. > + */ > + if (topo_info.boot_cpu_apic_id == BAD_APICID) { > + pr_info("#### topo_info.boot_cpu_apic_id == BAD_APICID\n"); > ^^^^ I added this > + return 0; > + } > + > > I see the "#### topo_info.boot_cpu_apic_id == BAD_APICID" message > twice in the log. See patched.log at the page pointed to above. I can see why this is emitted. That happens on the initial CPUID evaluation of the boot CPU very early during boot. [ 0.000000] Command line: console=ttyS0 [ 0.000000] CPU topo: APIC logical ID: 0 0 6 [ 0.000000] CPU topo: #### topo_info.boot_cpu_apic_id == BAD_APICID [ 0.000000] CPU topo: APIC logical ID: 0 0 4 [ 0.000000] CPU topo: #### topo_info.boot_cpu_apic_id == BAD_APICID The later full CPUID evaluation happens after the ACPI enumeration and way before the affected RAPL driver is initialized: [ 0.088029] CPU topo: APIC logical ID: 0 0 6 [ 0.088084] CPU topo: APIC logical ID: 0 0 4 This invocation has the boot APIC registered as your extra print does not show up. ... [ 0.585850] RAPL PMU: API unit is 2^-32 Joules, 0 fixed counters, 10737418240 ms ovfl timer So even without that guard (which we need anyway for the non APIC case) topology_logical_die_id() == cpu_data(cpu).topo.logical_die_id must have the correct value in that RAPL initialization and CPU hotplug callback code. But our absolutely convoluted startup logic prevents that because: identify_cpu_early() operates on boot_cpu_data smp_prepare_boot_cpu() copies boot_cpu_data to per CPU cpu data identify_boot_cpu() operates on boot_cpu_data identify_boot_cpu() is the one which gets the correct logical die info, but that never gets copied over to the per CPU data instance on which the RAPL code and everything else works on. I'll cook up a patch later. Thanks for providing the info! tglx
--- a/arch/x86/include/asm/processor.h +++ b/arch/x86/include/asm/processor.h @@ -185,13 +185,8 @@ extern struct cpuinfo_x86 new_cpu_data; extern __u32 cpu_caps_cleared[NCAPINTS + NBUGINTS]; extern __u32 cpu_caps_set[NCAPINTS + NBUGINTS]; -#ifdef CONFIG_SMP DECLARE_PER_CPU_READ_MOSTLY(struct cpuinfo_x86, cpu_info); #define cpu_data(cpu) per_cpu(cpu_info, cpu) -#else -#define cpu_info boot_cpu_data -#define cpu_data(cpu) boot_cpu_data -#endif extern const struct seq_operations cpuinfo_op; --- a/arch/x86/kernel/cpu/common.c +++ b/arch/x86/kernel/cpu/common.c @@ -70,6 +70,9 @@ #include "cpu.h" +DEFINE_PER_CPU_READ_MOSTLY(struct cpuinfo_x86, cpu_info); +EXPORT_PER_CPU_SYMBOL(cpu_info); + u32 elf_hwcap2 __read_mostly; /* Number of siblings per CPU package */ --- a/arch/x86/kernel/setup.c +++ b/arch/x86/kernel/setup.c @@ -1211,6 +1211,16 @@ void __init i386_reserve_resources(void) #endif /* CONFIG_X86_32 */ +#ifndef CONFIG_SMP +void __init smp_prepare_boot_cpu(void) +{ + struct cpuinfo_x86 *c = &cpu_data(0); + + *c = boot_cpu_data; + c->initialized = true; +} +#endif + static struct notifier_block kernel_offset_notifier = { .notifier_call = dump_kernel_offset }; --- a/arch/x86/kernel/smpboot.c +++ b/arch/x86/kernel/smpboot.c @@ -101,10 +101,6 @@ EXPORT_PER_CPU_SYMBOL(cpu_core_map); DEFINE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_die_map); EXPORT_PER_CPU_SYMBOL(cpu_die_map); -/* Per CPU bogomips and other parameters */ -DEFINE_PER_CPU_READ_MOSTLY(struct cpuinfo_x86, cpu_info); -EXPORT_PER_CPU_SYMBOL(cpu_info); - /* CPUs which are the primary SMT threads */ struct cpumask __cpu_primary_thread_mask __read_mostly;
On UP builds sparse complains rightfully about accesses to cpu_info with per CPU accessors: cacheinfo.c:282:30: sparse: warning: incorrect type in initializer (different address spaces) cacheinfo.c:282:30: sparse: expected void const [noderef] __percpu *__vpp_verify cacheinfo.c:282:30: sparse: got unsigned int * The reason is that on UP builds cpu_info which is a per CPU variable on SMP is mapped to boot_cpu_info which is a regular variable. There is a hideous accessor cpu_data() which tries to hide this, but it's not sufficient as some places require raw accessors and generates worse code than the regular per CPU accessors. Waste sizeof(struct x86_cpuinfo) memory on UP and provide the per CPU cpu_info unconditionally. This requires to update the CPU info on the boot CPU as SMP does. (Ab)use the weakly defined smp_prepare_boot_cpu() function and implement exactly that. This allows to use regular per CPU accessors uncoditionally and paves the way to remove the cpu_data() hackery. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> --- arch/x86/include/asm/processor.h | 5 ----- arch/x86/kernel/cpu/common.c | 3 +++ arch/x86/kernel/setup.c | 10 ++++++++++ arch/x86/kernel/smpboot.c | 4 ---- 4 files changed, 13 insertions(+), 9 deletions(-)