[5/9] x86: Cure per CPU madness on UP

Message ID	20240304005104.622511517@linutronix.de (mailing list archive)
State	New, archived
Headers	show Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id BA8053A1B5; Mon, 4 Mar 2024 10:12:25 +0000 (UTC) Message-ID: <20240304005104.622511517@linutronix.de> From: Thomas Gleixner <tglx@linutronix.de> To: LKML <linux-kernel@vger.kernel.org> Cc: x86@kernel.org, Linus Torvalds <torvalds@linuxfoundation.org>, Uros Bizjak <ubizjak@gmail.com>, linux-sparse@vger.kernel.org, lkp@intel.com, oe-kbuild-all@lists.linux.dev Subject: [patch 5/9] x86: Cure per CPU madness on UP References: <20240303235029.555787150@linutronix.de> Precedence: bulk MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Date: Mon, 4 Mar 2024 11:12:23 +0100 (CET)
Series	x86: Cure tons of sparse warnings (mostly __percpu) \| expand [0/9] x86: Cure tons of sparse warnings (mostly __percpu) [1/9] perf/x86/amd/uncore: Fix __percpu annotation [2/9] x86/msr: Prepare for including percpu.h [3/9] x86/msr: Add missing __percpu annotations [4/9] smp: Consolidate smp_prepare_boot_cpu() [5/9] x86: Cure per CPU madness on UP [6/9] x86/uaccess: Add missing __force to casts in __access_ok() and valid_user_address() [7/9] x86/cpu: Use EXPORT_PER_CPU_SYMBOL_GPL() for x86_spec_ctrl_current [8/9] x86/cpu: Provide a declaration for itlb_multihit_kvm_mitigation [9/9] x86/callthunks: Use EXPORT_PER_CPU_SYMBOL_GPL() for per CPU variables

Thomas Gleixner March 4, 2024, 10:12 a.m. UTC

On UP builds sparse complains rightfully about accesses to cpu_info with
per CPU accessors:

cacheinfo.c:282:30: sparse: warning: incorrect type in initializer (different address spaces)
cacheinfo.c:282:30: sparse:    expected void const [noderef] __percpu *__vpp_verify
cacheinfo.c:282:30: sparse:    got unsigned int *

The reason is that on UP builds cpu_info which is a per CPU variable on SMP
is mapped to boot_cpu_info which is a regular variable. There is a hideous
accessor cpu_data() which tries to hide this, but it's not sufficient as
some places require raw accessors and generates worse code than the regular
per CPU accessors.

Waste sizeof(struct x86_cpuinfo) memory on UP and provide the per CPU
cpu_info unconditionally. This requires to update the CPU info on the boot
CPU as SMP does. (Ab)use the weakly defined smp_prepare_boot_cpu() function
and implement exactly that.

This allows to use regular per CPU accessors uncoditionally and paves the
way to remove the cpu_data() hackery.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/include/asm/processor.h |    5 -----
 arch/x86/kernel/cpu/common.c     |    3 +++
 arch/x86/kernel/setup.c          |   10 ++++++++++
 arch/x86/kernel/smpboot.c        |    4 ----
 4 files changed, 13 insertions(+), 9 deletions(-)

Guenter Roeck March 15, 2024, 4:17 p.m. UTC | #1

Hi,

On Mon, Mar 04, 2024 at 11:12:23AM +0100, Thomas Gleixner wrote:
> On UP builds sparse complains rightfully about accesses to cpu_info with
> per CPU accessors:
> 
> cacheinfo.c:282:30: sparse: warning: incorrect type in initializer (different address spaces)
> cacheinfo.c:282:30: sparse:    expected void const [noderef] __percpu *__vpp_verify
> cacheinfo.c:282:30: sparse:    got unsigned int *
> 
> The reason is that on UP builds cpu_info which is a per CPU variable on SMP
> is mapped to boot_cpu_info which is a regular variable. There is a hideous
> accessor cpu_data() which tries to hide this, but it's not sufficient as
> some places require raw accessors and generates worse code than the regular
> per CPU accessors.
> 
> Waste sizeof(struct x86_cpuinfo) memory on UP and provide the per CPU
> cpu_info unconditionally. This requires to update the CPU info on the boot
> CPU as SMP does. (Ab)use the weakly defined smp_prepare_boot_cpu() function
> and implement exactly that.
> 
> This allows to use regular per CPU accessors uncoditionally and paves the
> way to remove the cpu_data() hackery.
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

This patch results in crashes when running the mainline kernel in qemu
with nosmp builds and Intel CPUs. The problem is _not_ seen on tag
x86-cleanups-2024-03-11; it is only seen in the mainline kernel. I didn't
check all of them, but it looks like AMD CPUs are not affected. The
initial bisect points to the merge of x86-cleanups-2024-03-11 into the
mainline kernel. I rebased x86-cleanups-2024-03-11 on top of the mainline
kernel; the second bisect uses that rebase as base. Reverting this patch
from the mainline kernel fixes the problem.

I don't know the code well enough to determine what is wrong.
Please let me know what I can do to help debugging the problem.

Thanks,
Guenter

----
crash log:

[    3.291087] BUG: unable to handle page fault for address: ffff9cd801f3f2a0
[    3.291087] #PF: supervisor write access in kernel mode
[    3.291087] #PF: error_code(0x0002) - not-present page
[    3.291087] PGD 60201067 P4D 60201067 PUD 0
[    3.291087] Oops: 0002 [#1] PREEMPT PTI
[    3.291087] CPU: 0 PID: 1 Comm: swapper Not tainted 6.8.0-06619-ge5e038b7ae9d #1
[    3.291087] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
[    3.291087] RIP: 0010:rapl_cpu_online+0xf2/0x110
[    3.291087] Code: 05 ff 8e 07 03 40 42 0f 00 48 89 43 60 e8 56 5f 12 00 8b 15 b4 84 61 02 48 8b 05 01 8f 07 03 48 c7 83 90 00 00 00 e0 84 80 b6 <48> 89 9c d0 38 01 00 00 e9 2b ff ff ff b8 f4 ff ff ff e9 47 ff ff
[    3.291087] RSP: 0018:ffffa3d54001fdd0 EFLAGS: 00000246
[    3.291087] RAX: ffff9cd001f3f200 RBX: ffff9cd001fb34a8 RCX: 0000000000000000
[    3.291087] RDX: 00000000ffffffed RSI: 0000000000000001 RDI: ffff9cd001fb3550
[    3.291087] RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000001
[    3.291087] R10: 0000000000000001 R11: 0000000000018001 R12: 0000000000000000
[    3.291087] R13: 000000000000009e R14: ffffffffb6808180 R15: ffffffffb86710e5
[    3.291087] FS:  0000000000000000(0000) GS:ffffffffb8ab0000(0000) knlGS:0000000000000000
[    3.291087] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    3.291087] CR2: ffff9cd801f3f2a0 CR3: 000000005e6a2000 CR4: 00000000001506f0
[    3.291087] Call Trace:
[    3.291087]  <TASK>
[    3.291087]  ? __die+0x1f/0x60
[    3.291087]  ? page_fault_oops+0x148/0x460
[    3.291087]  ? search_exception_tables+0x37/0x50
[    3.291087]  ? fixup_exception+0x21/0x320
[    3.291087]  ? exc_page_fault+0xca/0x150
[    3.291087]  ? asm_exc_page_fault+0x26/0x30
[    3.291087]  ? __pfx_rapl_cpu_online+0x10/0x10
[    3.291087]  ? rapl_cpu_online+0xf2/0x110
[    3.291087]  cpuhp_invoke_callback.constprop.0+0x117/0x3e0
[    3.291087]  __cpuhp_setup_state_cpuslocked+0x1b7/0x280
[    3.291087]  ? __pfx_rapl_cpu_online+0x10/0x10
[    3.291087]  rapl_pmu_init+0x189/0x2e0
[    3.291087]  ? __pfx_rapl_pmu_init+0x10/0x10
[    3.291087]  do_one_initcall+0x4f/0x210
[    3.291087]  kernel_init_freeable+0x166/0x290
[    3.291087]  ? __pfx_kernel_init+0x10/0x10
[    3.291087]  kernel_init+0x15/0x1b0
[    3.291087]  ret_from_fork+0x2f/0x50
[    3.291087]  ? __pfx_kernel_init+0x10/0x10
[    3.291087]  ret_from_fork_asm+0x19/0x30
[    3.291087]  </TASK>
[    3.291087] Modules linked in:
[    3.291087] CR2: ffff9cd801f3f2a0
[    3.291087] ---[ end trace 0000000000000000 ]---
[    3.291087] RIP: 0010:rapl_cpu_online+0xf2/0x110
[    3.291087] Code: 05 ff 8e 07 03 40 42 0f 00 48 89 43 60 e8 56 5f 12 00 8b 15 b4 84 61 02 48 8b 05 01 8f 07 03 48 c7 83 90 00 00 00 e0 84 80 b6 <48> 89 9c d0 38 01 00 00 e9 2b ff ff ff b8 f4 ff ff ff e9 47 ff ff
[    3.291087] RSP: 0018:ffffa3d54001fdd0 EFLAGS: 00000246
[    3.291087] RAX: ffff9cd001f3f200 RBX: ffff9cd001fb34a8 RCX: 0000000000000000
[    3.291087] RDX: 00000000ffffffed RSI: 0000000000000001 RDI: ffff9cd001fb3550
[    3.291087] RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000001
[    3.291087] R10: 0000000000000001 R11: 0000000000018001 R12: 0000000000000000
[    3.291087] R13: 000000000000009e R14: ffffffffb6808180 R15: ffffffffb86710e5
[    3.291087] FS:  0000000000000000(0000) GS:ffffffffb8ab0000(0000) knlGS:0000000000000000
[    3.291087] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    3.291087] CR2: ffff9cd801f3f2a0 CR3: 000000005e6a2000 CR4: 00000000001506f0
[    3.291087] note: swapper[1] exited with irqs disabled
[    3.306047] Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000009

---
1st bisect:

# bad: [1bbeaf83dd7b5e3628b98bec66ff8fe2646e14aa] Merge tag 'perf-tools-for-v6.9-2024-03-13' of git://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools
# good: [e8f897f4afef0031fe618a8e94127a0934896aba] Linux 6.8
git bisect start 'HEAD' 'v6.8'
# bad: [9187210eee7d87eea37b45ea93454a88681894a4] Merge tag 'net-next-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
git bisect bad 9187210eee7d87eea37b45ea93454a88681894a4
# bad: [a01c9fe32378636ae65bec8047b5de3fdb2ba5c8] Merge tag 'nfsd-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux
git bisect bad a01c9fe32378636ae65bec8047b5de3fdb2ba5c8
# bad: [691632f0e86973604678d193ccfa47b9614581aa] Merge tag 's390-6.9-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
git bisect bad 691632f0e86973604678d193ccfa47b9614581aa
# good: [8ede842f669b6f78812349bbef4d1efd0fbdafce] Merge tag 'rust-6.9' of https://github.com/Rust-for-Linux/linux
git bisect good 8ede842f669b6f78812349bbef4d1efd0fbdafce
# good: [bfdb395a7cde12d83a623949ed029b0ab38d765b] Merge tag 'x86_mtrr_for_v6.9_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
git bisect good bfdb395a7cde12d83a623949ed029b0ab38d765b
# bad: [685d98211273f60e38a6d361b62d7016c545297e] Merge tag 'x86-core-2024-03-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
git bisect bad 685d98211273f60e38a6d361b62d7016c545297e
# good: [b0402403e54ae9eb94ce1cbb53c7def776e97426] Merge tag 'edac_updates_for_v6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras
git bisect good b0402403e54ae9eb94ce1cbb53c7def776e97426
# good: [cb81deefb59de01325ab822f900c13941bfaf67f] x86/idle: Sanitize X86_BUG_AMD_E400 handling
git bisect good cb81deefb59de01325ab822f900c13941bfaf67f
# good: [73f0d1d7b4abb4a46bae1a0d8caf66e23d1138d0] Merge tag 'x86-asm-2024-03-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
git bisect good 73f0d1d7b4abb4a46bae1a0d8caf66e23d1138d0
# good: [65efc4dc12c5cc296374278673b89390eba79fe6] x86/cpu: Provide a declaration for itlb_multihit_kvm_mitigation
git bisect good 65efc4dc12c5cc296374278673b89390eba79fe6
# good: [d69ad12c786f0a4593c48c0658043aa4a5116b09] Merge tag 'x86-build-2024-03-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
git bisect good d69ad12c786f0a4593c48c0658043aa4a5116b09
# good: [35ce64922c8263448e58a2b9e8d15a64e11e9b2d] x86/idle: Select idle routine only once
git bisect good 35ce64922c8263448e58a2b9e8d15a64e11e9b2d
# good: [774a86f1c885460ade4334b901919fa1d8ae6ec6] x86/nmi: Drop unused declaration of proc_nmi_enabled()
git bisect good 774a86f1c885460ade4334b901919fa1d8ae6ec6
# bad: [fcc196579aa1fc167d6778948bff69fae6116737] Merge tag 'x86-cleanups-2024-03-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
git bisect bad fcc196579aa1fc167d6778948bff69fae6116737
# first bad commit: [fcc196579aa1fc167d6778948bff69fae6116737] Merge tag 'x86-cleanups-2024-03-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

---
2nd bisect:

# bad: [6d70929c7019e265425f7a89cf72163a639d462e] x86/nmi: Drop unused declaration of proc_nmi_enabled()
# good: [d69ad12c786f0a4593c48c0658043aa4a5116b09] Merge tag 'x86-build-2024-03-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
git bisect start 'HEAD' 'fcc196579aa1fc167d6778948bff69fae6116737~1'
# good: [5c157d25918ef15454c2a9a9262f7b763d9c9add] x86/msr: Add missing __percpu annotations
git bisect good 5c157d25918ef15454c2a9a9262f7b763d9c9add
# bad: [f5a6b1e2d92d825a0f73ca11e960795da336d99e] x86/uaccess: Add missing __force to casts in __access_ok() and valid_user_address()
git bisect bad f5a6b1e2d92d825a0f73ca11e960795da336d99e
# bad: [68907233f6d26a214bb79f892db8d999b15dda97] x86/percpu: Cure per CPU madness on UP
git bisect bad 68907233f6d26a214bb79f892db8d999b15dda97
# good: [053df18ceb928c8631042317a884b2842a457f1b] smp: Consolidate smp_prepare_boot_cpu()
git bisect good 053df18ceb928c8631042317a884b2842a457f1b
# first bad commit: [68907233f6d26a214bb79f892db8d999b15dda97] x86/percpu: Cure per CPU madness on UP

Linus Torvalds March 15, 2024, 4:42 p.m. UTC | #2

On Fri, 15 Mar 2024 at 09:17, Guenter Roeck <linux@roeck-us.net> wrote:
>
> [    3.291087] RIP: 0010:rapl_cpu_online+0xf2/0x110
> [    3.291087] Code: 05 ff 8e 07 03 40 42 0f 00 48 89 43 60 e8 56 5f 12 00 8b 15 b4 84 61 02 48 8b 05 01 8f 07 03 48 c7 83 90 00 00 00 e0 84 80 b6 <48> 89 9c d0 38 01 00 00 e9 2b ff ff ff b8 f4 ff ff ff e9 47 ff ff

The code is

  mov    %rax,0x60(%rbx)
  call   0x125f5f
  mov    0x26184b4(%rip),%edx
  mov    0x3078f01(%rip),%rax
  movq   $0xffffffffb68084e0,0x90(%rbx)
  mov    %rbx,0x138(%rax,%rdx,8)                <-- trapping instruction
  jmp    <backwards>

with %rdx being some index having the value 0xffffffed (-19).

That's ENODEV.

Without line numbers (if you have debug info for that kernel, it's
good to run "scripts/decode_stacktrace.sh" on stack traces) it's hard
to really know what's up, but I strongly suspect that it's this:

        rapl_pmus->pmus[topology_logical_die_id(cpu)] = pmu;

because we have

   topology_logical_die_id(cpu) ->
       (cpu_data(cpu).topo.logical_die_id)

and we have

    c->topo.logical_die_id = topology_get_logical_id(apicid, TOPO_DIE_DOMAIN);

and topology_get_logical_id() does this:

        if (lvlid >= MAX_LOCAL_APIC)
                return -ERANGE;
        if (!test_bit(lvlid, apic_maps[at_level].map))
                return -ENODEV;

so that -ENODEV is not entirely unlikely for a UP run.

This also explains why it *used* to work - that whole thing is new to
the current merge window and came in through commit ca7e91776912
("Merge tag 'x86-apic-2024-03-10' of ...").

Thomas, over to you. I wonder if maybe all those topology macros
should just return 0 on an UP build, but that
topology_get_logical_id() thing looks a bit wrong regardless.

It really shouldn't depend on local apic data for configs that may not
*have* a local apic.

                 Linus

Guenter Roeck March 15, 2024, 5:02 p.m. UTC | #3

On 3/15/24 09:42, Linus Torvalds wrote:
> On Fri, 15 Mar 2024 at 09:17, Guenter Roeck <linux@roeck-us.net> wrote:
>>
>> [    3.291087] RIP: 0010:rapl_cpu_online+0xf2/0x110
>> [    3.291087] Code: 05 ff 8e 07 03 40 42 0f 00 48 89 43 60 e8 56 5f 12 00 8b 15 b4 84 61 02 48 8b 05 01 8f 07 03 48 c7 83 90 00 00 00 e0 84 80 b6 <48> 89 9c d0 38 01 00 00 e9 2b ff ff ff b8 f4 ff ff ff e9 47 ff ff
> 
> The code is
> 
>    mov    %rax,0x60(%rbx)
>    call   0x125f5f
>    mov    0x26184b4(%rip),%edx
>    mov    0x3078f01(%rip),%rax
>    movq   $0xffffffffb68084e0,0x90(%rbx)
>    mov    %rbx,0x138(%rax,%rdx,8)                <-- trapping instruction
>    jmp    <backwards>
> 
> with %rdx being some index having the value 0xffffffed (-19).
> 
> That's ENODEV.
> 
> Without line numbers (if you have debug info for that kernel, it's
> good to run "scripts/decode_stacktrace.sh" on stack traces) it's hard

Sorry, I should have done that.

> to really know what's up, but I strongly suspect that it's this:
> 
>          rapl_pmus->pmus[topology_logical_die_id(cpu)] = pmu;

Correct:

[    2.632164] RIP: 0010:rapl_cpu_online (arch/x86/events/rapl.c:581)

which does point to that line. Here is a complete decoded backtrace:

[    2.632164] Call Trace:
[    2.632164]  <TASK>
[    2.632164] ? __die (arch/x86/kernel/dumpstack.c:421 arch/x86/kernel/dumpstack.c:434)
[    2.632164] ? page_fault_oops (arch/x86/mm/fault.c:713)
[    2.632164] ? search_exception_tables (kernel/extable.c:59)
[    2.632164] ? fixup_exception (arch/x86/mm/extable.c:328)
[    2.632164] ? exc_page_fault (arch/x86/mm/fault.c:1503 arch/x86/mm/fault.c:1563)
[    2.632164] ? asm_exc_page_fault (./arch/x86/include/asm/idtentry.h:623)
[    2.632164] ? __pfx_rapl_cpu_online (arch/x86/events/rapl.c:566)
[    2.632164] ? rapl_cpu_online (arch/x86/events/rapl.c:581)
[    2.632164] cpuhp_invoke_callback.constprop.0 (kernel/cpu.c:195)
[    2.632164] __cpuhp_setup_state_cpuslocked (kernel/cpu.c:2541)
[    2.632164] ? __pfx_rapl_cpu_online (arch/x86/events/rapl.c:566)
[    2.632164] rapl_pmu_init (./include/linux/cpuhotplug.h:274 arch/x86/events/rapl.c:843)
[    2.632164] ? __pfx_rapl_pmu_init (arch/x86/events/rapl.c:816)
[    2.632164] do_one_initcall (init/main.c:1241)
[    2.632164] kernel_init_freeable (init/main.c:1302 init/main.c:1319 init/main.c:1338 init/main.c:1551)
[    2.632164] ? __pfx_kernel_init (init/main.c:1432)
[    2.632164] kernel_init (init/main.c:1442)
[    2.632164] ret_from_fork (arch/x86/kernel/process.c:153)
[    2.632164] ? __pfx_kernel_init (init/main.c:1432)
[    2.632164] ret_from_fork_asm (arch/x86/entry/entry_64.S:256)

Guenter

Thomas Gleixner March 15, 2024, 5:40 p.m. UTC | #4

On Fri, Mar 15 2024 at 09:42, Linus Torvalds wrote:
> On Fri, 15 Mar 2024 at 09:17, Guenter Roeck <linux@roeck-us.net> wrote:
> Without line numbers (if you have debug info for that kernel, it's
> good to run "scripts/decode_stacktrace.sh" on stack traces) it's hard
> to really know what's up, but I strongly suspect that it's this:
>
>         rapl_pmus->pmus[topology_logical_die_id(cpu)] = pmu;
>
> because we have
>
>    topology_logical_die_id(cpu) ->
>        (cpu_data(cpu).topo.logical_die_id)
>
> and we have
>
>     c->topo.logical_die_id = topology_get_logical_id(apicid, TOPO_DIE_DOMAIN);
>
> and topology_get_logical_id() does this:
>
>         if (lvlid >= MAX_LOCAL_APIC)
>                 return -ERANGE;
>         if (!test_bit(lvlid, apic_maps[at_level].map))
>                 return -ENODEV;
>
> so that -ENODEV is not entirely unlikely for a UP run.
>
> This also explains why it *used* to work - that whole thing is new to
> the current merge window and came in through commit ca7e91776912
> ("Merge tag 'x86-apic-2024-03-10' of ...").
>
> Thomas, over to you. I wonder if maybe all those topology macros
> should just return 0 on an UP build, but that
> topology_get_logical_id() thing looks a bit wrong regardless.
>
> It really shouldn't depend on local apic data for configs that may not
> *have* a local apic.

Right. Let me look.

Thomas Gleixner March 15, 2024, 10:55 p.m. UTC | #5

On Fri, Mar 15 2024 at 18:40, Thomas Gleixner wrote:
> On Fri, Mar 15 2024 at 09:42, Linus Torvalds wrote:
>> On Fri, 15 Mar 2024 at 09:17, Guenter Roeck <linux@roeck-us.net> wrote:
>> Thomas, over to you. I wonder if maybe all those topology macros
>> should just return 0 on an UP build, but that
>> topology_get_logical_id() thing looks a bit wrong regardless.
>>
>> It really shouldn't depend on local apic data for configs that may not
>> *have* a local apic.
>
> Right. Let me look.

Not really. The problem is that a SMP build can run on a UP machine w/o
APIC or command line disables the APIC and will run into the exactly
same problem. The only case where we know that it is impossible is when
APIC support is disabled, which is silly but topic for a different
discussion.

So the proper thing to do is to check for num_possible_cpus() == 1 in
that function.

Sure you can argue that we could avoid it for SMP=n builds completely,
but I think the right thing to do is to aim for removing CONFIG_SMP and
make the UP build a subset of a generic SMP capable build which has
CONFIG_NR_CPUS=1, i.e. num_possible_cpus() = 1. Why?

Because it consolidates the code and makes UP use exactly the same
mechanisms as SMP which pretty much avoids the problem we see today that
UP lacks test coverage and becomes more esoteric and untested over time.

The downside is a slightly larger kernel image for such a build.

Though if we pretend that we seriously care about that 10% larger memory
footprint or about the marginal performance benefit of SMP=n on dead
hardware, then we are just taking the wrong pills.

The point is that this very commit in question was heading deliberately
into the direction of removing the by now silly differences of UP/SMP
for correctness sake. It just happened to unearth the missing check in
topology_get_logical_id(), but that check is required independent of
SMP=y/n as I pointed out above.

Thanks,

        tglx

---
diff --git a/arch/x86/kernel/cpu/topology.c b/arch/x86/kernel/cpu/topology.c
index 3259b1d4fefe..118d9f7792ee 100644
--- a/arch/x86/kernel/cpu/topology.c
+++ b/arch/x86/kernel/cpu/topology.c
@@ -279,6 +279,15 @@ int topology_get_logical_id(u32 apicid, enum x86_topology_domains at_level)

 	if (lvlid >= MAX_LOCAL_APIC)
 		return -ERANGE;
+	/*
+	 * Spare the exercise on UP as there is only one instance at any
+	 * level and the map check below might fail because the CPU does
+	 * not have a local APIC or local APIC has been disabled on the
+	 * kernel command line.
+	 */
+	if (num_possible_cpus() == 1)
+		return 0;
+
 	if (!test_bit(lvlid, apic_maps[at_level].map))
 		return -ENODEV;
 	/* Get the number of set bits before @lvlid. */

Linus Torvalds March 15, 2024, 11:23 p.m. UTC | #6

On Fri, 15 Mar 2024 at 15:55, Thomas Gleixner <tglx@linutronix.de> wrote:
>
> Not really. The problem is that a SMP build can run on a UP machine w/o
> APIC or command line disables the APIC and will run into the exactly
> same problem. The only case where we know that it is impossible is when
> APIC support is disabled, which is silly but topic for a different
> discussion.

Oh, I agree - that was why I said that it shouldn't depend on a local
APIC on machines that may not even have one.

That "may not even have one" can still be a static option - we
technically allow 32-bit UP kernel to not enable X86_UP_APIC, although
it might be time to drop that option.

> So the proper thing to do is to check for num_possible_cpus() == 1 in
> that function.

I think that's _one_ proper thing. I still think that the deeper
problem is that it still looks at local apic rules even when those
rules are completely nonsensical.

For example, that MAX_LOCAL_APIC range test may not matter simply
because it's testing a constant value, but it still smells entirely
wrong to even check for that, when the system doesn't necessarily have
one.

So I think your patch may fix the immediate bug, but I think it's
still just a band-aid.

Either we should just make all machines look like they have the proper
local apic mappings, or we shouldn't look at any local apic rules AT
ALL.

So I'd rather see those apic_maps[] just be properly filled in.

> Sure you can argue that we could avoid it for SMP=n builds completely,
> but I think the right thing to do is to aim for removing CONFIG_SMP and
> make the UP build a subset of a generic SMP capable build which has
> CONFIG_NR_CPUS=1, i.e. num_possible_cpus() = 1. Why?

I wouldn't be entirely opposed to just doing that. UP has become
fairly irrelevant.

That said, UP is *not* entirely irrelevant on other architectures, and
if we drop UP support on x86, we'll be effectively dropping a lot of
coverage testing. The number of people who do cross-compilers is
pretty small.

End result: I'd *much* rather get rid of X86_UP_APIC and the "nolapic"
kernel command line, and say "even UP has to have a local APIC".

We already require a Pentium-class CPU, so in practice we already
require that local APIC setup. And yes, machines existed where it
could be turned off, but I don't think that is relevant any more.

Put another way: I think "UP config for wider build testing" is a
_lot_ more relevant than "no LAPIC support".

             Linus

Guenter Roeck March 16, 2024, 12:56 a.m. UTC | #7

On 3/15/24 15:55, Thomas Gleixner wrote:
> On Fri, Mar 15 2024 at 18:40, Thomas Gleixner wrote:
>> On Fri, Mar 15 2024 at 09:42, Linus Torvalds wrote:
>>> On Fri, 15 Mar 2024 at 09:17, Guenter Roeck <linux@roeck-us.net> wrote:
>>> Thomas, over to you. I wonder if maybe all those topology macros
>>> should just return 0 on an UP build, but that
>>> topology_get_logical_id() thing looks a bit wrong regardless.
>>>
>>> It really shouldn't depend on local apic data for configs that may not
>>> *have* a local apic.
>>
>> Right. Let me look.
> 
> Not really. The problem is that a SMP build can run on a UP machine w/o
> APIC or command line disables the APIC and will run into the exactly
> same problem. The only case where we know that it is impossible is when
> APIC support is disabled, which is silly but topic for a different
> discussion.
> 
> So the proper thing to do is to check for num_possible_cpus() == 1 in
> that function.
> 
> Sure you can argue that we could avoid it for SMP=n builds completely,
> but I think the right thing to do is to aim for removing CONFIG_SMP and
> make the UP build a subset of a generic SMP capable build which has
> CONFIG_NR_CPUS=1, i.e. num_possible_cpus() = 1. Why?
> 
> Because it consolidates the code and makes UP use exactly the same
> mechanisms as SMP which pretty much avoids the problem we see today that
> UP lacks test coverage and becomes more esoteric and untested over time.
> 
> The downside is a slightly larger kernel image for such a build.
> 
> Though if we pretend that we seriously care about that 10% larger memory
> footprint or about the marginal performance benefit of SMP=n on dead
> hardware, then we are just taking the wrong pills.
> 

FWIW, I would very much prefer for SMP=n builds to go away for x86.
I don't think anyone uses that in the real world nowadays, and I never
know if I should report problems like this one or just stop testing it.

> The point is that this very commit in question was heading deliberately
> into the direction of removing the by now silly differences of UP/SMP
> for correctness sake. It just happened to unearth the missing check in
> topology_get_logical_id(), but that check is required independent of
> SMP=y/n as I pointed out above.
> 
> Thanks,
> 
>          tglx
> 
> ---
> diff --git a/arch/x86/kernel/cpu/topology.c b/arch/x86/kernel/cpu/topology.c
> index 3259b1d4fefe..118d9f7792ee 100644
> --- a/arch/x86/kernel/cpu/topology.c
> +++ b/arch/x86/kernel/cpu/topology.c
> @@ -279,6 +279,15 @@ int topology_get_logical_id(u32 apicid, enum x86_topology_domains at_level)
>   
>   	if (lvlid >= MAX_LOCAL_APIC)
>   		return -ERANGE;
> +	/*
> +	 * Spare the exercise on UP as there is only one instance at any
> +	 * level and the map check below might fail because the CPU does
> +	 * not have a local APIC or local APIC has been disabled on the
> +	 * kernel command line.
> +	 */
> +	if (num_possible_cpus() == 1)
> +		return 0;
> +

That works.

Tested-by: Guenter Roeck <linux@roeck-us.net>

>   	if (!test_bit(lvlid, apic_maps[at_level].map))
>   		return -ENODEV;
>   	/* Get the number of set bits before @lvlid. */

Thomas Gleixner March 16, 2024, 1:11 a.m. UTC | #8

On Fri, Mar 15 2024 at 16:23, Linus Torvalds wrote:
> On Fri, 15 Mar 2024 at 15:55, Thomas Gleixner <tglx@linutronix.de> wrote:
>> So the proper thing to do is to check for num_possible_cpus() == 1 in
>> that function.
>
> I think that's _one_ proper thing. I still think that the deeper
> problem is that it still looks at local apic rules even when those
> rules are completely nonsensical.
>
> For example, that MAX_LOCAL_APIC range test may not matter simply
> because it's testing a constant value, but it still smells entirely
> wrong to even check for that, when the system doesn't necessarily have
> one.

cpu_info.apic_id defaults to 0, so unless the calling code is completely
broken it will be correct. And I rather catch the case of calling code
being broken in the !APIC case if we still want to support systems
without a local APIC.

> So I think your patch may fix the immediate bug, but I think it's
> still just a band-aid.
>
> Either we should just make all machines look like they have the proper
> local apic mappings, or we shouldn't look at any local apic rules AT
> ALL.

Sure. I can simply check if there was an APIC registered instead.

>> Sure you can argue that we could avoid it for SMP=n builds completely,
>> but I think the right thing to do is to aim for removing CONFIG_SMP and
>> make the UP build a subset of a generic SMP capable build which has
>> CONFIG_NR_CPUS=1, i.e. num_possible_cpus() = 1. Why?
>
> I wouldn't be entirely opposed to just doing that. UP has become
> fairly irrelevant.
>
> That said, UP is *not* entirely irrelevant on other architectures, and
> if we drop UP support on x86, we'll be effectively dropping a lot of
> coverage testing. The number of people who do cross-compilers is
> pretty small.
>
> End result: I'd *much* rather get rid of X86_UP_APIC and the "nolapic"
> kernel command line, and say "even UP has to have a local APIC".
>
> We already require a Pentium-class CPU, so in practice we already
> require that local APIC setup. And yes, machines existed where it
> could be turned off, but I don't think that is relevant any more.

You wish. We still support 486 and some of the still produced 486 clones
do not have a local APIC.

Not that I care and yes I'm all for getting rid of CONFIG_.*_APIC and of
the related config/command line options. If we refuse to boot on
hardware which does not enumerate an APIC then even better.

But that is only a part of the overall problem.

> Put another way: I think "UP config for wider build testing" is a
> _lot_ more relevant than "no LAPIC support".

I really have to disagree here.

The concept of making UP a proper subset of SMP has absolutely nothing
to do with x86 and UP test coverage.

We want SMP as a general concept and overhaul the whole kernel to get
rid of this ever increasing non-sensical UP burden. The real world UP
small system use cases have moved over to other OSes like Zephyr & Co
long ago.

Just because some esoteric architectures (m68k comes to my mind) will
have serious issues with that for the very wrong reasons does not mean
that we should not go there.

It's going to be quite some effort, but the overall benefit is worth it.

OTOH, it's absolutely not rocket science to pretend to be SMP capable
and if some architectures fail to accomodate on the way then we just
should remove them as that's a clear sign of being unmaintained and
irrelevant.

The amount of untested SMP=n code in the kernel becomes frigthening and
your argument that build coverage is making a difference is wishful
thinking at best.

Anything else than making the kernel SMP capable and making UP builds a
well defined subset via CONFIG_NR_CPUS=1 is a complete waste of time and
effort.

If your intention is to indulge in the historical glory of Linux running
on any (by now) irrelevant hardware on the planet, then I stop arguing
right here.

If not, can we please have a serious discussion about going SMP only and
making UP the simple and obvious NR_CPUS=1 subset?

The amount of subtle SMP=n fallout has been kinda exponentially
increasing over the years and it's just putting burden on the wrong
people. TBH, I'm tired of this nonsense.

Thanks,

        tglx

Linus Torvalds March 16, 2024, 1:23 a.m. UTC | #9

On Fri, 15 Mar 2024 at 18:11, Thomas Gleixner <tglx@linutronix.de> wrote:
>
> You wish. We still support 486 and some of the still produced 486 clones
> do not have a local APIC.

Ouch. I was _sure_ we had dropped i486 support too due to cmpxchg8b.

But apparently that was just a discussion, and my wishful thinking,
and we never actually followed through.

         Linus

Arnd Bergmann March 16, 2024, 9:34 p.m. UTC | #10

On Sat, Mar 16, 2024, at 02:23, Linus Torvalds wrote:
> On Fri, 15 Mar 2024 at 18:11, Thomas Gleixner <tglx@linutronix.de> wrote:
>>
>> You wish. We still support 486 and some of the still produced 486 clones
>> do not have a local APIC.
>
> Ouch. I was _sure_ we had dropped i486 support too due to cmpxchg8b.
>
> But apparently that was just a discussion, and my wishful thinking,
> and we never actually followed through.

Maciej Rozycki still cares about i486 type hardware, and he was
asking for it to be kept around in the thread following [1]

I think the best suggestion at the time was to make cmpxchg8b
a compile-time feature and I had expected Maciej to follow up with
a patch for that, but this never happend, and nobody sent a patch
to remove support 486 and the early 586 clones either.

I saw recently that there are still distros that advertise 486
support on modern kernels: Tiny Core Linux and Damn Small
Linux. Both ship with a 486 SMP kernel but fail to boot
on qemu unless an APIC is enabled (DSL also requires i686 or
higher to run userspace).

       Arnd

[1] https://lore.kernel.org/all/20220815071332.627393-9-yuzhao@google.com/

David Laight March 17, 2024, 9:03 p.m. UTC | #11

From: Thomas Gleixner
> Sent: 16 March 2024 01:11
...
> We want SMP as a general concept and overhaul the whole kernel to get
> rid of this ever increasing non-sensical UP burden. The real world UP
> small system use cases have moved over to other OSes like Zephyr & Co
> long ago.
> 
> Just because some esoteric architectures (m68k comes to my mind) will
> have serious issues with that for the very wrong reasons does not mean
> that we should not go there.
> 
> It's going to be quite some effort, but the overall benefit is worth it.
> 
> OTOH, it's absolutely not rocket science to pretend to be SMP capable
> and if some architectures fail to accomodate on the way then we just
> should remove them as that's a clear sign of being unmaintained and
> irrelevant.

There are fpga soft-cpu (eg Nios & Risc-V) that can run linux.
They are definitely memory constrained and really wouldn't want
most of the SMP overhead.

I'm not what it involves apart from simplified startup, compiling
out IPI and spinlocks and optimising per-cpu data.
But you wouldn't want to be running an SMP capable kernel on such systems.
x86 is a different beast - except perhaps 486.

It has to be said that I've never understood why anyone would run
Linux on a Nios-II cpu. Far too slow for anything useful (you might
get 100MHz if you are lucky), caches will be small and external memory
accesses slow.
I doubt soft RISC-V are any better (and I suspect they are worse).
We do have 4 Nios-II in the fpga image for a PCIe card.
They run very small programs (one has 2kB of code memory) to do things
that would be impossible to write (sensibly) in VHDL.
There are fpga with embedded ARM (and probably RISC-V) cores for
running real OS.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

Thomas Gleixner March 18, 2024, 11:11 a.m. UTC | #12

On Sat, Mar 16 2024 at 02:11, Thomas Gleixner wrote:
> On Fri, Mar 15 2024 at 16:23, Linus Torvalds wrote:
>> Either we should just make all machines look like they have the proper
>> local apic mappings, or we shouldn't look at any local apic rules AT
>> ALL.
>
> Sure. I can simply check if there was an APIC registered instead.

Like the below. I'm not entirely sure though whether the sanity checks
should return an error code, which is what caused the crash Guenter
observed, but I couldn't come up with something sensible either.

Returning 0 might keep the machine alive, but does it make sense?

Thanks,

        tglx
---
 arch/x86/kernel/cpu/topology.c |   15 +++++++++++++--
 1 file changed, 13 insertions(+), 2 deletions(-)

--- a/arch/x86/kernel/cpu/topology.c
+++ b/arch/x86/kernel/cpu/topology.c
@@ -277,10 +277,21 @@ int topology_get_logical_id(u32 apicid,
 	/* Remove the bits below @at_level to get the proper level ID of @apicid */
 	unsigned int lvlid = topo_apicid(apicid, at_level);
 
-	if (lvlid >= MAX_LOCAL_APIC)
+	if (WARN_ON_ONCE(lvlid >= MAX_LOCAL_APIC))
 		return -ERANGE;
-	if (!test_bit(lvlid, apic_maps[at_level].map))
+
+	/*
+	 * If there was no APIC registered, then the map check below would
+	 * fail. With no APIC this is guaranteed to be an UP system and
+	 * therefore all topology levels have only one entry and their
+	 * logical ID is obviously 0.
+	 */
+	if (topo_info.boot_cpu_apic_id == BAD_APICID)
+		return 0;
+
+	if (WARN_ON_ONCE(!test_bit(lvlid, apic_maps[at_level].map)))
 		return -ENODEV;
+
 	/* Get the number of set bits before @lvlid. */
 	return bitmap_weight(apic_maps[at_level].map, lvlid);
 }

Thomas Gleixner March 18, 2024, 5:27 p.m. UTC | #13

On Sat, Mar 16 2024 at 02:11, Thomas Gleixner wrote:
> On Fri, Mar 15 2024 at 16:23, Linus Torvalds wrote:
> The amount of subtle SMP=n fallout has been kinda exponentially
> increasing over the years and it's just putting burden on the wrong
> people. TBH, I'm tired of this nonsense.

And for the fun of it I hacked Kconfig to allow a SMP=y NR_CPUS=1 build
and checked the size of vmlinux:

                64-bit          32-bit
SMP, NCPUS=1    38438400        22110177
UP              38393703        21682041
Delta              44697          428076
                     0.1%              2%              

The UP savings are not really impressive...

Let me look what it actually takes to do that.

Thanks,

        tglx

Arnd Bergmann March 18, 2024, 7:13 p.m. UTC | #14

On Mon, Mar 18, 2024, at 18:27, Thomas Gleixner wrote:
> On Sat, Mar 16 2024 at 02:11, Thomas Gleixner wrote:
>> On Fri, Mar 15 2024 at 16:23, Linus Torvalds wrote:
>> The amount of subtle SMP=n fallout has been kinda exponentially
>> increasing over the years and it's just putting burden on the wrong
>> people. TBH, I'm tired of this nonsense.
>
> And for the fun of it I hacked Kconfig to allow a SMP=y NR_CPUS=1 build
> and checked the size of vmlinux:
>
>                 64-bit          32-bit
> SMP, NCPUS=1    38438400        22110177
> UP              38393703        21682041
> Delta              44697          428076
>                      0.1%              2%              
>
> The UP savings are not really impressive...
>
> Let me look what it actually takes to do that.

FWIW, I did some experiments a few weeks ago on 32-bit ARM,
using a fairly minimal kernel in a virtual machine, and
checking the runtime memory consumption rather than compile-time.
In a kvm guest with 32MiB RAM, I saw a difference of multiple
megabytes in memory usage:

Linux testvm 6.8.0-rc4-00410-gc02197fc9076-dirty #1 SMP PREEMPT armv7l
root@testvm:~# free
           	total   used    free  	shared  buff/cache   available
Mem:       	26932   14956   1732   	    52       12800   	11976
Swap:      	16360    3632   12728

Linux testvm 6.8.0-rc4-00410-gc02197fc9076-dirty #2 PREEMPT armv7l
root@testvm:~# free
           	total    used  	free  	shared  buff/cache   available
Mem:       	26932   13744  	5648        32       10092   	13188
Swap:      	16360    3880  	12480

There is a little difference between runs, but this does seem
significant enough to keep it. The SMP build was with
CONFIG_NR_CPUS=2 (the smallest supported compile-time number),
but running on a single-CPU qemu instance.

      Arnd

Thomas Gleixner March 19, 2024, 4:21 p.m. UTC | #15

On Mon, Mar 18 2024 at 20:13, Arnd Bergmann wrote:
> FWIW, I did some experiments a few weeks ago on 32-bit ARM,
> using a fairly minimal kernel in a virtual machine, and
> checking the runtime memory consumption rather than compile-time.
> In a kvm guest with 32MiB RAM, I saw a difference of multiple
> megabytes in memory usage:
>
> Linux testvm 6.8.0-rc4-00410-gc02197fc9076-dirty #1 SMP PREEMPT armv7l
> root@testvm:~# free
>            	total   used    free  	shared  buff/cache   available
> Mem:       	26932   14956   1732   	    52       12800   	11976
> Swap:      	16360    3632   12728
>
> Linux testvm 6.8.0-rc4-00410-gc02197fc9076-dirty #2 PREEMPT armv7l
> root@testvm:~# free
>            	total    used  	free  	shared  buff/cache   available
> Mem:       	26932   13744  	5648        32       10092   	13188
> Swap:      	16360    3880  	12480
>
> There is a little difference between runs, but this does seem
> significant enough to keep it. The SMP build was with
> CONFIG_NR_CPUS=2 (the smallest supported compile-time number),
> but running on a single-CPU qemu instance.

With a SMP=y, NR_CPUS=1 build on x86 64bit I get:

               total        used        free      shared  buff/cache   available
Mem:        32882056      498068    32590580        4884      128884    32383988
Swap:         998396           0      998396

Same config just SMP=n:

               total        used        free      shared  buff/cache   available
Mem:        32885804      461704    32635284        4876      119480    32424100
Swap:         998396           0      998396

So the delta for available is ~40 MiB.

But if I look at it with init=/bin/sh on the command line then the delta
is significantly different:

With a SMP=y, NR_CPUS=1 build on x86 64bit I get:

               total        used        free      shared  buff/cache   available
Mem:        32883680      324120    32822728         216       10864    32559560
Swap:              0           0           0

Same config just SMP=n:

               total        used        free      shared  buff/cache   available
Mem:        32885804      326876    32821972         216       11100    32558928
Swap:              0           0           0

Delta available = 632 KiB

I haven't had the time to stare at that in detail, but comparing
/proc/meminfo for the full boot case above does not immediately give me
a hint. It's confusing at best...

Thanks,

        tglx

Guenter Roeck March 19, 2024, 6:26 p.m. UTC | #16

On 3/19/24 09:21, Thomas Gleixner wrote:
> On Mon, Mar 18 2024 at 20:13, Arnd Bergmann wrote:
>> FWIW, I did some experiments a few weeks ago on 32-bit ARM,
>> using a fairly minimal kernel in a virtual machine, and
>> checking the runtime memory consumption rather than compile-time.
>> In a kvm guest with 32MiB RAM, I saw a difference of multiple
>> megabytes in memory usage:
>>
>> Linux testvm 6.8.0-rc4-00410-gc02197fc9076-dirty #1 SMP PREEMPT armv7l
>> root@testvm:~# free
>>             	total   used    free  	shared  buff/cache   available
>> Mem:       	26932   14956   1732   	    52       12800   	11976
>> Swap:      	16360    3632   12728
>>
>> Linux testvm 6.8.0-rc4-00410-gc02197fc9076-dirty #2 PREEMPT armv7l
>> root@testvm:~# free
>>             	total    used  	free  	shared  buff/cache   available
>> Mem:       	26932   13744  	5648        32       10092   	13188
>> Swap:      	16360    3880  	12480
>>
>> There is a little difference between runs, but this does seem
>> significant enough to keep it. The SMP build was with
>> CONFIG_NR_CPUS=2 (the smallest supported compile-time number),
>> but running on a single-CPU qemu instance.
> 
> With a SMP=y, NR_CPUS=1 build on x86 64bit I get:
> 
>                 total        used        free      shared  buff/cache   available
> Mem:        32882056      498068    32590580        4884      128884    32383988
> Swap:         998396           0      998396
> 
> Same config just SMP=n:
> 
>                 total        used        free      shared  buff/cache   available
> Mem:        32885804      461704    32635284        4876      119480    32424100
> Swap:         998396           0      998396
> 
> So the delta for available is ~40 MiB.
> 
> But if I look at it with init=/bin/sh on the command line then the delta
> is significantly different:
> 
> With a SMP=y, NR_CPUS=1 build on x86 64bit I get:
> 
>                 total        used        free      shared  buff/cache   available
> Mem:        32883680      324120    32822728         216       10864    32559560
> Swap:              0           0           0
> 
> Same config just SMP=n:
> 
>                 total        used        free      shared  buff/cache   available
> Mem:        32885804      326876    32821972         216       11100    32558928
> Swap:              0           0           0
> 
> Delta available = 632 KiB
> 
> I haven't had the time to stare at that in detail, but comparing
> /proc/meminfo for the full boot case above does not immediately give me
> a hint. It's confusing at best...
>

That makes me wonder if the number is affected by the total memory size.
How about a system with 1GB of memory or less ?

Guenter

Thomas Gleixner March 20, 2024, 8:58 a.m. UTC | #17

On Fri, Mar 15 2024 at 09:17, Guenter Roeck wrote:
> I don't know the code well enough to determine what is wrong.
> Please let me know what I can do to help debugging the problem.

Could you provide me the config and the qemu command line?

Thanks,

        tglx

Guenter Roeck March 20, 2024, 3:46 p.m. UTC | #18

On 3/20/24 01:58, Thomas Gleixner wrote:
> On Fri, Mar 15 2024 at 09:17, Guenter Roeck wrote:
>> I don't know the code well enough to determine what is wrong.
>> Please let me know what I can do to help debugging the problem.
> 
> Could you provide me the config and the qemu command line?
> 

defconfig-CONFIG_SMP and

qemu-system-x86_64 -kernel arch/x86/boot/bzImage -cpu Haswell \
      --append "console=ttyS0" -nographic -monitor none

The cpu doesn't really matter as long as it is an Intel CPU.
A root file system isn't needed since the boot doesn't get that far.

Guenter

Thomas Gleixner March 21, 2024, 11:14 a.m. UTC | #19

On Wed, Mar 20 2024 at 08:46, Guenter Roeck wrote:
> On 3/20/24 01:58, Thomas Gleixner wrote:
>> On Fri, Mar 15 2024 at 09:17, Guenter Roeck wrote:
>>> I don't know the code well enough to determine what is wrong.
>>> Please let me know what I can do to help debugging the problem.
>> 
>> Could you provide me the config and the qemu command line?
>> 
>
> defconfig-CONFIG_SMP and
>
> qemu-system-x86_64 -kernel arch/x86/boot/bzImage -cpu Haswell \
>       --append "console=ttyS0" -nographic -monitor none
>
> The cpu doesn't really matter as long as it is an Intel CPU.
> A root file system isn't needed since the boot doesn't get that far.

Now it get's interesting because I can't reproduce it with that setup at
all.

What's weird is that I saw it exactly once on 64-bit in a VM with a UP
config two days ago, but when I started to add instrumentation it never
came back even after backing the instrumentation changes out. I have
seriously no idea what's going on there.

Is it fully reproducible on your side?

If so can you please provide a full dmesg and then apply the patch below
and provide the resulting full dmesg too?

I found two other issues while trying to find a way to reproduce, but
those are completely unrelated to the problem you are observing.

Thanks,

        tglx
---
 arch/x86/kernel/cpu/topology.c |   19 +++++++++++++++++--
 1 file changed, 17 insertions(+), 2 deletions(-)

--- a/arch/x86/kernel/cpu/topology.c
+++ b/arch/x86/kernel/cpu/topology.c
@@ -176,6 +176,8 @@ static __init void topo_register_apic(u3
 {
 	int cpu, dom;
 
+	pr_info("APIC: %x %d\n", apic_id, present);
+
 	if (present) {
 		set_bit(apic_id, phys_cpu_present_map);
 
@@ -277,10 +279,23 @@ int topology_get_logical_id(u32 apicid,
 	/* Remove the bits below @at_level to get the proper level ID of @apicid */
 	unsigned int lvlid = topo_apicid(apicid, at_level);
 
-	if (lvlid >= MAX_LOCAL_APIC)
+	pr_info("APIC logical ID: %x %x %d\n", apicid, lvlid, at_level);
+
+	if (WARN_ON_ONCE(lvlid >= MAX_LOCAL_APIC))
 		return -ERANGE;
-	if (!test_bit(lvlid, apic_maps[at_level].map))
+
+	/*
+	 * If there was no APIC registered, then the map check below would
+	 * fail. With no APIC this is guaranteed to be an UP system and
+	 * therefore all topology levels have only one entry and their
+	 * logical ID is obviously 0.
+	 */
+	if (topo_info.boot_cpu_apic_id == BAD_APICID)
+		return 0;
+
+	if (WARN_ON_ONCE(!test_bit(lvlid, apic_maps[at_level].map)))
 		return -ENODEV;
+
 	/* Get the number of set bits before @lvlid. */
 	return bitmap_weight(apic_maps[at_level].map, lvlid);
 }

Guenter Roeck March 21, 2024, 2:06 p.m. UTC | #20

On 3/21/24 04:14, Thomas Gleixner wrote:
> On Wed, Mar 20 2024 at 08:46, Guenter Roeck wrote:
>> On 3/20/24 01:58, Thomas Gleixner wrote:
>>> On Fri, Mar 15 2024 at 09:17, Guenter Roeck wrote:
>>>> I don't know the code well enough to determine what is wrong.
>>>> Please let me know what I can do to help debugging the problem.
>>>
>>> Could you provide me the config and the qemu command line?
>>>
>>
>> defconfig-CONFIG_SMP and
>>
>> qemu-system-x86_64 -kernel arch/x86/boot/bzImage -cpu Haswell \
>>        --append "console=ttyS0" -nographic -monitor none
>>
>> The cpu doesn't really matter as long as it is an Intel CPU.
>> A root file system isn't needed since the boot doesn't get that far.
> 
> Now it get's interesting because I can't reproduce it with that setup at
> all.
> 
> What's weird is that I saw it exactly once on 64-bit in a VM with a UP
> config two days ago, but when I started to add instrumentation it never
> came back even after backing the instrumentation changes out. I have
> seriously no idea what's going on there.
> 
> Is it fully reproducible on your side?
> 

Yes, always.

> If so can you please provide a full dmesg and then apply the patch below
> and provide the resulting full dmesg too?
> 

You'll find everything at http://server.roeck-us.net/qemu/x86-nosmp/

The crash is gone after applying your patch. The difference is:

+       /*
+        * If there was no APIC registered, then the map check below would
+        * fail. With no APIC this is guaranteed to be an UP system and
+        * therefore all topology levels have only one entry and their
+        * logical ID is obviously 0.
+        */
+       if (topo_info.boot_cpu_apic_id == BAD_APICID) {
+               pr_info("#### topo_info.boot_cpu_apic_id == BAD_APICID\n");
                 ^^^^ I added this
+               return 0;
+       }
+

I see the "#### topo_info.boot_cpu_apic_id == BAD_APICID" message
twice in the log. See patched.log at the page pointed to above.

Hope the helps,
Guenter

Thomas Gleixner March 21, 2024, 4:49 p.m. UTC | #21

On Thu, Mar 21 2024 at 07:06, Guenter Roeck wrote:
> On 3/21/24 04:14, Thomas Gleixner wrote:
>> If so can you please provide a full dmesg and then apply the patch below
>> and provide the resulting full dmesg too?
>
> You'll find everything at http://server.roeck-us.net/qemu/x86-nosmp/

Thanks for providing this.

> The crash is gone after applying your patch. The difference is:
>
> +       /*
> +        * If there was no APIC registered, then the map check below would
> +        * fail. With no APIC this is guaranteed to be an UP system and
> +        * therefore all topology levels have only one entry and their
> +        * logical ID is obviously 0.
> +        */
> +       if (topo_info.boot_cpu_apic_id == BAD_APICID) {
> +               pr_info("#### topo_info.boot_cpu_apic_id == BAD_APICID\n");
>                  ^^^^ I added this
> +               return 0;
> +       }
> +
>
> I see the "#### topo_info.boot_cpu_apic_id == BAD_APICID" message
> twice in the log. See patched.log at the page pointed to above.

I can see why this is emitted. That happens on the initial CPUID
evaluation of the boot CPU very early during boot.

[    0.000000] Command line: console=ttyS0
[    0.000000] CPU topo: APIC logical ID: 0 0 6
[    0.000000] CPU topo: #### topo_info.boot_cpu_apic_id == BAD_APICID
[    0.000000] CPU topo: APIC logical ID: 0 0 4
[    0.000000] CPU topo: #### topo_info.boot_cpu_apic_id == BAD_APICID

The later full CPUID evaluation happens after the ACPI enumeration and
way before the affected RAPL driver is initialized:

[    0.088029] CPU topo: APIC logical ID: 0 0 6
[    0.088084] CPU topo: APIC logical ID: 0 0 4

This invocation has the boot APIC registered as your extra print does
not show up.

...

[    0.585850] RAPL PMU: API unit is 2^-32 Joules, 0 fixed counters, 10737418240 ms ovfl timer

So even without that guard (which we need anyway for the non APIC case)
topology_logical_die_id() == cpu_data(cpu).topo.logical_die_id must have
the correct value in that RAPL initialization and CPU hotplug callback
code.

But our absolutely convoluted startup logic prevents that because:

    identify_cpu_early()      operates on boot_cpu_data
    smp_prepare_boot_cpu()    copies boot_cpu_data to per CPU cpu data
    identify_boot_cpu()       operates on boot_cpu_data

identify_boot_cpu() is the one which gets the correct logical die info,
but that never gets copied over to the per CPU data instance on which
the RAPL code and everything else works on.

I'll cook up a patch later.

Thanks for providing the info!

       tglx

[5/9] x86: Cure per CPU madness on UP

Commit Message

Comments

Patch