diff mbox series

RISC-V: Probe misaligned access speed in parallel

Message ID 20230915184904.1976183-1-evan@rivosinc.com (mailing list archive)
State Superseded
Headers show
Series RISC-V: Probe misaligned access speed in parallel | expand

Checks

Context Check Description
conchuod/cover_letter success Single patches do not need cover letters
conchuod/tree_selection success Guessed tree name to be for-next at HEAD 0bb80ecc33a8
conchuod/fixes_present success Fixes tag not required for -next series
conchuod/maintainers_pattern success MAINTAINERS pattern errors before the patch: 5 and now 5
conchuod/verify_signedoff success Signed-off-by tag matches author and committer
conchuod/kdoc success Errors and warnings before: 0 this patch: 0
conchuod/build_rv64_clang_allmodconfig success Errors and warnings before: 12 this patch: 12
conchuod/module_param success Was 0 now: 0
conchuod/build_rv64_gcc_allmodconfig success Errors and warnings before: 13 this patch: 13
conchuod/build_rv32_defconfig success Build OK
conchuod/dtb_warn_rv64 success Errors and warnings before: 25 this patch: 25
conchuod/header_inline success No static functions without inline keyword in header files
conchuod/checkpatch success total: 0 errors, 0 warnings, 0 checks, 90 lines checked
conchuod/build_rv64_nommu_k210_defconfig success Build OK
conchuod/verify_fixes success No Fixes tag
conchuod/build_rv64_nommu_virt_defconfig success Build OK

Commit Message

Evan Green Sept. 15, 2023, 6:49 p.m. UTC
Probing for misaligned access speed takes about 0.06 seconds. On a
system with 64 cores, doing this in smp_callin() means it's done
serially, extending boot time by 3.8 seconds. That's a lot of boot time.

Instead of measuring each CPU serially, let's do the measurements on
all CPUs in parallel. If we disable preemption on all CPUs, the
jiffies stop ticking, so we can do this in stages of 1) everybody
except core 0, then 2) core 0.

The measurement call in smp_callin() stays around, but is now
conditionalized to only run if a new CPU shows up after the round of
in-parallel measurements has run. The goal is to have the measurement
call not run during boot or suspend/resume, but only on a hotplug
addition.

Signed-off-by: Evan Green <evan@rivosinc.com>

---

Jisheng, I didn't add your Tested-by tag since the patch evolved from
the one you tested. Hopefully this one brings you the same result.

---
 arch/riscv/include/asm/cpufeature.h |  3 ++-
 arch/riscv/kernel/cpufeature.c      | 28 +++++++++++++++++++++++-----
 arch/riscv/kernel/smpboot.c         | 11 ++++++++++-
 3 files changed, 35 insertions(+), 7 deletions(-)

Comments

Conor Dooley Sept. 16, 2023, 12:16 a.m. UTC | #1
Yo Evan,

On Fri, Sep 15, 2023 at 11:49:03AM -0700, Evan Green wrote:
> Probing for misaligned access speed takes about 0.06 seconds. On a
> system with 64 cores, doing this in smp_callin() means it's done
> serially, extending boot time by 3.8 seconds. That's a lot of boot time.
> 
> Instead of measuring each CPU serially, let's do the measurements on
> all CPUs in parallel. If we disable preemption on all CPUs, the
> jiffies stop ticking, so we can do this in stages of 1) everybody
> except core 0, then 2) core 0.
> 
> The measurement call in smp_callin() stays around, but is now
> conditionalized to only run if a new CPU shows up after the round of
> in-parallel measurements has run. The goal is to have the measurement
> call not run during boot or suspend/resume, but only on a hotplug
> addition.
> 
> Signed-off-by: Evan Green <evan@rivosinc.com>
> 
> ---
> 
> Jisheng, I didn't add your Tested-by tag since the patch evolved from
> the one you tested. Hopefully this one brings you the same result.

Ya know, I think there's scope to add Reported-by:, Closes: and Fixes:
tags to this patch, mentioning explicitly that this has regressed boot
time for many core systems, so that this can be fixes material. What do
you think?

> ---
>  arch/riscv/include/asm/cpufeature.h |  3 ++-
>  arch/riscv/kernel/cpufeature.c      | 28 +++++++++++++++++++++++-----
>  arch/riscv/kernel/smpboot.c         | 11 ++++++++++-
>  3 files changed, 35 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/riscv/include/asm/cpufeature.h b/arch/riscv/include/asm/cpufeature.h
> index d0345bd659c9..19e7817eba10 100644
> --- a/arch/riscv/include/asm/cpufeature.h
> +++ b/arch/riscv/include/asm/cpufeature.h
> @@ -30,6 +30,7 @@ DECLARE_PER_CPU(long, misaligned_access_speed);
>  /* Per-cpu ISA extensions. */
>  extern struct riscv_isainfo hart_isa[NR_CPUS];
>  
> -void check_unaligned_access(int cpu);
> +extern bool misaligned_speed_measured;
> +int check_unaligned_access(void *unused);
>  
>  #endif
> diff --git a/arch/riscv/kernel/cpufeature.c b/arch/riscv/kernel/cpufeature.c
> index 1cfbba65d11a..8eb36e1dfb95 100644
> --- a/arch/riscv/kernel/cpufeature.c
> +++ b/arch/riscv/kernel/cpufeature.c
> @@ -42,6 +42,9 @@ struct riscv_isainfo hart_isa[NR_CPUS];
>  /* Performance information */
>  DEFINE_PER_CPU(long, misaligned_access_speed);
>  
> +/* Boot-time in-parallel unaligned access measurement has occurred. */
> +bool misaligned_speed_measured;

If you did something like s/measured/complete/ I think you could drop
the comment. Tis whatever though :)

Conor.
Andrew Jones Sept. 16, 2023, 6:45 a.m. UTC | #2
On Fri, Sep 15, 2023 at 11:49:03AM -0700, Evan Green wrote:
> Probing for misaligned access speed takes about 0.06 seconds. On a
> system with 64 cores, doing this in smp_callin() means it's done
> serially, extending boot time by 3.8 seconds. That's a lot of boot time.
> 
> Instead of measuring each CPU serially, let's do the measurements on
> all CPUs in parallel. If we disable preemption on all CPUs, the
> jiffies stop ticking, so we can do this in stages of 1) everybody
> except core 0, then 2) core 0.
> 
> The measurement call in smp_callin() stays around, but is now
> conditionalized to only run if a new CPU shows up after the round of
> in-parallel measurements has run. The goal is to have the measurement
> call not run during boot or suspend/resume, but only on a hotplug
> addition.

Yay! I had just recently tested suspend/resume and wanted to report the
probe as an issue, but I hadn't gotten around to it. This patch resolves
the issue, so

Test-by: Andrew Jones <ajones@ventanamicro.com>

> 
> Signed-off-by: Evan Green <evan@rivosinc.com>
> 
> ---
> 
> Jisheng, I didn't add your Tested-by tag since the patch evolved from
> the one you tested. Hopefully this one brings you the same result.
> 
> ---
>  arch/riscv/include/asm/cpufeature.h |  3 ++-
>  arch/riscv/kernel/cpufeature.c      | 28 +++++++++++++++++++++++-----
>  arch/riscv/kernel/smpboot.c         | 11 ++++++++++-
>  3 files changed, 35 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/riscv/include/asm/cpufeature.h b/arch/riscv/include/asm/cpufeature.h
> index d0345bd659c9..19e7817eba10 100644
> --- a/arch/riscv/include/asm/cpufeature.h
> +++ b/arch/riscv/include/asm/cpufeature.h
> @@ -30,6 +30,7 @@ DECLARE_PER_CPU(long, misaligned_access_speed);
>  /* Per-cpu ISA extensions. */
>  extern struct riscv_isainfo hart_isa[NR_CPUS];
>  
> -void check_unaligned_access(int cpu);
> +extern bool misaligned_speed_measured;

Do we need this new state or could we just always check the boot cpu's
state to get the same information?

 per_cpu(misaligned_access_speed, 0) != RISCV_HWPROBE_MISALIGNED_UNKNOWN

> +int check_unaligned_access(void *unused);
>  
>  #endif
> diff --git a/arch/riscv/kernel/cpufeature.c b/arch/riscv/kernel/cpufeature.c
> index 1cfbba65d11a..8eb36e1dfb95 100644
> --- a/arch/riscv/kernel/cpufeature.c
> +++ b/arch/riscv/kernel/cpufeature.c
> @@ -42,6 +42,9 @@ struct riscv_isainfo hart_isa[NR_CPUS];
>  /* Performance information */
>  DEFINE_PER_CPU(long, misaligned_access_speed);
>  
> +/* Boot-time in-parallel unaligned access measurement has occurred. */
> +bool misaligned_speed_measured;
> +
>  /**
>   * riscv_isa_extension_base() - Get base extension word
>   *
> @@ -556,8 +559,9 @@ unsigned long riscv_get_elf_hwcap(void)
>  	return hwcap;
>  }
>  
> -void check_unaligned_access(int cpu)
> +int check_unaligned_access(void *unused)
>  {
> +	int cpu = smp_processor_id();
>  	u64 start_cycles, end_cycles;
>  	u64 word_cycles;
>  	u64 byte_cycles;
> @@ -571,7 +575,7 @@ void check_unaligned_access(int cpu)
>  	page = alloc_pages(GFP_NOWAIT, get_order(MISALIGNED_BUFFER_SIZE));
>  	if (!page) {
>  		pr_warn("Can't alloc pages to measure memcpy performance");
> -		return;
> +		return 0;
>  	}
>  
>  	/* Make an unaligned destination buffer. */
> @@ -643,15 +647,29 @@ void check_unaligned_access(int cpu)
>  
>  out:
>  	__free_pages(page, get_order(MISALIGNED_BUFFER_SIZE));
> +	return 0;
> +}
> +
> +static void check_unaligned_access_nonboot_cpu(void *param)
> +{
> +	if (smp_processor_id() != 0)
> +		check_unaligned_access(param);
>  }
>  
> -static int check_unaligned_access_boot_cpu(void)
> +static int check_unaligned_access_all_cpus(void)
>  {
> -	check_unaligned_access(0);
> +	/* Check everybody except 0, who stays behind to tend jiffies. */
> +	on_each_cpu(check_unaligned_access_nonboot_cpu, NULL, 1);
> +
> +	/* Check core 0. */
> +	smp_call_on_cpu(0, check_unaligned_access, NULL, true);
> +
> +	/* Boot-time measurements are complete. */
> +	misaligned_speed_measured = true;
>  	return 0;
>  }
>  
> -arch_initcall(check_unaligned_access_boot_cpu);
> +arch_initcall(check_unaligned_access_all_cpus);
>  
>  #ifdef CONFIG_RISCV_ALTERNATIVE
>  /*
> diff --git a/arch/riscv/kernel/smpboot.c b/arch/riscv/kernel/smpboot.c
> index 1b8da4e40a4d..39322ae20a75 100644
> --- a/arch/riscv/kernel/smpboot.c
> +++ b/arch/riscv/kernel/smpboot.c
> @@ -27,6 +27,7 @@
>  #include <linux/sched/mm.h>
>  #include <asm/cpu_ops.h>
>  #include <asm/cpufeature.h>
> +#include <asm/hwprobe.h>
>  #include <asm/irq.h>
>  #include <asm/mmu_context.h>
>  #include <asm/numa.h>
> @@ -246,7 +247,15 @@ asmlinkage __visible void smp_callin(void)
>  
>  	numa_add_cpu(curr_cpuid);
>  	set_cpu_online(curr_cpuid, 1);
> -	check_unaligned_access(curr_cpuid);
> +
> +	/*
> +	 * Boot-time misaligned access speed measurements are done in parallel
> +	 * in an initcall. Only measure here for hotplug.
> +	 */
> +	if (misaligned_speed_measured &&
> +	    (per_cpu(misaligned_access_speed, curr_cpuid) == RISCV_HWPROBE_MISALIGNED_UNKNOWN)) {
> +		check_unaligned_access(NULL);
> +	}
>  
>  	if (has_vector()) {
>  		if (riscv_v_setup_vsize())
> -- 
> 2.34.1
>

Besides my reluctance to add another global variable, this looks good to
me.

Reviewed-by: Andrew Jones <ajones@ventanamicro.com>

Thanks,
drew
Jisheng Zhang Sept. 16, 2023, 8:39 a.m. UTC | #3
On Fri, Sep 15, 2023 at 11:49:03AM -0700, Evan Green wrote:
> Probing for misaligned access speed takes about 0.06 seconds. On a
> system with 64 cores, doing this in smp_callin() means it's done
> serially, extending boot time by 3.8 seconds. That's a lot of boot time.
> 
> Instead of measuring each CPU serially, let's do the measurements on
> all CPUs in parallel. If we disable preemption on all CPUs, the
> jiffies stop ticking, so we can do this in stages of 1) everybody
> except core 0, then 2) core 0.
> 
> The measurement call in smp_callin() stays around, but is now
> conditionalized to only run if a new CPU shows up after the round of
> in-parallel measurements has run. The goal is to have the measurement
> call not run during boot or suspend/resume, but only on a hotplug
> addition.
> 
> Signed-off-by: Evan Green <evan@rivosinc.com>

Reported-by: Jisheng Zhang <jszhang@kernel.org>
> 
> ---
> 
> Jisheng, I didn't add your Tested-by tag since the patch evolved from
> the one you tested. Hopefully this one brings you the same result.
> 
> ---
>  arch/riscv/include/asm/cpufeature.h |  3 ++-
>  arch/riscv/kernel/cpufeature.c      | 28 +++++++++++++++++++++++-----
>  arch/riscv/kernel/smpboot.c         | 11 ++++++++++-
>  3 files changed, 35 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/riscv/include/asm/cpufeature.h b/arch/riscv/include/asm/cpufeature.h
> index d0345bd659c9..19e7817eba10 100644
> --- a/arch/riscv/include/asm/cpufeature.h
> +++ b/arch/riscv/include/asm/cpufeature.h
> @@ -30,6 +30,7 @@ DECLARE_PER_CPU(long, misaligned_access_speed);
>  /* Per-cpu ISA extensions. */
>  extern struct riscv_isainfo hart_isa[NR_CPUS];
>  
> -void check_unaligned_access(int cpu);
> +extern bool misaligned_speed_measured;
> +int check_unaligned_access(void *unused);
>  
>  #endif
> diff --git a/arch/riscv/kernel/cpufeature.c b/arch/riscv/kernel/cpufeature.c
> index 1cfbba65d11a..8eb36e1dfb95 100644
> --- a/arch/riscv/kernel/cpufeature.c
> +++ b/arch/riscv/kernel/cpufeature.c
> @@ -42,6 +42,9 @@ struct riscv_isainfo hart_isa[NR_CPUS];
>  /* Performance information */
>  DEFINE_PER_CPU(long, misaligned_access_speed);
>  
> +/* Boot-time in-parallel unaligned access measurement has occurred. */
> +bool misaligned_speed_measured;

This var can be avoided, see below.

> +
>  /**
>   * riscv_isa_extension_base() - Get base extension word
>   *
> @@ -556,8 +559,9 @@ unsigned long riscv_get_elf_hwcap(void)
>  	return hwcap;
>  }
>  
> -void check_unaligned_access(int cpu)
> +int check_unaligned_access(void *unused)
>  {
> +	int cpu = smp_processor_id();
>  	u64 start_cycles, end_cycles;
>  	u64 word_cycles;
>  	u64 byte_cycles;
> @@ -571,7 +575,7 @@ void check_unaligned_access(int cpu)
>  	page = alloc_pages(GFP_NOWAIT, get_order(MISALIGNED_BUFFER_SIZE));
>  	if (!page) {
>  		pr_warn("Can't alloc pages to measure memcpy performance");
> -		return;
> +		return 0;
>  	}
>  
>  	/* Make an unaligned destination buffer. */
> @@ -643,15 +647,29 @@ void check_unaligned_access(int cpu)
>  
>  out:
>  	__free_pages(page, get_order(MISALIGNED_BUFFER_SIZE));
> +	return 0;
> +}
> +
> +static void check_unaligned_access_nonboot_cpu(void *param)
> +{
> +	if (smp_processor_id() != 0)
> +		check_unaligned_access(param);
>  }
>  
> -static int check_unaligned_access_boot_cpu(void)
> +static int check_unaligned_access_all_cpus(void)
>  {
> -	check_unaligned_access(0);
> +	/* Check everybody except 0, who stays behind to tend jiffies. */
> +	on_each_cpu(check_unaligned_access_nonboot_cpu, NULL, 1);
> +
> +	/* Check core 0. */
> +	smp_call_on_cpu(0, check_unaligned_access, NULL, true);
> +
> +	/* Boot-time measurements are complete. */
> +	misaligned_speed_measured = true;
>  	return 0;
>  }
>  
> -arch_initcall(check_unaligned_access_boot_cpu);
> +arch_initcall(check_unaligned_access_all_cpus);
>  
>  #ifdef CONFIG_RISCV_ALTERNATIVE
>  /*
> diff --git a/arch/riscv/kernel/smpboot.c b/arch/riscv/kernel/smpboot.c
> index 1b8da4e40a4d..39322ae20a75 100644
> --- a/arch/riscv/kernel/smpboot.c
> +++ b/arch/riscv/kernel/smpboot.c
> @@ -27,6 +27,7 @@
>  #include <linux/sched/mm.h>
>  #include <asm/cpu_ops.h>
>  #include <asm/cpufeature.h>
> +#include <asm/hwprobe.h>
>  #include <asm/irq.h>
>  #include <asm/mmu_context.h>
>  #include <asm/numa.h>
> @@ -246,7 +247,15 @@ asmlinkage __visible void smp_callin(void)
>  
>  	numa_add_cpu(curr_cpuid);
>  	set_cpu_online(curr_cpuid, 1);
> -	check_unaligned_access(curr_cpuid);
> +
> +	/*
> +	 * Boot-time misaligned access speed measurements are done in parallel
> +	 * in an initcall. Only measure here for hotplug.
> +	 */
> +	if (misaligned_speed_measured &&
> +	    (per_cpu(misaligned_access_speed, curr_cpuid) == RISCV_HWPROBE_MISALIGNED_UNKNOWN)) {

I believe this check is for cpu not-booted during boot time but hotplug in
after that, if so I'm not sure whether
misaligned_speed_measured can be replaced with
(system_state == SYSTEM_RUNNING)
then we don't need misaligned_speed_measured at all.

> +		check_unaligned_access(NULL);
> +	}
>  
>  	if (has_vector()) {
>  		if (riscv_v_setup_vsize())
> -- 
> 2.34.1
>
Jisheng Zhang Nov. 1, 2023, 11:31 a.m. UTC | #4
On Sat, Sep 16, 2023 at 04:39:54PM +0800, Jisheng Zhang wrote:
> On Fri, Sep 15, 2023 at 11:49:03AM -0700, Evan Green wrote:
> > Probing for misaligned access speed takes about 0.06 seconds. On a
> > system with 64 cores, doing this in smp_callin() means it's done
> > serially, extending boot time by 3.8 seconds. That's a lot of boot time.
> > 
> > Instead of measuring each CPU serially, let's do the measurements on
> > all CPUs in parallel. If we disable preemption on all CPUs, the
> > jiffies stop ticking, so we can do this in stages of 1) everybody
> > except core 0, then 2) core 0.
> > 
> > The measurement call in smp_callin() stays around, but is now
> > conditionalized to only run if a new CPU shows up after the round of
> > in-parallel measurements has run. The goal is to have the measurement
> > call not run during boot or suspend/resume, but only on a hotplug
> > addition.
> > 
> > Signed-off-by: Evan Green <evan@rivosinc.com>
> 
> Reported-by: Jisheng Zhang <jszhang@kernel.org>

Hi Evan, Palmer,

This patch seems missing in v6.6, I dunno what happened.

And this patch doesn't fix the boot time regression but also fix a real
bug during cpu hotplug on and off.

Here is the reproduce script:

while true
do
echo 0 > /sys/devices/system/cpu/cpu1/online
echo 1 > /sys/devices/system/cpu/cpu1/online
done


Here is the BUG log on qemu:

[   20.950753] CPU1: failed to come online
[   20.951875] ------------[ cut here ]------------
[   20.952070] kernel BUG at kernel/time/hrtimer.c:2227!
[   20.952341] Kernel BUG [#1]
[   20.952366] Modules linked in:
[   20.952515] CPU: 0 PID: 46 Comm: sh Not tainted 6.6.0 #3
[   20.952607] Hardware name: riscv-virtio,qemu (DT)
[   20.952695] epc : hrtimers_dead_cpu+0x22e/0x230
[   20.952808]  ra : cpuhp_invoke_callback+0xe4/0x54e
[   20.952844] epc : ffffffff8007d6c0 ra : ffffffff8000f904 sp : ff600000011ebb30
[   20.952863]  gp : ffffffff80d081d0 tp : ff6000000134da00 t0 : 0000000000000040
[   20.952880]  t1 : 0000000000000000 t2 : 0000000000000000 s0 : ff600000011ebbb0
[   20.952895]  s1 : 0000000000000001 a0 : 0000000000000001 a1 : 000000000000002c
[   20.952911]  a2 : 0000000000000000 a3 : 0000000000000000 a4 : 0000000000000000
[   20.952926]  a5 : 0000000000000001 a6 : 0000000000000538 a7 : 0000000000000000
[   20.952941]  s2 : 000000000000002c s3 : 0000000000000000 s4 : ff6000003ffd4390
[   20.952957]  s5 : ffffffff80d0a1f8 s6 : 0000000000000000 s7 : ffffffff8007d492
[   20.952972]  s8 : 0000000000000001 s9 : fffffffffffffffb s10: 0000000000000000
[   20.952987]  s11: 00005555820dc708 t3 : 0000000000000002 t4 : 0000000000000402
[   20.953002]  t5 : ff600000010f0710 t6 : ff600000010f0718
[   20.953016] status: 0000000200000120 badaddr: 0000000000000000 cause: 0000000000000003
[   20.953124] [<ffffffff8007d6c0>] hrtimers_dead_cpu+0x22e/0x230
[   20.953226] [<ffffffff8000f904>] cpuhp_invoke_callback+0xe4/0x54e
[   20.953241] [<ffffffff80010fb8>] _cpu_up+0x200/0x2a2
[   20.953254] [<ffffffff800110ac>] cpu_up+0x52/0x8a
[   20.953266] [<ffffffff80011654>] cpu_device_up+0x14/0x1c
[   20.953279] [<ffffffff8029abb6>] cpu_subsys_online+0x1e/0x68
[   20.953296] [<ffffffff802957de>] device_online+0x3c/0x70
[   20.953306] [<ffffffff8029587a>] online_store+0x68/0x8c
[   20.953317] [<ffffffff802909ba>] dev_attr_store+0xe/0x1a
[   20.953330] [<ffffffff801df8aa>] sysfs_kf_write+0x2a/0x34
[   20.953346] [<ffffffff801def06>] kernfs_fop_write_iter+0xde/0x162
[   20.953360] [<ffffffff8018154a>] vfs_write+0x136/0x320
[   20.953372] [<ffffffff801818e4>] ksys_write+0x4a/0xb4
[   20.953383] [<ffffffff80181962>] __riscv_sys_write+0x14/0x1c
[   20.953394] [<ffffffff803dec7e>] do_trap_ecall_u+0x4a/0x110
[   20.953420] [<ffffffff80003666>] ret_from_exception+0x0/0x66
[   20.953648] Code: 7c42 7ca2 7d02 6de2 4501 6109 8082 c0ef 7463 bd1d (9002) 1141
[   20.953897] ---[ end trace 0000000000000000 ]---
[   20.954068] Kernel panic - not syncing: Fatal exception in interrupt
[   20.954128] SMP: stopping secondary CPUs
[   22.749953] SMP: failed to stop secondary CPUs 0-1
[   22.803768] ---[ end Kernel panic - not syncing: Fatal exception in interrupt ]---


> > 
> > ---
> > 
> > Jisheng, I didn't add your Tested-by tag since the patch evolved from
> > the one you tested. Hopefully this one brings you the same result.
> > 
> > ---
> >  arch/riscv/include/asm/cpufeature.h |  3 ++-
> >  arch/riscv/kernel/cpufeature.c      | 28 +++++++++++++++++++++++-----
> >  arch/riscv/kernel/smpboot.c         | 11 ++++++++++-
> >  3 files changed, 35 insertions(+), 7 deletions(-)
> > 
> > diff --git a/arch/riscv/include/asm/cpufeature.h b/arch/riscv/include/asm/cpufeature.h
> > index d0345bd659c9..19e7817eba10 100644
> > --- a/arch/riscv/include/asm/cpufeature.h
> > +++ b/arch/riscv/include/asm/cpufeature.h
> > @@ -30,6 +30,7 @@ DECLARE_PER_CPU(long, misaligned_access_speed);
> >  /* Per-cpu ISA extensions. */
> >  extern struct riscv_isainfo hart_isa[NR_CPUS];
> >  
> > -void check_unaligned_access(int cpu);
> > +extern bool misaligned_speed_measured;
> > +int check_unaligned_access(void *unused);
> >  
> >  #endif
> > diff --git a/arch/riscv/kernel/cpufeature.c b/arch/riscv/kernel/cpufeature.c
> > index 1cfbba65d11a..8eb36e1dfb95 100644
> > --- a/arch/riscv/kernel/cpufeature.c
> > +++ b/arch/riscv/kernel/cpufeature.c
> > @@ -42,6 +42,9 @@ struct riscv_isainfo hart_isa[NR_CPUS];
> >  /* Performance information */
> >  DEFINE_PER_CPU(long, misaligned_access_speed);
> >  
> > +/* Boot-time in-parallel unaligned access measurement has occurred. */
> > +bool misaligned_speed_measured;
> 
> This var can be avoided, see below.
> 
> > +
> >  /**
> >   * riscv_isa_extension_base() - Get base extension word
> >   *
> > @@ -556,8 +559,9 @@ unsigned long riscv_get_elf_hwcap(void)
> >  	return hwcap;
> >  }
> >  
> > -void check_unaligned_access(int cpu)
> > +int check_unaligned_access(void *unused)
> >  {
> > +	int cpu = smp_processor_id();
> >  	u64 start_cycles, end_cycles;
> >  	u64 word_cycles;
> >  	u64 byte_cycles;
> > @@ -571,7 +575,7 @@ void check_unaligned_access(int cpu)
> >  	page = alloc_pages(GFP_NOWAIT, get_order(MISALIGNED_BUFFER_SIZE));
> >  	if (!page) {
> >  		pr_warn("Can't alloc pages to measure memcpy performance");
> > -		return;
> > +		return 0;
> >  	}
> >  
> >  	/* Make an unaligned destination buffer. */
> > @@ -643,15 +647,29 @@ void check_unaligned_access(int cpu)
> >  
> >  out:
> >  	__free_pages(page, get_order(MISALIGNED_BUFFER_SIZE));
> > +	return 0;
> > +}
> > +
> > +static void check_unaligned_access_nonboot_cpu(void *param)
> > +{
> > +	if (smp_processor_id() != 0)
> > +		check_unaligned_access(param);
> >  }
> >  
> > -static int check_unaligned_access_boot_cpu(void)
> > +static int check_unaligned_access_all_cpus(void)
> >  {
> > -	check_unaligned_access(0);
> > +	/* Check everybody except 0, who stays behind to tend jiffies. */
> > +	on_each_cpu(check_unaligned_access_nonboot_cpu, NULL, 1);
> > +
> > +	/* Check core 0. */
> > +	smp_call_on_cpu(0, check_unaligned_access, NULL, true);
> > +
> > +	/* Boot-time measurements are complete. */
> > +	misaligned_speed_measured = true;
> >  	return 0;
> >  }
> >  
> > -arch_initcall(check_unaligned_access_boot_cpu);
> > +arch_initcall(check_unaligned_access_all_cpus);
> >  
> >  #ifdef CONFIG_RISCV_ALTERNATIVE
> >  /*
> > diff --git a/arch/riscv/kernel/smpboot.c b/arch/riscv/kernel/smpboot.c
> > index 1b8da4e40a4d..39322ae20a75 100644
> > --- a/arch/riscv/kernel/smpboot.c
> > +++ b/arch/riscv/kernel/smpboot.c
> > @@ -27,6 +27,7 @@
> >  #include <linux/sched/mm.h>
> >  #include <asm/cpu_ops.h>
> >  #include <asm/cpufeature.h>
> > +#include <asm/hwprobe.h>
> >  #include <asm/irq.h>
> >  #include <asm/mmu_context.h>
> >  #include <asm/numa.h>
> > @@ -246,7 +247,15 @@ asmlinkage __visible void smp_callin(void)
> >  
> >  	numa_add_cpu(curr_cpuid);
> >  	set_cpu_online(curr_cpuid, 1);
> > -	check_unaligned_access(curr_cpuid);
> > +
> > +	/*
> > +	 * Boot-time misaligned access speed measurements are done in parallel
> > +	 * in an initcall. Only measure here for hotplug.
> > +	 */
> > +	if (misaligned_speed_measured &&
> > +	    (per_cpu(misaligned_access_speed, curr_cpuid) == RISCV_HWPROBE_MISALIGNED_UNKNOWN)) {
> 
> I believe this check is for cpu not-booted during boot time but hotplug in
> after that, if so I'm not sure whether
> misaligned_speed_measured can be replaced with
> (system_state == SYSTEM_RUNNING)
> then we don't need misaligned_speed_measured at all.
> 
> > +		check_unaligned_access(NULL);
> > +	}
> >  
> >  	if (has_vector()) {
> >  		if (riscv_v_setup_vsize())
> > -- 
> > 2.34.1
> >
Evan Green Nov. 1, 2023, 5:28 p.m. UTC | #5
On Wed, Nov 1, 2023 at 4:44 AM Jisheng Zhang <jszhang@kernel.org> wrote:
>
> On Sat, Sep 16, 2023 at 04:39:54PM +0800, Jisheng Zhang wrote:
> > On Fri, Sep 15, 2023 at 11:49:03AM -0700, Evan Green wrote:
> > > Probing for misaligned access speed takes about 0.06 seconds. On a
> > > system with 64 cores, doing this in smp_callin() means it's done
> > > serially, extending boot time by 3.8 seconds. That's a lot of boot time.
> > >
> > > Instead of measuring each CPU serially, let's do the measurements on
> > > all CPUs in parallel. If we disable preemption on all CPUs, the
> > > jiffies stop ticking, so we can do this in stages of 1) everybody
> > > except core 0, then 2) core 0.
> > >
> > > The measurement call in smp_callin() stays around, but is now
> > > conditionalized to only run if a new CPU shows up after the round of
> > > in-parallel measurements has run. The goal is to have the measurement
> > > call not run during boot or suspend/resume, but only on a hotplug
> > > addition.
> > >
> > > Signed-off-by: Evan Green <evan@rivosinc.com>
> >
> > Reported-by: Jisheng Zhang <jszhang@kernel.org>
>
> Hi Evan, Palmer,
>
> This patch seems missing in v6.6, I dunno what happened.
>
> And this patch doesn't fix the boot time regression but also fix a real
> bug during cpu hotplug on and off.

Hi Jisheng,
Just to clarify, you're saying this both fixes the boot regression,
and fixes a hotplug crash? I was slightly thrown off by the "doesn't
fix the boot time regression", holler if there's still something wrong
with boot time.

The splat you pasted suggests the CPU isn't coming back online. Off
the top of my head I can't think of what that might be or why this
patch would fix it. I tried this on an old palmer/for-next and didn't
repro the issue:

# echo 0 > online
[   31.777280] CPU3: off
[   31.777740] CPU3 may not have stopped: 3
# echo 1 > online
[   36.236313] cpu3: Ratio of byte access time to unaligned word
access is 7.26, unaligned accesses are fast

FWIW, Palmer's for-next branch now has the v2 of this patch. I
verified that branch is booting, and hotplug seems to work as well.
-Evan



>
> Here is the reproduce script:
>
> while true
> do
> echo 0 > /sys/devices/system/cpu/cpu1/online
> echo 1 > /sys/devices/system/cpu/cpu1/online
> done
>
>
> Here is the BUG log on qemu:
>
> [   20.950753] CPU1: failed to come online
> [   20.951875] ------------[ cut here ]------------
> [   20.952070] kernel BUG at kernel/time/hrtimer.c:2227!
> [   20.952341] Kernel BUG [#1]
> [   20.952366] Modules linked in:
> [   20.952515] CPU: 0 PID: 46 Comm: sh Not tainted 6.6.0 #3
> [   20.952607] Hardware name: riscv-virtio,qemu (DT)
> [   20.952695] epc : hrtimers_dead_cpu+0x22e/0x230
> [   20.952808]  ra : cpuhp_invoke_callback+0xe4/0x54e
> [   20.952844] epc : ffffffff8007d6c0 ra : ffffffff8000f904 sp : ff600000011ebb30
> [   20.952863]  gp : ffffffff80d081d0 tp : ff6000000134da00 t0 : 0000000000000040
> [   20.952880]  t1 : 0000000000000000 t2 : 0000000000000000 s0 : ff600000011ebbb0
> [   20.952895]  s1 : 0000000000000001 a0 : 0000000000000001 a1 : 000000000000002c
> [   20.952911]  a2 : 0000000000000000 a3 : 0000000000000000 a4 : 0000000000000000
> [   20.952926]  a5 : 0000000000000001 a6 : 0000000000000538 a7 : 0000000000000000
> [   20.952941]  s2 : 000000000000002c s3 : 0000000000000000 s4 : ff6000003ffd4390
> [   20.952957]  s5 : ffffffff80d0a1f8 s6 : 0000000000000000 s7 : ffffffff8007d492
> [   20.952972]  s8 : 0000000000000001 s9 : fffffffffffffffb s10: 0000000000000000
> [   20.952987]  s11: 00005555820dc708 t3 : 0000000000000002 t4 : 0000000000000402
> [   20.953002]  t5 : ff600000010f0710 t6 : ff600000010f0718
> [   20.953016] status: 0000000200000120 badaddr: 0000000000000000 cause: 0000000000000003
> [   20.953124] [<ffffffff8007d6c0>] hrtimers_dead_cpu+0x22e/0x230
> [   20.953226] [<ffffffff8000f904>] cpuhp_invoke_callback+0xe4/0x54e
> [   20.953241] [<ffffffff80010fb8>] _cpu_up+0x200/0x2a2
> [   20.953254] [<ffffffff800110ac>] cpu_up+0x52/0x8a
> [   20.953266] [<ffffffff80011654>] cpu_device_up+0x14/0x1c
> [   20.953279] [<ffffffff8029abb6>] cpu_subsys_online+0x1e/0x68
> [   20.953296] [<ffffffff802957de>] device_online+0x3c/0x70
> [   20.953306] [<ffffffff8029587a>] online_store+0x68/0x8c
> [   20.953317] [<ffffffff802909ba>] dev_attr_store+0xe/0x1a
> [   20.953330] [<ffffffff801df8aa>] sysfs_kf_write+0x2a/0x34
> [   20.953346] [<ffffffff801def06>] kernfs_fop_write_iter+0xde/0x162
> [   20.953360] [<ffffffff8018154a>] vfs_write+0x136/0x320
> [   20.953372] [<ffffffff801818e4>] ksys_write+0x4a/0xb4
> [   20.953383] [<ffffffff80181962>] __riscv_sys_write+0x14/0x1c
> [   20.953394] [<ffffffff803dec7e>] do_trap_ecall_u+0x4a/0x110
> [   20.953420] [<ffffffff80003666>] ret_from_exception+0x0/0x66
> [   20.953648] Code: 7c42 7ca2 7d02 6de2 4501 6109 8082 c0ef 7463 bd1d (9002) 1141
> [   20.953897] ---[ end trace 0000000000000000 ]---
> [   20.954068] Kernel panic - not syncing: Fatal exception in interrupt
> [   20.954128] SMP: stopping secondary CPUs
> [   22.749953] SMP: failed to stop secondary CPUs 0-1
> [   22.803768] ---[ end Kernel panic - not syncing: Fatal exception in interrupt ]---
>
>
> > >
> > > ---
> > >
> > > Jisheng, I didn't add your Tested-by tag since the patch evolved from
> > > the one you tested. Hopefully this one brings you the same result.
> > >
> > > ---
> > >  arch/riscv/include/asm/cpufeature.h |  3 ++-
> > >  arch/riscv/kernel/cpufeature.c      | 28 +++++++++++++++++++++++-----
> > >  arch/riscv/kernel/smpboot.c         | 11 ++++++++++-
> > >  3 files changed, 35 insertions(+), 7 deletions(-)
> > >
> > > diff --git a/arch/riscv/include/asm/cpufeature.h b/arch/riscv/include/asm/cpufeature.h
> > > index d0345bd659c9..19e7817eba10 100644
> > > --- a/arch/riscv/include/asm/cpufeature.h
> > > +++ b/arch/riscv/include/asm/cpufeature.h
> > > @@ -30,6 +30,7 @@ DECLARE_PER_CPU(long, misaligned_access_speed);
> > >  /* Per-cpu ISA extensions. */
> > >  extern struct riscv_isainfo hart_isa[NR_CPUS];
> > >
> > > -void check_unaligned_access(int cpu);
> > > +extern bool misaligned_speed_measured;
> > > +int check_unaligned_access(void *unused);
> > >
> > >  #endif
> > > diff --git a/arch/riscv/kernel/cpufeature.c b/arch/riscv/kernel/cpufeature.c
> > > index 1cfbba65d11a..8eb36e1dfb95 100644
> > > --- a/arch/riscv/kernel/cpufeature.c
> > > +++ b/arch/riscv/kernel/cpufeature.c
> > > @@ -42,6 +42,9 @@ struct riscv_isainfo hart_isa[NR_CPUS];
> > >  /* Performance information */
> > >  DEFINE_PER_CPU(long, misaligned_access_speed);
> > >
> > > +/* Boot-time in-parallel unaligned access measurement has occurred. */
> > > +bool misaligned_speed_measured;
> >
> > This var can be avoided, see below.
> >
> > > +
> > >  /**
> > >   * riscv_isa_extension_base() - Get base extension word
> > >   *
> > > @@ -556,8 +559,9 @@ unsigned long riscv_get_elf_hwcap(void)
> > >     return hwcap;
> > >  }
> > >
> > > -void check_unaligned_access(int cpu)
> > > +int check_unaligned_access(void *unused)
> > >  {
> > > +   int cpu = smp_processor_id();
> > >     u64 start_cycles, end_cycles;
> > >     u64 word_cycles;
> > >     u64 byte_cycles;
> > > @@ -571,7 +575,7 @@ void check_unaligned_access(int cpu)
> > >     page = alloc_pages(GFP_NOWAIT, get_order(MISALIGNED_BUFFER_SIZE));
> > >     if (!page) {
> > >             pr_warn("Can't alloc pages to measure memcpy performance");
> > > -           return;
> > > +           return 0;
> > >     }
> > >
> > >     /* Make an unaligned destination buffer. */
> > > @@ -643,15 +647,29 @@ void check_unaligned_access(int cpu)
> > >
> > >  out:
> > >     __free_pages(page, get_order(MISALIGNED_BUFFER_SIZE));
> > > +   return 0;
> > > +}
> > > +
> > > +static void check_unaligned_access_nonboot_cpu(void *param)
> > > +{
> > > +   if (smp_processor_id() != 0)
> > > +           check_unaligned_access(param);
> > >  }
> > >
> > > -static int check_unaligned_access_boot_cpu(void)
> > > +static int check_unaligned_access_all_cpus(void)
> > >  {
> > > -   check_unaligned_access(0);
> > > +   /* Check everybody except 0, who stays behind to tend jiffies. */
> > > +   on_each_cpu(check_unaligned_access_nonboot_cpu, NULL, 1);
> > > +
> > > +   /* Check core 0. */
> > > +   smp_call_on_cpu(0, check_unaligned_access, NULL, true);
> > > +
> > > +   /* Boot-time measurements are complete. */
> > > +   misaligned_speed_measured = true;
> > >     return 0;
> > >  }
> > >
> > > -arch_initcall(check_unaligned_access_boot_cpu);
> > > +arch_initcall(check_unaligned_access_all_cpus);
> > >
> > >  #ifdef CONFIG_RISCV_ALTERNATIVE
> > >  /*
> > > diff --git a/arch/riscv/kernel/smpboot.c b/arch/riscv/kernel/smpboot.c
> > > index 1b8da4e40a4d..39322ae20a75 100644
> > > --- a/arch/riscv/kernel/smpboot.c
> > > +++ b/arch/riscv/kernel/smpboot.c
> > > @@ -27,6 +27,7 @@
> > >  #include <linux/sched/mm.h>
> > >  #include <asm/cpu_ops.h>
> > >  #include <asm/cpufeature.h>
> > > +#include <asm/hwprobe.h>
> > >  #include <asm/irq.h>
> > >  #include <asm/mmu_context.h>
> > >  #include <asm/numa.h>
> > > @@ -246,7 +247,15 @@ asmlinkage __visible void smp_callin(void)
> > >
> > >     numa_add_cpu(curr_cpuid);
> > >     set_cpu_online(curr_cpuid, 1);
> > > -   check_unaligned_access(curr_cpuid);
> > > +
> > > +   /*
> > > +    * Boot-time misaligned access speed measurements are done in parallel
> > > +    * in an initcall. Only measure here for hotplug.
> > > +    */
> > > +   if (misaligned_speed_measured &&
> > > +       (per_cpu(misaligned_access_speed, curr_cpuid) == RISCV_HWPROBE_MISALIGNED_UNKNOWN)) {
> >
> > I believe this check is for cpu not-booted during boot time but hotplug in
> > after that, if so I'm not sure whether
> > misaligned_speed_measured can be replaced with
> > (system_state == SYSTEM_RUNNING)
> > then we don't need misaligned_speed_measured at all.
> >
> > > +           check_unaligned_access(NULL);
> > > +   }
> > >
> > >     if (has_vector()) {
> > >             if (riscv_v_setup_vsize())
> > > --
> > > 2.34.1
> > >
Jisheng Zhang Nov. 2, 2023, 5:07 p.m. UTC | #6
On Wed, Nov 01, 2023 at 10:28:53AM -0700, Evan Green wrote:
> On Wed, Nov 1, 2023 at 4:44 AM Jisheng Zhang <jszhang@kernel.org> wrote:
> >
> > On Sat, Sep 16, 2023 at 04:39:54PM +0800, Jisheng Zhang wrote:
> > > On Fri, Sep 15, 2023 at 11:49:03AM -0700, Evan Green wrote:
> > > > Probing for misaligned access speed takes about 0.06 seconds. On a
> > > > system with 64 cores, doing this in smp_callin() means it's done
> > > > serially, extending boot time by 3.8 seconds. That's a lot of boot time.
> > > >
> > > > Instead of measuring each CPU serially, let's do the measurements on
> > > > all CPUs in parallel. If we disable preemption on all CPUs, the
> > > > jiffies stop ticking, so we can do this in stages of 1) everybody
> > > > except core 0, then 2) core 0.
> > > >
> > > > The measurement call in smp_callin() stays around, but is now
> > > > conditionalized to only run if a new CPU shows up after the round of
> > > > in-parallel measurements has run. The goal is to have the measurement
> > > > call not run during boot or suspend/resume, but only on a hotplug
> > > > addition.
> > > >
> > > > Signed-off-by: Evan Green <evan@rivosinc.com>
> > >
> > > Reported-by: Jisheng Zhang <jszhang@kernel.org>
> >
> > Hi Evan, Palmer,
> >
> > This patch seems missing in v6.6, I dunno what happened.
> >
> > And this patch doesn't fix the boot time regression but also fix a real
> > bug during cpu hotplug on and off.
> 
> Hi Jisheng,
> Just to clarify, you're saying this both fixes the boot regression,
> and fixes a hotplug crash? I was slightly thrown off by the "doesn't
> fix the boot time regression", holler if there's still something wrong

typo: should be "not only fix the boot time regression but also ..."

> with boot time.
> 
> The splat you pasted suggests the CPU isn't coming back online. Off
> the top of my head I can't think of what that might be or why this
> patch would fix it. I tried this on an old palmer/for-next and didn't
> repro the issue:
> 
> # echo 0 > online
> [   31.777280] CPU3: off
> [   31.777740] CPU3 may not have stopped: 3
> # echo 1 > online
> [   36.236313] cpu3: Ratio of byte access time to unaligned word
> access is 7.26, unaligned accesses are fast

you need to run the script for some time, 3 ~ 5 minutes for example.

Only hotplug cpu off then on for once isn't enough
> 
> FWIW, Palmer's for-next branch now has the v2 of this patch. I

I want v2 patch be merged
> verified that branch is booting, and hotplug seems to work as well.

can you try stress cpu hotplug without your patch? I.E try on v6.6 

Thanks
> 
> 
> >
> > Here is the reproduce script:
> >
> > while true
> > do
> > echo 0 > /sys/devices/system/cpu/cpu1/online
> > echo 1 > /sys/devices/system/cpu/cpu1/online
> > done
> >
> >
> > Here is the BUG log on qemu:
> >
> > [   20.950753] CPU1: failed to come online
> > [   20.951875] ------------[ cut here ]------------
> > [   20.952070] kernel BUG at kernel/time/hrtimer.c:2227!
> > [   20.952341] Kernel BUG [#1]
> > [   20.952366] Modules linked in:
> > [   20.952515] CPU: 0 PID: 46 Comm: sh Not tainted 6.6.0 #3
> > [   20.952607] Hardware name: riscv-virtio,qemu (DT)
> > [   20.952695] epc : hrtimers_dead_cpu+0x22e/0x230
> > [   20.952808]  ra : cpuhp_invoke_callback+0xe4/0x54e
> > [   20.952844] epc : ffffffff8007d6c0 ra : ffffffff8000f904 sp : ff600000011ebb30
> > [   20.952863]  gp : ffffffff80d081d0 tp : ff6000000134da00 t0 : 0000000000000040
> > [   20.952880]  t1 : 0000000000000000 t2 : 0000000000000000 s0 : ff600000011ebbb0
> > [   20.952895]  s1 : 0000000000000001 a0 : 0000000000000001 a1 : 000000000000002c
> > [   20.952911]  a2 : 0000000000000000 a3 : 0000000000000000 a4 : 0000000000000000
> > [   20.952926]  a5 : 0000000000000001 a6 : 0000000000000538 a7 : 0000000000000000
> > [   20.952941]  s2 : 000000000000002c s3 : 0000000000000000 s4 : ff6000003ffd4390
> > [   20.952957]  s5 : ffffffff80d0a1f8 s6 : 0000000000000000 s7 : ffffffff8007d492
> > [   20.952972]  s8 : 0000000000000001 s9 : fffffffffffffffb s10: 0000000000000000
> > [   20.952987]  s11: 00005555820dc708 t3 : 0000000000000002 t4 : 0000000000000402
> > [   20.953002]  t5 : ff600000010f0710 t6 : ff600000010f0718
> > [   20.953016] status: 0000000200000120 badaddr: 0000000000000000 cause: 0000000000000003
> > [   20.953124] [<ffffffff8007d6c0>] hrtimers_dead_cpu+0x22e/0x230
> > [   20.953226] [<ffffffff8000f904>] cpuhp_invoke_callback+0xe4/0x54e
> > [   20.953241] [<ffffffff80010fb8>] _cpu_up+0x200/0x2a2
> > [   20.953254] [<ffffffff800110ac>] cpu_up+0x52/0x8a
> > [   20.953266] [<ffffffff80011654>] cpu_device_up+0x14/0x1c
> > [   20.953279] [<ffffffff8029abb6>] cpu_subsys_online+0x1e/0x68
> > [   20.953296] [<ffffffff802957de>] device_online+0x3c/0x70
> > [   20.953306] [<ffffffff8029587a>] online_store+0x68/0x8c
> > [   20.953317] [<ffffffff802909ba>] dev_attr_store+0xe/0x1a
> > [   20.953330] [<ffffffff801df8aa>] sysfs_kf_write+0x2a/0x34
> > [   20.953346] [<ffffffff801def06>] kernfs_fop_write_iter+0xde/0x162
> > [   20.953360] [<ffffffff8018154a>] vfs_write+0x136/0x320
> > [   20.953372] [<ffffffff801818e4>] ksys_write+0x4a/0xb4
> > [   20.953383] [<ffffffff80181962>] __riscv_sys_write+0x14/0x1c
> > [   20.953394] [<ffffffff803dec7e>] do_trap_ecall_u+0x4a/0x110
> > [   20.953420] [<ffffffff80003666>] ret_from_exception+0x0/0x66
> > [   20.953648] Code: 7c42 7ca2 7d02 6de2 4501 6109 8082 c0ef 7463 bd1d (9002) 1141
> > [   20.953897] ---[ end trace 0000000000000000 ]---
> > [   20.954068] Kernel panic - not syncing: Fatal exception in interrupt
> > [   20.954128] SMP: stopping secondary CPUs
> > [   22.749953] SMP: failed to stop secondary CPUs 0-1
> > [   22.803768] ---[ end Kernel panic - not syncing: Fatal exception in interrupt ]---
> >
> >
> > > >
> > > > ---
> > > >
> > > > Jisheng, I didn't add your Tested-by tag since the patch evolved from
> > > > the one you tested. Hopefully this one brings you the same result.
> > > >
> > > > ---
> > > >  arch/riscv/include/asm/cpufeature.h |  3 ++-
> > > >  arch/riscv/kernel/cpufeature.c      | 28 +++++++++++++++++++++++-----
> > > >  arch/riscv/kernel/smpboot.c         | 11 ++++++++++-
> > > >  3 files changed, 35 insertions(+), 7 deletions(-)
> > > >
> > > > diff --git a/arch/riscv/include/asm/cpufeature.h b/arch/riscv/include/asm/cpufeature.h
> > > > index d0345bd659c9..19e7817eba10 100644
> > > > --- a/arch/riscv/include/asm/cpufeature.h
> > > > +++ b/arch/riscv/include/asm/cpufeature.h
> > > > @@ -30,6 +30,7 @@ DECLARE_PER_CPU(long, misaligned_access_speed);
> > > >  /* Per-cpu ISA extensions. */
> > > >  extern struct riscv_isainfo hart_isa[NR_CPUS];
> > > >
> > > > -void check_unaligned_access(int cpu);
> > > > +extern bool misaligned_speed_measured;
> > > > +int check_unaligned_access(void *unused);
> > > >
> > > >  #endif
> > > > diff --git a/arch/riscv/kernel/cpufeature.c b/arch/riscv/kernel/cpufeature.c
> > > > index 1cfbba65d11a..8eb36e1dfb95 100644
> > > > --- a/arch/riscv/kernel/cpufeature.c
> > > > +++ b/arch/riscv/kernel/cpufeature.c
> > > > @@ -42,6 +42,9 @@ struct riscv_isainfo hart_isa[NR_CPUS];
> > > >  /* Performance information */
> > > >  DEFINE_PER_CPU(long, misaligned_access_speed);
> > > >
> > > > +/* Boot-time in-parallel unaligned access measurement has occurred. */
> > > > +bool misaligned_speed_measured;
> > >
> > > This var can be avoided, see below.
> > >
> > > > +
> > > >  /**
> > > >   * riscv_isa_extension_base() - Get base extension word
> > > >   *
> > > > @@ -556,8 +559,9 @@ unsigned long riscv_get_elf_hwcap(void)
> > > >     return hwcap;
> > > >  }
> > > >
> > > > -void check_unaligned_access(int cpu)
> > > > +int check_unaligned_access(void *unused)
> > > >  {
> > > > +   int cpu = smp_processor_id();
> > > >     u64 start_cycles, end_cycles;
> > > >     u64 word_cycles;
> > > >     u64 byte_cycles;
> > > > @@ -571,7 +575,7 @@ void check_unaligned_access(int cpu)
> > > >     page = alloc_pages(GFP_NOWAIT, get_order(MISALIGNED_BUFFER_SIZE));
> > > >     if (!page) {
> > > >             pr_warn("Can't alloc pages to measure memcpy performance");
> > > > -           return;
> > > > +           return 0;
> > > >     }
> > > >
> > > >     /* Make an unaligned destination buffer. */
> > > > @@ -643,15 +647,29 @@ void check_unaligned_access(int cpu)
> > > >
> > > >  out:
> > > >     __free_pages(page, get_order(MISALIGNED_BUFFER_SIZE));
> > > > +   return 0;
> > > > +}
> > > > +
> > > > +static void check_unaligned_access_nonboot_cpu(void *param)
> > > > +{
> > > > +   if (smp_processor_id() != 0)
> > > > +           check_unaligned_access(param);
> > > >  }
> > > >
> > > > -static int check_unaligned_access_boot_cpu(void)
> > > > +static int check_unaligned_access_all_cpus(void)
> > > >  {
> > > > -   check_unaligned_access(0);
> > > > +   /* Check everybody except 0, who stays behind to tend jiffies. */
> > > > +   on_each_cpu(check_unaligned_access_nonboot_cpu, NULL, 1);
> > > > +
> > > > +   /* Check core 0. */
> > > > +   smp_call_on_cpu(0, check_unaligned_access, NULL, true);
> > > > +
> > > > +   /* Boot-time measurements are complete. */
> > > > +   misaligned_speed_measured = true;
> > > >     return 0;
> > > >  }
> > > >
> > > > -arch_initcall(check_unaligned_access_boot_cpu);
> > > > +arch_initcall(check_unaligned_access_all_cpus);
> > > >
> > > >  #ifdef CONFIG_RISCV_ALTERNATIVE
> > > >  /*
> > > > diff --git a/arch/riscv/kernel/smpboot.c b/arch/riscv/kernel/smpboot.c
> > > > index 1b8da4e40a4d..39322ae20a75 100644
> > > > --- a/arch/riscv/kernel/smpboot.c
> > > > +++ b/arch/riscv/kernel/smpboot.c
> > > > @@ -27,6 +27,7 @@
> > > >  #include <linux/sched/mm.h>
> > > >  #include <asm/cpu_ops.h>
> > > >  #include <asm/cpufeature.h>
> > > > +#include <asm/hwprobe.h>
> > > >  #include <asm/irq.h>
> > > >  #include <asm/mmu_context.h>
> > > >  #include <asm/numa.h>
> > > > @@ -246,7 +247,15 @@ asmlinkage __visible void smp_callin(void)
> > > >
> > > >     numa_add_cpu(curr_cpuid);
> > > >     set_cpu_online(curr_cpuid, 1);
> > > > -   check_unaligned_access(curr_cpuid);
> > > > +
> > > > +   /*
> > > > +    * Boot-time misaligned access speed measurements are done in parallel
> > > > +    * in an initcall. Only measure here for hotplug.
> > > > +    */
> > > > +   if (misaligned_speed_measured &&
> > > > +       (per_cpu(misaligned_access_speed, curr_cpuid) == RISCV_HWPROBE_MISALIGNED_UNKNOWN)) {
> > >
> > > I believe this check is for cpu not-booted during boot time but hotplug in
> > > after that, if so I'm not sure whether
> > > misaligned_speed_measured can be replaced with
> > > (system_state == SYSTEM_RUNNING)
> > > then we don't need misaligned_speed_measured at all.
> > >
> > > > +           check_unaligned_access(NULL);
> > > > +   }
> > > >
> > > >     if (has_vector()) {
> > > >             if (riscv_v_setup_vsize())
> > > > --
> > > > 2.34.1
> > > >
Evan Green Nov. 2, 2023, 10:41 p.m. UTC | #7
On Fri, Sep 15, 2023 at 11:49 AM Evan Green <evan@rivosinc.com> wrote:
>
> Probing for misaligned access speed takes about 0.06 seconds. On a
> system with 64 cores, doing this in smp_callin() means it's done
> serially, extending boot time by 3.8 seconds. That's a lot of boot time.
>
> Instead of measuring each CPU serially, let's do the measurements on
> all CPUs in parallel. If we disable preemption on all CPUs, the
> jiffies stop ticking, so we can do this in stages of 1) everybody
> except core 0, then 2) core 0.
>
> The measurement call in smp_callin() stays around, but is now
> conditionalized to only run if a new CPU shows up after the round of
> in-parallel measurements has run. The goal is to have the measurement
> call not run during boot or suspend/resume, but only on a hotplug
> addition.
>
> Signed-off-by: Evan Green <evan@rivosinc.com>

Shoot, I saw the other thread [1] where it seems like my use of
alloc_pages() in this context is improper? I had thought I was
alright, as Documentation/core-api/memory-allocation.rst says:

 > If the allocation is performed from an atomic context, e.g interrupt
 > handler, use ``GFP_NOWAIT``.

Any tips for reproducing that splat? I have CONFIG_DEBUG_ATOMIC_SLEEP
on (it's in the defconfig), and lockdep, and I'm on Conor's
linux-6.6.y-rt, but so far I'm not seeing it.

-Evan

[1] https://lore.kernel.org/linux-riscv/ZUPWc7sY47l34lV+@xhacker/T/#t
Conor Dooley Nov. 3, 2023, 8:34 a.m. UTC | #8
On Thu, Nov 02, 2023 at 03:41:58PM -0700, Evan Green wrote:
> On Fri, Sep 15, 2023 at 11:49 AM Evan Green <evan@rivosinc.com> wrote:
> >
> > Probing for misaligned access speed takes about 0.06 seconds. On a
> > system with 64 cores, doing this in smp_callin() means it's done
> > serially, extending boot time by 3.8 seconds. That's a lot of boot time.
> >
> > Instead of measuring each CPU serially, let's do the measurements on
> > all CPUs in parallel. If we disable preemption on all CPUs, the
> > jiffies stop ticking, so we can do this in stages of 1) everybody
> > except core 0, then 2) core 0.
> >
> > The measurement call in smp_callin() stays around, but is now
> > conditionalized to only run if a new CPU shows up after the round of
> > in-parallel measurements has run. The goal is to have the measurement
> > call not run during boot or suspend/resume, but only on a hotplug
> > addition.
> >
> > Signed-off-by: Evan Green <evan@rivosinc.com>
> 
> Shoot, I saw the other thread [1] where it seems like my use of
> alloc_pages() in this context is improper? I had thought I was
> alright, as Documentation/core-api/memory-allocation.rst says:
> 
>  > If the allocation is performed from an atomic context, e.g interrupt
>  > handler, use ``GFP_NOWAIT``.
> 
> Any tips for reproducing that splat? I have CONFIG_DEBUG_ATOMIC_SLEEP
> on (it's in the defconfig), and lockdep, and I'm on Conor's
> linux-6.6.y-rt, but so far I'm not seeing it.

It was originally produced in hardware, but I can also see these issues
in QEMU's emulation of my hardware (although as you may have seen, I get
them both with and without this patch). My qemu incantation was
something like:
	$(qemu) -M microchip-icicle-kit \
		-m 3G -smp 5 \
		-kernel vmlinux.bin \
		-dtb mpfs-icicle.dtb \
		-initrd initramfs \
		-display none -serial null \
		-serial stdio \
		-D qemu.log -d unimp

Where the kernel was built from the .config in that branch in my repo.

Cheers,
Conor.
diff mbox series

Patch

diff --git a/arch/riscv/include/asm/cpufeature.h b/arch/riscv/include/asm/cpufeature.h
index d0345bd659c9..19e7817eba10 100644
--- a/arch/riscv/include/asm/cpufeature.h
+++ b/arch/riscv/include/asm/cpufeature.h
@@ -30,6 +30,7 @@  DECLARE_PER_CPU(long, misaligned_access_speed);
 /* Per-cpu ISA extensions. */
 extern struct riscv_isainfo hart_isa[NR_CPUS];
 
-void check_unaligned_access(int cpu);
+extern bool misaligned_speed_measured;
+int check_unaligned_access(void *unused);
 
 #endif
diff --git a/arch/riscv/kernel/cpufeature.c b/arch/riscv/kernel/cpufeature.c
index 1cfbba65d11a..8eb36e1dfb95 100644
--- a/arch/riscv/kernel/cpufeature.c
+++ b/arch/riscv/kernel/cpufeature.c
@@ -42,6 +42,9 @@  struct riscv_isainfo hart_isa[NR_CPUS];
 /* Performance information */
 DEFINE_PER_CPU(long, misaligned_access_speed);
 
+/* Boot-time in-parallel unaligned access measurement has occurred. */
+bool misaligned_speed_measured;
+
 /**
  * riscv_isa_extension_base() - Get base extension word
  *
@@ -556,8 +559,9 @@  unsigned long riscv_get_elf_hwcap(void)
 	return hwcap;
 }
 
-void check_unaligned_access(int cpu)
+int check_unaligned_access(void *unused)
 {
+	int cpu = smp_processor_id();
 	u64 start_cycles, end_cycles;
 	u64 word_cycles;
 	u64 byte_cycles;
@@ -571,7 +575,7 @@  void check_unaligned_access(int cpu)
 	page = alloc_pages(GFP_NOWAIT, get_order(MISALIGNED_BUFFER_SIZE));
 	if (!page) {
 		pr_warn("Can't alloc pages to measure memcpy performance");
-		return;
+		return 0;
 	}
 
 	/* Make an unaligned destination buffer. */
@@ -643,15 +647,29 @@  void check_unaligned_access(int cpu)
 
 out:
 	__free_pages(page, get_order(MISALIGNED_BUFFER_SIZE));
+	return 0;
+}
+
+static void check_unaligned_access_nonboot_cpu(void *param)
+{
+	if (smp_processor_id() != 0)
+		check_unaligned_access(param);
 }
 
-static int check_unaligned_access_boot_cpu(void)
+static int check_unaligned_access_all_cpus(void)
 {
-	check_unaligned_access(0);
+	/* Check everybody except 0, who stays behind to tend jiffies. */
+	on_each_cpu(check_unaligned_access_nonboot_cpu, NULL, 1);
+
+	/* Check core 0. */
+	smp_call_on_cpu(0, check_unaligned_access, NULL, true);
+
+	/* Boot-time measurements are complete. */
+	misaligned_speed_measured = true;
 	return 0;
 }
 
-arch_initcall(check_unaligned_access_boot_cpu);
+arch_initcall(check_unaligned_access_all_cpus);
 
 #ifdef CONFIG_RISCV_ALTERNATIVE
 /*
diff --git a/arch/riscv/kernel/smpboot.c b/arch/riscv/kernel/smpboot.c
index 1b8da4e40a4d..39322ae20a75 100644
--- a/arch/riscv/kernel/smpboot.c
+++ b/arch/riscv/kernel/smpboot.c
@@ -27,6 +27,7 @@ 
 #include <linux/sched/mm.h>
 #include <asm/cpu_ops.h>
 #include <asm/cpufeature.h>
+#include <asm/hwprobe.h>
 #include <asm/irq.h>
 #include <asm/mmu_context.h>
 #include <asm/numa.h>
@@ -246,7 +247,15 @@  asmlinkage __visible void smp_callin(void)
 
 	numa_add_cpu(curr_cpuid);
 	set_cpu_online(curr_cpuid, 1);
-	check_unaligned_access(curr_cpuid);
+
+	/*
+	 * Boot-time misaligned access speed measurements are done in parallel
+	 * in an initcall. Only measure here for hotplug.
+	 */
+	if (misaligned_speed_measured &&
+	    (per_cpu(misaligned_access_speed, curr_cpuid) == RISCV_HWPROBE_MISALIGNED_UNKNOWN)) {
+		check_unaligned_access(NULL);
+	}
 
 	if (has_vector()) {
 		if (riscv_v_setup_vsize())