diff mbox series

[RFC] arch_topology: Pre-allocate cacheinfo from primary CPU

Message ID 20230323224242.31142-1-rrendec@redhat.com (mailing list archive)
State New, archived
Headers show
Series [RFC] arch_topology: Pre-allocate cacheinfo from primary CPU | expand

Commit Message

Radu Rendec March 23, 2023, 10:42 p.m. UTC
Commit 5944ce092b97 ("arch_topology: Build cacheinfo from primary CPU")
tries to build the cacheinfo from the primary CPU prior to secondary
CPUs boot, if the DT/ACPI description contains cache information.
However, if such information is not present, it still reverts to the old
behavior, which allocates the cacheinfo memory on each secondary CPU. On
RT kernels, this triggers a "BUG: sleeping function called from invalid
context" because the allocation is done before preemption is first
enabled on the secondary CPU.

The solution is to add cache information to DT/ACPI, but at least on
aarch64 systems this can be avoided by leveraging automatic detection
(through the CLIDR_EL1 register), which is already implemented but
currently doesn't work on RT kernels for the reason described above.

This patch attempts to enable automatic detection for RT kernels when no
DT/ACPI cache information is available, by pre-allocating cacheinfo
memory on the primary CPU. The allocated memory size depends on the
number of cache leaves, which at that point is unknown without the
DT/ACPI information. What this patch does is guess the number of cache
leaves and pre-allocate memory on the primary CPU, then go back and
reallocate the memory if the guess turns out to be wrong when automatic
detection eventually runs on the secondary CPU. In that case, it will
basically revert to the original behavior and still trigger a splat on
RT kernels. The assumption is that most systems have identical CPUs, so
the number of cache leaves will be the same on the secondary CPUs as the
primary CPU. The "guess" uses the number of leaves of the primary CPU.

If the DT/ACPI cache information is present, the previous behavior of
pre-allocating memory through init_cpu_topology() is preserved.

With this patch applied, automatic detection should work on RT kernels
for all systems with identical CPUs, without requiring to modify the
DT/ACPI to include the cache information.

Signed-off-by: Radu Rendec <rrendec@redhat.com>
---
 arch/arm64/kernel/smp.c   |  3 +++
 drivers/base/cacheinfo.c  | 49 +++++++++++++++++++++++++++++++++++++--
 include/linux/cacheinfo.h |  1 +
 3 files changed, 51 insertions(+), 2 deletions(-)

Comments

Pierre Gondois March 27, 2023, 12:02 p.m. UTC | #1
Hello Radu,
About populating the cache info from the CLIDR_EL1 register, it seems
the information might be incorrect for DT based system, and the mask
of L1 data/instruction caches could be advertised as private for
ACPI/DT based systems when no cache information is available.
There is a patch-set that should fix this at:
https://lore.kernel.org/all/20230327115953.788244-1-pierre.gondois@arm.com/


On 3/23/23 23:42, Radu Rendec wrote:
> Commit 5944ce092b97 ("arch_topology: Build cacheinfo from primary CPU")
> tries to build the cacheinfo from the primary CPU prior to secondary
> CPUs boot, if the DT/ACPI description contains cache information.
> However, if such information is not present, it still reverts to the old
> behavior, which allocates the cacheinfo memory on each secondary CPU. On
> RT kernels, this triggers a "BUG: sleeping function called from invalid
> context" because the allocation is done before preemption is first
> enabled on the secondary CPU.
> 
> The solution is to add cache information to DT/ACPI, but at least on
> aarch64 systems this can be avoided by leveraging automatic detection
> (through the CLIDR_EL1 register), which is already implemented but
> currently doesn't work on RT kernels for the reason described above.
> 
> This patch attempts to enable automatic detection for RT kernels when no
> DT/ACPI cache information is available, by pre-allocating cacheinfo
> memory on the primary CPU. The allocated memory size depends on the
> number of cache leaves, which at that point is unknown without the
> DT/ACPI information. What this patch does is guess the number of cache
> leaves and pre-allocate memory on the primary CPU, then go back and
> reallocate the memory if the guess turns out to be wrong when automatic
> detection eventually runs on the secondary CPU. In that case, it will
> basically revert to the original behavior and still trigger a splat on
> RT kernels. The assumption is that most systems have identical CPUs, so
> the number of cache leaves will be the same on the secondary CPUs as the
> primary CPU. The "guess" uses the number of leaves of the primary CPU.
> 
> If the DT/ACPI cache information is present, the previous behavior of
> pre-allocating memory through init_cpu_topology() is preserved.
> 
> With this patch applied, automatic detection should work on RT kernels
> for all systems with identical CPUs, without requiring to modify the
> DT/ACPI to include the cache information.
> 
> Signed-off-by: Radu Rendec <rrendec@redhat.com>
> ---
>   arch/arm64/kernel/smp.c   |  3 +++
>   drivers/base/cacheinfo.c  | 49 +++++++++++++++++++++++++++++++++++++--
>   include/linux/cacheinfo.h |  1 +
>   3 files changed, 51 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
> index 4e8327264255..7ee2c38185d4 100644
> --- a/arch/arm64/kernel/smp.c
> +++ b/arch/arm64/kernel/smp.c
> @@ -33,6 +33,7 @@
>   #include <linux/kernel_stat.h>
>   #include <linux/kexec.h>
>   #include <linux/kvm_host.h>
> +#include <linux/cacheinfo.h>
>   
>   #include <asm/alternative.h>
>   #include <asm/atomic.h>
> @@ -730,6 +731,8 @@ void __init smp_prepare_cpus(unsigned int max_cpus)
>   	numa_store_cpu_info(this_cpu);
>   	numa_add_cpu(this_cpu);
>   
> +	pre_alloc_cache_info();
> +

Would it work to do the pre-allocation in fetch_cache_info() instead of
returning '-ENOENT' ? This would allow to not add a new step in
smp_prepare_cpus() and let all the logic in cacheinfo.c


>   	/*
>   	 * If UP is mandated by "nosmp" (which implies "maxcpus=0"), don't set
>   	 * secondary CPUs present.
> diff --git a/drivers/base/cacheinfo.c b/drivers/base/cacheinfo.c
> index f6573c335f4c..c7d691ef7839 100644
> --- a/drivers/base/cacheinfo.c
> +++ b/drivers/base/cacheinfo.c
> @@ -28,6 +28,9 @@ static DEFINE_PER_CPU(struct cpu_cacheinfo, ci_cpu_cacheinfo);
>   #define per_cpu_cacheinfo_idx(cpu, idx)		\
>   				(per_cpu_cacheinfo(cpu) + (idx))
>   
> +static DEFINE_PER_CPU(struct cacheinfo *, pre_alloc_ci_list);
> +static unsigned int pre_alloc_ci_leaves;
> +
>   struct cpu_cacheinfo *get_cpu_cacheinfo(unsigned int cpu)
>   {
>   	return ci_cacheinfo(cpu);
> @@ -408,9 +411,51 @@ int __weak populate_cache_leaves(unsigned int cpu)
>   	return -ENOENT;
>   }
>   
> -static inline
> -int allocate_cache_info(int cpu)
> +void pre_alloc_cache_info(void)
>   {
> +	unsigned int leaves = cache_leaves(smp_processor_id());
> +	unsigned int cpu;
> +	struct cacheinfo *ci;
> +
> +	if (!leaves)
> +		return;
> +
> +	for_each_possible_cpu(cpu) {
> +		if (per_cpu_cacheinfo(cpu))
> +			/*
> +			 * Early allocation through init_cpu_topology() was
> +			 * successful, so there is no point in pre-allocating.
> +			 */
> +			continue;
> +
> +		ci = kcalloc(leaves, sizeof(struct cacheinfo), GFP_ATOMIC);
> +		if (!ci) {
> +			for_each_possible_cpu(cpu)
> +				kfree(per_cpu(pre_alloc_ci_list, cpu));
> +			return;
> +		}
> +
> +		per_cpu(pre_alloc_ci_list, cpu) = ci;
> +	}
> +
> +	pre_alloc_ci_leaves = leaves;
> +}
> +
> +static int allocate_cache_info(int cpu)
> +{
> +	struct cacheinfo *ci = per_cpu(pre_alloc_ci_list, cpu);
> +
> +	if (ci) {
> +		per_cpu(pre_alloc_ci_list, cpu) = NULL;
> +
> +		if (cache_leaves(cpu) <= pre_alloc_ci_leaves) {
> +			per_cpu_cacheinfo(cpu) = ci;
> +			return 0;
> +		}
> +
> +		kfree(ci);
> +	}
> +
>   	per_cpu_cacheinfo(cpu) = kcalloc(cache_leaves(cpu),
>   					 sizeof(struct cacheinfo), GFP_ATOMIC);
>   	if (!per_cpu_cacheinfo(cpu)) {
> diff --git a/include/linux/cacheinfo.h b/include/linux/cacheinfo.h
> index 908e19d17f49..23f9dac61d67 100644
> --- a/include/linux/cacheinfo.h
> +++ b/include/linux/cacheinfo.h
> @@ -85,6 +85,7 @@ int populate_cache_leaves(unsigned int cpu);
>   int cache_setup_acpi(unsigned int cpu);
>   bool last_level_cache_is_valid(unsigned int cpu);
>   bool last_level_cache_is_shared(unsigned int cpu_x, unsigned int cpu_y);
> +void pre_alloc_cache_info(void);
>   int fetch_cache_info(unsigned int cpu);
>   int detect_cache_attributes(unsigned int cpu);
>   #ifndef CONFIG_ACPI_PPTT


Regards,
Pierre
Radu Rendec March 27, 2023, 2:23 p.m. UTC | #2
On Mon, 2023-03-27 at 14:02 +0200, Pierre Gondois wrote:
> About populating the cache info from the CLIDR_EL1 register, it seems
> the information might be incorrect for DT based system, and the mask
> of L1 data/instruction caches could be advertised as private for
> ACPI/DT based systems when no cache information is available.
> There is a patch-set that should fix this at:
> https://lore.kernel.org/all/20230327115953.788244-1-pierre.gondois@arm.com/

Hello Pierre,

Thanks for pointing out this issue and creating a patch-set to fix it.
I will keep an eye on it. This is definitely important, since we are
planing to rely on the CLIDR_EL1 based detection.

> On 3/23/23 23:42, Radu Rendec wrote:
> > This patch attempts to enable automatic detection for RT kernels when no
> > DT/ACPI cache information is available, by pre-allocating cacheinfo
> > memory on the primary CPU. The allocated memory size depends on the
> > number of cache leaves, which at that point is unknown without the
> > DT/ACPI information. What this patch does is guess the number of cache
> > leaves and pre-allocate memory on the primary CPU, then go back and
> > reallocate the memory if the guess turns out to be wrong when automatic
> > detection eventually runs on the secondary CPU. In that case, it will
> > basically revert to the original behavior and still trigger a splat on
> > RT kernels. The assumption is that most systems have identical CPUs, so
> > the number of cache leaves will be the same on the secondary CPUs as the
> > primary CPU. The "guess" uses the number of leaves of the primary CPU.
> 
> Would it work to do the pre-allocation in fetch_cache_info() instead of
> returning '-ENOENT' ? This would allow to not add a new step in
> smp_prepare_cpus() and let all the logic in cacheinfo.c

That was my initial intention, and I agree it would be much cleaner.
Unfortunately, the number of cache leaves of the primary CPU hasn't
been detected/populated at the time when fetch_cache_info() is called.
That means we don't know how much memory to pre-allocate.

This is the call flow that I recorded (read from top to bottom):
smp_prepare_cpus
        init_cpu_topology
                parse_acpi_topology
                parse_dt_topology
                fetch_cache_info(cpu)
                        init_of_cache_level(cpu)
                        acpi_get_cache_info(cpu)
                        allocate_cache_info(cpu)
        store_cpu_topology(this_cpu)
                update_siblings_masks
                        detect_cache_attributes


The number of cache leaves of the primary CPU is populated inside that
last call to detect_cache_attributes(). If you can think of any other
way around it, I am definitely open to suggestions.

On a slightly different note, I think most of this code/mechanism is
shared with RISC-V, but smp_prepare_cpus() is ARM64 specific. Unless we
can think of a way to avoid that extra call to pre_alloc_cache_info(),
we may need one for RISC-V as well.

My main intention was to get some feedback and see if you are willing
to accept something along these lines upstream. I can improve it and
send subsequent versions of the patch if needed. Thanks again for
taking the time to review it.

Best regards,
Radu
Pierre Gondois March 29, 2023, 2:42 p.m. UTC | #3
Hello Radu,

On 3/27/23 16:23, Radu Rendec wrote:
> On Mon, 2023-03-27 at 14:02 +0200, Pierre Gondois wrote:
>> About populating the cache info from the CLIDR_EL1 register, it seems
>> the information might be incorrect for DT based system, and the mask
>> of L1 data/instruction caches could be advertised as private for
>> ACPI/DT based systems when no cache information is available.
>> There is a patch-set that should fix this at:
>> https://lore.kernel.org/all/20230327115953.788244-1-pierre.gondois@arm.com/
> 
> Hello Pierre,
> 
> Thanks for pointing out this issue and creating a patch-set to fix it.
> I will keep an eye on it. This is definitely important, since we are
> planing to rely on the CLIDR_EL1 based detection.
> 
>> On 3/23/23 23:42, Radu Rendec wrote:
>>> This patch attempts to enable automatic detection for RT kernels when no
>>> DT/ACPI cache information is available, by pre-allocating cacheinfo
>>> memory on the primary CPU. The allocated memory size depends on the
>>> number of cache leaves, which at that point is unknown without the
>>> DT/ACPI information. What this patch does is guess the number of cache
>>> leaves and pre-allocate memory on the primary CPU, then go back and
>>> reallocate the memory if the guess turns out to be wrong when automatic
>>> detection eventually runs on the secondary CPU. In that case, it will
>>> basically revert to the original behavior and still trigger a splat on
>>> RT kernels. The assumption is that most systems have identical CPUs, so
>>> the number of cache leaves will be the same on the secondary CPUs as the
>>> primary CPU. The "guess" uses the number of leaves of the primary CPU.
>>
>> Would it work to do the pre-allocation in fetch_cache_info() instead of
>> returning '-ENOENT' ? This would allow to not add a new step in
>> smp_prepare_cpus() and let all the logic in cacheinfo.c
> 
> That was my initial intention, and I agree it would be much cleaner.
> Unfortunately, the number of cache leaves of the primary CPU hasn't
> been detected/populated at the time when fetch_cache_info() is called.
> That means we don't know how much memory to pre-allocate.
> 
> This is the call flow that I recorded (read from top to bottom):
> smp_prepare_cpus
>          init_cpu_topology
>                  parse_acpi_topology
>                  parse_dt_topology
>                  fetch_cache_info(cpu)
>                          init_of_cache_level(cpu)
>                          acpi_get_cache_info(cpu)
>                          allocate_cache_info(cpu)
>          store_cpu_topology(this_cpu)
>                  update_siblings_masks
>                          detect_cache_attributes

It seems that both fetch_cache_info() and init_cache_level()'s role are
to set num_levels/num_leaves variables:
- fetch_cache_info() tries to do it from DT/ACPI information.
- init_cache_level() allows arch specific implementations. In particular,
   for arm64, the implementation relies on the content of clidr_el1 and then
   tries to complete the cache info with ACPI/DT.

Maybe it would be worth:
- making fetch_cache_info() try to get the cacheinfo from DT/ACPI,
   and it it fails fallback to init_cache_level().
- in arm64's init_cache_level() implementation, removing the code that
   tries to get the information from ACPI/DT as fetch_cache_info() would
   have already done it.
- In detect_cache_attributes(), replace init_cache_level() with
   fetch_cache_info().

This would mean that for all architectures, the cacheinfo would come from
ACPI/DT first, and if it fails from init_cache_level(). It would be a big
change but maybe it would be clearer in the end.

What do you think ?

> 
> 
> The number of cache leaves of the primary CPU is populated inside that
> last call to detect_cache_attributes(). If you can think of any other
> way around it, I am definitely open to suggestions.
> 
> On a slightly different note, I think most of this code/mechanism is
> shared with RISC-V, but smp_prepare_cpus() is ARM64 specific. Unless we
> can think of a way to avoid that extra call to pre_alloc_cache_info(),
> we may need one for RISC-V as well.
> 
> My main intention was to get some feedback and see if you are willing
> to accept something along these lines upstream. I can improve it and
> send subsequent versions of the patch if needed. Thanks again for
> taking the time to review it.

Please bear in mind I don't have the final take and as what is suggested
would impact other architectures, it might be good to also have a conversation
on lkml,

Regards,
Pierre

> 
> Best regards,
> Radu
>
Sudeep Holla March 29, 2023, 3:03 p.m. UTC | #4
On Wed, Mar 29, 2023 at 04:42:07PM +0200, Pierre Gondois wrote:
>
> This would mean that for all architectures, the cacheinfo would come from
> ACPI/DT first.....

x86 doesn't fall into the above category. So we need to ensure it continues
to work with no errors.

--
Regards,
Sudeep
Pierre Gondois March 29, 2023, 3:39 p.m. UTC | #5
On 3/29/23 17:03, Sudeep Holla wrote:
> On Wed, Mar 29, 2023 at 04:42:07PM +0200, Pierre Gondois wrote:
>>
>> This would mean that for all architectures, the cacheinfo would come from
>> ACPI/DT first.....
> 
> x86 doesn't fall into the above category. So we need to ensure it continues
> to work with no errors.

Ok, then maybe having a second arch specific function like
init_cache_level() would work.

This function would be called in fetch_cache_info() after
init_of_cache_level()/acpi_get_cache_info() fail. It would fetch
cache info anywhere but in DT/ACPI.
Archs that don't want it would not implement it, and it would
allow the others to get the num_leaves/levels during early boot.

> 
> --
> Regards,
> Sudeep
Radu Rendec March 29, 2023, 9:35 p.m. UTC | #6
On Wed, 2023-03-29 at 17:39 +0200, Pierre Gondois wrote:
> On 3/29/23 17:03, Sudeep Holla wrote:
> > On Wed, Mar 29, 2023 at 04:42:07PM +0200, Pierre Gondois wrote:
> > > 
> > > This would mean that for all architectures, the cacheinfo would come from
> > > ACPI/DT first.....
> > 
> > x86 doesn't fall into the above category. So we need to ensure it continues
> > to work with no errors.
> 
> Ok, then maybe having a second arch specific function like
> init_cache_level() would work.
> 
> This function would be called in fetch_cache_info() after
> init_of_cache_level()/acpi_get_cache_info() fail. It would fetch
> cache info anywhere but in DT/ACPI.
> Archs that don't want it would not implement it, and it would
> allow the others to get the num_leaves/levels during early boot.

Hello Pierre,

If I understand correctly, in the case of arm64 this new function would
use CLIDR_EL1 to detect the number of leaves/levels, right? But since
init_cpu_topology() calls fetch_cache_info() for each CPU, doesn't this
mean we would end up doing CLIDR_EL1 based detection for the secondary
CPUs by running the (arch specific) detection code on the primary CPU?

My intimate knowledge of arm64 is very limited, but I *assumed* one of
the reasons why detect_cache_attributes() (and init_cache_level()) run
on the secondary CPU today is because not all CPUs are necessarily
identical. Another possible reason I can think of is because maybe on
some architectures auto-detection isn't possible altogether before the
secondary CPU is brought up.

In particular, for arm64 is it possible that CLIDR_EL1 may not look the
same depending on the CPU that reads it? What about SoC's with
asymmetrical CPU cores? (no concrete example here, just assuming this
is a real/possible thing)

Best regards,
Radu
Pierre Gondois March 30, 2023, 6:57 a.m. UTC | #7
On 3/29/23 23:35, Radu Rendec wrote:
> On Wed, 2023-03-29 at 17:39 +0200, Pierre Gondois wrote:
>> On 3/29/23 17:03, Sudeep Holla wrote:
>>> On Wed, Mar 29, 2023 at 04:42:07PM +0200, Pierre Gondois wrote:
>>>>
>>>> This would mean that for all architectures, the cacheinfo would come from
>>>> ACPI/DT first.....
>>>
>>> x86 doesn't fall into the above category. So we need to ensure it continues
>>> to work with no errors.
>>
>> Ok, then maybe having a second arch specific function like
>> init_cache_level() would work.
>>
>> This function would be called in fetch_cache_info() after
>> init_of_cache_level()/acpi_get_cache_info() fail. It would fetch
>> cache info anywhere but in DT/ACPI.
>> Archs that don't want it would not implement it, and it would
>> allow the others to get the num_leaves/levels during early boot.
> 
> Hello Pierre,
> 
> If I understand correctly, in the case of arm64 this new function would
> use CLIDR_EL1 to detect the number of leaves/levels, right? But since
> init_cpu_topology() calls fetch_cache_info() for each CPU, doesn't this
> mean we would end up doing CLIDR_EL1 based detection for the secondary
> CPUs by running the (arch specific) detection code on the primary CPU?

Yes indeed, this would rely on the assumption made in the RFC that
the platform is symmetrical (i.e. all CPUs have the same number/level
of caches).

> 
> My intimate knowledge of arm64 is very limited, but I *assumed* one of
> the reasons why detect_cache_attributes() (and init_cache_level()) run
> on the secondary CPU today is because not all CPUs are necessarily
> identical. Another possible reason I can think of is because maybe on
> some architectures auto-detection isn't possible altogether before the
> secondary CPU is brought up.

Yes I think you are right.

> 
> In particular, for arm64 is it possible that CLIDR_EL1 may not look the
> same depending on the CPU that reads it? What about SoC's with
> asymmetrical CPU cores? (no concrete example here, just assuming this
> is a real/possible thing)

This would indeed be an issue if all the CPUs don't have the same number/level
of caches. In case there is no DT/ACPI, it should be possible to:
- from the primary CPU using CLIDR_EL1, allocate the cacheinfo (making the
   assumption the platform is symmetrical)
- from the secondary CPUs, if we know a pre-allocation has been made,
   run init_cache_level() and check the pre-allocation was correct.
   If not, re-allocate the cacheinfo (and trigger a warning).

I think this is more or less what was done in the RFC, the only difference
being there is no call from smp_prepare_cpus(), or did I miss something ?

Regards,
Pierre
Radu Rendec March 30, 2023, 11:32 p.m. UTC | #8
On Thu, 2023-03-30 at 08:57 +0200, Pierre Gondois wrote:
> On 3/29/23 23:35, Radu Rendec wrote:
> > On Wed, 2023-03-29 at 17:39 +0200, Pierre Gondois wrote:
> > > On 3/29/23 17:03, Sudeep Holla wrote:
> > > > On Wed, Mar 29, 2023 at 04:42:07PM +0200, Pierre Gondois wrote:
> > > > > 
> > > > > This would mean that for all architectures, the cacheinfo would come from
> > > > > ACPI/DT first.....
> > > > 
> > > > x86 doesn't fall into the above category. So we need to ensure it continues
> > > > to work with no errors.
> > > 
> > > Ok, then maybe having a second arch specific function like
> > > init_cache_level() would work.
> > > 
> > > This function would be called in fetch_cache_info() after
> > > init_of_cache_level()/acpi_get_cache_info() fail. It would fetch
> > > cache info anywhere but in DT/ACPI.
> > > Archs that don't want it would not implement it, and it would
> > > allow the others to get the num_leaves/levels during early boot.
> > 
> > In particular, for arm64 is it possible that CLIDR_EL1 may not look the
> > same depending on the CPU that reads it? What about SoC's with
> > asymmetrical CPU cores? (no concrete example here, just assuming this
> > is a real/possible thing)
> 
> This would indeed be an issue if all the CPUs don't have the same number/level
> of caches. In case there is no DT/ACPI, it should be possible to:
> - from the primary CPU using CLIDR_EL1, allocate the cacheinfo (making the
>    assumption the platform is symmetrical)
> - from the secondary CPUs, if we know a pre-allocation has been made,
>    run init_cache_level() and check the pre-allocation was correct.
>    If not, re-allocate the cacheinfo (and trigger a warning).

That makes sense. The question here is how do we know a pre-allocation
has been made? If we modified fetch_cache_info() the way you proposed,
then we would always pre-allocate memory, and using the pointer alone
it would be impossible to tell if the pre-allocation is based on a
"guessed" size.

Also I'm not sure about the "trigger a warning" part. If it's an RT
kernel, it will try to reallocate and that will be a "bug" splat
anyway. If not, the reallocation will fix the issue automatically.
Maybe a pr_warn() as opposed to a WARN_ON() (or perhaps that's what you
had in mind as well).

> I think this is more or less what was done in the RFC, the only difference
> being there is no call from smp_prepare_cpus(), or did I miss something ?

Yes, it's more or less the same thing. The main differences are:
 * There is no window when the information in struct cpu_cacheinfo (the
   number of cache levels and leaves) is populated but potentially
   inaccurate (e.g. in the case of asymmetric CPUs with no DT/ACPI).
 * The logic in detect_cache_attributes() remains unchanged because the
   allocated memory is "published" to ci_cpu_cacheinfo only if the size
   is known to be correct (i.e. it comes from DT/ACPI).

Based on what you proposed, I hacked away at the code and came up with
the patch below. It works in my particular test case because the CPUs
are symmetrical, but doesn't solve the reallocation part.

I can't think of an elegant way to solve this. One thing that comes to
mind is change the logic in detect_cache_attributes(). But I am
reluctant because this code is architecture independent, while the
"guess first/fix later" strategy is implementation specific (arm64 in
our case). The main problem is that the call to init_cache_level() is
skipped altogether if there is a valid pointer value in
ci_cpu_cacheinfo. In other words, if the memory has been already
allocated, it is assumed to be correct. This is why I used a separate
per-cpu variable in my original RFC patch.

Would it be horrible to add a bool field to struct cpu_cacheinfo that
tells us if the levels/leaves are based on early detection? The field
would be set by fetch_cache_info() when it calls early_cache_level() as
fallback. Then detect_cache_attributes() could avoid skipping the call
to init_cache_level() when the flag is set. The nice thing about this
is that it doesn't affect other architectures - if they don't implement
early_cache_level(), the flag is never set and then the behavior of
detect_cache_attributes() is basically unchanged.

Best regards,
Radu

From a79542d43a0ece06e13bfd431abc98299a9747f2 Mon Sep 17 00:00:00 2001
From: Radu Rendec <rrendec@redhat.com>
Date: Thu, 30 Mar 2023 18:18:46 -0400
Subject: [PATCH] Allocate cacheinfo early - WIP #1

Signed-off-by: Radu Rendec <rrendec@redhat.com>
---
 arch/arm64/kernel/cacheinfo.c | 22 ++++++++++++++++++++++
 drivers/base/cacheinfo.c | 12 ++++++++++--
 include/linux/cacheinfo.h | 1 +
 3 files changed, 33 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/kernel/cacheinfo.c b/arch/arm64/kernel/cacheinfo.c
index c307f69e9b55..6f5ac5c385f0 100644
--- a/arch/arm64/kernel/cacheinfo.c
+++ b/arch/arm64/kernel/cacheinfo.c
@@ -38,6 +38,27 @@ static void ci_leaf_init(struct cacheinfo *this_leaf,
 this_leaf->type = type;
 }
+int early_cache_level(unsigned int cpu)
+{
+ unsigned int ctype, level, leaves;
+ struct cpu_cacheinfo *this_cpu_ci = get_cpu_cacheinfo(cpu);
+
+ for (level = 1, leaves = 0; level <= MAX_CACHE_LEVEL; level++) {
+ ctype = get_cache_type(level);
+ if (ctype == CACHE_TYPE_NOCACHE) {
+ level--;
+ break;
+ }
+ /* Separate instruction and data caches */
+ leaves += (ctype == CACHE_TYPE_SEPARATE) ? 2 : 1;
+ }
+
+ this_cpu_ci->num_levels = level;
+ this_cpu_ci->num_leaves = leaves;
+ return 0;
+}
+
+#if 0
 int init_cache_level(unsigned int cpu)
 {
 unsigned int ctype, level, leaves;
@@ -76,6 +97,7 @@ int init_cache_level(unsigned int cpu)
 this_cpu_ci->num_leaves = leaves;
 return 0;
 }
+#endif
 int populate_cache_leaves(unsigned int cpu)
 {
diff --git a/drivers/base/cacheinfo.c b/drivers/base/cacheinfo.c
index f6573c335f4c..4640671b4547 100644
--- a/drivers/base/cacheinfo.c
+++ b/drivers/base/cacheinfo.c
@@ -398,6 +398,11 @@ static void free_cache_attributes(unsigned int cpu)
 cache_shared_cpu_map_remove(cpu);
 }
+int __weak early_cache_level(unsigned int cpu)
+{
+ return -ENOENT;
+}
+
 int __weak init_cache_level(unsigned int cpu)
 {
 return -ENOENT;
@@ -446,8 +451,11 @@ int fetch_cache_info(unsigned int cpu)
 */
 this_cpu_ci->num_leaves = levels + split_levels;
 }
- if (!cache_leaves(cpu))
- return -ENOENT;
+ if (!cache_leaves(cpu)) {
+ ret = early_cache_level(cpu);
+ if (ret < 0)
+ return ret;
+ }
 return allocate_cache_info(cpu);
 }
diff --git a/include/linux/cacheinfo.h b/include/linux/cacheinfo.h
index 908e19d17f49..9788100a7d3c 100644
--- a/include/linux/cacheinfo.h
+++ b/include/linux/cacheinfo.h
@@ -79,6 +79,7 @@ struct cpu_cacheinfo {
 };
 struct cpu_cacheinfo *get_cpu_cacheinfo(unsigned int cpu);
+int early_cache_level(unsigned int cpu);
 int init_cache_level(unsigned int cpu);
 int init_of_cache_level(unsigned int cpu);
 int populate_cache_leaves(unsigned int cpu);
Pierre Gondois March 31, 2023, 2:49 p.m. UTC | #9
Hello Radu,

On 3/31/23 01:32, Radu Rendec wrote:
> On Thu, 2023-03-30 at 08:57 +0200, Pierre Gondois wrote:
>> On 3/29/23 23:35, Radu Rendec wrote:
>>> On Wed, 2023-03-29 at 17:39 +0200, Pierre Gondois wrote:
>>>> On 3/29/23 17:03, Sudeep Holla wrote:
>>>>> On Wed, Mar 29, 2023 at 04:42:07PM +0200, Pierre Gondois wrote:
>>>>>>
>>>>>> This would mean that for all architectures, the cacheinfo would come from
>>>>>> ACPI/DT first.....
>>>>>
>>>>> x86 doesn't fall into the above category. So we need to ensure it continues
>>>>> to work with no errors.
>>>>
>>>> Ok, then maybe having a second arch specific function like
>>>> init_cache_level() would work.
>>>>
>>>> This function would be called in fetch_cache_info() after
>>>> init_of_cache_level()/acpi_get_cache_info() fail. It would fetch
>>>> cache info anywhere but in DT/ACPI.
>>>> Archs that don't want it would not implement it, and it would
>>>> allow the others to get the num_leaves/levels during early boot.
>>>
>>> In particular, for arm64 is it possible that CLIDR_EL1 may not look the
>>> same depending on the CPU that reads it? What about SoC's with
>>> asymmetrical CPU cores? (no concrete example here, just assuming this
>>> is a real/possible thing)
>>
>> This would indeed be an issue if all the CPUs don't have the same number/level
>> of caches. In case there is no DT/ACPI, it should be possible to:
>> - from the primary CPU using CLIDR_EL1, allocate the cacheinfo (making the
>>     assumption the platform is symmetrical)
>> - from the secondary CPUs, if we know a pre-allocation has been made,
>>     run init_cache_level() and check the pre-allocation was correct.
>>     If not, re-allocate the cacheinfo (and trigger a warning).
> 
> That makes sense. The question here is how do we know a pre-allocation
> has been made? If we modified fetch_cache_info() the way you proposed,
> then we would always pre-allocate memory, and using the pointer alone
> it would be impossible to tell if the pre-allocation is based on a
> "guessed" size.
> 
> Also I'm not sure about the "trigger a warning" part. If it's an RT
> kernel, it will try to reallocate and that will be a "bug" splat
> anyway. If not, the reallocation will fix the issue automatically.
> Maybe a pr_warn() as opposed to a WARN_ON() (or perhaps that's what you
> had in mind as well).

Sorry I misformulated, I just wanted to say the "BUG: sleeping function
called from invalid context" splat would be triggered.

> 
>> I think this is more or less what was done in the RFC, the only difference
>> being there is no call from smp_prepare_cpus(), or did I miss something ?
> 
> Yes, it's more or less the same thing. The main differences are:
>   * There is no window when the information in struct cpu_cacheinfo (the
>     number of cache levels and leaves) is populated but potentially
>     inaccurate (e.g. in the case of asymmetric CPUs with no DT/ACPI).

In the current code, fetch_cache_info() allocates the memory and populates
num_levels/num_leaves from the primary CPU. The information is populated
later from the secondary CPUs. So there is already a time window when
the information is inaccurate.

>   * The logic in detect_cache_attributes() remains unchanged because the
>     allocated memory is "published" to ci_cpu_cacheinfo only if the size
>     is known to be correct (i.e. it comes from DT/ACPI).

Yes right.

> 
> Based on what you proposed, I hacked away at the code and came up with
> the patch below. It works in my particular test case because the CPUs
> are symmetrical, but doesn't solve the reallocation part.
> 
> I can't think of an elegant way to solve this. One thing that comes to
> mind is change the logic in detect_cache_attributes(). But I am
> reluctant because this code is architecture independent, while the
> "guess first/fix later" strategy is implementation specific (arm64 in
> our case). The main problem is that the call to init_cache_level() is
> skipped altogether if there is a valid pointer value in
> ci_cpu_cacheinfo. In other words, if the memory has been already
> allocated, it is assumed to be correct. This is why I used a separate
> per-cpu variable in my original RFC patch.
> 
> Would it be horrible to add a bool field to struct cpu_cacheinfo that
> tells us if the levels/leaves are based on early detection? The field
> would be set by fetch_cache_info() when it calls early_cache_level() as
> fallback. Then detect_cache_attributes() could avoid skipping the call
> to init_cache_level() when the flag is set. The nice thing about this
> is that it doesn't affect other architectures - if they don't implement
> early_cache_level(), the flag is never set and then the behavior of
> detect_cache_attributes() is basically unchanged.

FWIW I prefer the solution with the bool/flag over the per-cpu variable.
At:
https://lore.kernel.org/all/20230327115953.788244-4-pierre.gondois@arm.com/
I suggested to add a 'bool use_arch_info;' and nobody seemed to complain
about it (so far). I assume the field could be re-used for this purpose.

> 
> Best regards,
> Radu
> 
>  From a79542d43a0ece06e13bfd431abc98299a9747f2 Mon Sep 17 00:00:00 2001
> From: Radu Rendec <rrendec@redhat.com>
> Date: Thu, 30 Mar 2023 18:18:46 -0400
> Subject: [PATCH] Allocate cacheinfo early - WIP #1
> 
> Signed-off-by: Radu Rendec <rrendec@redhat.com>
> ---
>   arch/arm64/kernel/cacheinfo.c | 22 ++++++++++++++++++++++
>   drivers/base/cacheinfo.c | 12 ++++++++++--
>   include/linux/cacheinfo.h | 1 +
>   3 files changed, 33 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/arm64/kernel/cacheinfo.c b/arch/arm64/kernel/cacheinfo.c
> index c307f69e9b55..6f5ac5c385f0 100644
> --- a/arch/arm64/kernel/cacheinfo.c
> +++ b/arch/arm64/kernel/cacheinfo.c
> @@ -38,6 +38,27 @@ static void ci_leaf_init(struct cacheinfo *this_leaf,
>   this_leaf->type = type;
>   }
> +int early_cache_level(unsigned int cpu)
> +{
> + unsigned int ctype, level, leaves;
> + struct cpu_cacheinfo *this_cpu_ci = get_cpu_cacheinfo(cpu);
> +
> + for (level = 1, leaves = 0; level <= MAX_CACHE_LEVEL; level++) {
> + ctype = get_cache_type(level);
> + if (ctype == CACHE_TYPE_NOCACHE) {
> + level--;
> + break;
> + }
> + /* Separate instruction and data caches */
> + leaves += (ctype == CACHE_TYPE_SEPARATE) ? 2 : 1;
> + }
> +
> + this_cpu_ci->num_levels = level;
> + this_cpu_ci->num_leaves = leaves;
> + return 0;
> +}
> +
> +#if 0
>   int init_cache_level(unsigned int cpu)
>   {
>   unsigned int ctype, level, leaves;
> @@ -76,6 +97,7 @@ int init_cache_level(unsigned int cpu)
>   this_cpu_ci->num_leaves = leaves;
>   return 0;
>   }
> +#endif
>   int populate_cache_leaves(unsigned int cpu)
>   {
> diff --git a/drivers/base/cacheinfo.c b/drivers/base/cacheinfo.c
> index f6573c335f4c..4640671b4547 100644
> --- a/drivers/base/cacheinfo.c
> +++ b/drivers/base/cacheinfo.c
> @@ -398,6 +398,11 @@ static void free_cache_attributes(unsigned int cpu)
>   cache_shared_cpu_map_remove(cpu);
>   }
> +int __weak early_cache_level(unsigned int cpu)
> +{
> + return -ENOENT;
> +}
> +
>   int __weak init_cache_level(unsigned int cpu)
>   {
>   return -ENOENT;
> @@ -446,8 +451,11 @@ int fetch_cache_info(unsigned int cpu)
>   */
>   this_cpu_ci->num_leaves = levels + split_levels;
>   }
> - if (!cache_leaves(cpu))
> - return -ENOENT;
> + if (!cache_leaves(cpu)) {
> + ret = early_cache_level(cpu);
> + if (ret < 0)
> + return ret;
> + }

If there are no PPTT, acpi_get_cache_info() should return an error code.
The patch-set at:
https://lore.kernel.org/all/20230327115953.788244-1-pierre.gondois@arm.com/
should make init_of_cache_level() return error code aswell.

I think these error codes should be caught instead of checking:
+ if (!cache_leaves(cpu))

>   return allocate_cache_info(cpu);
>   }
> diff --git a/include/linux/cacheinfo.h b/include/linux/cacheinfo.h
> index 908e19d17f49..9788100a7d3c 100644
> --- a/include/linux/cacheinfo.h
> +++ b/include/linux/cacheinfo.h
> @@ -79,6 +79,7 @@ struct cpu_cacheinfo {
>   };
>   struct cpu_cacheinfo *get_cpu_cacheinfo(unsigned int cpu);
> +int early_cache_level(unsigned int cpu);
>   int init_cache_level(unsigned int cpu);
>   int init_of_cache_level(unsigned int cpu);
>   int populate_cache_leaves(unsigned int cpu);

Regards,
Pierre
Radu Rendec April 3, 2023, 9:25 p.m. UTC | #10
Hello Pierre,

On Fri, 2023-03-31 at 16:49 +0200, Pierre Gondois wrote:
> On 3/31/23 01:32, Radu Rendec wrote:
> 
> > Yes, it's more or less the same thing. The main differences are:
> >   * There is no window when the information in struct cpu_cacheinfo (the
> >     number of cache levels and leaves) is populated but potentially
> >     inaccurate (e.g. in the case of asymmetric CPUs with no DT/ACPI).
> 
> In the current code, fetch_cache_info() allocates the memory and populates
> num_levels/num_leaves from the primary CPU. The information is populated
> later from the secondary CPUs. So there is already a time window when
> the information is inaccurate.

Fair enough. Then this is not something to worry about.

> > Based on what you proposed, I hacked away at the code and came up with
> > the patch below. It works in my particular test case because the CPUs
> > are symmetrical, but doesn't solve the reallocation part.
> > 
> > I can't think of an elegant way to solve this. One thing that comes to
> > mind is change the logic in detect_cache_attributes(). But I am
> > reluctant because this code is architecture independent, while the
> > "guess first/fix later" strategy is implementation specific (arm64 in
> > our case). The main problem is that the call to init_cache_level() is
> > skipped altogether if there is a valid pointer value in
> > ci_cpu_cacheinfo. In other words, if the memory has been already
> > allocated, it is assumed to be correct. This is why I used a separate
> > per-cpu variable in my original RFC patch.
> > 
> > Would it be horrible to add a bool field to struct cpu_cacheinfo that
> > tells us if the levels/leaves are based on early detection? The field
> > would be set by fetch_cache_info() when it calls early_cache_level() as
> > fallback. Then detect_cache_attributes() could avoid skipping the call
> > to init_cache_level() when the flag is set. The nice thing about this
> > is that it doesn't affect other architectures - if they don't implement
> > early_cache_level(), the flag is never set and then the behavior of
> > detect_cache_attributes() is basically unchanged.
> 
> FWIW I prefer the solution with the bool/flag over the per-cpu variable.
> At:
> https://lore.kernel.org/all/20230327115953.788244-4-pierre.gondois@arm.com/
> I suggested to add a 'bool use_arch_info;' and nobody seemed to complain
> about it (so far). I assume the field could be re-used for this purpose.

In that patch, 'bool use_arch_info' is a local variable (not a field in
struct cpu_cacheinfo), so I'm not exactly sure how it can be reused.
But for now I'm going to assume it should be fine to add a field :)

I will prepare a v2 patch based on everything we have discussed so far
and send it, then I guess we can take it from there.

> > @@ -446,8 +451,11 @@ int fetch_cache_info(unsigned int cpu)
> >   */
> >   this_cpu_ci->num_leaves = levels + split_levels;
> >   }
> > - if (!cache_leaves(cpu))
> > - return -ENOENT;
> > + if (!cache_leaves(cpu)) {
> > + ret = early_cache_level(cpu);
> > + if (ret < 0)
> > + return ret;
> > + }
> 
> If there are no PPTT, acpi_get_cache_info() should return an error code.

What if the PPTT tables are there and acpi_get_cache_info() returns 0
but the number of leaves is still 0 in the end? This *does* happen in
my test environment (qemu -M virt -cpu cortex-a57). I agree that then
the system is likely broken (and in my case it's probably a qemu bug),
but my point is that the kernel should still try its best to do the
right thing and fall back to auto-detection.

> The patch-set at:
> https://lore.kernel.org/all/20230327115953.788244-1-pierre.gondois@arm.com/
> should make init_of_cache_level() return error code aswell.
> 
> I think these error codes should be caught instead of checking:
> + if (!cache_leaves(cpu))


You have a valid point here. We *must* fall back to auto-detection if
there is an error reading the DT/ACPI. Currently the function returns
early in that case.

Maybe we should check both the error code and the number of leaves? But
in that case, the error code becomes irrelevant, since the number of
leaves is likely 0 anyway if there is an error.

Best regards,
Radu
diff mbox series

Patch

diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
index 4e8327264255..7ee2c38185d4 100644
--- a/arch/arm64/kernel/smp.c
+++ b/arch/arm64/kernel/smp.c
@@ -33,6 +33,7 @@ 
 #include <linux/kernel_stat.h>
 #include <linux/kexec.h>
 #include <linux/kvm_host.h>
+#include <linux/cacheinfo.h>
 
 #include <asm/alternative.h>
 #include <asm/atomic.h>
@@ -730,6 +731,8 @@  void __init smp_prepare_cpus(unsigned int max_cpus)
 	numa_store_cpu_info(this_cpu);
 	numa_add_cpu(this_cpu);
 
+	pre_alloc_cache_info();
+
 	/*
 	 * If UP is mandated by "nosmp" (which implies "maxcpus=0"), don't set
 	 * secondary CPUs present.
diff --git a/drivers/base/cacheinfo.c b/drivers/base/cacheinfo.c
index f6573c335f4c..c7d691ef7839 100644
--- a/drivers/base/cacheinfo.c
+++ b/drivers/base/cacheinfo.c
@@ -28,6 +28,9 @@  static DEFINE_PER_CPU(struct cpu_cacheinfo, ci_cpu_cacheinfo);
 #define per_cpu_cacheinfo_idx(cpu, idx)		\
 				(per_cpu_cacheinfo(cpu) + (idx))
 
+static DEFINE_PER_CPU(struct cacheinfo *, pre_alloc_ci_list);
+static unsigned int pre_alloc_ci_leaves;
+
 struct cpu_cacheinfo *get_cpu_cacheinfo(unsigned int cpu)
 {
 	return ci_cacheinfo(cpu);
@@ -408,9 +411,51 @@  int __weak populate_cache_leaves(unsigned int cpu)
 	return -ENOENT;
 }
 
-static inline
-int allocate_cache_info(int cpu)
+void pre_alloc_cache_info(void)
 {
+	unsigned int leaves = cache_leaves(smp_processor_id());
+	unsigned int cpu;
+	struct cacheinfo *ci;
+
+	if (!leaves)
+		return;
+
+	for_each_possible_cpu(cpu) {
+		if (per_cpu_cacheinfo(cpu))
+			/*
+			 * Early allocation through init_cpu_topology() was
+			 * successful, so there is no point in pre-allocating.
+			 */
+			continue;
+
+		ci = kcalloc(leaves, sizeof(struct cacheinfo), GFP_ATOMIC);
+		if (!ci) {
+			for_each_possible_cpu(cpu)
+				kfree(per_cpu(pre_alloc_ci_list, cpu));
+			return;
+		}
+
+		per_cpu(pre_alloc_ci_list, cpu) = ci;
+	}
+
+	pre_alloc_ci_leaves = leaves;
+}
+
+static int allocate_cache_info(int cpu)
+{
+	struct cacheinfo *ci = per_cpu(pre_alloc_ci_list, cpu);
+
+	if (ci) {
+		per_cpu(pre_alloc_ci_list, cpu) = NULL;
+
+		if (cache_leaves(cpu) <= pre_alloc_ci_leaves) {
+			per_cpu_cacheinfo(cpu) = ci;
+			return 0;
+		}
+
+		kfree(ci);
+	}
+
 	per_cpu_cacheinfo(cpu) = kcalloc(cache_leaves(cpu),
 					 sizeof(struct cacheinfo), GFP_ATOMIC);
 	if (!per_cpu_cacheinfo(cpu)) {
diff --git a/include/linux/cacheinfo.h b/include/linux/cacheinfo.h
index 908e19d17f49..23f9dac61d67 100644
--- a/include/linux/cacheinfo.h
+++ b/include/linux/cacheinfo.h
@@ -85,6 +85,7 @@  int populate_cache_leaves(unsigned int cpu);
 int cache_setup_acpi(unsigned int cpu);
 bool last_level_cache_is_valid(unsigned int cpu);
 bool last_level_cache_is_shared(unsigned int cpu_x, unsigned int cpu_y);
+void pre_alloc_cache_info(void);
 int fetch_cache_info(unsigned int cpu);
 int detect_cache_attributes(unsigned int cpu);
 #ifndef CONFIG_ACPI_PPTT