diff mbox series

arm64: topology: Stop using MPIDR for topology information

Message ID 20200829130016.26106-1-valentin.schneider@arm.com (mailing list archive)
State New, archived
Headers show
Series arm64: topology: Stop using MPIDR for topology information | expand

Commit Message

Valentin Schneider Aug. 29, 2020, 1 p.m. UTC
In the absence of ACPI or DT topology data, we fallback to haphazardly
decoding *something* out of MPIDR. Sadly, the contents of that register are
mostly unusable due to the implementation leniancy and things like Aff0
having to be capped to 15 (despite being encoded on 8 bits).

Consider a simple system with a single package of 32 cores, all under the
same LLC. We ought to be shoving them in the same core_sibling mask, but
MPIDR is going to look like:

  | CPU  | 0 | ... | 15 | 16 | ... | 31 |
  |------+---+-----+----+----+-----+----+
  | Aff0 | 0 | ... | 15 |  0 | ... | 15 |
  | Aff1 | 0 | ... |  0 |  1 | ... |  1 |
  | Aff2 | 0 | ... |  0 |  0 | ... |  0 |

Which will eventually yield

  core_sibling(0-15)  == 0-15
  core_sibling(16-31) == 16-31

NUMA woes
=========

If we try to play games with this and set up NUMA boundaries within those
groups of 16 cores via e.g. QEMU:

  # Node0: 0-9; Node1: 10-19
  $ qemu-system-aarch64 <blah> \
    -smp 20 -numa node,cpus=0-9,nodeid=0 -numa node,cpus=10-19,nodeid=1

The scheduler's MC domain (all CPUs with same LLC) is going to be built via

  arch_topology.c::cpu_coregroup_mask()

In there we try to figure out a sensible mask out of the topology
information we have. In short, here we'll pick the smallest of NUMA or
core sibling mask.

  node_mask(CPU9)    == 0-9
  core_sibling(CPU9) == 0-15

MC mask for CPU9 will thus be 0-9, not a problem.

  node_mask(CPU10)    == 10-19
  core_sibling(CPU10) == 0-15

MC mask for CPU10 will thus be 10-19, not a problem.

  node_mask(CPU16)    == 10-19
  core_sibling(CPU16) == 16-19

MC mask for CPU16 will thus be 16-19... Uh oh. CPUs 16-19 are in two
different unique MC spans, and the scheduler has no idea what to make of
that. That triggers the WARN_ON() added by commit

  ccf74128d66c ("sched/topology: Assert non-NUMA topology masks don't (partially) overlap")

Fixing MPIDR-derived topology
=============================

We could try to come up with some cleverer scheme to figure out which of
the available masks to pick, but really if one of those masks resulted from
MPIDR then it should be discarded because it's bound to be bogus.

I was hoping to give MPIDR a chance for SMT, to figure out which threads are
in the same core using Aff1-3 as core ID, but Sudeep and Robin pointed out
to me that there are systems out there where *all* cores have non-zero
values in their higher affinity fields (e.g. RK3288 has "5" in all of its
cores' MPIDR.Aff1), which would expose a bogus core ID to userspace.

Stop using MPIDR for topology information. When no other source of topology
information is available, mark each CPU as its own core and its NUMA node
as its LLC domain.

Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
---
 arch/arm64/kernel/topology.c | 32 +++++++++++++++++---------------
 1 file changed, 17 insertions(+), 15 deletions(-)

Comments

Zengtao (B) Sept. 2, 2020, 3:24 a.m. UTC | #1
Hi Valentin:

> -----Original Message-----
> From: Valentin Schneider [mailto:valentin.schneider@arm.com]
> Sent: Saturday, August 29, 2020 9:00 PM
> To: linux-kernel@vger.kernel.org; linux-arm-kernel@lists.infradead.org
> Cc: Catalin Marinas; Will Deacon; Sudeep Holla; Robin Murphy; Jeremy
> Linton; Dietmar Eggemann; Morten Rasmussen; Zengtao (B)
> Subject: [PATCH] arm64: topology: Stop using MPIDR for topology
> information
> 
> In the absence of ACPI or DT topology data, we fallback to haphazardly
> decoding *something* out of MPIDR. Sadly, the contents of that register
> are
> mostly unusable due to the implementation leniancy and things like Aff0
> having to be capped to 15 (despite being encoded on 8 bits).
> 
> Consider a simple system with a single package of 32 cores, all under the
> same LLC. We ought to be shoving them in the same core_sibling mask,
> but
> MPIDR is going to look like:
> 
>   | CPU  | 0 | ... | 15 | 16 | ... | 31 |
>   |------+---+-----+----+----+-----+----+
>   | Aff0 | 0 | ... | 15 |  0 | ... | 15 |
>   | Aff1 | 0 | ... |  0 |  1 | ... |  1 |
>   | Aff2 | 0 | ... |  0 |  0 | ... |  0 |
> 
> Which will eventually yield
> 
>   core_sibling(0-15)  == 0-15
>   core_sibling(16-31) == 16-31
> 
> NUMA woes
> =========
> 
> If we try to play games with this and set up NUMA boundaries within those
> groups of 16 cores via e.g. QEMU:
> 
>   # Node0: 0-9; Node1: 10-19
>   $ qemu-system-aarch64 <blah> \
>     -smp 20 -numa node,cpus=0-9,nodeid=0 -numa
> node,cpus=10-19,nodeid=1
> 
> The scheduler's MC domain (all CPUs with same LLC) is going to be built via
> 
>   arch_topology.c::cpu_coregroup_mask()
> 
> In there we try to figure out a sensible mask out of the topology
> information we have. In short, here we'll pick the smallest of NUMA or
> core sibling mask.
> 
>   node_mask(CPU9)    == 0-9
>   core_sibling(CPU9) == 0-15
> 
> MC mask for CPU9 will thus be 0-9, not a problem.
> 
>   node_mask(CPU10)    == 10-19
>   core_sibling(CPU10) == 0-15
> 
> MC mask for CPU10 will thus be 10-19, not a problem.
> 
>   node_mask(CPU16)    == 10-19
>   core_sibling(CPU16) == 16-19
> 
> MC mask for CPU16 will thus be 16-19... Uh oh. CPUs 16-19 are in two
> different unique MC spans, and the scheduler has no idea what to make of
> that. That triggers the WARN_ON() added by commit
> 
>   ccf74128d66c ("sched/topology: Assert non-NUMA topology masks
> don't (partially) overlap")
> 
> Fixing MPIDR-derived topology
> =============================
> 
> We could try to come up with some cleverer scheme to figure out which of
> the available masks to pick, but really if one of those masks resulted from
> MPIDR then it should be discarded because it's bound to be bogus.
> 
> I was hoping to give MPIDR a chance for SMT, to figure out which threads
> are
> in the same core using Aff1-3 as core ID, but Sudeep and Robin pointed out
> to me that there are systems out there where *all* cores have non-zero
> values in their higher affinity fields (e.g. RK3288 has "5" in all of its
> cores' MPIDR.Aff1), which would expose a bogus core ID to userspace.
> 
> Stop using MPIDR for topology information. When no other source of
> topology
> information is available, mark each CPU as its own core and its NUMA
> node
> as its LLC domain.

I agree with your idea to remove the topology functionality of MPIDR ,
but I think we need also consider ARM32 and GIC. 

> 
> Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
> ---
>  arch/arm64/kernel/topology.c | 32 +++++++++++++++++---------------
>  1 file changed, 17 insertions(+), 15 deletions(-)
> 
> diff --git a/arch/arm64/kernel/topology.c b/arch/arm64/kernel/topology.c
> index 0801a0f3c156..ff1dd1dbfe64 100644
> --- a/arch/arm64/kernel/topology.c
> +++ b/arch/arm64/kernel/topology.c
> @@ -36,21 +36,23 @@ void store_cpu_topology(unsigned int cpuid)
>  	if (mpidr & MPIDR_UP_BITMASK)
>  		return;
> 
> -	/* Create cpu topology mapping based on MPIDR. */
> -	if (mpidr & MPIDR_MT_BITMASK) {
> -		/* Multiprocessor system : Multi-threads per core */
> -		cpuid_topo->thread_id  = MPIDR_AFFINITY_LEVEL(mpidr, 0);
> -		cpuid_topo->core_id    = MPIDR_AFFINITY_LEVEL(mpidr, 1);
> -		cpuid_topo->package_id = MPIDR_AFFINITY_LEVEL(mpidr, 2) |
> -					 MPIDR_AFFINITY_LEVEL(mpidr, 3) << 8;
> -	} else {
> -		/* Multiprocessor system : Single-thread per core */
> -		cpuid_topo->thread_id  = -1;
> -		cpuid_topo->core_id    = MPIDR_AFFINITY_LEVEL(mpidr, 0);
> -		cpuid_topo->package_id = MPIDR_AFFINITY_LEVEL(mpidr, 1) |
> -					 MPIDR_AFFINITY_LEVEL(mpidr, 2) << 8 |
> -					 MPIDR_AFFINITY_LEVEL(mpidr, 3) << 16;
> -	}
> +	/*
> +	 * This would be the place to create cpu topology based on MPIDR.
> +	 *
> +	 * However, it cannot be trusted to depict the actual topology; some
> +	 * pieces of the architecture enforce an artificial cap on Aff0 values
> +	 * (e.g. GICv3's ICC_SGI1R_EL1 limits it to 15), leading to an
> +	 * artificial cycling of Aff1, Aff2 and Aff3 values. IOW, these end up
> +	 * having absolutely no relationship to the actual underlying system
> +	 * topology, and cannot be reasonably used as core / package ID.
> +	 *
> +	 * If the MT bit is set, Aff0 *could* be used to define a thread ID, but
> +	 * we still wouldn't be able to obtain a sane core ID. This means we
> +	 * need to entirely ignore MPIDR for any topology deduction.
> +	 */
> +	cpuid_topo->thread_id  = -1;
> +	cpuid_topo->core_id    = cpuid;
> +	cpuid_topo->package_id = cpu_to_node(cpuid);
> 
>  	pr_debug("CPU%u: cluster %d core %d thread %d mpidr %#016llx\n",
>  		 cpuid, cpuid_topo->package_id, cpuid_topo->core_id,
> --
> 2.27.0
Sudeep Holla Sept. 2, 2020, 10:04 a.m. UTC | #2
On Sat, Aug 29, 2020 at 02:00:16PM +0100, Valentin Schneider wrote:
> In the absence of ACPI or DT topology data, we fallback to haphazardly
> decoding *something* out of MPIDR. Sadly, the contents of that register are
> mostly unusable due to the implementation leniancy and things like Aff0
> having to be capped to 15 (despite being encoded on 8 bits).
>
> Consider a simple system with a single package of 32 cores, all under the
> same LLC. We ought to be shoving them in the same core_sibling mask, but
> MPIDR is going to look like:
>
>   | CPU  | 0 | ... | 15 | 16 | ... | 31 |
>   |------+---+-----+----+----+-----+----+
>   | Aff0 | 0 | ... | 15 |  0 | ... | 15 |
>   | Aff1 | 0 | ... |  0 |  1 | ... |  1 |
>   | Aff2 | 0 | ... |  0 |  0 | ... |  0 |
>
> Which will eventually yield
>
>   core_sibling(0-15)  == 0-15
>   core_sibling(16-31) == 16-31
>
> NUMA woes
> =========
>
> If we try to play games with this and set up NUMA boundaries within those
> groups of 16 cores via e.g. QEMU:
>
>   # Node0: 0-9; Node1: 10-19
>   $ qemu-system-aarch64 <blah> \
>     -smp 20 -numa node,cpus=0-9,nodeid=0 -numa node,cpus=10-19,nodeid=1
>
> The scheduler's MC domain (all CPUs with same LLC) is going to be built via
>
>   arch_topology.c::cpu_coregroup_mask()
>
> In there we try to figure out a sensible mask out of the topology
> information we have. In short, here we'll pick the smallest of NUMA or
> core sibling mask.
>
>   node_mask(CPU9)    == 0-9
>   core_sibling(CPU9) == 0-15
>
> MC mask for CPU9 will thus be 0-9, not a problem.
>
>   node_mask(CPU10)    == 10-19
>   core_sibling(CPU10) == 0-15
>
> MC mask for CPU10 will thus be 10-19, not a problem.
>
>   node_mask(CPU16)    == 10-19
>   core_sibling(CPU16) == 16-19
>
> MC mask for CPU16 will thus be 16-19... Uh oh. CPUs 16-19 are in two
> different unique MC spans, and the scheduler has no idea what to make of
> that. That triggers the WARN_ON() added by commit
>
>   ccf74128d66c ("sched/topology: Assert non-NUMA topology masks don't (partially) overlap")
>
> Fixing MPIDR-derived topology
> =============================
>
> We could try to come up with some cleverer scheme to figure out which of
> the available masks to pick, but really if one of those masks resulted from
> MPIDR then it should be discarded because it's bound to be bogus.
>
> I was hoping to give MPIDR a chance for SMT, to figure out which threads are
> in the same core using Aff1-3 as core ID, but Sudeep and Robin pointed out
> to me that there are systems out there where *all* cores have non-zero
> values in their higher affinity fields (e.g. RK3288 has "5" in all of its
> cores' MPIDR.Aff1), which would expose a bogus core ID to userspace.
>
> Stop using MPIDR for topology information. When no other source of topology
> information is available, mark each CPU as its own core and its NUMA node
> as its LLC domain.
>

Looks good to me, so:

Reviewed-by: Sudeep Holla <sudeep.holla@arm.com>

However, we need to get it tested on some systems with *weird* MPIDR
values and don't have topology described in DT cpu-maps and somehow
(wrongly) relied on below logic. Also though these affect user ABI via
sysfs topology, I expect systems w/o DT cpu-maps and weird MPIDR are
broken either way.

Luckily found only one such mpidr in arm64 DTS files:
arch/arm64/boot/dts/sprd/sc9860.dtsi

--
Regards,
Sudeep
Sudeep Holla Sept. 2, 2020, 10:50 a.m. UTC | #3
On Wed, Sep 02, 2020 at 03:24:17AM +0000, Zengtao (B) wrote:
> Hi Valentin:
>
> > -----Original Message-----
> > From: Valentin Schneider [mailto:valentin.schneider@arm.com]
> > Sent: Saturday, August 29, 2020 9:00 PM
> > To: linux-kernel@vger.kernel.org; linux-arm-kernel@lists.infradead.org
> > Cc: Catalin Marinas; Will Deacon; Sudeep Holla; Robin Murphy; Jeremy
> > Linton; Dietmar Eggemann; Morten Rasmussen; Zengtao (B)
> > Subject: [PATCH] arm64: topology: Stop using MPIDR for topology
> > information
> >
> > In the absence of ACPI or DT topology data, we fallback to haphazardly
> > decoding *something* out of MPIDR. Sadly, the contents of that register
> > are
> > mostly unusable due to the implementation leniancy and things like Aff0
> > having to be capped to 15 (despite being encoded on 8 bits).
> >
> > Consider a simple system with a single package of 32 cores, all under the
> > same LLC. We ought to be shoving them in the same core_sibling mask,
> > but
> > MPIDR is going to look like:
> >
> >   | CPU  | 0 | ... | 15 | 16 | ... | 31 |
> >   |------+---+-----+----+----+-----+----+
> >   | Aff0 | 0 | ... | 15 |  0 | ... | 15 |
> >   | Aff1 | 0 | ... |  0 |  1 | ... |  1 |
> >   | Aff2 | 0 | ... |  0 |  0 | ... |  0 |
> >
> > Which will eventually yield
> >
> >   core_sibling(0-15)  == 0-15
> >   core_sibling(16-31) == 16-31
> >
> > NUMA woes
> > =========
> >
> > If we try to play games with this and set up NUMA boundaries within those
> > groups of 16 cores via e.g. QEMU:
> >
> >   # Node0: 0-9; Node1: 10-19
> >   $ qemu-system-aarch64 <blah> \
> >     -smp 20 -numa node,cpus=0-9,nodeid=0 -numa
> > node,cpus=10-19,nodeid=1
> >
> > The scheduler's MC domain (all CPUs with same LLC) is going to be built via
> >
> >   arch_topology.c::cpu_coregroup_mask()
> >
> > In there we try to figure out a sensible mask out of the topology
> > information we have. In short, here we'll pick the smallest of NUMA or
> > core sibling mask.
> >
> >   node_mask(CPU9)    == 0-9
> >   core_sibling(CPU9) == 0-15
> >
> > MC mask for CPU9 will thus be 0-9, not a problem.
> >
> >   node_mask(CPU10)    == 10-19
> >   core_sibling(CPU10) == 0-15
> >
> > MC mask for CPU10 will thus be 10-19, not a problem.
> >
> >   node_mask(CPU16)    == 10-19
> >   core_sibling(CPU16) == 16-19
> >
> > MC mask for CPU16 will thus be 16-19... Uh oh. CPUs 16-19 are in two
> > different unique MC spans, and the scheduler has no idea what to make of
> > that. That triggers the WARN_ON() added by commit
> >
> >   ccf74128d66c ("sched/topology: Assert non-NUMA topology masks
> > don't (partially) overlap")
> >
> > Fixing MPIDR-derived topology
> > =============================
> >
> > We could try to come up with some cleverer scheme to figure out which of
> > the available masks to pick, but really if one of those masks resulted from
> > MPIDR then it should be discarded because it's bound to be bogus.
> >
> > I was hoping to give MPIDR a chance for SMT, to figure out which threads
> > are
> > in the same core using Aff1-3 as core ID, but Sudeep and Robin pointed out
> > to me that there are systems out there where *all* cores have non-zero
> > values in their higher affinity fields (e.g. RK3288 has "5" in all of its
> > cores' MPIDR.Aff1), which would expose a bogus core ID to userspace.
> >
> > Stop using MPIDR for topology information. When no other source of
> > topology
> > information is available, mark each CPU as its own core and its NUMA
> > node
> > as its LLC domain.
>
> I agree with your idea to remove the topology functionality of MPIDR ,
> but I think we need also consider ARM32 and GIC.
>

This is changing only arm64 for now. For fun, looked at some arm32 DTS:
arch/arm/boot/dts/aspeed-g6.dtsi
arch/arm/boot/dts/bcm2836.dtsi
arch/arm/boot/dts/exynos4210.dtsi
arch/arm/boot/dts/exynos4412.dtsi
arch/arm/boot/dts/highbank.dts
arch/arm/boot/dts/imx7ulp.dtsi
arch/arm/boot/dts/ls1021a.dtsi
arch/arm/boot/dts/meson6.dtsi
arch/arm/boot/dts/meson8.dtsi
arch/arm/boot/dts/meson8b.dtsi
arch/arm/boot/dts/milbeaut-m10v.dtsi
arch/arm/boot/dts/rk3036.dtsi
arch/arm/boot/dts/rk322x.dtsi
arch/arm/boot/dts/rk3288.dtsi
arch/arm/boot/dts/rtd1195.dtsi
arch/arm/boot/dts/rv1108.dtsi
arch/arm/boot/dts/ste-dbx5x0.dtsi

These have random non-zero values in Aff1 or Aff2. I may have generated
some false positives with simple search.

--
Regards,
Sudeep
Valentin Schneider Sept. 2, 2020, 10:52 a.m. UTC | #4
On 02/09/20 04:24, B wrote:
> Hi Valentin:
>
>> -----Original Message-----
>> From: Valentin Schneider [mailto:valentin.schneider@arm.com]
>> Sent: Saturday, August 29, 2020 9:00 PM
>> To: linux-kernel@vger.kernel.org; linux-arm-kernel@lists.infradead.org
>> Cc: Catalin Marinas; Will Deacon; Sudeep Holla; Robin Murphy; Jeremy
>> Linton; Dietmar Eggemann; Morten Rasmussen; Zengtao (B)
>> Subject: [PATCH] arm64: topology: Stop using MPIDR for topology
>> information
>>
>> In the absence of ACPI or DT topology data, we fallback to haphazardly
>> decoding *something* out of MPIDR. Sadly, the contents of that register
>> are
>> mostly unusable due to the implementation leniancy and things like Aff0
>> having to be capped to 15 (despite being encoded on 8 bits).
>>
>> Consider a simple system with a single package of 32 cores, all under the
>> same LLC. We ought to be shoving them in the same core_sibling mask,
>> but
>> MPIDR is going to look like:
>>
>>   | CPU  | 0 | ... | 15 | 16 | ... | 31 |
>>   |------+---+-----+----+----+-----+----+
>>   | Aff0 | 0 | ... | 15 |  0 | ... | 15 |
>>   | Aff1 | 0 | ... |  0 |  1 | ... |  1 |
>>   | Aff2 | 0 | ... |  0 |  0 | ... |  0 |
>>
>> Which will eventually yield
>>
>>   core_sibling(0-15)  == 0-15
>>   core_sibling(16-31) == 16-31
>>
>> NUMA woes
>> =========
>>
>> If we try to play games with this and set up NUMA boundaries within those
>> groups of 16 cores via e.g. QEMU:
>>
>>   # Node0: 0-9; Node1: 10-19
>>   $ qemu-system-aarch64 <blah> \
>>     -smp 20 -numa node,cpus=0-9,nodeid=0 -numa
>> node,cpus=10-19,nodeid=1
>>
>> The scheduler's MC domain (all CPUs with same LLC) is going to be built via
>>
>>   arch_topology.c::cpu_coregroup_mask()
>>
>> In there we try to figure out a sensible mask out of the topology
>> information we have. In short, here we'll pick the smallest of NUMA or
>> core sibling mask.
>>
>>   node_mask(CPU9)    == 0-9
>>   core_sibling(CPU9) == 0-15
>>
>> MC mask for CPU9 will thus be 0-9, not a problem.
>>
>>   node_mask(CPU10)    == 10-19
>>   core_sibling(CPU10) == 0-15
>>
>> MC mask for CPU10 will thus be 10-19, not a problem.
>>
>>   node_mask(CPU16)    == 10-19
>>   core_sibling(CPU16) == 16-19
>>
>> MC mask for CPU16 will thus be 16-19... Uh oh. CPUs 16-19 are in two
>> different unique MC spans, and the scheduler has no idea what to make of
>> that. That triggers the WARN_ON() added by commit
>>
>>   ccf74128d66c ("sched/topology: Assert non-NUMA topology masks
>> don't (partially) overlap")
>>
>> Fixing MPIDR-derived topology
>> =============================
>>
>> We could try to come up with some cleverer scheme to figure out which of
>> the available masks to pick, but really if one of those masks resulted from
>> MPIDR then it should be discarded because it's bound to be bogus.
>>
>> I was hoping to give MPIDR a chance for SMT, to figure out which threads
>> are
>> in the same core using Aff1-3 as core ID, but Sudeep and Robin pointed out
>> to me that there are systems out there where *all* cores have non-zero
>> values in their higher affinity fields (e.g. RK3288 has "5" in all of its
>> cores' MPIDR.Aff1), which would expose a bogus core ID to userspace.
>>
>> Stop using MPIDR for topology information. When no other source of
>> topology
>> information is available, mark each CPU as its own core and its NUMA
>> node
>> as its LLC domain.
>
> I agree with your idea to remove the topology functionality of MPIDR ,
> but I think we need also consider ARM32 and GIC.
>

Could you please elaborate? This change doesn't impact arch_topology, so
only arm64 is affected.
Valentin Schneider Sept. 2, 2020, 10:52 a.m. UTC | #5
On 02/09/20 11:04, Sudeep Holla wrote:
> On Sat, Aug 29, 2020 at 02:00:16PM +0100, Valentin Schneider wrote:
>> In the absence of ACPI or DT topology data, we fallback to haphazardly
>> decoding *something* out of MPIDR. Sadly, the contents of that register are
>> mostly unusable due to the implementation leniancy and things like Aff0
>> having to be capped to 15 (despite being encoded on 8 bits).
>>
>> Consider a simple system with a single package of 32 cores, all under the
>> same LLC. We ought to be shoving them in the same core_sibling mask, but
>> MPIDR is going to look like:
>>
>>   | CPU  | 0 | ... | 15 | 16 | ... | 31 |
>>   |------+---+-----+----+----+-----+----+
>>   | Aff0 | 0 | ... | 15 |  0 | ... | 15 |
>>   | Aff1 | 0 | ... |  0 |  1 | ... |  1 |
>>   | Aff2 | 0 | ... |  0 |  0 | ... |  0 |
>>
>> Which will eventually yield
>>
>>   core_sibling(0-15)  == 0-15
>>   core_sibling(16-31) == 16-31
>>
>> NUMA woes
>> =========
>>
>> If we try to play games with this and set up NUMA boundaries within those
>> groups of 16 cores via e.g. QEMU:
>>
>>   # Node0: 0-9; Node1: 10-19
>>   $ qemu-system-aarch64 <blah> \
>>     -smp 20 -numa node,cpus=0-9,nodeid=0 -numa node,cpus=10-19,nodeid=1
>>
>> The scheduler's MC domain (all CPUs with same LLC) is going to be built via
>>
>>   arch_topology.c::cpu_coregroup_mask()
>>
>> In there we try to figure out a sensible mask out of the topology
>> information we have. In short, here we'll pick the smallest of NUMA or
>> core sibling mask.
>>
>>   node_mask(CPU9)    == 0-9
>>   core_sibling(CPU9) == 0-15
>>
>> MC mask for CPU9 will thus be 0-9, not a problem.
>>
>>   node_mask(CPU10)    == 10-19
>>   core_sibling(CPU10) == 0-15
>>
>> MC mask for CPU10 will thus be 10-19, not a problem.
>>
>>   node_mask(CPU16)    == 10-19
>>   core_sibling(CPU16) == 16-19
>>
>> MC mask for CPU16 will thus be 16-19... Uh oh. CPUs 16-19 are in two
>> different unique MC spans, and the scheduler has no idea what to make of
>> that. That triggers the WARN_ON() added by commit
>>
>>   ccf74128d66c ("sched/topology: Assert non-NUMA topology masks don't (partially) overlap")
>>
>> Fixing MPIDR-derived topology
>> =============================
>>
>> We could try to come up with some cleverer scheme to figure out which of
>> the available masks to pick, but really if one of those masks resulted from
>> MPIDR then it should be discarded because it's bound to be bogus.
>>
>> I was hoping to give MPIDR a chance for SMT, to figure out which threads are
>> in the same core using Aff1-3 as core ID, but Sudeep and Robin pointed out
>> to me that there are systems out there where *all* cores have non-zero
>> values in their higher affinity fields (e.g. RK3288 has "5" in all of its
>> cores' MPIDR.Aff1), which would expose a bogus core ID to userspace.
>>
>> Stop using MPIDR for topology information. When no other source of topology
>> information is available, mark each CPU as its own core and its NUMA node
>> as its LLC domain.
>>
>
> Looks good to me, so:
>
> Reviewed-by: Sudeep Holla <sudeep.holla@arm.com>
>

Thanks!

> However, we need to get it tested on some systems with *weird* MPIDR
> values and don't have topology described in DT cpu-maps and somehow
> (wrongly) relied on below logic. Also though these affect user ABI via
> sysfs topology, I expect systems w/o DT cpu-maps and weird MPIDR are
> broken either way.
>

Agreed, it's the one bit that may be regarded as a regression, but what is
already out there is quite broken :(

> Luckily found only one such mpidr in arm64 DTS files:
> arch/arm64/boot/dts/sprd/sc9860.dtsi

So those have 0x53 for Aff2 for all cores, which is going to end up in the
package_id. AFAICT that means that

  /sys/devices/system/cpu/cpu*/topology/physical_package_id

is going to look pretty wild.
Zengtao (B) Sept. 3, 2020, 1:44 a.m. UTC | #6
> -----Original Message-----
> From: Valentin Schneider [mailto:valentin.schneider@arm.com]
> Sent: Wednesday, September 02, 2020 6:52 PM
> To: Zengtao (B)
> Cc: linux-kernel@vger.kernel.org; linux-arm-kernel@lists.infradead.org;
> Catalin Marinas; Will Deacon; Sudeep Holla; Robin Murphy; Jeremy Linton;
> Dietmar Eggemann; Morten Rasmussen
> Subject: Re: [PATCH] arm64: topology: Stop using MPIDR for topology
> information
> 
> 
> On 02/09/20 04:24, B wrote:
> > Hi Valentin:
> >
> >> -----Original Message-----
> >> From: Valentin Schneider [mailto:valentin.schneider@arm.com]
> >> Sent: Saturday, August 29, 2020 9:00 PM
> >> To: linux-kernel@vger.kernel.org; linux-arm-kernel@lists.infradead.org
> >> Cc: Catalin Marinas; Will Deacon; Sudeep Holla; Robin Murphy; Jeremy
> >> Linton; Dietmar Eggemann; Morten Rasmussen; Zengtao (B)
> >> Subject: [PATCH] arm64: topology: Stop using MPIDR for topology
> >> information
> >>
> >> In the absence of ACPI or DT topology data, we fallback to haphazardly
> >> decoding *something* out of MPIDR. Sadly, the contents of that
> register
> >> are
> >> mostly unusable due to the implementation leniancy and things like Aff0
> >> having to be capped to 15 (despite being encoded on 8 bits).
> >>
> >> Consider a simple system with a single package of 32 cores, all under
> the
> >> same LLC. We ought to be shoving them in the same core_sibling mask,
> >> but
> >> MPIDR is going to look like:
> >>
> >>   | CPU  | 0 | ... | 15 | 16 | ... | 31 |
> >>   |------+---+-----+----+----+-----+----+
> >>   | Aff0 | 0 | ... | 15 |  0 | ... | 15 |
> >>   | Aff1 | 0 | ... |  0 |  1 | ... |  1 |
> >>   | Aff2 | 0 | ... |  0 |  0 | ... |  0 |
> >>
> >> Which will eventually yield
> >>
> >>   core_sibling(0-15)  == 0-15
> >>   core_sibling(16-31) == 16-31
> >>
> >> NUMA woes
> >> =========
> >>
> >> If we try to play games with this and set up NUMA boundaries within
> those
> >> groups of 16 cores via e.g. QEMU:
> >>
> >>   # Node0: 0-9; Node1: 10-19
> >>   $ qemu-system-aarch64 <blah> \
> >>     -smp 20 -numa node,cpus=0-9,nodeid=0 -numa
> >> node,cpus=10-19,nodeid=1
> >>
> >> The scheduler's MC domain (all CPUs with same LLC) is going to be built
> via
> >>
> >>   arch_topology.c::cpu_coregroup_mask()
> >>
> >> In there we try to figure out a sensible mask out of the topology
> >> information we have. In short, here we'll pick the smallest of NUMA or
> >> core sibling mask.
> >>
> >>   node_mask(CPU9)    == 0-9
> >>   core_sibling(CPU9) == 0-15
> >>
> >> MC mask for CPU9 will thus be 0-9, not a problem.
> >>
> >>   node_mask(CPU10)    == 10-19
> >>   core_sibling(CPU10) == 0-15
> >>
> >> MC mask for CPU10 will thus be 10-19, not a problem.
> >>
> >>   node_mask(CPU16)    == 10-19
> >>   core_sibling(CPU16) == 16-19
> >>
> >> MC mask for CPU16 will thus be 16-19... Uh oh. CPUs 16-19 are in two
> >> different unique MC spans, and the scheduler has no idea what to make
> of
> >> that. That triggers the WARN_ON() added by commit
> >>
> >>   ccf74128d66c ("sched/topology: Assert non-NUMA topology masks
> >> don't (partially) overlap")
> >>
> >> Fixing MPIDR-derived topology
> >> =============================
> >>
> >> We could try to come up with some cleverer scheme to figure out which
> of
> >> the available masks to pick, but really if one of those masks resulted
> from
> >> MPIDR then it should be discarded because it's bound to be bogus.
> >>
> >> I was hoping to give MPIDR a chance for SMT, to figure out which
> threads
> >> are
> >> in the same core using Aff1-3 as core ID, but Sudeep and Robin pointed
> out
> >> to me that there are systems out there where *all* cores have
> non-zero
> >> values in their higher affinity fields (e.g. RK3288 has "5" in all of its
> >> cores' MPIDR.Aff1), which would expose a bogus core ID to userspace.
> >>
> >> Stop using MPIDR for topology information. When no other source of
> >> topology
> >> information is available, mark each CPU as its own core and its NUMA
> >> node
> >> as its LLC domain.
> >
> > I agree with your idea to remove the topology functionality of MPIDR ,
> > but I think we need also consider ARM32 and GIC.
> >
> 
> Could you please elaborate? This change doesn't impact arch_topology, so
> only arm64 is affected.

Yes, this change only affects arm64, my question is that do we need to
 leverage it to arm32 since arm32 got the same issue.

And for GIC we are also using MPIDR for the topology info, but I am sure
It's got the same issue or not, just a suggestion to have a look.
Will Deacon Sept. 7, 2020, 9:35 p.m. UTC | #7
On Sat, 29 Aug 2020 14:00:16 +0100, Valentin Schneider wrote:
> In the absence of ACPI or DT topology data, we fallback to haphazardly
> decoding *something* out of MPIDR. Sadly, the contents of that register are
> mostly unusable due to the implementation leniancy and things like Aff0
> having to be capped to 15 (despite being encoded on 8 bits).
> 
> Consider a simple system with a single package of 32 cores, all under the
> same LLC. We ought to be shoving them in the same core_sibling mask, but
> MPIDR is going to look like:
> 
> [...]

Applied to arm64 (for-next/topology), thanks!

[1/1] arm64: topology: Stop using MPIDR for topology information
      https://git.kernel.org/arm64/c/3102bc0e6ac7

Cheers,
Valentin Schneider Sept. 9, 2020, 11:21 a.m. UTC | #8
On 03/09/20 02:44, B wrote:
>> -----Original Message-----
>> From: Valentin Schneider [mailto:valentin.schneider@arm.com]
>> On 02/09/20 04:24, B wrote:
>> > I agree with your idea to remove the topology functionality of MPIDR ,
>> > but I think we need also consider ARM32 and GIC.
>> >
>>
>> Could you please elaborate? This change doesn't impact arch_topology, so
>> only arm64 is affected.
>
> Yes, this change only affects arm64, my question is that do we need to
>  leverage it to arm32 since arm32 got the same issue.
>
> And for GIC we are also using MPIDR for the topology info, but I am sure
> It's got the same issue or not, just a suggestion to have a look.

So technically yes, we can be bothered by this on arm32 - Sudeep pointed
out a list of DT files that shows platforms with non-zero values in Aff1 or
above.

However, the bigger issue is that artificial separation in clusters of 16
CPUs due to extra limitations on Aff0 (mainly due to GICv3 AIUI). Given
that GICv2 can support at most 8 CPU interfaces, I don't think we have it
as bad on arm32.
diff mbox series

Patch

diff --git a/arch/arm64/kernel/topology.c b/arch/arm64/kernel/topology.c
index 0801a0f3c156..ff1dd1dbfe64 100644
--- a/arch/arm64/kernel/topology.c
+++ b/arch/arm64/kernel/topology.c
@@ -36,21 +36,23 @@  void store_cpu_topology(unsigned int cpuid)
 	if (mpidr & MPIDR_UP_BITMASK)
 		return;
 
-	/* Create cpu topology mapping based on MPIDR. */
-	if (mpidr & MPIDR_MT_BITMASK) {
-		/* Multiprocessor system : Multi-threads per core */
-		cpuid_topo->thread_id  = MPIDR_AFFINITY_LEVEL(mpidr, 0);
-		cpuid_topo->core_id    = MPIDR_AFFINITY_LEVEL(mpidr, 1);
-		cpuid_topo->package_id = MPIDR_AFFINITY_LEVEL(mpidr, 2) |
-					 MPIDR_AFFINITY_LEVEL(mpidr, 3) << 8;
-	} else {
-		/* Multiprocessor system : Single-thread per core */
-		cpuid_topo->thread_id  = -1;
-		cpuid_topo->core_id    = MPIDR_AFFINITY_LEVEL(mpidr, 0);
-		cpuid_topo->package_id = MPIDR_AFFINITY_LEVEL(mpidr, 1) |
-					 MPIDR_AFFINITY_LEVEL(mpidr, 2) << 8 |
-					 MPIDR_AFFINITY_LEVEL(mpidr, 3) << 16;
-	}
+	/*
+	 * This would be the place to create cpu topology based on MPIDR.
+	 *
+	 * However, it cannot be trusted to depict the actual topology; some
+	 * pieces of the architecture enforce an artificial cap on Aff0 values
+	 * (e.g. GICv3's ICC_SGI1R_EL1 limits it to 15), leading to an
+	 * artificial cycling of Aff1, Aff2 and Aff3 values. IOW, these end up
+	 * having absolutely no relationship to the actual underlying system
+	 * topology, and cannot be reasonably used as core / package ID.
+	 *
+	 * If the MT bit is set, Aff0 *could* be used to define a thread ID, but
+	 * we still wouldn't be able to obtain a sane core ID. This means we
+	 * need to entirely ignore MPIDR for any topology deduction.
+	 */
+	cpuid_topo->thread_id  = -1;
+	cpuid_topo->core_id    = cpuid;
+	cpuid_topo->package_id = cpu_to_node(cpuid);
 
 	pr_debug("CPU%u: cluster %d core %d thread %d mpidr %#016llx\n",
 		 cpuid, cpuid_topo->package_id, cpuid_topo->core_id,