diff mbox series

[v5] topology: make core_mask include at least cluster_siblings

Message ID c8fe9fce7c86ed56b4c455b8c902982dc2303868.1649696956.git.darren@os.amperecomputing.com (mailing list archive)
State New, archived
Headers show
Series [v5] topology: make core_mask include at least cluster_siblings | expand

Commit Message

Darren Hart April 11, 2022, 8:53 p.m. UTC
Ampere Altra defines CPU clusters in the ACPI PPTT. They share a Snoop
Control Unit, but have no shared CPU-side last level cache.

cpu_coregroup_mask() will return a cpumask with weight 1, while
cpu_clustergroup_mask() will return a cpumask with weight 2.

As a result, build_sched_domain() will BUG() once per CPU with:

BUG: arch topology borken
the CLS domain not a subset of the MC domain

The MC level cpumask is then extended to that of the CLS child, and is
later removed entirely as redundant. This sched domain topology is an
improvement over previous topologies, or those built without
SCHED_CLUSTER, particularly for certain latency sensitive workloads.
With the current scheduler model and heuristics, this is a desirable
default topology for Ampere Altra and Altra Max system.

Rather than create a custom sched domains topology structure and
introduce new logic in arch/arm64 to detect these systems, update the
core_mask so coregroup is never a subset of clustergroup, extending it
to cluster_siblings if necessary. Only do this if CONFIG_SCHED_CLUSTER
is enabled to avoid also changing the topology (MC) when
CONFIG_SCHED_CLUSTER is disabled.

This has the added benefit over a custom topology of working for both
symmetric and asymmetric topologies. It does not address systems where
the CLUSTER topology is above a populated MC topology, but these are not
considered today and can be addressed separately if and when they
appear.

The final sched domain topology for a 2 socket Ampere Altra system is
unchanged with or without CONFIG_SCHED_CLUSTER, and the BUG is avoided:

For CPU0:

CONFIG_SCHED_CLUSTER=y
CLS  [0-1]
DIE  [0-79]
NUMA [0-159]

CONFIG_SCHED_CLUSTER is not set
DIE  [0-79]
NUMA [0-159]

Signed-off-by: Darren Hart <darren@os.amperecomputing.com>
Suggested-by: Barry Song <song.bao.hua@hisilicon.com>
Reviewed-by: Barry Song <song.bao.hua@hisilicon.com>
Acked-by: Sudeep Holla <sudeep.holla@arm.com>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: D. Scott Phillips <scott@os.amperecomputing.com>
Cc: Ilkka Koskinen <ilkka@os.amperecomputing.com>
Cc: <stable@vger.kernel.org> # 5.16.x
---
v1: Drop MC level if coregroup weight == 1
v2: New sd topo in arch/arm64/kernel/smp.c
v3: No new topo, extend core_mask to cluster_siblings
v4: Rebase on 5.18-rc1 for GregKH to pull. Add IS_ENABLED(CONFIG_SCHED_CLUSTER).
v5: Rebase on 5.18-rc2 for GregKH to pull. Add collected tags. No other changes.

 drivers/base/arch_topology.c | 9 +++++++++
 1 file changed, 9 insertions(+)

Comments

Yicong Yang Sept. 15, 2022, 12:01 p.m. UTC | #1
Hi Darren,

On 2022/4/12 4:53, Darren Hart wrote:
> Ampere Altra defines CPU clusters in the ACPI PPTT. They share a Snoop
> Control Unit, but have no shared CPU-side last level cache.
> 
> cpu_coregroup_mask() will return a cpumask with weight 1, while
> cpu_clustergroup_mask() will return a cpumask with weight 2.
> 
> As a result, build_sched_domain() will BUG() once per CPU with:
> 
> BUG: arch topology borken
> the CLS domain not a subset of the MC domain
> 
> The MC level cpumask is then extended to that of the CLS child, and is
> later removed entirely as redundant. This sched domain topology is an
> improvement over previous topologies, or those built without
> SCHED_CLUSTER, particularly for certain latency sensitive workloads.
> With the current scheduler model and heuristics, this is a desirable
> default topology for Ampere Altra and Altra Max system.
> 
> Rather than create a custom sched domains topology structure and
> introduce new logic in arch/arm64 to detect these systems, update the
> core_mask so coregroup is never a subset of clustergroup, extending it
> to cluster_siblings if necessary. Only do this if CONFIG_SCHED_CLUSTER
> is enabled to avoid also changing the topology (MC) when
> CONFIG_SCHED_CLUSTER is disabled.
> 
> This has the added benefit over a custom topology of working for both
> symmetric and asymmetric topologies. It does not address systems where
> the CLUSTER topology is above a populated MC topology, but these are not
> considered today and can be addressed separately if and when they
> appear.
> 
> The final sched domain topology for a 2 socket Ampere Altra system is
> unchanged with or without CONFIG_SCHED_CLUSTER, and the BUG is avoided:
> 
> For CPU0:
> 
> CONFIG_SCHED_CLUSTER=y
> CLS  [0-1]
> DIE  [0-79]
> NUMA [0-159]
> 
> CONFIG_SCHED_CLUSTER is not set
> DIE  [0-79]
> NUMA [0-159]
> 
> Signed-off-by: Darren Hart <darren@os.amperecomputing.com>
> Suggested-by: Barry Song <song.bao.hua@hisilicon.com>
> Reviewed-by: Barry Song <song.bao.hua@hisilicon.com>
> Acked-by: Sudeep Holla <sudeep.holla@arm.com>
> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> Cc: "Rafael J. Wysocki" <rafael@kernel.org>
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Will Deacon <will@kernel.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Vincent Guittot <vincent.guittot@linaro.org>
> Cc: D. Scott Phillips <scott@os.amperecomputing.com>
> Cc: Ilkka Koskinen <ilkka@os.amperecomputing.com>
> Cc: <stable@vger.kernel.org> # 5.16.x
> ---
> v1: Drop MC level if coregroup weight == 1
> v2: New sd topo in arch/arm64/kernel/smp.c
> v3: No new topo, extend core_mask to cluster_siblings
> v4: Rebase on 5.18-rc1 for GregKH to pull. Add IS_ENABLED(CONFIG_SCHED_CLUSTER).
> v5: Rebase on 5.18-rc2 for GregKH to pull. Add collected tags. No other changes.
> 
>  drivers/base/arch_topology.c | 9 +++++++++
>  1 file changed, 9 insertions(+)
> 
> diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
> index 1d6636ebaac5..5497c5ab7318 100644
> --- a/drivers/base/arch_topology.c
> +++ b/drivers/base/arch_topology.c
> @@ -667,6 +667,15 @@ const struct cpumask *cpu_coregroup_mask(int cpu)
>  			core_mask = &cpu_topology[cpu].llc_sibling;
>  	}
>  
> +	/*
> +	 * For systems with no shared cpu-side LLC but with clusters defined,
> +	 * extend core_mask to cluster_siblings. The sched domain builder will
> +	 * then remove MC as redundant with CLS if SCHED_CLUSTER is enabled.
> +	 */
> +	if (IS_ENABLED(CONFIG_SCHED_CLUSTER) &&
> +	    cpumask_subset(core_mask, &cpu_topology[cpu].cluster_sibling))
> +		core_mask = &cpu_topology[cpu].cluster_sibling;
> +
>  	return core_mask;
>  }
>  

Is this patch still necessary for Ampere after Ionela's patch [1], which
will limit the cluster's span within coregroup's span.

I found an issue that the NUMA domains are not built on qemu with:

qemu-system-aarch64 \
        -kernel ${Image} \
        -smp 8 \
        -cpu cortex-a72 \
        -m 32G \
        -object memory-backend-ram,id=node0,size=8G \
        -object memory-backend-ram,id=node1,size=8G \
        -object memory-backend-ram,id=node2,size=8G \
        -object memory-backend-ram,id=node3,size=8G \
        -numa node,memdev=node0,cpus=0-1,nodeid=0 \
        -numa node,memdev=node1,cpus=2-3,nodeid=1 \
        -numa node,memdev=node2,cpus=4-5,nodeid=2 \
        -numa node,memdev=node3,cpus=6-7,nodeid=3 \
        -numa dist,src=0,dst=1,val=12 \
        -numa dist,src=0,dst=2,val=20 \
        -numa dist,src=0,dst=3,val=22 \
        -numa dist,src=1,dst=2,val=22 \
        -numa dist,src=1,dst=3,val=24 \
        -numa dist,src=2,dst=3,val=12 \
        -machine virt,iommu=smmuv3 \
        -net none \
        -initrd ${Rootfs} \
        -nographic \
        -bios QEMU_EFI.fd \
        -append "rdinit=/init console=ttyAMA0 earlycon=pl011,0x9000000 sched_verbose loglevel=8"

I can see the schedule domain build stops at MC level since we reach all the
cpus in the system:

[    2.141316] CPU0 attaching sched-domain(s):
[    2.142558]  domain-0: span=0-7 level=MC
[    2.145364]   groups: 0:{ span=0 cap=964 }, 1:{ span=1 cap=914 }, 2:{ span=2 cap=921 }, 3:{ span=3 cap=964 }, 4:{ span=4 cap=925 }, 5:{ span=5 cap=964 }, 6:{ span=6 cap=967 }, 7:{ span=7 cap=967 }
[    2.158357] CPU1 attaching sched-domain(s):
[    2.158964]  domain-0: span=0-7 level=MC
[...]

Without this the NUMA domains are built correctly:

[    2.008885] CPU0 attaching sched-domain(s):
[    2.009764]  domain-0: span=0-1 level=MC
[    2.012654]   groups: 0:{ span=0 cap=962 }, 1:{ span=1 cap=925 }
[    2.016532]   domain-1: span=0-3 level=NUMA
[    2.017444]    groups: 0:{ span=0-1 cap=1887 }, 2:{ span=2-3 cap=1871 }
[    2.019354]    domain-2: span=0-5 level=NUMA
[    2.019983]     groups: 0:{ span=0-3 cap=3758 }, 4:{ span=4-5 cap=1935 }
[    2.021527]     domain-3: span=0-7 level=NUMA
[    2.022516]      groups: 0:{ span=0-5 mask=0-1 cap=5693 }, 6:{ span=4-7 mask=6-7 cap=3978 }
[...]

Hope to see your comments since I have no Ampere machine and I don't know
how to emulate its topology on qemu.

[1] bfcc4397435d ("arch_topology: Limit span of cpu_clustergroup_mask()")

Thanks,
Yicong
Darren Hart Sept. 15, 2022, 5:56 p.m. UTC | #2
On Thu, Sep 15, 2022 at 08:01:18PM +0800, Yicong Yang wrote:
> Hi Darren,
> 

Hi Yicong,

...

> > diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
> > index 1d6636ebaac5..5497c5ab7318 100644
> > --- a/drivers/base/arch_topology.c
> > +++ b/drivers/base/arch_topology.c
> > @@ -667,6 +667,15 @@ const struct cpumask *cpu_coregroup_mask(int cpu)
> >  			core_mask = &cpu_topology[cpu].llc_sibling;
> >  	}
> >  
> > +	/*
> > +	 * For systems with no shared cpu-side LLC but with clusters defined,
> > +	 * extend core_mask to cluster_siblings. The sched domain builder will
> > +	 * then remove MC as redundant with CLS if SCHED_CLUSTER is enabled.
> > +	 */
> > +	if (IS_ENABLED(CONFIG_SCHED_CLUSTER) &&
> > +	    cpumask_subset(core_mask, &cpu_topology[cpu].cluster_sibling))
> > +		core_mask = &cpu_topology[cpu].cluster_sibling;
> > +
> >  	return core_mask;
> >  }
> >  
> 
> Is this patch still necessary for Ampere after Ionela's patch [1], which
> will limit the cluster's span within coregroup's span.

Yes, see:
https://lore.kernel.org/lkml/YshYAyEWhE4z%2FKpB@fedora/

Both patches work together to accomplish the desired sched domains for the
Ampere Altra family.

> 
> I found an issue that the NUMA domains are not built on qemu with:
> 
> qemu-system-aarch64 \
>         -kernel ${Image} \
>         -smp 8 \
>         -cpu cortex-a72 \
>         -m 32G \
>         -object memory-backend-ram,id=node0,size=8G \
>         -object memory-backend-ram,id=node1,size=8G \
>         -object memory-backend-ram,id=node2,size=8G \
>         -object memory-backend-ram,id=node3,size=8G \
>         -numa node,memdev=node0,cpus=0-1,nodeid=0 \
>         -numa node,memdev=node1,cpus=2-3,nodeid=1 \
>         -numa node,memdev=node2,cpus=4-5,nodeid=2 \
>         -numa node,memdev=node3,cpus=6-7,nodeid=3 \
>         -numa dist,src=0,dst=1,val=12 \
>         -numa dist,src=0,dst=2,val=20 \
>         -numa dist,src=0,dst=3,val=22 \
>         -numa dist,src=1,dst=2,val=22 \
>         -numa dist,src=1,dst=3,val=24 \
>         -numa dist,src=2,dst=3,val=12 \
>         -machine virt,iommu=smmuv3 \
>         -net none \
>         -initrd ${Rootfs} \
>         -nographic \
>         -bios QEMU_EFI.fd \
>         -append "rdinit=/init console=ttyAMA0 earlycon=pl011,0x9000000 sched_verbose loglevel=8"
> 
> I can see the schedule domain build stops at MC level since we reach all the
> cpus in the system:
> 
> [    2.141316] CPU0 attaching sched-domain(s):
> [    2.142558]  domain-0: span=0-7 level=MC
> [    2.145364]   groups: 0:{ span=0 cap=964 }, 1:{ span=1 cap=914 }, 2:{ span=2 cap=921 }, 3:{ span=3 cap=964 }, 4:{ span=4 cap=925 }, 5:{ span=5 cap=964 }, 6:{ span=6 cap=967 }, 7:{ span=7 cap=967 }
> [    2.158357] CPU1 attaching sched-domain(s):
> [    2.158964]  domain-0: span=0-7 level=MC
> [...]
> 
> Without this the NUMA domains are built correctly:
> 

Without which? My patch, Ionela's patch, or both?

> [    2.008885] CPU0 attaching sched-domain(s):
> [    2.009764]  domain-0: span=0-1 level=MC
> [    2.012654]   groups: 0:{ span=0 cap=962 }, 1:{ span=1 cap=925 }
> [    2.016532]   domain-1: span=0-3 level=NUMA
> [    2.017444]    groups: 0:{ span=0-1 cap=1887 }, 2:{ span=2-3 cap=1871 }
> [    2.019354]    domain-2: span=0-5 level=NUMA

I'm not following this topology - what in the description above should result in
a domain with span=0-5?


> [    2.019983]     groups: 0:{ span=0-3 cap=3758 }, 4:{ span=4-5 cap=1935 }
> [    2.021527]     domain-3: span=0-7 level=NUMA
> [    2.022516]      groups: 0:{ span=0-5 mask=0-1 cap=5693 }, 6:{ span=4-7 mask=6-7 cap=3978 }
> [...]
> 
> Hope to see your comments since I have no Ampere machine and I don't know
> how to emulate its topology on qemu.
> 
> [1] bfcc4397435d ("arch_topology: Limit span of cpu_clustergroup_mask()")
> 
> Thanks,
> Yicong

Thanks,
Yicong Yang Sept. 16, 2022, 7:59 a.m. UTC | #3
On 2022/9/16 1:56, Darren Hart wrote:
> On Thu, Sep 15, 2022 at 08:01:18PM +0800, Yicong Yang wrote:
>> Hi Darren,
>>
> 
> Hi Yicong,
> 
> ...
> 
>>> diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
>>> index 1d6636ebaac5..5497c5ab7318 100644
>>> --- a/drivers/base/arch_topology.c
>>> +++ b/drivers/base/arch_topology.c
>>> @@ -667,6 +667,15 @@ const struct cpumask *cpu_coregroup_mask(int cpu)
>>>  			core_mask = &cpu_topology[cpu].llc_sibling;
>>>  	}
>>>  
>>> +	/*
>>> +	 * For systems with no shared cpu-side LLC but with clusters defined,
>>> +	 * extend core_mask to cluster_siblings. The sched domain builder will
>>> +	 * then remove MC as redundant with CLS if SCHED_CLUSTER is enabled.
>>> +	 */
>>> +	if (IS_ENABLED(CONFIG_SCHED_CLUSTER) &&
>>> +	    cpumask_subset(core_mask, &cpu_topology[cpu].cluster_sibling))
>>> +		core_mask = &cpu_topology[cpu].cluster_sibling;
>>> +
>>>  	return core_mask;
>>>  }
>>>  
>>
>> Is this patch still necessary for Ampere after Ionela's patch [1], which
>> will limit the cluster's span within coregroup's span.
> 
> Yes, see:
> https://lore.kernel.org/lkml/YshYAyEWhE4z%2FKpB@fedora/
> 
> Both patches work together to accomplish the desired sched domains for the
> Ampere Altra family.
> 

Thanks for the link. From my understanding, on the Altra machine we'll get
the following results:

with your patch alone:
Scheduler will get a weight of 2 for both CLS and MC level and finally the
MC domain will be squashed. The lowest domain will be CLS.

with both your patch and Ionela's:
CLS will have a weight of 1 and MC will have a weight of 2. CLS won't be
built and the lowest domain will be MC.

with Ionela's patch alone:
Both CLS and MC will have a weight of 1, which is incorrect.

So your patch is still necessary for Amphere Altra. Then we need to limit
MC span to DIE/NODE span, according to the scheduler's definition for
topology level, for the issue below. Maybe something like this:

diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
index 46cbe4471e78..8ebaba576836 100644
--- a/drivers/base/arch_topology.c
+++ b/drivers/base/arch_topology.c
@@ -713,6 +713,9 @@ const struct cpumask *cpu_coregroup_mask(int cpu)
            cpumask_subset(core_mask, &cpu_topology[cpu].cluster_sibling))
                core_mask = &cpu_topology[cpu].cluster_sibling;

+       if (cpumask_subset(cpu_cpu_mask(cpu), core_mask))
+               core_mask = cpu_cpu_mask(cpu);
+
        return core_mask;
 }

>>
>> I found an issue that the NUMA domains are not built on qemu with:
>>
>> qemu-system-aarch64 \
>>         -kernel ${Image} \
>>         -smp 8 \
>>         -cpu cortex-a72 \
>>         -m 32G \
>>         -object memory-backend-ram,id=node0,size=8G \
>>         -object memory-backend-ram,id=node1,size=8G \
>>         -object memory-backend-ram,id=node2,size=8G \
>>         -object memory-backend-ram,id=node3,size=8G \
>>         -numa node,memdev=node0,cpus=0-1,nodeid=0 \
>>         -numa node,memdev=node1,cpus=2-3,nodeid=1 \
>>         -numa node,memdev=node2,cpus=4-5,nodeid=2 \
>>         -numa node,memdev=node3,cpus=6-7,nodeid=3 \
>>         -numa dist,src=0,dst=1,val=12 \
>>         -numa dist,src=0,dst=2,val=20 \
>>         -numa dist,src=0,dst=3,val=22 \
>>         -numa dist,src=1,dst=2,val=22 \
>>         -numa dist,src=1,dst=3,val=24 \
>>         -numa dist,src=2,dst=3,val=12 \
>>         -machine virt,iommu=smmuv3 \
>>         -net none \
>>         -initrd ${Rootfs} \
>>         -nographic \
>>         -bios QEMU_EFI.fd \
>>         -append "rdinit=/init console=ttyAMA0 earlycon=pl011,0x9000000 sched_verbose loglevel=8"
>>
>> I can see the schedule domain build stops at MC level since we reach all the
>> cpus in the system:
>>
>> [    2.141316] CPU0 attaching sched-domain(s):
>> [    2.142558]  domain-0: span=0-7 level=MC
>> [    2.145364]   groups: 0:{ span=0 cap=964 }, 1:{ span=1 cap=914 }, 2:{ span=2 cap=921 }, 3:{ span=3 cap=964 }, 4:{ span=4 cap=925 }, 5:{ span=5 cap=964 }, 6:{ span=6 cap=967 }, 7:{ span=7 cap=967 }
>> [    2.158357] CPU1 attaching sched-domain(s):
>> [    2.158964]  domain-0: span=0-7 level=MC
>> [...]
>>
>> Without this the NUMA domains are built correctly:
>>
> > Without which? My patch, Ionela's patch, or both?
> 

Revert your patch only will have below result, sorry for the ambiguous. Before reverting,
for CPU 0, MC should span 0-1 but with your patch it's extended to 0-7 and the scheduler
domain build will stop at MC level because it has reached all the CPUs.

>> [    2.008885] CPU0 attaching sched-domain(s):
>> [    2.009764]  domain-0: span=0-1 level=MC
>> [    2.012654]   groups: 0:{ span=0 cap=962 }, 1:{ span=1 cap=925 }
>> [    2.016532]   domain-1: span=0-3 level=NUMA
>> [    2.017444]    groups: 0:{ span=0-1 cap=1887 }, 2:{ span=2-3 cap=1871 }
>> [    2.019354]    domain-2: span=0-5 level=NUMA
> 
> I'm not following this topology - what in the description above should result in
> a domain with span=0-5?
> 

It emulates a 3-hop NUMA machine and the NUMA domains will be built according to the
NUMA distances:

node   0   1   2   3
  0:  10  12  20  22
  1:  12  10  22  24
  2:  20  22  10  12
  3:  22  24  12  10

So for CPU 0 the NUMA domains will look like:
NUMA domain 0 for local nodes (squashed to MC domain), CPU 0-1
NUMA domain 1 for nodes within distance 12, CPU 0-3
NUMA domain 2 for nodes within distance 20, CPU 0-5
NUMA domain 3 for all the nodes, CPU 0-7

Thanks.

> 
>> [    2.019983]     groups: 0:{ span=0-3 cap=3758 }, 4:{ span=4-5 cap=1935 }
>> [    2.021527]     domain-3: span=0-7 level=NUMA
>> [    2.022516]      groups: 0:{ span=0-5 mask=0-1 cap=5693 }, 6:{ span=4-7 mask=6-7 cap=3978 }
>> [...]
>>
>> Hope to see your comments since I have no Ampere machine and I don't know
>> how to emulate its topology on qemu.
>>
>> [1] bfcc4397435d ("arch_topology: Limit span of cpu_clustergroup_mask()")
>>
>> Thanks,
>> Yicong
> 
> Thanks,
>
Ionela Voinescu Sept. 16, 2022, 4:14 p.m. UTC | #4
Hi,

On Friday 16 Sep 2022 at 15:59:34 (+0800), Yicong Yang wrote:
> On 2022/9/16 1:56, Darren Hart wrote:
> > On Thu, Sep 15, 2022 at 08:01:18PM +0800, Yicong Yang wrote:
> >> Hi Darren,
> >>
> > 
> > Hi Yicong,
> > 
> > ...
> > 
> >>> diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
> >>> index 1d6636ebaac5..5497c5ab7318 100644
> >>> --- a/drivers/base/arch_topology.c
> >>> +++ b/drivers/base/arch_topology.c
> >>> @@ -667,6 +667,15 @@ const struct cpumask *cpu_coregroup_mask(int cpu)
> >>>  			core_mask = &cpu_topology[cpu].llc_sibling;
> >>>  	}
> >>>  
> >>> +	/*
> >>> +	 * For systems with no shared cpu-side LLC but with clusters defined,
> >>> +	 * extend core_mask to cluster_siblings. The sched domain builder will
> >>> +	 * then remove MC as redundant with CLS if SCHED_CLUSTER is enabled.
> >>> +	 */
> >>> +	if (IS_ENABLED(CONFIG_SCHED_CLUSTER) &&
> >>> +	    cpumask_subset(core_mask, &cpu_topology[cpu].cluster_sibling))
> >>> +		core_mask = &cpu_topology[cpu].cluster_sibling;
> >>> +
> >>>  	return core_mask;
> >>>  }
> >>>  
> >>
> >> Is this patch still necessary for Ampere after Ionela's patch [1], which
> >> will limit the cluster's span within coregroup's span.
> > 
> > Yes, see:
> > https://lore.kernel.org/lkml/YshYAyEWhE4z%2FKpB@fedora/
> > 
> > Both patches work together to accomplish the desired sched domains for the
> > Ampere Altra family.
> > 
> 
> Thanks for the link. From my understanding, on the Altra machine we'll get
> the following results:
> 
> with your patch alone:
> Scheduler will get a weight of 2 for both CLS and MC level and finally the
> MC domain will be squashed. The lowest domain will be CLS.
> 
> with both your patch and Ionela's:
> CLS will have a weight of 1 and MC will have a weight of 2. CLS won't be
> built and the lowest domain will be MC.
> 
> with Ionela's patch alone:
> Both CLS and MC will have a weight of 1, which is incorrect.
> 

This would happen with or without my patch. My patch only breaks the tie
between CLS and MC.

And the above outcome is "incorrect" for Ampere Altra where there's no
cache spanning multiple cores, but ACPI presents clusters. With Darren's
patch this information on clusters is used instead to build the MC domain.


> So your patch is still necessary for Amphere Altra. Then we need to limit
> MC span to DIE/NODE span, according to the scheduler's definition for
> topology level, for the issue below. Maybe something like this:
> 
> diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
> index 46cbe4471e78..8ebaba576836 100644
> --- a/drivers/base/arch_topology.c
> +++ b/drivers/base/arch_topology.c
> @@ -713,6 +713,9 @@ const struct cpumask *cpu_coregroup_mask(int cpu)
>             cpumask_subset(core_mask, &cpu_topology[cpu].cluster_sibling))
>                 core_mask = &cpu_topology[cpu].cluster_sibling;
> 
> +       if (cpumask_subset(cpu_cpu_mask(cpu), core_mask))
> +               core_mask = cpu_cpu_mask(cpu);
> +
>         return core_mask;
>  }
> 

I agree cluster_sibling should not span more CPUs than package/node.
I thought that restriction was imposed by find_acpi_cpu_topology_cluster().
I'll take a further look over that as I think it's a better location to
restrict the span of the cluster.


> >>
> >> I found an issue that the NUMA domains are not built on qemu with:
> >>
> >> qemu-system-aarch64 \
> >>         -kernel ${Image} \
> >>         -smp 8 \
> >>         -cpu cortex-a72 \
> >>         -m 32G \
> >>         -object memory-backend-ram,id=node0,size=8G \
> >>         -object memory-backend-ram,id=node1,size=8G \
> >>         -object memory-backend-ram,id=node2,size=8G \
> >>         -object memory-backend-ram,id=node3,size=8G \
> >>         -numa node,memdev=node0,cpus=0-1,nodeid=0 \
> >>         -numa node,memdev=node1,cpus=2-3,nodeid=1 \
> >>         -numa node,memdev=node2,cpus=4-5,nodeid=2 \
> >>         -numa node,memdev=node3,cpus=6-7,nodeid=3 \
> >>         -numa dist,src=0,dst=1,val=12 \
> >>         -numa dist,src=0,dst=2,val=20 \
> >>         -numa dist,src=0,dst=3,val=22 \
> >>         -numa dist,src=1,dst=2,val=22 \
> >>         -numa dist,src=1,dst=3,val=24 \
> >>         -numa dist,src=2,dst=3,val=12 \
> >>         -machine virt,iommu=smmuv3 \
> >>         -net none \
> >>         -initrd ${Rootfs} \
> >>         -nographic \
> >>         -bios QEMU_EFI.fd \
> >>         -append "rdinit=/init console=ttyAMA0 earlycon=pl011,0x9000000 sched_verbose loglevel=8"
> >>
> >> I can see the schedule domain build stops at MC level since we reach all the
> >> cpus in the system:
> >>
> >> [    2.141316] CPU0 attaching sched-domain(s):
> >> [    2.142558]  domain-0: span=0-7 level=MC
> >> [    2.145364]   groups: 0:{ span=0 cap=964 }, 1:{ span=1 cap=914 }, 2:{ span=2 cap=921 }, 3:{ span=3 cap=964 }, 4:{ span=4 cap=925 }, 5:{ span=5 cap=964 }, 6:{ span=6 cap=967 }, 7:{ span=7 cap=967 }
> >> [    2.158357] CPU1 attaching sched-domain(s):
> >> [    2.158964]  domain-0: span=0-7 level=MC
> >> [...]
> >>

It took me a bit to reproduce this as it requires "QEMU emulator version
7.1.0" otherwise there won't be a PPTT table.

With this, the cache hierarchy is not really "healthy", so it's not a
topology I'd expect to see in practice. But I suppose we should try to
fix it.

root@debian-arm64-buster:/sys/devices/system/cpu/cpu0/cache# grep . */*
index0/level:1
index0/shared_cpu_list:0-7
index0/shared_cpu_map:ff
index0/type:Data
index1/level:1
index1/shared_cpu_list:0-7
index1/shared_cpu_map:ff
index1/type:Instruction
index2/level:2
index2/shared_cpu_list:0-7
index2/shared_cpu_map:ff
index2/type:Unified

Thanks,
Ionela.

> >> Without this the NUMA domains are built correctly:
> >>
> > > Without which? My patch, Ionela's patch, or both?
> > 
> 
> Revert your patch only will have below result, sorry for the ambiguous. Before reverting,
> for CPU 0, MC should span 0-1 but with your patch it's extended to 0-7 and the scheduler
> domain build will stop at MC level because it has reached all the CPUs.
> 
> >> [    2.008885] CPU0 attaching sched-domain(s):
> >> [    2.009764]  domain-0: span=0-1 level=MC
> >> [    2.012654]   groups: 0:{ span=0 cap=962 }, 1:{ span=1 cap=925 }
> >> [    2.016532]   domain-1: span=0-3 level=NUMA
> >> [    2.017444]    groups: 0:{ span=0-1 cap=1887 }, 2:{ span=2-3 cap=1871 }
> >> [    2.019354]    domain-2: span=0-5 level=NUMA
> > 
> > I'm not following this topology - what in the description above should result in
> > a domain with span=0-5?
> > 
> 
> It emulates a 3-hop NUMA machine and the NUMA domains will be built according to the
> NUMA distances:
> 
> node   0   1   2   3
>   0:  10  12  20  22
>   1:  12  10  22  24
>   2:  20  22  10  12
>   3:  22  24  12  10
> 
> So for CPU 0 the NUMA domains will look like:
> NUMA domain 0 for local nodes (squashed to MC domain), CPU 0-1
> NUMA domain 1 for nodes within distance 12, CPU 0-3
> NUMA domain 2 for nodes within distance 20, CPU 0-5
> NUMA domain 3 for all the nodes, CPU 0-7
> 
> Thanks.
> 
> > 
> >> [    2.019983]     groups: 0:{ span=0-3 cap=3758 }, 4:{ span=4-5 cap=1935 }
> >> [    2.021527]     domain-3: span=0-7 level=NUMA
> >> [    2.022516]      groups: 0:{ span=0-5 mask=0-1 cap=5693 }, 6:{ span=4-7 mask=6-7 cap=3978 }
> >> [...]
> >>
> >> Hope to see your comments since I have no Ampere machine and I don't know
> >> how to emulate its topology on qemu.
> >>
> >> [1] bfcc4397435d ("arch_topology: Limit span of cpu_clustergroup_mask()")
> >>
> >> Thanks,
> >> Yicong
> > 
> > Thanks,
> >
Darren Hart Sept. 16, 2022, 5:41 p.m. UTC | #5
On Fri, Sep 16, 2022 at 03:59:34PM +0800, Yicong Yang wrote:
> On 2022/9/16 1:56, Darren Hart wrote:
> > On Thu, Sep 15, 2022 at 08:01:18PM +0800, Yicong Yang wrote:
> >> Hi Darren,
> >>
> > 
> > Hi Yicong,
> > 
> > ...
> > 
> >>> diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
> >>> index 1d6636ebaac5..5497c5ab7318 100644
> >>> --- a/drivers/base/arch_topology.c
> >>> +++ b/drivers/base/arch_topology.c
> >>> @@ -667,6 +667,15 @@ const struct cpumask *cpu_coregroup_mask(int cpu)
> >>>  			core_mask = &cpu_topology[cpu].llc_sibling;
> >>>  	}
> >>>  
> >>> +	/*
> >>> +	 * For systems with no shared cpu-side LLC but with clusters defined,
> >>> +	 * extend core_mask to cluster_siblings. The sched domain builder will
> >>> +	 * then remove MC as redundant with CLS if SCHED_CLUSTER is enabled.
> >>> +	 */
> >>> +	if (IS_ENABLED(CONFIG_SCHED_CLUSTER) &&
> >>> +	    cpumask_subset(core_mask, &cpu_topology[cpu].cluster_sibling))
> >>> +		core_mask = &cpu_topology[cpu].cluster_sibling;
> >>> +
> >>>  	return core_mask;
> >>>  }
> >>>  
> >>
> >> Is this patch still necessary for Ampere after Ionela's patch [1], which
> >> will limit the cluster's span within coregroup's span.
> > 
> > Yes, see:
> > https://lore.kernel.org/lkml/YshYAyEWhE4z%2FKpB@fedora/
> > 
> > Both patches work together to accomplish the desired sched domains for the
> > Ampere Altra family.
> > 
> 
> Thanks for the link. From my understanding, on the Altra machine we'll get
> the following results:
> 
> with your patch alone:
> Scheduler will get a weight of 2 for both CLS and MC level and finally the
> MC domain will be squashed. The lowest domain will be CLS.
> 
> with both your patch and Ionela's:
> CLS will have a weight of 1 and MC will have a weight of 2. CLS won't be
> built and the lowest domain will be MC.
> 
> with Ionela's patch alone:
> Both CLS and MC will have a weight of 1, which is incorrect.
> 
> So your patch is still necessary for Amphere Altra. Then we need to limit
> MC span to DIE/NODE span, according to the scheduler's definition for
> topology level, for the issue below. Maybe something like this:

That seems reasonable.

What isn't clear to me is why qemu is creating a cluster layer with the
description you provide. Why is cluster_siblings being populated?

> 
> diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
> index 46cbe4471e78..8ebaba576836 100644
> --- a/drivers/base/arch_topology.c
> +++ b/drivers/base/arch_topology.c
> @@ -713,6 +713,9 @@ const struct cpumask *cpu_coregroup_mask(int cpu)
>             cpumask_subset(core_mask, &cpu_topology[cpu].cluster_sibling))
>                 core_mask = &cpu_topology[cpu].cluster_sibling;
> 
> +       if (cpumask_subset(cpu_cpu_mask(cpu), core_mask))
> +               core_mask = cpu_cpu_mask(cpu);
> +
>         return core_mask;
>  }
> 
> >>
> >> I found an issue that the NUMA domains are not built on qemu with:
> >>
> >> qemu-system-aarch64 \
> >>         -kernel ${Image} \
> >>         -smp 8 \
> >>         -cpu cortex-a72 \
> >>         -m 32G \
> >>         -object memory-backend-ram,id=node0,size=8G \
> >>         -object memory-backend-ram,id=node1,size=8G \
> >>         -object memory-backend-ram,id=node2,size=8G \
> >>         -object memory-backend-ram,id=node3,size=8G \
> >>         -numa node,memdev=node0,cpus=0-1,nodeid=0 \
> >>         -numa node,memdev=node1,cpus=2-3,nodeid=1 \
> >>         -numa node,memdev=node2,cpus=4-5,nodeid=2 \
> >>         -numa node,memdev=node3,cpus=6-7,nodeid=3 \
> >>         -numa dist,src=0,dst=1,val=12 \
> >>         -numa dist,src=0,dst=2,val=20 \
> >>         -numa dist,src=0,dst=3,val=22 \
> >>         -numa dist,src=1,dst=2,val=22 \
> >>         -numa dist,src=1,dst=3,val=24 \
> >>         -numa dist,src=2,dst=3,val=12 \
> >>         -machine virt,iommu=smmuv3 \
> >>         -net none \
> >>         -initrd ${Rootfs} \
> >>         -nographic \
> >>         -bios QEMU_EFI.fd \
> >>         -append "rdinit=/init console=ttyAMA0 earlycon=pl011,0x9000000 sched_verbose loglevel=8"
> >>
> >> I can see the schedule domain build stops at MC level since we reach all the
> >> cpus in the system:
> >>
> >> [    2.141316] CPU0 attaching sched-domain(s):
> >> [    2.142558]  domain-0: span=0-7 level=MC
> >> [    2.145364]   groups: 0:{ span=0 cap=964 }, 1:{ span=1 cap=914 }, 2:{ span=2 cap=921 }, 3:{ span=3 cap=964 }, 4:{ span=4 cap=925 }, 5:{ span=5 cap=964 }, 6:{ span=6 cap=967 }, 7:{ span=7 cap=967 }
> >> [    2.158357] CPU1 attaching sched-domain(s):
> >> [    2.158964]  domain-0: span=0-7 level=MC
> >> [...]
> >>
> >> Without this the NUMA domains are built correctly:
> >>
> > > Without which? My patch, Ionela's patch, or both?
> > 
> 
> Revert your patch only will have below result, sorry for the ambiguous. Before reverting,
> for CPU 0, MC should span 0-1 but with your patch it's extended to 0-7 and the scheduler
> domain build will stop at MC level because it has reached all the CPUs.
> 
> >> [    2.008885] CPU0 attaching sched-domain(s):
> >> [    2.009764]  domain-0: span=0-1 level=MC
> >> [    2.012654]   groups: 0:{ span=0 cap=962 }, 1:{ span=1 cap=925 }
> >> [    2.016532]   domain-1: span=0-3 level=NUMA
> >> [    2.017444]    groups: 0:{ span=0-1 cap=1887 }, 2:{ span=2-3 cap=1871 }
> >> [    2.019354]    domain-2: span=0-5 level=NUMA
> > 
> > I'm not following this topology - what in the description above should result in
> > a domain with span=0-5?
> > 
> 
> It emulates a 3-hop NUMA machine and the NUMA domains will be built according to the
> NUMA distances:
> 
> node   0   1   2   3
>   0:  10  12  20  22
>   1:  12  10  22  24
>   2:  20  22  10  12
>   3:  22  24  12  10
> 
> So for CPU 0 the NUMA domains will look like:
> NUMA domain 0 for local nodes (squashed to MC domain), CPU 0-1
> NUMA domain 1 for nodes within distance 12, CPU 0-3
> NUMA domain 2 for nodes within distance 20, CPU 0-5
> NUMA domain 3 for all the nodes, CPU 0-7
> 

Right, thanks for the explanation.

So the bit that remains unclear to me, is why is cluster_siblings being
populated? Which part of your qemu topology description becomes the CLS layer
during sched domain cosntruction?
Darren Hart Sept. 16, 2022, 5:46 p.m. UTC | #6
On Fri, Sep 16, 2022 at 05:14:41PM +0100, Ionela Voinescu wrote:
> > >>
> > >> I found an issue that the NUMA domains are not built on qemu with:
> > >>
> > >> qemu-system-aarch64 \
> > >>         -kernel ${Image} \
> > >>         -smp 8 \
> > >>         -cpu cortex-a72 \
> > >>         -m 32G \
> > >>         -object memory-backend-ram,id=node0,size=8G \
> > >>         -object memory-backend-ram,id=node1,size=8G \
> > >>         -object memory-backend-ram,id=node2,size=8G \
> > >>         -object memory-backend-ram,id=node3,size=8G \
> > >>         -numa node,memdev=node0,cpus=0-1,nodeid=0 \
> > >>         -numa node,memdev=node1,cpus=2-3,nodeid=1 \
> > >>         -numa node,memdev=node2,cpus=4-5,nodeid=2 \
> > >>         -numa node,memdev=node3,cpus=6-7,nodeid=3 \
> > >>         -numa dist,src=0,dst=1,val=12 \
> > >>         -numa dist,src=0,dst=2,val=20 \
> > >>         -numa dist,src=0,dst=3,val=22 \
> > >>         -numa dist,src=1,dst=2,val=22 \
> > >>         -numa dist,src=1,dst=3,val=24 \
> > >>         -numa dist,src=2,dst=3,val=12 \
> > >>         -machine virt,iommu=smmuv3 \
> > >>         -net none \
> > >>         -initrd ${Rootfs} \
> > >>         -nographic \
> > >>         -bios QEMU_EFI.fd \
> > >>         -append "rdinit=/init console=ttyAMA0 earlycon=pl011,0x9000000 sched_verbose loglevel=8"
> > >>
> > >> I can see the schedule domain build stops at MC level since we reach all the
> > >> cpus in the system:
> > >>
> > >> [    2.141316] CPU0 attaching sched-domain(s):
> > >> [    2.142558]  domain-0: span=0-7 level=MC
> > >> [    2.145364]   groups: 0:{ span=0 cap=964 }, 1:{ span=1 cap=914 }, 2:{ span=2 cap=921 }, 3:{ span=3 cap=964 }, 4:{ span=4 cap=925 }, 5:{ span=5 cap=964 }, 6:{ span=6 cap=967 }, 7:{ span=7 cap=967 }
> > >> [    2.158357] CPU1 attaching sched-domain(s):
> > >> [    2.158964]  domain-0: span=0-7 level=MC
> > >> [...]
> > >>
> 
> It took me a bit to reproduce this as it requires "QEMU emulator version
> 7.1.0" otherwise there won't be a PPTT table.
> 

Is this new PPTT presenting what we'd expect from the qemu topology? e.g. if
it's presenting a cluster layer in the PPTT - should it be? Or should that be
limited to the SRAT table only?
Yicong Yang Sept. 19, 2022, 1:22 p.m. UTC | #7
On 2022/9/17 1:41, Darren Hart wrote:
> On Fri, Sep 16, 2022 at 03:59:34PM +0800, Yicong Yang wrote:
>> On 2022/9/16 1:56, Darren Hart wrote:
>>> On Thu, Sep 15, 2022 at 08:01:18PM +0800, Yicong Yang wrote:
>>>> Hi Darren,
>>>>
>>>
>>> Hi Yicong,
>>>
>>> ...
>>>
>>>>> diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
>>>>> index 1d6636ebaac5..5497c5ab7318 100644
>>>>> --- a/drivers/base/arch_topology.c
>>>>> +++ b/drivers/base/arch_topology.c
>>>>> @@ -667,6 +667,15 @@ const struct cpumask *cpu_coregroup_mask(int cpu)
>>>>>  			core_mask = &cpu_topology[cpu].llc_sibling;
>>>>>  	}
>>>>>  
>>>>> +	/*
>>>>> +	 * For systems with no shared cpu-side LLC but with clusters defined,
>>>>> +	 * extend core_mask to cluster_siblings. The sched domain builder will
>>>>> +	 * then remove MC as redundant with CLS if SCHED_CLUSTER is enabled.
>>>>> +	 */
>>>>> +	if (IS_ENABLED(CONFIG_SCHED_CLUSTER) &&
>>>>> +	    cpumask_subset(core_mask, &cpu_topology[cpu].cluster_sibling))
>>>>> +		core_mask = &cpu_topology[cpu].cluster_sibling;
>>>>> +
>>>>>  	return core_mask;
>>>>>  }
>>>>>  
>>>>
>>>> Is this patch still necessary for Ampere after Ionela's patch [1], which
>>>> will limit the cluster's span within coregroup's span.
>>>
>>> Yes, see:
>>> https://lore.kernel.org/lkml/YshYAyEWhE4z%2FKpB@fedora/
>>>
>>> Both patches work together to accomplish the desired sched domains for the
>>> Ampere Altra family.
>>>
>>
>> Thanks for the link. From my understanding, on the Altra machine we'll get
>> the following results:
>>
>> with your patch alone:
>> Scheduler will get a weight of 2 for both CLS and MC level and finally the
>> MC domain will be squashed. The lowest domain will be CLS.
>>
>> with both your patch and Ionela's:
>> CLS will have a weight of 1 and MC will have a weight of 2. CLS won't be
>> built and the lowest domain will be MC.
>>
>> with Ionela's patch alone:
>> Both CLS and MC will have a weight of 1, which is incorrect.
>>
>> So your patch is still necessary for Amphere Altra. Then we need to limit
>> MC span to DIE/NODE span, according to the scheduler's definition for
>> topology level, for the issue below. Maybe something like this:
> 
> That seems reasonable.
> 
> What isn't clear to me is why qemu is creating a cluster layer with the
> description you provide. Why is cluster_siblings being populated?
> 
>>
>> diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
>> index 46cbe4471e78..8ebaba576836 100644
>> --- a/drivers/base/arch_topology.c
>> +++ b/drivers/base/arch_topology.c
>> @@ -713,6 +713,9 @@ const struct cpumask *cpu_coregroup_mask(int cpu)
>>             cpumask_subset(core_mask, &cpu_topology[cpu].cluster_sibling))
>>                 core_mask = &cpu_topology[cpu].cluster_sibling;
>>
>> +       if (cpumask_subset(cpu_cpu_mask(cpu), core_mask))
>> +               core_mask = cpu_cpu_mask(cpu);
>> +
>>         return core_mask;
>>  }
>>
>>>>
>>>> I found an issue that the NUMA domains are not built on qemu with:
>>>>
>>>> qemu-system-aarch64 \
>>>>         -kernel ${Image} \
>>>>         -smp 8 \
>>>>         -cpu cortex-a72 \
>>>>         -m 32G \
>>>>         -object memory-backend-ram,id=node0,size=8G \
>>>>         -object memory-backend-ram,id=node1,size=8G \
>>>>         -object memory-backend-ram,id=node2,size=8G \
>>>>         -object memory-backend-ram,id=node3,size=8G \
>>>>         -numa node,memdev=node0,cpus=0-1,nodeid=0 \
>>>>         -numa node,memdev=node1,cpus=2-3,nodeid=1 \
>>>>         -numa node,memdev=node2,cpus=4-5,nodeid=2 \
>>>>         -numa node,memdev=node3,cpus=6-7,nodeid=3 \
>>>>         -numa dist,src=0,dst=1,val=12 \
>>>>         -numa dist,src=0,dst=2,val=20 \
>>>>         -numa dist,src=0,dst=3,val=22 \
>>>>         -numa dist,src=1,dst=2,val=22 \
>>>>         -numa dist,src=1,dst=3,val=24 \
>>>>         -numa dist,src=2,dst=3,val=12 \
>>>>         -machine virt,iommu=smmuv3 \
>>>>         -net none \
>>>>         -initrd ${Rootfs} \
>>>>         -nographic \
>>>>         -bios QEMU_EFI.fd \
>>>>         -append "rdinit=/init console=ttyAMA0 earlycon=pl011,0x9000000 sched_verbose loglevel=8"
>>>>
>>>> I can see the schedule domain build stops at MC level since we reach all the
>>>> cpus in the system:
>>>>
>>>> [    2.141316] CPU0 attaching sched-domain(s):
>>>> [    2.142558]  domain-0: span=0-7 level=MC
>>>> [    2.145364]   groups: 0:{ span=0 cap=964 }, 1:{ span=1 cap=914 }, 2:{ span=2 cap=921 }, 3:{ span=3 cap=964 }, 4:{ span=4 cap=925 }, 5:{ span=5 cap=964 }, 6:{ span=6 cap=967 }, 7:{ span=7 cap=967 }
>>>> [    2.158357] CPU1 attaching sched-domain(s):
>>>> [    2.158964]  domain-0: span=0-7 level=MC
>>>> [...]
>>>>
>>>> Without this the NUMA domains are built correctly:
>>>>
>>>> Without which? My patch, Ionela's patch, or both?
>>>
>>
>> Revert your patch only will have below result, sorry for the ambiguous. Before reverting,
>> for CPU 0, MC should span 0-1 but with your patch it's extended to 0-7 and the scheduler
>> domain build will stop at MC level because it has reached all the CPUs.
>>
>>>> [    2.008885] CPU0 attaching sched-domain(s):
>>>> [    2.009764]  domain-0: span=0-1 level=MC
>>>> [    2.012654]   groups: 0:{ span=0 cap=962 }, 1:{ span=1 cap=925 }
>>>> [    2.016532]   domain-1: span=0-3 level=NUMA
>>>> [    2.017444]    groups: 0:{ span=0-1 cap=1887 }, 2:{ span=2-3 cap=1871 }
>>>> [    2.019354]    domain-2: span=0-5 level=NUMA
>>>
>>> I'm not following this topology - what in the description above should result in
>>> a domain with span=0-5?
>>>
>>
>> It emulates a 3-hop NUMA machine and the NUMA domains will be built according to the
>> NUMA distances:
>>
>> node   0   1   2   3
>>   0:  10  12  20  22
>>   1:  12  10  22  24
>>   2:  20  22  10  12
>>   3:  22  24  12  10
>>
>> So for CPU 0 the NUMA domains will look like:
>> NUMA domain 0 for local nodes (squashed to MC domain), CPU 0-1
>> NUMA domain 1 for nodes within distance 12, CPU 0-3
>> NUMA domain 2 for nodes within distance 20, CPU 0-5
>> NUMA domain 3 for all the nodes, CPU 0-7
>>
> 
> Right, thanks for the explanation.
> 
> So the bit that remains unclear to me, is why is cluster_siblings being
> populated? Which part of your qemu topology description becomes the CLS layer
> during sched domain cosntruction?

I think your concern's right, qemu indeed populate a cluster in the system. I checked
and cluster_id looks like:

estuary:/$ cat /sys/devices/system/cpu/cpu*/topology/cluster_id
56
56
56
56
56
56
56
56

I check the qemu codes that it'll always populate a cluster for aarch64's virt machine
even if user doesn't specify it through '-smp cluster=N', the range of the cluster
will be equal to package.

I tried the attached code of qemu (based on v7.1.0-rc4) and kernel (based on 6.0.0-rc1).
Then the cluster won't be built and the NUMA domains built correctly:

estuary:/$ cat /sys/devices/system/cpu/cpu*/topology/cluster_id
-2
-2
-2
-2
-2
-2
-2
-2

Ionela replied to check the PPTT code. Maybe we should both restrict this in Qemu and kernel
side.

Thanks,
Yicong

Kernel changes below. Make sure the package node is not recognized as a cluster node.

diff --git a/drivers/acpi/pptt.c b/drivers/acpi/pptt.c
index c91342dcbcd6..6cec3cf52921 100644
--- a/drivers/acpi/pptt.c
+++ b/drivers/acpi/pptt.c
@@ -750,7 +750,7 @@ int find_acpi_cpu_topology_cluster(unsigned int cpu)

        is_thread = cpu_node->flags & ACPI_PPTT_ACPI_PROCESSOR_IS_THREAD;
        cluster_node = fetch_pptt_node(table, cpu_node->parent);
-       if (!cluster_node)
+       if (!cluster_node || cluster_node->flags & ACPI_PPTT_PHYSICAL_PACKAGE)
                return -ENOENT;

        if (is_thread) {



Qemu changes below. Don't build cluster node in PPTT if user doesn't specified
"-smp clusters=N".

yangyicong@ubuntu:~/Community/qemu/build$ git diff | cat
diff --git a/hw/acpi/aml-build.c b/hw/acpi/aml-build.c
index e6bfac95c7..1a0f708250 100644
--- a/hw/acpi/aml-build.c
+++ b/hw/acpi/aml-build.c
@@ -2030,7 +2030,7 @@ void build_pptt(GArray *table_data, BIOSLinker *linker, MachineState *ms,
                 0, socket_id, NULL, 0);
         }

-        if (mc->smp_props.clusters_supported) {
+        if (mc->smp_props.clusters_supported && ms->smp.has_cluster) {
             if (cpus->cpus[n].props.cluster_id != cluster_id) {
                 assert(cpus->cpus[n].props.cluster_id > cluster_id);
                 cluster_id = cpus->cpus[n].props.cluster_id;
diff --git a/hw/core/machine-smp.c b/hw/core/machine-smp.c
index b39ed21e65..97c830660b 100644
--- a/hw/core/machine-smp.c
+++ b/hw/core/machine-smp.c
@@ -158,6 +158,9 @@ void machine_parse_smp_config(MachineState *ms,
     ms->smp.threads = threads;
     ms->smp.max_cpus = maxcpus;

+    if (config->has_clusters)
+        ms->smp.has_cluster = true;
+
     /* sanity-check of the computed topology */
     if (sockets * dies * clusters * cores * threads != maxcpus) {
         g_autofree char *topo_msg = cpu_hierarchy_to_string(ms);
diff --git a/include/hw/boards.h b/include/hw/boards.h
index 7b416c9787..6f4473e80a 100644
--- a/include/hw/boards.h
+++ b/include/hw/boards.h
@@ -314,6 +314,7 @@ typedef struct CpuTopology {
     unsigned int cores;
     unsigned int threads;
     unsigned int max_cpus;
+    bool has_cluster;
 } CpuTopology;

 /**
diff mbox series

Patch

diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
index 1d6636ebaac5..5497c5ab7318 100644
--- a/drivers/base/arch_topology.c
+++ b/drivers/base/arch_topology.c
@@ -667,6 +667,15 @@  const struct cpumask *cpu_coregroup_mask(int cpu)
 			core_mask = &cpu_topology[cpu].llc_sibling;
 	}
 
+	/*
+	 * For systems with no shared cpu-side LLC but with clusters defined,
+	 * extend core_mask to cluster_siblings. The sched domain builder will
+	 * then remove MC as redundant with CLS if SCHED_CLUSTER is enabled.
+	 */
+	if (IS_ENABLED(CONFIG_SCHED_CLUSTER) &&
+	    cpumask_subset(core_mask, &cpu_topology[cpu].cluster_sibling))
+		core_mask = &cpu_topology[cpu].cluster_sibling;
+
 	return core_mask;
 }