diff mbox series

[v3] topology: make core_mask include at least cluster_siblings

Message ID f1deaeabfd31fdf512ff6502f38186ef842c2b1f.1646413117.git.darren@os.amperecomputing.com (mailing list archive)
State New, archived
Headers show
Series [v3] topology: make core_mask include at least cluster_siblings | expand

Commit Message

Darren Hart March 4, 2022, 5:01 p.m. UTC
Ampere Altra defines CPU clusters in the ACPI PPTT. They share a Snoop
Control Unit, but have no shared CPU-side last level cache.

cpu_coregroup_mask() will return a cpumask with weight 1, while
cpu_clustergroup_mask() will return a cpumask with weight 2.

As a result, build_sched_domain() will BUG() once per CPU with:

BUG: arch topology borken
the CLS domain not a subset of the MC domain

The MC level cpumask is then extended to that of the CLS child, and is
later removed entirely as redundant. This sched domain topology is an
improvement over previous topologies, or those built without
SCHED_CLUSTER, particularly for certain latency sensitive workloads.
With the current scheduler model and heuristics, this is a desirable
default topology for Ampere Altra and Altra Max system.

Rather than create a custom sched domains topology structure and
introduce new logic in arch/arm64 to detect these systems, update the
core_mask so coregroup is never a subset of clustergroup, extending it
to cluster_siblings if necessary.

This has the added benefit over a custom topology of working for both
symmetric and asymmetric topologies. It does not address systems where
the cluster topology is above a populated mc topology, but these are not
considered today and can be addressed separately if and when they
appear.

The final sched domain topology for a 2 socket Ampere Altra system is
unchanged with or without CONFIG_SCHED_CLUSTER, and the BUG is avoided:

For CPU0:

CONFIG_SCHED_CLUSTER=y
CLS  [0-1]
DIE  [0-79]
NUMA [0-159]

CONFIG_SCHED_CLUSTER is not set
DIE  [0-79]
NUMA [0-159]

Cc: Sudeep Holla <sudeep.holla@arm.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Barry Song <song.bao.hua@hisilicon.com>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: D. Scott Phillips <scott@os.amperecomputing.com>
Cc: Ilkka Koskinen <ilkka@os.amperecomputing.com>
Cc: <stable@vger.kernel.org> # 5.16.x
Suggested-by: Barry Song <song.bao.hua@hisilicon.com>
Signed-off-by: Darren Hart <darren@os.amperecomputing.com>
---
v1: Drop MC level if coregroup weight == 1
v2: New sd topo in arch/arm64/kernel/smp.c
v3: No new topo, extend core_mask to cluster_siblings

 drivers/base/arch_topology.c | 8 ++++++++
 1 file changed, 8 insertions(+)

Comments

Will Deacon March 8, 2022, 10:30 a.m. UTC | #1
On Fri, Mar 04, 2022 at 09:01:36AM -0800, Darren Hart wrote:
> Ampere Altra defines CPU clusters in the ACPI PPTT. They share a Snoop
> Control Unit, but have no shared CPU-side last level cache.
> 
> cpu_coregroup_mask() will return a cpumask with weight 1, while
> cpu_clustergroup_mask() will return a cpumask with weight 2.
> 
> As a result, build_sched_domain() will BUG() once per CPU with:
> 
> BUG: arch topology borken
> the CLS domain not a subset of the MC domain
> 
> The MC level cpumask is then extended to that of the CLS child, and is
> later removed entirely as redundant. This sched domain topology is an
> improvement over previous topologies, or those built without
> SCHED_CLUSTER, particularly for certain latency sensitive workloads.
> With the current scheduler model and heuristics, this is a desirable
> default topology for Ampere Altra and Altra Max system.
> 
> Rather than create a custom sched domains topology structure and
> introduce new logic in arch/arm64 to detect these systems, update the
> core_mask so coregroup is never a subset of clustergroup, extending it
> to cluster_siblings if necessary.
> 
> This has the added benefit over a custom topology of working for both
> symmetric and asymmetric topologies. It does not address systems where
> the cluster topology is above a populated mc topology, but these are not
> considered today and can be addressed separately if and when they
> appear.
> 
> The final sched domain topology for a 2 socket Ampere Altra system is
> unchanged with or without CONFIG_SCHED_CLUSTER, and the BUG is avoided:
> 
> For CPU0:
> 
> CONFIG_SCHED_CLUSTER=y
> CLS  [0-1]
> DIE  [0-79]
> NUMA [0-159]
> 
> CONFIG_SCHED_CLUSTER is not set
> DIE  [0-79]
> NUMA [0-159]
> 
> Cc: Sudeep Holla <sudeep.holla@arm.com>
> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> Cc: "Rafael J. Wysocki" <rafael@kernel.org>
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Will Deacon <will@kernel.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Vincent Guittot <vincent.guittot@linaro.org>
> Cc: Barry Song <song.bao.hua@hisilicon.com>
> Cc: Valentin Schneider <valentin.schneider@arm.com>
> Cc: D. Scott Phillips <scott@os.amperecomputing.com>
> Cc: Ilkka Koskinen <ilkka@os.amperecomputing.com>
> Cc: <stable@vger.kernel.org> # 5.16.x
> Suggested-by: Barry Song <song.bao.hua@hisilicon.com>
> Signed-off-by: Darren Hart <darren@os.amperecomputing.com>
> ---
> v1: Drop MC level if coregroup weight == 1
> v2: New sd topo in arch/arm64/kernel/smp.c
> v3: No new topo, extend core_mask to cluster_siblings
> 
>  drivers/base/arch_topology.c | 8 ++++++++
>  1 file changed, 8 insertions(+)
> 
> diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
> index 976154140f0b..a96f45db928b 100644
> --- a/drivers/base/arch_topology.c
> +++ b/drivers/base/arch_topology.c
> @@ -628,6 +628,14 @@ const struct cpumask *cpu_coregroup_mask(int cpu)
>  			core_mask = &cpu_topology[cpu].llc_sibling;
>  	}
>  
> +	/*
> +	 * For systems with no shared cpu-side LLC but with clusters defined,
> +	 * extend core_mask to cluster_siblings. The sched domain builder will
> +	 * then remove MC as redundant with CLS if SCHED_CLUSTER is enabled.
> +	 */
> +	if (cpumask_subset(core_mask, &cpu_topology[cpu].cluster_sibling))
> +		core_mask = &cpu_topology[cpu].cluster_sibling;
> +

Sudeep, Vincent, are you happy with this now?

Will
Sudeep Holla March 8, 2022, 10:45 a.m. UTC | #2
On Tue, Mar 08, 2022 at 10:30:12AM +0000, Will Deacon wrote:
> On Fri, Mar 04, 2022 at 09:01:36AM -0800, Darren Hart wrote:
> > Ampere Altra defines CPU clusters in the ACPI PPTT. They share a Snoop
> > Control Unit, but have no shared CPU-side last level cache.
> > 
> > cpu_coregroup_mask() will return a cpumask with weight 1, while
> > cpu_clustergroup_mask() will return a cpumask with weight 2.
> > 
> > As a result, build_sched_domain() will BUG() once per CPU with:
> > 
> > BUG: arch topology borken
> > the CLS domain not a subset of the MC domain
> > 
> > The MC level cpumask is then extended to that of the CLS child, and is
> > later removed entirely as redundant. This sched domain topology is an
> > improvement over previous topologies, or those built without
> > SCHED_CLUSTER, particularly for certain latency sensitive workloads.
> > With the current scheduler model and heuristics, this is a desirable
> > default topology for Ampere Altra and Altra Max system.
> > 
> > Rather than create a custom sched domains topology structure and
> > introduce new logic in arch/arm64 to detect these systems, update the
> > core_mask so coregroup is never a subset of clustergroup, extending it
> > to cluster_siblings if necessary.
> > 
> > This has the added benefit over a custom topology of working for both
> > symmetric and asymmetric topologies. It does not address systems where
> > the cluster topology is above a populated mc topology, but these are not
> > considered today and can be addressed separately if and when they
> > appear.
> > 
> > The final sched domain topology for a 2 socket Ampere Altra system is
> > unchanged with or without CONFIG_SCHED_CLUSTER, and the BUG is avoided:
> > 
> > For CPU0:
> > 
> > CONFIG_SCHED_CLUSTER=y
> > CLS  [0-1]
> > DIE  [0-79]
> > NUMA [0-159]
> > 
> > CONFIG_SCHED_CLUSTER is not set
> > DIE  [0-79]
> > NUMA [0-159]
> > 
> > Cc: Sudeep Holla <sudeep.holla@arm.com>
> > Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> > Cc: "Rafael J. Wysocki" <rafael@kernel.org>
> > Cc: Catalin Marinas <catalin.marinas@arm.com>
> > Cc: Will Deacon <will@kernel.org>
> > Cc: Peter Zijlstra <peterz@infradead.org>
> > Cc: Vincent Guittot <vincent.guittot@linaro.org>
> > Cc: Barry Song <song.bao.hua@hisilicon.com>
> > Cc: Valentin Schneider <valentin.schneider@arm.com>
> > Cc: D. Scott Phillips <scott@os.amperecomputing.com>
> > Cc: Ilkka Koskinen <ilkka@os.amperecomputing.com>
> > Cc: <stable@vger.kernel.org> # 5.16.x
> > Suggested-by: Barry Song <song.bao.hua@hisilicon.com>
> > Signed-off-by: Darren Hart <darren@os.amperecomputing.com>
> > ---
> > v1: Drop MC level if coregroup weight == 1
> > v2: New sd topo in arch/arm64/kernel/smp.c
> > v3: No new topo, extend core_mask to cluster_siblings
> > 
> >  drivers/base/arch_topology.c | 8 ++++++++
> >  1 file changed, 8 insertions(+)
> > 
> > diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
> > index 976154140f0b..a96f45db928b 100644
> > --- a/drivers/base/arch_topology.c
> > +++ b/drivers/base/arch_topology.c
> > @@ -628,6 +628,14 @@ const struct cpumask *cpu_coregroup_mask(int cpu)
> >  			core_mask = &cpu_topology[cpu].llc_sibling;
> >  	}
> >  
> > +	/*
> > +	 * For systems with no shared cpu-side LLC but with clusters defined,
> > +	 * extend core_mask to cluster_siblings. The sched domain builder will
> > +	 * then remove MC as redundant with CLS if SCHED_CLUSTER is enabled.
> > +	 */
> > +	if (cpumask_subset(core_mask, &cpu_topology[cpu].cluster_sibling))
> > +		core_mask = &cpu_topology[cpu].cluster_sibling;
> > +
> 
> Sudeep, Vincent, are you happy with this now?
> 

It looks good to me. Since I don't have much knowledge on scheduler, I asked
Dietmar to have a look as well just in case if I am missing any other impact.
For now,

Acked-by: Sudeep Holla <sudeep.holla@arm.com>

--
Regards,
Sudeep
Vincent Guittot March 8, 2022, 11:04 a.m. UTC | #3
On Tue, 8 Mar 2022 at 11:30, Will Deacon <will@kernel.org> wrote:
>
> On Fri, Mar 04, 2022 at 09:01:36AM -0800, Darren Hart wrote:
> > Ampere Altra defines CPU clusters in the ACPI PPTT. They share a Snoop
> > Control Unit, but have no shared CPU-side last level cache.
> >
> > cpu_coregroup_mask() will return a cpumask with weight 1, while
> > cpu_clustergroup_mask() will return a cpumask with weight 2.
> >
> > As a result, build_sched_domain() will BUG() once per CPU with:
> >
> > BUG: arch topology borken
> > the CLS domain not a subset of the MC domain
> >
> > The MC level cpumask is then extended to that of the CLS child, and is
> > later removed entirely as redundant. This sched domain topology is an
> > improvement over previous topologies, or those built without
> > SCHED_CLUSTER, particularly for certain latency sensitive workloads.
> > With the current scheduler model and heuristics, this is a desirable
> > default topology for Ampere Altra and Altra Max system.
> >
> > Rather than create a custom sched domains topology structure and
> > introduce new logic in arch/arm64 to detect these systems, update the
> > core_mask so coregroup is never a subset of clustergroup, extending it
> > to cluster_siblings if necessary.
> >
> > This has the added benefit over a custom topology of working for both
> > symmetric and asymmetric topologies. It does not address systems where
> > the cluster topology is above a populated mc topology, but these are not
> > considered today and can be addressed separately if and when they
> > appear.
> >
> > The final sched domain topology for a 2 socket Ampere Altra system is
> > unchanged with or without CONFIG_SCHED_CLUSTER, and the BUG is avoided:
> >
> > For CPU0:
> >
> > CONFIG_SCHED_CLUSTER=y
> > CLS  [0-1]
> > DIE  [0-79]
> > NUMA [0-159]
> >
> > CONFIG_SCHED_CLUSTER is not set
> > DIE  [0-79]
> > NUMA [0-159]
> >
> > Cc: Sudeep Holla <sudeep.holla@arm.com>
> > Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> > Cc: "Rafael J. Wysocki" <rafael@kernel.org>
> > Cc: Catalin Marinas <catalin.marinas@arm.com>
> > Cc: Will Deacon <will@kernel.org>
> > Cc: Peter Zijlstra <peterz@infradead.org>
> > Cc: Vincent Guittot <vincent.guittot@linaro.org>
> > Cc: Barry Song <song.bao.hua@hisilicon.com>
> > Cc: Valentin Schneider <valentin.schneider@arm.com>
> > Cc: D. Scott Phillips <scott@os.amperecomputing.com>
> > Cc: Ilkka Koskinen <ilkka@os.amperecomputing.com>
> > Cc: <stable@vger.kernel.org> # 5.16.x
> > Suggested-by: Barry Song <song.bao.hua@hisilicon.com>
> > Signed-off-by: Darren Hart <darren@os.amperecomputing.com>
> > ---
> > v1: Drop MC level if coregroup weight == 1
> > v2: New sd topo in arch/arm64/kernel/smp.c
> > v3: No new topo, extend core_mask to cluster_siblings
> >
> >  drivers/base/arch_topology.c | 8 ++++++++
> >  1 file changed, 8 insertions(+)
> >
> > diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
> > index 976154140f0b..a96f45db928b 100644
> > --- a/drivers/base/arch_topology.c
> > +++ b/drivers/base/arch_topology.c
> > @@ -628,6 +628,14 @@ const struct cpumask *cpu_coregroup_mask(int cpu)
> >                       core_mask = &cpu_topology[cpu].llc_sibling;
> >       }
> >
> > +     /*
> > +      * For systems with no shared cpu-side LLC but with clusters defined,
> > +      * extend core_mask to cluster_siblings. The sched domain builder will
> > +      * then remove MC as redundant with CLS if SCHED_CLUSTER is enabled.
> > +      */
> > +     if (cpumask_subset(core_mask, &cpu_topology[cpu].cluster_sibling))
> > +             core_mask = &cpu_topology[cpu].cluster_sibling;
> > +
>
> Sudeep, Vincent, are you happy with this now?

I would not say that I'm happy because this solution skews the core
cpu mask in order to abuse the scheduler so that it will remove a
wrong but useless level when it will build its domains.
But this works so as long as the maintainer are happy, I'm fine

>
> Will
Dietmar Eggemann March 8, 2022, 4:03 p.m. UTC | #4
On 08/03/2022 12:04, Vincent Guittot wrote:
> On Tue, 8 Mar 2022 at 11:30, Will Deacon <will@kernel.org> wrote:

[...]

>>> ---
>>> v1: Drop MC level if coregroup weight == 1
>>> v2: New sd topo in arch/arm64/kernel/smp.c
>>> v3: No new topo, extend core_mask to cluster_siblings
>>>
>>>  drivers/base/arch_topology.c | 8 ++++++++
>>>  1 file changed, 8 insertions(+)
>>>
>>> diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
>>> index 976154140f0b..a96f45db928b 100644
>>> --- a/drivers/base/arch_topology.c
>>> +++ b/drivers/base/arch_topology.c
>>> @@ -628,6 +628,14 @@ const struct cpumask *cpu_coregroup_mask(int cpu)
>>>                       core_mask = &cpu_topology[cpu].llc_sibling;
>>>       }
>>>
>>> +     /*
>>> +      * For systems with no shared cpu-side LLC but with clusters defined,
>>> +      * extend core_mask to cluster_siblings. The sched domain builder will
>>> +      * then remove MC as redundant with CLS if SCHED_CLUSTER is enabled.

IMHO, if core_mask weight is 1, MC will be removed/degenerated anyway.

This is what I get on my Ampere Altra (I guess I don't have the ACPI
changes which would let to a CLS sched domain):

# cat /sys/kernel/debug/sched/domains/cpu0/domain*/name
DIE
NUMA
root@oss-altra01:~# zcat /proc/config.gz | grep SCHED_CLUSTER
CONFIG_SCHED_CLUSTER=y

>>> +      */
>>> +     if (cpumask_subset(core_mask, &cpu_topology[cpu].cluster_sibling))
>>> +             core_mask = &cpu_topology[cpu].cluster_sibling;
>>> +
>>
>> Sudeep, Vincent, are you happy with this now?
> 
> I would not say that I'm happy because this solution skews the core
> cpu mask in order to abuse the scheduler so that it will remove a
> wrong but useless level when it will build its domains.
> But this works so as long as the maintainer are happy, I'm fine

I do not have any better idea than this tweak here either in case the
platform can't provide a cleaner setup.

Maybe the following is easier to read but then we use
'&cpu_topology[cpu].llc_sibling' in cpu_coregroup_mask() already ...

@@ -617,6 +617,7 @@ EXPORT_SYMBOL_GPL(cpu_topology);
 const struct cpumask *cpu_coregroup_mask(int cpu)
 {
        const cpumask_t *core_mask = cpumask_of_node(cpu_to_node(cpu));
+       const cpumask_t *cluster_mask = cpu_clustergroup_mask(cpu);

        /* Find the smaller of NUMA, core or LLC siblings */
        if (cpumask_subset(&cpu_topology[cpu].core_sibling, core_mask)) {
@@ -628,6 +629,9 @@ const struct cpumask *cpu_coregroup_mask(int cpu)
                        core_mask = &cpu_topology[cpu].llc_sibling;
        }

+       if (cpumask_subset(core_mask, cluster_mask))
+               core_mask = cluster_mask;
+
        return core_mask;
 }

Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Darren Hart March 8, 2022, 5:49 p.m. UTC | #5
On Tue, Mar 08, 2022 at 05:03:07PM +0100, Dietmar Eggemann wrote:
> On 08/03/2022 12:04, Vincent Guittot wrote:
> > On Tue, 8 Mar 2022 at 11:30, Will Deacon <will@kernel.org> wrote:
> 
> [...]
> 
> >>> ---
> >>> v1: Drop MC level if coregroup weight == 1
> >>> v2: New sd topo in arch/arm64/kernel/smp.c
> >>> v3: No new topo, extend core_mask to cluster_siblings
> >>>
> >>>  drivers/base/arch_topology.c | 8 ++++++++
> >>>  1 file changed, 8 insertions(+)
> >>>
> >>> diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
> >>> index 976154140f0b..a96f45db928b 100644
> >>> --- a/drivers/base/arch_topology.c
> >>> +++ b/drivers/base/arch_topology.c
> >>> @@ -628,6 +628,14 @@ const struct cpumask *cpu_coregroup_mask(int cpu)
> >>>                       core_mask = &cpu_topology[cpu].llc_sibling;
> >>>       }
> >>>
> >>> +     /*
> >>> +      * For systems with no shared cpu-side LLC but with clusters defined,
> >>> +      * extend core_mask to cluster_siblings. The sched domain builder will
> >>> +      * then remove MC as redundant with CLS if SCHED_CLUSTER is enabled.
> 
> IMHO, if core_mask weight is 1, MC will be removed/degenerated anyway.
> 
> This is what I get on my Ampere Altra (I guess I don't have the ACPI
> changes which would let to a CLS sched domain):
> 
> # cat /sys/kernel/debug/sched/domains/cpu0/domain*/name
> DIE
> NUMA
> root@oss-altra01:~# zcat /proc/config.gz | grep SCHED_CLUSTER
> CONFIG_SCHED_CLUSTER=y

I'd like to follow up on this. Would you share your dmidecode BIOS
Information section?

Which kernel version?

> >>> +      */
> >>> +     if (cpumask_subset(core_mask, &cpu_topology[cpu].cluster_sibling))
> >>> +             core_mask = &cpu_topology[cpu].cluster_sibling;
> >>> +
> >>
> >> Sudeep, Vincent, are you happy with this now?
> > 
> > I would not say that I'm happy because this solution skews the core
> > cpu mask in order to abuse the scheduler so that it will remove a
> > wrong but useless level when it will build its domains.
> > But this works so as long as the maintainer are happy, I'm fine

I did explore the other options and they added considerably more
complexity without much benefit in my view. I prefer this option which
maintains the cpu_topology as described by the platform, and maps it
into something that suits the current scheduler abstraction. I agree
there is more work to be done here and intend to continue with it.

> I do not have any better idea than this tweak here either in case the
> platform can't provide a cleaner setup.

I'd argue The platform is describing itself accurately in ACPI PPTT
terms. The topology doesn't fit nicely within the kernel abstractions
today. This is an area where I hope to continue to improve things going
forward.

> Maybe the following is easier to read but then we use
> '&cpu_topology[cpu].llc_sibling' in cpu_coregroup_mask() already ...
> 
> @@ -617,6 +617,7 @@ EXPORT_SYMBOL_GPL(cpu_topology);
>  const struct cpumask *cpu_coregroup_mask(int cpu)
>  {
>         const cpumask_t *core_mask = cpumask_of_node(cpu_to_node(cpu));
> +       const cpumask_t *cluster_mask = cpu_clustergroup_mask(cpu);
> 
>         /* Find the smaller of NUMA, core or LLC siblings */
>         if (cpumask_subset(&cpu_topology[cpu].core_sibling, core_mask)) {
> @@ -628,6 +629,9 @@ const struct cpumask *cpu_coregroup_mask(int cpu)
>                         core_mask = &cpu_topology[cpu].llc_sibling;
>         }
> 
> +       if (cpumask_subset(core_mask, cluster_mask))
> +               core_mask = cluster_mask;
> +

Either works for me. I felt the version I sent was parallel to the
existing implementation, but have no preference either way.

>         return core_mask;
>  }
> 
> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
> 

Thanks for the review Dietmar.
Dietmar Eggemann March 9, 2022, 12:50 p.m. UTC | #6
On 08/03/2022 18:49, Darren Hart wrote:
> On Tue, Mar 08, 2022 at 05:03:07PM +0100, Dietmar Eggemann wrote:
>> On 08/03/2022 12:04, Vincent Guittot wrote:
>>> On Tue, 8 Mar 2022 at 11:30, Will Deacon <will@kernel.org> wrote:

[...]

>> IMHO, if core_mask weight is 1, MC will be removed/degenerated anyway.
>>
>> This is what I get on my Ampere Altra (I guess I don't have the ACPI
>> changes which would let to a CLS sched domain):
>>
>> # cat /sys/kernel/debug/sched/domains/cpu0/domain*/name
>> DIE
>> NUMA
>> root@oss-altra01:~# zcat /proc/config.gz | grep SCHED_CLUSTER
>> CONFIG_SCHED_CLUSTER=y
> 
> I'd like to follow up on this. Would you share your dmidecode BIOS
> Information section?

# dmidecode -t 0
# dmidecode 3.2
Getting SMBIOS data from sysfs.
SMBIOS 3.2.0 present.

Handle 0x0000, DMI type 0, 26 bytes
BIOS Information
	Vendor: Ampere(TM)
	Version: 0.9.20200724
	Release Date: 2020/07/24
	ROM Size: 7680 kB
	Characteristics:
		PCI is supported
		BIOS is upgradeable
		Boot from CD is supported
		Selectable boot is supported
		ACPI is supported
		UEFI is supported
	BIOS Revision: 5.15
	Firmware Revision: 0.6

> Which kernel version?

v5.17-rc5

[...]

>>> I would not say that I'm happy because this solution skews the core
>>> cpu mask in order to abuse the scheduler so that it will remove a
>>> wrong but useless level when it will build its domains.
>>> But this works so as long as the maintainer are happy, I'm fine
> 
> I did explore the other options and they added considerably more
> complexity without much benefit in my view. I prefer this option which
> maintains the cpu_topology as described by the platform, and maps it
> into something that suits the current scheduler abstraction. I agree
> there is more work to be done here and intend to continue with it.
> 
>> I do not have any better idea than this tweak here either in case the
>> platform can't provide a cleaner setup.
> 
> I'd argue The platform is describing itself accurately in ACPI PPTT
> terms. The topology doesn't fit nicely within the kernel abstractions
> today. This is an area where I hope to continue to improve things going
> forward.

I see. And I assume lying about SCU/LLC boundaries in ACPI is not an
option since it messes up /sys/devices/system/cpu/cpu0/cache/index*/.

[...]
Darren Hart March 9, 2022, 6:26 p.m. UTC | #7
On Wed, Mar 09, 2022 at 01:50:07PM +0100, Dietmar Eggemann wrote:
> On 08/03/2022 18:49, Darren Hart wrote:
> > On Tue, Mar 08, 2022 at 05:03:07PM +0100, Dietmar Eggemann wrote:
> >> On 08/03/2022 12:04, Vincent Guittot wrote:
> >>> On Tue, 8 Mar 2022 at 11:30, Will Deacon <will@kernel.org> wrote:
> 
> [...]
> 
> >> IMHO, if core_mask weight is 1, MC will be removed/degenerated anyway.
> >>
> >> This is what I get on my Ampere Altra (I guess I don't have the ACPI
> >> changes which would let to a CLS sched domain):
> >>
> >> # cat /sys/kernel/debug/sched/domains/cpu0/domain*/name
> >> DIE
> >> NUMA
> >> root@oss-altra01:~# zcat /proc/config.gz | grep SCHED_CLUSTER
> >> CONFIG_SCHED_CLUSTER=y
> > 
> > I'd like to follow up on this. Would you share your dmidecode BIOS
> > Information section?
> 
> # dmidecode -t 0
> # dmidecode 3.2
> Getting SMBIOS data from sysfs.
> SMBIOS 3.2.0 present.
> 
> Handle 0x0000, DMI type 0, 26 bytes
> BIOS Information
> 	Vendor: Ampere(TM)
> 	Version: 0.9.20200724
> 	Release Date: 2020/07/24
> 	ROM Size: 7680 kB
> 	Characteristics:
> 		PCI is supported
> 		BIOS is upgradeable
> 		Boot from CD is supported
> 		Selectable boot is supported
> 		ACPI is supported
> 		UEFI is supported
> 	BIOS Revision: 5.15
> 	Firmware Revision: 0.6
> 

Thank you, I'm following internally and will get with you.

> > Which kernel version?
> 
> v5.17-rc5
> 
> [...]
> 
> >>> I would not say that I'm happy because this solution skews the core
> >>> cpu mask in order to abuse the scheduler so that it will remove a
> >>> wrong but useless level when it will build its domains.
> >>> But this works so as long as the maintainer are happy, I'm fine
> > 
> > I did explore the other options and they added considerably more
> > complexity without much benefit in my view. I prefer this option which
> > maintains the cpu_topology as described by the platform, and maps it
> > into something that suits the current scheduler abstraction. I agree
> > there is more work to be done here and intend to continue with it.
> > 
> >> I do not have any better idea than this tweak here either in case the
> >> platform can't provide a cleaner setup.
> > 
> > I'd argue The platform is describing itself accurately in ACPI PPTT
> > terms. The topology doesn't fit nicely within the kernel abstractions
> > today. This is an area where I hope to continue to improve things going
> > forward.
> 
> I see. And I assume lying about SCU/LLC boundaries in ACPI is not an
> option since it messes up /sys/devices/system/cpu/cpu0/cache/index*/.
> 
> [...]

I'm not aware of a way to accurately describe the SCU topology in the PPTT, and
the risk we run with lying about LLC topology is that lie has to be comprehended
by all OSes and not conflict with other lies people may ask for. In general, I
think it is preferable and more maintainable to describe the topology as
accurately and honestly as we can within the existing platform mechanisms (PPTT,
HMAT, etc) and work on the higher level abstractions to accommodate a broader
set of topologies as they emerge (as well as working to more fully describe the
topology with new platform level mechanisms as needed).

As I mentioned, I intend to continue looking in to how to improve the current
abstractions. For now, it sounds like we have agreement that this patch can be
merged to address the BUG?

Thanks all,
Dietmar Eggemann March 14, 2022, 9:37 a.m. UTC | #8
On 09/03/2022 19:26, Darren Hart wrote:
> On Wed, Mar 09, 2022 at 01:50:07PM +0100, Dietmar Eggemann wrote:
>> On 08/03/2022 18:49, Darren Hart wrote:
>>> On Tue, Mar 08, 2022 at 05:03:07PM +0100, Dietmar Eggemann wrote:
>>>> On 08/03/2022 12:04, Vincent Guittot wrote:
>>>>> On Tue, 8 Mar 2022 at 11:30, Will Deacon <will@kernel.org> wrote:
>>
>> [...]
>>
>>>> IMHO, if core_mask weight is 1, MC will be removed/degenerated anyway.
>>>>
>>>> This is what I get on my Ampere Altra (I guess I don't have the ACPI
>>>> changes which would let to a CLS sched domain):
>>>>
>>>> # cat /sys/kernel/debug/sched/domains/cpu0/domain*/name
>>>> DIE
>>>> NUMA
>>>> root@oss-altra01:~# zcat /proc/config.gz | grep SCHED_CLUSTER
>>>> CONFIG_SCHED_CLUSTER=y
>>>
>>> I'd like to follow up on this. Would you share your dmidecode BIOS
>>> Information section?
>>
>> # dmidecode -t 0
>> # dmidecode 3.2
>> Getting SMBIOS data from sysfs.
>> SMBIOS 3.2.0 present.
>>
>> Handle 0x0000, DMI type 0, 26 bytes
>> BIOS Information
>> 	Vendor: Ampere(TM)
>> 	Version: 0.9.20200724
>> 	Release Date: 2020/07/24
>> 	ROM Size: 7680 kB
>> 	Characteristics:
>> 		PCI is supported
>> 		BIOS is upgradeable
>> 		Boot from CD is supported
>> 		Selectable boot is supported
>> 		ACPI is supported
>> 		UEFI is supported
>> 	BIOS Revision: 5.15
>> 	Firmware Revision: 0.6
>>
> 
> Thank you, I'm following internally and will get with you.

Looks like in my PPTT, the `Processor Hierarchy Nodes` which represents
cluster nodes have no valid `ACPI Processor ID`.

Example for CPU0:

cpu_node-:

[1B9Ch 7068   1]           Subtable Type : 00 [Processor Hierarchy Node]
[1B9Dh 7069   1]                       Length : 1C
[1B9Eh 7070   2]                     Reserved : 0000
[1BA0h 7072   4]        Flags (decoded below) : 0000001A
                            Physical package : 0
                     ACPI Processor ID valid : 1           <-- valid !!!
                       Processor is a thread : 0
                              Node is a leaf : 1
                    Identical Implementation : 1
[1BA4h 7076   4]                       Parent : 00001B88  <-- parent !!!
[1BA8h 7080   4]            ACPI Processor ID : 00001200 [1BACh 7084
4]      Private Resource Number : 00000002
[1BB0h 7088   4]             Private Resource : 00001B58
[1BB4h 7092   4]             Private Resource : 00001B70

cluster_node (cpu_node->parent):

[1B88h 7048   1]           Subtable Type : 00 [Processor Hierarchy Node]
[1B89h 7049   1]                       Length : 14
[1B8Ah 7050   2]                     Reserved : 0000
[1B8Ch 7052   4]        Flags (decoded below) : 00000010
                            Physical package : 0
                     ACPI Processor ID valid : 0       <-- not valid !!!
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                       Processor is a thread : 0
                              Node is a leaf : 0
                    Identical Implementation : 1
[1B90h 7056   4]                       Parent : 000001C8
[1B94h 7060   4]            ACPI Processor ID : 00000000
[1B98h 7064   4]      Private Resource Number : 00000000

The code which checks this is:

int find_acpi_cpu_topology_cluster(unsigned int cpu)
{
    ....
    if (cluster_node->flags & ACPI_PPTT_ACPI_PROCESSOR_ID_VALID)
        retval = cluster_node->acpi_processor_id;
    else
       retval = ACPI_PTR_DIFF(cluster_node, table);

The else patch just returns distinct values for each CPU, so there is no
sub-grouping of CPUs which can lead to a CLS SD.
Dietmar Eggemann March 14, 2022, 4:35 p.m. UTC | #9
On 09/03/2022 19:26, Darren Hart wrote:
> On Wed, Mar 09, 2022 at 01:50:07PM +0100, Dietmar Eggemann wrote:
>> On 08/03/2022 18:49, Darren Hart wrote:
>>> On Tue, Mar 08, 2022 at 05:03:07PM +0100, Dietmar Eggemann wrote:
>>>> On 08/03/2022 12:04, Vincent Guittot wrote:
>>>>> On Tue, 8 Mar 2022 at 11:30, Will Deacon <will@kernel.org> wrote:

[...]

>>>> I do not have any better idea than this tweak here either in case the
>>>> platform can't provide a cleaner setup.
>>>
>>> I'd argue The platform is describing itself accurately in ACPI PPTT
>>> terms. The topology doesn't fit nicely within the kernel abstractions
>>> today. This is an area where I hope to continue to improve things going
>>> forward.
>>
>> I see. And I assume lying about SCU/LLC boundaries in ACPI is not an
>> option since it messes up /sys/devices/system/cpu/cpu0/cache/index*/.
>>
>> [...]
> 
> I'm not aware of a way to accurately describe the SCU topology in the PPTT, and
> the risk we run with lying about LLC topology is that lie has to be comprehended
> by all OSes and not conflict with other lies people may ask for. In general, I
> think it is preferable and more maintainable to describe the topology as
> accurately and honestly as we can within the existing platform mechanisms (PPTT,
> HMAT, etc) and work on the higher level abstractions to accommodate a broader
> set of topologies as they emerge (as well as working to more fully describe the
> topology with new platform level mechanisms as needed).
> 
> As I mentioned, I intend to continue looking in to how to improve the current
> abstractions. For now, it sounds like we have agreement that this patch can be
> merged to address the BUG?

What about swapping the CLS and MC cpumasks for such a machine? This
would avoid that the task scheduler has to deal with a system which has
CLS but no MC. We essentially promote the CLS cpumask up to MC in this
case.

cat /sys/kernel/debug/sched/domains/cpu0/domain*/name
MC
^^
DIE
NUMA

cat /sys/kernel/debug/sched/domains/cpu0# cat domain*/flags
SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_SHARE_PKG_RESOURCES SD_PREFER_SIBLING
                                                                  ^^^^^^^^^^^^^^^^^^^^^^ 
SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_PREFER_SIBLING 
SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_SERIALIZE SD_OVERLAP SD_NUMA

Only very lightly tested on Altra and Juno-r0 (DT).

--->8---

From 54bef59e7f50fa41b7ae39190fd71af57209c27d Mon Sep 17 00:00:00 2001
From: Dietmar Eggemann <dietmar.eggemann@arm.com>
Date: Mon, 14 Mar 2022 15:08:23 +0000
Subject: [PATCH] arch_topology: Swap MC & CLS SD mask if MC weight==1 &
 subset(MC,CLS)

This avoids the issue of having a system with a CLS SD but no MC SD.
CLS should be sub-SD of MC.

The cpumask under /sys/devices/system/cpu/cpu*/cache/index* and
/sys/devices/system/cpu/cpu*/topology are not changed by this.

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
---
 drivers/base/arch_topology.c | 30 ++++++++++++++++++++++++++++--
 1 file changed, 28 insertions(+), 2 deletions(-)

diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
index 976154140f0b..9af90a5625c7 100644
--- a/drivers/base/arch_topology.c
+++ b/drivers/base/arch_topology.c
@@ -614,7 +614,7 @@ static int __init parse_dt_topology(void)
 struct cpu_topology cpu_topology[NR_CPUS];
 EXPORT_SYMBOL_GPL(cpu_topology);
 
-const struct cpumask *cpu_coregroup_mask(int cpu)
+const struct cpumask *_cpu_coregroup_mask(int cpu)
 {
 	const cpumask_t *core_mask = cpumask_of_node(cpu_to_node(cpu));
 
@@ -631,11 +631,37 @@ const struct cpumask *cpu_coregroup_mask(int cpu)
 	return core_mask;
 }
 
-const struct cpumask *cpu_clustergroup_mask(int cpu)
+const struct cpumask *_cpu_clustergroup_mask(int cpu)
 {
 	return &cpu_topology[cpu].cluster_sibling;
 }
 
+static int
+swap_masks(const cpumask_t *core_mask, const cpumask_t *cluster_mask)
+{
+	if (cpumask_weight(core_mask) == 1 &&
+	    cpumask_subset(core_mask, cluster_mask))
+		return 1;
+
+	return 0;
+}	
+
+const struct cpumask *cpu_coregroup_mask(int cpu)
+{
+	const cpumask_t *cluster_mask = _cpu_clustergroup_mask(cpu);
+	const cpumask_t *core_mask = _cpu_coregroup_mask(cpu);
+	
+	return swap_masks(core_mask, cluster_mask) ? cluster_mask : core_mask;
+}
+
+const struct cpumask *cpu_clustergroup_mask(int cpu)
+{
+	const cpumask_t *cluster_mask = _cpu_clustergroup_mask(cpu);
+	const cpumask_t *core_mask = _cpu_coregroup_mask(cpu);
+
+	return swap_masks(core_mask, cluster_mask) ? core_mask : cluster_mask;
+}
+
 void update_siblings_masks(unsigned int cpuid)
 {
 	struct cpu_topology *cpu_topo, *cpuid_topo = &cpu_topology[cpuid];
Darren Hart March 14, 2022, 4:54 p.m. UTC | #10
On Mon, Mar 14, 2022 at 05:35:05PM +0100, Dietmar Eggemann wrote:
> On 09/03/2022 19:26, Darren Hart wrote:
> > On Wed, Mar 09, 2022 at 01:50:07PM +0100, Dietmar Eggemann wrote:
> >> On 08/03/2022 18:49, Darren Hart wrote:
> >>> On Tue, Mar 08, 2022 at 05:03:07PM +0100, Dietmar Eggemann wrote:
> >>>> On 08/03/2022 12:04, Vincent Guittot wrote:
> >>>>> On Tue, 8 Mar 2022 at 11:30, Will Deacon <will@kernel.org> wrote:
> 
> [...]
> 
> >>>> I do not have any better idea than this tweak here either in case the
> >>>> platform can't provide a cleaner setup.
> >>>
> >>> I'd argue The platform is describing itself accurately in ACPI PPTT
> >>> terms. The topology doesn't fit nicely within the kernel abstractions
> >>> today. This is an area where I hope to continue to improve things going
> >>> forward.
> >>
> >> I see. And I assume lying about SCU/LLC boundaries in ACPI is not an
> >> option since it messes up /sys/devices/system/cpu/cpu0/cache/index*/.
> >>
> >> [...]
> > 
> > I'm not aware of a way to accurately describe the SCU topology in the PPTT, and
> > the risk we run with lying about LLC topology is that lie has to be comprehended
> > by all OSes and not conflict with other lies people may ask for. In general, I
> > think it is preferable and more maintainable to describe the topology as
> > accurately and honestly as we can within the existing platform mechanisms (PPTT,
> > HMAT, etc) and work on the higher level abstractions to accommodate a broader
> > set of topologies as they emerge (as well as working to more fully describe the
> > topology with new platform level mechanisms as needed).
> > 
> > As I mentioned, I intend to continue looking in to how to improve the current
> > abstractions. For now, it sounds like we have agreement that this patch can be
> > merged to address the BUG?
> 
> What about swapping the CLS and MC cpumasks for such a machine? This
> would avoid that the task scheduler has to deal with a system which has
> CLS but no MC. We essentially promote the CLS cpumask up to MC in this
> case.
> 
> cat /sys/kernel/debug/sched/domains/cpu0/domain*/name
> MC
> ^^
> DIE
> NUMA
> 
> cat /sys/kernel/debug/sched/domains/cpu0# cat domain*/flags
> SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_SHARE_PKG_RESOURCES SD_PREFER_SIBLING
>                                                                   ^^^^^^^^^^^^^^^^^^^^^^ 
> SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_PREFER_SIBLING 
> SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_SERIALIZE SD_OVERLAP SD_NUMA
> 
> Only very lightly tested on Altra and Juno-r0 (DT).
> 
> --->8---
> 
> From 54bef59e7f50fa41b7ae39190fd71af57209c27d Mon Sep 17 00:00:00 2001
> From: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Date: Mon, 14 Mar 2022 15:08:23 +0000
> Subject: [PATCH] arch_topology: Swap MC & CLS SD mask if MC weight==1 &
>  subset(MC,CLS)
> 
> This avoids the issue of having a system with a CLS SD but no MC SD.
> CLS should be sub-SD of MC.

Hi Dietmar,

Ultimately, this delivers the same result. I do think it imposes more complexity
for everyone to address what as far as I'm aware only affect the one system.

I don't think the term "Cluster" has a clear and universally understood
definition, so I don't think it's a given that "CLS should be sub-SD of MC". I
think this has been assumed, and that assumption has mostly held up, but this is
an abstraction, and the abstraction should follow the physical topologies rather
than the other way around in my opinion. If that's the primary motivation for
this approach, I don't think it justifies the additional complexity.

All told, I prefer the 2 line change contained within cpu_coregroup_mask() which
handles the one known exception with minimal impact. It's easy enough to come
back to this to address more cases with a more complex solution if needed in the
future - but I prefer to introduce the least amount of complexity as possible to
address the known issues, especially if the end result is the same and the cost
is paid by the affected systems.

Thanks,

> 
> The cpumask under /sys/devices/system/cpu/cpu*/cache/index* and
> /sys/devices/system/cpu/cpu*/topology are not changed by this.
> 
> Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
> ---
>  drivers/base/arch_topology.c | 30 ++++++++++++++++++++++++++++--
>  1 file changed, 28 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
> index 976154140f0b..9af90a5625c7 100644
> --- a/drivers/base/arch_topology.c
> +++ b/drivers/base/arch_topology.c
> @@ -614,7 +614,7 @@ static int __init parse_dt_topology(void)
>  struct cpu_topology cpu_topology[NR_CPUS];
>  EXPORT_SYMBOL_GPL(cpu_topology);
>  
> -const struct cpumask *cpu_coregroup_mask(int cpu)
> +const struct cpumask *_cpu_coregroup_mask(int cpu)
>  {
>  	const cpumask_t *core_mask = cpumask_of_node(cpu_to_node(cpu));
>  
> @@ -631,11 +631,37 @@ const struct cpumask *cpu_coregroup_mask(int cpu)
>  	return core_mask;
>  }
>  
> -const struct cpumask *cpu_clustergroup_mask(int cpu)
> +const struct cpumask *_cpu_clustergroup_mask(int cpu)
>  {
>  	return &cpu_topology[cpu].cluster_sibling;
>  }
>  
> +static int
> +swap_masks(const cpumask_t *core_mask, const cpumask_t *cluster_mask)
> +{
> +	if (cpumask_weight(core_mask) == 1 &&
> +	    cpumask_subset(core_mask, cluster_mask))
> +		return 1;
> +
> +	return 0;
> +}	
> +
> +const struct cpumask *cpu_coregroup_mask(int cpu)
> +{
> +	const cpumask_t *cluster_mask = _cpu_clustergroup_mask(cpu);
> +	const cpumask_t *core_mask = _cpu_coregroup_mask(cpu);
> +	
> +	return swap_masks(core_mask, cluster_mask) ? cluster_mask : core_mask;
> +}
> +
> +const struct cpumask *cpu_clustergroup_mask(int cpu)
> +{
> +	const cpumask_t *cluster_mask = _cpu_clustergroup_mask(cpu);
> +	const cpumask_t *core_mask = _cpu_coregroup_mask(cpu);
> +
> +	return swap_masks(core_mask, cluster_mask) ? core_mask : cluster_mask;
> +}
> +
>  void update_siblings_masks(unsigned int cpuid)
>  {
>  	struct cpu_topology *cpu_topo, *cpuid_topo = &cpu_topology[cpuid];
> -- 
> 2.25.1
Darren Hart March 14, 2022, 4:56 p.m. UTC | #11
On Mon, Mar 14, 2022 at 10:37:21AM +0100, Dietmar Eggemann wrote:
> On 09/03/2022 19:26, Darren Hart wrote:
> > On Wed, Mar 09, 2022 at 01:50:07PM +0100, Dietmar Eggemann wrote:
> >> On 08/03/2022 18:49, Darren Hart wrote:
> >>> On Tue, Mar 08, 2022 at 05:03:07PM +0100, Dietmar Eggemann wrote:
> >>>> On 08/03/2022 12:04, Vincent Guittot wrote:
> >>>>> On Tue, 8 Mar 2022 at 11:30, Will Deacon <will@kernel.org> wrote:
> >>
> >> [...]
> >>
> >>>> IMHO, if core_mask weight is 1, MC will be removed/degenerated anyway.
> >>>>
> >>>> This is what I get on my Ampere Altra (I guess I don't have the ACPI
> >>>> changes which would let to a CLS sched domain):
> >>>>
> >>>> # cat /sys/kernel/debug/sched/domains/cpu0/domain*/name
> >>>> DIE
> >>>> NUMA
> >>>> root@oss-altra01:~# zcat /proc/config.gz | grep SCHED_CLUSTER
> >>>> CONFIG_SCHED_CLUSTER=y
> >>>
> >>> I'd like to follow up on this. Would you share your dmidecode BIOS
> >>> Information section?
> >>
> >> # dmidecode -t 0
> >> # dmidecode 3.2
> >> Getting SMBIOS data from sysfs.
> >> SMBIOS 3.2.0 present.
> >>
> >> Handle 0x0000, DMI type 0, 26 bytes
> >> BIOS Information
> >> 	Vendor: Ampere(TM)
> >> 	Version: 0.9.20200724
> >> 	Release Date: 2020/07/24
> >> 	ROM Size: 7680 kB
> >> 	Characteristics:
> >> 		PCI is supported
> >> 		BIOS is upgradeable
> >> 		Boot from CD is supported
> >> 		Selectable boot is supported
> >> 		ACPI is supported
> >> 		UEFI is supported
> >> 	BIOS Revision: 5.15
> >> 	Firmware Revision: 0.6
> >>
> > 
> > Thank you, I'm following internally and will get with you.
> 
> Looks like in my PPTT, the `Processor Hierarchy Nodes` which represents
> cluster nodes have no valid `ACPI Processor ID`.

Thanks, I'm looking for the right person to get us the latest information
available this. Will follow up once I have.
kernel test robot March 14, 2022, 9:29 p.m. UTC | #12
Hi Dietmar,

I love your patch! Perhaps something to improve:

[auto build test WARNING on driver-core/driver-core-testing]
[also build test WARNING on v5.17-rc8 next-20220310]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Dietmar-Eggemann/arch_topology-Swap-MC-CLS-SD-mask-if-MC-weight-1/20220315-004742
base:   https://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core.git 4a248f85b3dd8e010ff8335755c927130e9b0764
config: arm-defconfig (https://download.01.org/0day-ci/archive/20220315/202203150553.QRvgHFHm-lkp@intel.com/config)
compiler: arm-linux-gnueabi-gcc (GCC) 11.2.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/0day-ci/linux/commit/7528fb2ea1e30038ee1dcc48df9d413502977895
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Dietmar-Eggemann/arch_topology-Swap-MC-CLS-SD-mask-if-MC-weight-1/20220315-004742
        git checkout 7528fb2ea1e30038ee1dcc48df9d413502977895
        # save the config file to linux build tree
        mkdir build_dir
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-11.2.0 make.cross O=build_dir ARCH=arm SHELL=/bin/bash drivers/base/

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

>> drivers/base/arch_topology.c:617:23: warning: no previous prototype for '_cpu_coregroup_mask' [-Wmissing-prototypes]
     617 | const struct cpumask *_cpu_coregroup_mask(int cpu)
         |                       ^~~~~~~~~~~~~~~~~~~
>> drivers/base/arch_topology.c:634:23: warning: no previous prototype for '_cpu_clustergroup_mask' [-Wmissing-prototypes]
     634 | const struct cpumask *_cpu_clustergroup_mask(int cpu)
         |                       ^~~~~~~~~~~~~~~~~~~~~~


vim +/_cpu_coregroup_mask +617 drivers/base/arch_topology.c

   616	
 > 617	const struct cpumask *_cpu_coregroup_mask(int cpu)
   618	{
   619		const cpumask_t *core_mask = cpumask_of_node(cpu_to_node(cpu));
   620	
   621		/* Find the smaller of NUMA, core or LLC siblings */
   622		if (cpumask_subset(&cpu_topology[cpu].core_sibling, core_mask)) {
   623			/* not numa in package, lets use the package siblings */
   624			core_mask = &cpu_topology[cpu].core_sibling;
   625		}
   626		if (cpu_topology[cpu].llc_id != -1) {
   627			if (cpumask_subset(&cpu_topology[cpu].llc_sibling, core_mask))
   628				core_mask = &cpu_topology[cpu].llc_sibling;
   629		}
   630	
   631		return core_mask;
   632	}
   633	
 > 634	const struct cpumask *_cpu_clustergroup_mask(int cpu)
   635	{
   636		return &cpu_topology[cpu].cluster_sibling;
   637	}
   638	

---
0-DAY CI Kernel Test Service
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org
kernel test robot March 14, 2022, 11:02 p.m. UTC | #13
Hi Dietmar,

I love your patch! Perhaps something to improve:

[auto build test WARNING on driver-core/driver-core-testing]
[also build test WARNING on v5.17-rc8 next-20220310]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Dietmar-Eggemann/arch_topology-Swap-MC-CLS-SD-mask-if-MC-weight-1/20220315-004742
base:   https://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core.git 4a248f85b3dd8e010ff8335755c927130e9b0764
config: riscv-randconfig-r042-20220314 (https://download.01.org/0day-ci/archive/20220315/202203150659.T51bnDgz-lkp@intel.com/config)
compiler: clang version 15.0.0 (https://github.com/llvm/llvm-project 3e4950d7fa78ac83f33bbf1658e2f49a73719236)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # install riscv cross compiling tool for clang build
        # apt-get install binutils-riscv64-linux-gnu
        # https://github.com/0day-ci/linux/commit/7528fb2ea1e30038ee1dcc48df9d413502977895
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Dietmar-Eggemann/arch_topology-Swap-MC-CLS-SD-mask-if-MC-weight-1/20220315-004742
        git checkout 7528fb2ea1e30038ee1dcc48df9d413502977895
        # save the config file to linux build tree
        mkdir build_dir
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=riscv SHELL=/bin/bash drivers/base/

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

>> drivers/base/arch_topology.c:617:23: warning: no previous prototype for function '_cpu_coregroup_mask' [-Wmissing-prototypes]
   const struct cpumask *_cpu_coregroup_mask(int cpu)
                         ^
   drivers/base/arch_topology.c:617:7: note: declare 'static' if the function is not intended to be used outside of this translation unit
   const struct cpumask *_cpu_coregroup_mask(int cpu)
         ^
   static 
>> drivers/base/arch_topology.c:634:23: warning: no previous prototype for function '_cpu_clustergroup_mask' [-Wmissing-prototypes]
   const struct cpumask *_cpu_clustergroup_mask(int cpu)
                         ^
   drivers/base/arch_topology.c:634:7: note: declare 'static' if the function is not intended to be used outside of this translation unit
   const struct cpumask *_cpu_clustergroup_mask(int cpu)
         ^
   static 
   2 warnings generated.


vim +/_cpu_coregroup_mask +617 drivers/base/arch_topology.c

   616	
 > 617	const struct cpumask *_cpu_coregroup_mask(int cpu)
   618	{
   619		const cpumask_t *core_mask = cpumask_of_node(cpu_to_node(cpu));
   620	
   621		/* Find the smaller of NUMA, core or LLC siblings */
   622		if (cpumask_subset(&cpu_topology[cpu].core_sibling, core_mask)) {
   623			/* not numa in package, lets use the package siblings */
   624			core_mask = &cpu_topology[cpu].core_sibling;
   625		}
   626		if (cpu_topology[cpu].llc_id != -1) {
   627			if (cpumask_subset(&cpu_topology[cpu].llc_sibling, core_mask))
   628				core_mask = &cpu_topology[cpu].llc_sibling;
   629		}
   630	
   631		return core_mask;
   632	}
   633	
 > 634	const struct cpumask *_cpu_clustergroup_mask(int cpu)
   635	{
   636		return &cpu_topology[cpu].cluster_sibling;
   637	}
   638	

---
0-DAY CI Kernel Test Service
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org
Dietmar Eggemann March 16, 2022, 2:42 p.m. UTC | #14
On 14/03/2022 17:56, Darren Hart wrote:
> On Mon, Mar 14, 2022 at 10:37:21AM +0100, Dietmar Eggemann wrote:
>> On 09/03/2022 19:26, Darren Hart wrote:
>>> On Wed, Mar 09, 2022 at 01:50:07PM +0100, Dietmar Eggemann wrote:
>>>> On 08/03/2022 18:49, Darren Hart wrote:
>>>>> On Tue, Mar 08, 2022 at 05:03:07PM +0100, Dietmar Eggemann wrote:
>>>>>> On 08/03/2022 12:04, Vincent Guittot wrote:
>>>>>>> On Tue, 8 Mar 2022 at 11:30, Will Deacon <will@kernel.org> wrote:

[...]

>>> Thank you, I'm following internally and will get with you.
>>
>> Looks like in my PPTT, the `Processor Hierarchy Nodes` which represents
>> cluster nodes have no valid `ACPI Processor ID`.
> 
> Thanks, I'm looking for the right person to get us the latest information
> available this. Will follow up once I have.

We're installing the new FW on our machine right now. Should see the CLS
SD level w/o hacking find_acpi_cpu_topology_cluster() then ;-)
Dietmar Eggemann March 16, 2022, 2:48 p.m. UTC | #15
- Barry Song <song.bao.hua@hisilicon.com> (always get undelivered mail
  returned to sender)
+ Barry Song <21cnbao@gmail.com>

On 14/03/2022 17:54, Darren Hart wrote:
> On Mon, Mar 14, 2022 at 05:35:05PM +0100, Dietmar Eggemann wrote:
>> On 09/03/2022 19:26, Darren Hart wrote:
>>> On Wed, Mar 09, 2022 at 01:50:07PM +0100, Dietmar Eggemann wrote:
>>>> On 08/03/2022 18:49, Darren Hart wrote:
>>>>> On Tue, Mar 08, 2022 at 05:03:07PM +0100, Dietmar Eggemann wrote:
>>>>>> On 08/03/2022 12:04, Vincent Guittot wrote:
>>>>>>> On Tue, 8 Mar 2022 at 11:30, Will Deacon <will@kernel.org> wrote:

[...]

> Ultimately, this delivers the same result. I do think it imposes more complexity
> for everyone to address what as far as I'm aware only affect the one system.
> 
> I don't think the term "Cluster" has a clear and universally understood
> definition, so I don't think it's a given that "CLS should be sub-SD of MC". I

I agree, the term 'cluster' is overloaded but default_topology[] clearly
says (with direction up means smaller SD spans).

  #ifdef CONFIG_SCHED_CLUSTER
        { cpu_clustergroup_mask, cpu_cluster_flags, SD_INIT_NAME(CLS) },
  #endif

  #ifdef CONFIG_SCHED_MC
        { cpu_coregroup_mask, cpu_core_flags, SD_INIT_NAME(MC) },
  #endif

In ACPI code we have `cluster_node = fetch_pptt_node(... ,
cpu_node->parent) but then the cache information (via
llc_id/llc_sibling) can change things which make this less easy to grasp.

> think this has been assumed, and that assumption has mostly held up, but this is
> an abstraction, and the abstraction should follow the physical topologies rather
> than the other way around in my opinion. If that's the primary motivation for
> this approach, I don't think it justifies the additional complexity.
> 
> All told, I prefer the 2 line change contained within cpu_coregroup_mask() which
> handles the one known exception with minimal impact. It's easy enough to come
> back to this to address more cases with a more complex solution if needed in the
> future - but I prefer to introduce the least amount of complexity as possible to
> address the known issues, especially if the end result is the same and the cost
> is paid by the affected systems.
> 
> Thanks,

Yeah, I can see your point. It's the smaller hack. My solution just
prevents us to manipulate the coregroup mask only to get the MC layer
degenerated by the core topology code. But people might say that's a
clever thing to do here. So I'm fine with your original solution as well.

[...]
Darren Hart March 16, 2022, 3:20 p.m. UTC | #16
On Wed, Mar 16, 2022 at 03:48:50PM +0100, Dietmar Eggemann wrote:
> - Barry Song <song.bao.hua@hisilicon.com> (always get undelivered mail
>   returned to sender)
> + Barry Song <21cnbao@gmail.com>
> 
> On 14/03/2022 17:54, Darren Hart wrote:
> > On Mon, Mar 14, 2022 at 05:35:05PM +0100, Dietmar Eggemann wrote:
> >> On 09/03/2022 19:26, Darren Hart wrote:
> >>> On Wed, Mar 09, 2022 at 01:50:07PM +0100, Dietmar Eggemann wrote:
> >>>> On 08/03/2022 18:49, Darren Hart wrote:
> >>>>> On Tue, Mar 08, 2022 at 05:03:07PM +0100, Dietmar Eggemann wrote:
> >>>>>> On 08/03/2022 12:04, Vincent Guittot wrote:
> >>>>>>> On Tue, 8 Mar 2022 at 11:30, Will Deacon <will@kernel.org> wrote:
> 
> [...]
> 
> > Ultimately, this delivers the same result. I do think it imposes more complexity
> > for everyone to address what as far as I'm aware only affect the one system.
> > 
> > I don't think the term "Cluster" has a clear and universally understood
> > definition, so I don't think it's a given that "CLS should be sub-SD of MC". I
> 
> I agree, the term 'cluster' is overloaded but default_topology[] clearly
> says (with direction up means smaller SD spans).
> 
>   #ifdef CONFIG_SCHED_CLUSTER
>         { cpu_clustergroup_mask, cpu_cluster_flags, SD_INIT_NAME(CLS) },
>   #endif
> 
>   #ifdef CONFIG_SCHED_MC
>         { cpu_coregroup_mask, cpu_core_flags, SD_INIT_NAME(MC) },
>   #endif
> 

Right, understood. It is a clear expectation of the current Sched Domain
topology abstraction.

> In ACPI code we have `cluster_node = fetch_pptt_node(... ,
> cpu_node->parent) but then the cache information (via
> llc_id/llc_sibling) can change things which make this less easy to grasp.
> 
> > think this has been assumed, and that assumption has mostly held up, but this is
> > an abstraction, and the abstraction should follow the physical topologies rather
> > than the other way around in my opinion. If that's the primary motivation for
> > this approach, I don't think it justifies the additional complexity.
> > 
> > All told, I prefer the 2 line change contained within cpu_coregroup_mask() which
> > handles the one known exception with minimal impact. It's easy enough to come
> > back to this to address more cases with a more complex solution if needed in the
> > future - but I prefer to introduce the least amount of complexity as possible to
> > address the known issues, especially if the end result is the same and the cost
> > is paid by the affected systems.
> > 
> > Thanks,
> 
> Yeah, I can see your point. It's the smaller hack. My solution just
> prevents us to manipulate the coregroup mask only to get the MC layer
> degenerated by the core topology code. But people might say that's a
> clever thing to do here. So I'm fine with your original solution as well.
> 
> [...]

Thanks Dietmar,

Sudeep, do we have sufficient consensus to pull in this patch?

Thanks,
Sudeep Holla March 16, 2022, 3:55 p.m. UTC | #17
On Wed, Mar 16, 2022 at 08:20:32AM -0700, Darren Hart wrote:
> On Wed, Mar 16, 2022 at 03:48:50PM +0100, Dietmar Eggemann wrote:

[...]

> >
> > Yeah, I can see your point. It's the smaller hack. My solution just
> > prevents us to manipulate the coregroup mask only to get the MC layer
> > degenerated by the core topology code. But people might say that's a
> > clever thing to do here. So I'm fine with your original solution as well.
> >
> > [...]
>
> Thanks Dietmar,
>
> Sudeep, do we have sufficient consensus to pull in this patch?

Indeed. I have already Acked, and sure after all these discussions we have
concluded that this is the best we can do though not matches everyone's taste.

Greg or Will(not sure why he had asked since v3 doesn't touch arm64),
Can one of you pick this patch ?

--
Regards,
Sudeep
Barry Song March 17, 2022, 6:10 a.m. UTC | #18
On Sat, Mar 5, 2022 at 8:54 AM Darren Hart
<darren@os.amperecomputing.com> wrote:
>
> Ampere Altra defines CPU clusters in the ACPI PPTT. They share a Snoop
> Control Unit, but have no shared CPU-side last level cache.
>
> cpu_coregroup_mask() will return a cpumask with weight 1, while
> cpu_clustergroup_mask() will return a cpumask with weight 2.
>
> As a result, build_sched_domain() will BUG() once per CPU with:
>
> BUG: arch topology borken
> the CLS domain not a subset of the MC domain
>
> The MC level cpumask is then extended to that of the CLS child, and is
> later removed entirely as redundant. This sched domain topology is an
> improvement over previous topologies, or those built without
> SCHED_CLUSTER, particularly for certain latency sensitive workloads.
> With the current scheduler model and heuristics, this is a desirable
> default topology for Ampere Altra and Altra Max system.
>
> Rather than create a custom sched domains topology structure and
> introduce new logic in arch/arm64 to detect these systems, update the
> core_mask so coregroup is never a subset of clustergroup, extending it
> to cluster_siblings if necessary.
>
> This has the added benefit over a custom topology of working for both
> symmetric and asymmetric topologies. It does not address systems where
> the cluster topology is above a populated mc topology, but these are not
> considered today and can be addressed separately if and when they
> appear.
>
> The final sched domain topology for a 2 socket Ampere Altra system is
> unchanged with or without CONFIG_SCHED_CLUSTER, and the BUG is avoided:
>
> For CPU0:
>
> CONFIG_SCHED_CLUSTER=y
> CLS  [0-1]
> DIE  [0-79]
> NUMA [0-159]
>
> CONFIG_SCHED_CLUSTER is not set
> DIE  [0-79]
> NUMA [0-159]
>
> Cc: Sudeep Holla <sudeep.holla@arm.com>
> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> Cc: "Rafael J. Wysocki" <rafael@kernel.org>
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Will Deacon <will@kernel.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Vincent Guittot <vincent.guittot@linaro.org>
> Cc: Barry Song <song.bao.hua@hisilicon.com>
> Cc: Valentin Schneider <valentin.schneider@arm.com>
> Cc: D. Scott Phillips <scott@os.amperecomputing.com>
> Cc: Ilkka Koskinen <ilkka@os.amperecomputing.com>
> Cc: <stable@vger.kernel.org> # 5.16.x
> Suggested-by: Barry Song <song.bao.hua@hisilicon.com>
> Signed-off-by: Darren Hart <darren@os.amperecomputing.com>

Reviewed-by: Barry Song <baohua@kernel.org>

> ---
> v1: Drop MC level if coregroup weight == 1
> v2: New sd topo in arch/arm64/kernel/smp.c
> v3: No new topo, extend core_mask to cluster_siblings
>
>  drivers/base/arch_topology.c | 8 ++++++++
>  1 file changed, 8 insertions(+)
>
> diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
> index 976154140f0b..a96f45db928b 100644
> --- a/drivers/base/arch_topology.c
> +++ b/drivers/base/arch_topology.c
> @@ -628,6 +628,14 @@ const struct cpumask *cpu_coregroup_mask(int cpu)
>                         core_mask = &cpu_topology[cpu].llc_sibling;
>         }
>
> +       /*
> +        * For systems with no shared cpu-side LLC but with clusters defined,
> +        * extend core_mask to cluster_siblings. The sched domain builder will
> +        * then remove MC as redundant with CLS if SCHED_CLUSTER is enabled.
> +        */
> +       if (cpumask_subset(core_mask, &cpu_topology[cpu].cluster_sibling))
> +               core_mask = &cpu_topology[cpu].cluster_sibling;
> +
>         return core_mask;
>  }
>
> --
> 2.31.1
>

Thanks
Barry
Will Deacon March 21, 2022, 2:30 p.m. UTC | #19
On Wed, Mar 16, 2022 at 03:55:41PM +0000, Sudeep Holla wrote:
> On Wed, Mar 16, 2022 at 08:20:32AM -0700, Darren Hart wrote:
> > On Wed, Mar 16, 2022 at 03:48:50PM +0100, Dietmar Eggemann wrote:
> 
> [...]
> 
> > >
> > > Yeah, I can see your point. It's the smaller hack. My solution just
> > > prevents us to manipulate the coregroup mask only to get the MC layer
> > > degenerated by the core topology code. But people might say that's a
> > > clever thing to do here. So I'm fine with your original solution as well.
> > >
> > > [...]
> >
> > Thanks Dietmar,
> >
> > Sudeep, do we have sufficient consensus to pull in this patch?
> 
> Indeed. I have already Acked, and sure after all these discussions we have
> concluded that this is the best we can do though not matches everyone's taste.
> 
> Greg or Will(not sure why he had asked since v3 doesn't touch arm64),
> Can one of you pick this patch ?

Right, this doesn't touch arm64 any more so I don't think I'm the right
person to queue it.

Will
Greg Kroah-Hartman March 21, 2022, 3:56 p.m. UTC | #20
On Mon, Mar 21, 2022 at 02:30:22PM +0000, Will Deacon wrote:
> On Wed, Mar 16, 2022 at 03:55:41PM +0000, Sudeep Holla wrote:
> > On Wed, Mar 16, 2022 at 08:20:32AM -0700, Darren Hart wrote:
> > > On Wed, Mar 16, 2022 at 03:48:50PM +0100, Dietmar Eggemann wrote:
> > 
> > [...]
> > 
> > > >
> > > > Yeah, I can see your point. It's the smaller hack. My solution just
> > > > prevents us to manipulate the coregroup mask only to get the MC layer
> > > > degenerated by the core topology code. But people might say that's a
> > > > clever thing to do here. So I'm fine with your original solution as well.
> > > >
> > > > [...]
> > >
> > > Thanks Dietmar,
> > >
> > > Sudeep, do we have sufficient consensus to pull in this patch?
> > 
> > Indeed. I have already Acked, and sure after all these discussions we have
> > concluded that this is the best we can do though not matches everyone's taste.
> > 
> > Greg or Will(not sure why he had asked since v3 doesn't touch arm64),
> > Can one of you pick this patch ?
> 
> Right, this doesn't touch arm64 any more so I don't think I'm the right
> person to queue it.

It's too late for 5.18-rc1.  Please rebase and resend it after that is
out and I will be glad to queue it up.

thanks,

greg k-h
diff mbox series

Patch

diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
index 976154140f0b..a96f45db928b 100644
--- a/drivers/base/arch_topology.c
+++ b/drivers/base/arch_topology.c
@@ -628,6 +628,14 @@  const struct cpumask *cpu_coregroup_mask(int cpu)
 			core_mask = &cpu_topology[cpu].llc_sibling;
 	}
 
+	/*
+	 * For systems with no shared cpu-side LLC but with clusters defined,
+	 * extend core_mask to cluster_siblings. The sched domain builder will
+	 * then remove MC as redundant with CLS if SCHED_CLUSTER is enabled.
+	 */
+	if (cpumask_subset(core_mask, &cpu_topology[cpu].cluster_sibling))
+		core_mask = &cpu_topology[cpu].cluster_sibling;
+
 	return core_mask;
 }