diff mbox series

arm64: smp: Skip MC domain for SoCs without shared cache

Message ID 8c4a69eca4d0591f30c112df59c5098c24923bd3.1644543449.git.darren@os.amperecomputing.com (mailing list archive)
State New, archived
Headers show
Series arm64: smp: Skip MC domain for SoCs without shared cache | expand

Commit Message

Darren Hart Feb. 11, 2022, 1:42 a.m. UTC
SoCs such as the Ampere Altra define clusters but have no shared
processor-side cache. As of v5.16 with CONFIG_SCHED_CLUSTER and
CONFIG_SCHED_MC, build_sched_domain() will BUG() with:

BUG: arch topology borken
     the CLS domain not a subset of the MC domain

for each CPU (160 times for a 2 socket 80 core Altra system). The MC
level cpu mask is then extended to that of the CLS child, and is later
removed entirely as redundant.

This change detects when all cpu_coregroup_mask weights=1 and uses an
alternative sched_domain_topology equivalent to the default if
CONFIG_SCHED_MC were disabled.

The final resulting sched domain topology is unchanged with or without
CONFIG_SCHED_CLUSTER, and the BUG is avoided:

For CPU0:

With CLS:
CLS  [0-1]
DIE  [0-79]
NUMA [0-159]

Without CLS:
DIE  [0-79]
NUMA [0-159]

Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Barry Song <song.bao.hua@hisilicon.com>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: D. Scott Phillips <scott@os.amperecomputing.com>
Cc: Ilkka Koskinen <ilkka@os.amperecomputing.com>
Cc: <stable@vger.kernel.org> # 5.16.x
Signed-off-by: Darren Hart <darren@os.amperecomputing.com>
---
 arch/arm64/kernel/smp.c | 32 ++++++++++++++++++++++++++++++++
 1 file changed, 32 insertions(+)

Comments

Song Bao Hua (Barry Song) Feb. 11, 2022, 3:20 a.m. UTC | #1
> -----Original Message-----
> From: Darren Hart [mailto:darren@os.amperecomputing.com]
> Sent: Friday, February 11, 2022 2:43 PM
> To: LKML <linux-kernel@vger.kernel.org>; Linux Arm
> <linux-arm-kernel@lists.infradead.org>
> Cc: Catalin Marinas <catalin.marinas@arm.com>; Will Deacon <will@kernel.org>;
> Peter Zijlstra <peterz@infradead.org>; Vincent Guittot
> <vincent.guittot@linaro.org>; Song Bao Hua (Barry Song)
> <song.bao.hua@hisilicon.com>; Valentin Schneider
> <valentin.schneider@arm.com>; D . Scott Phillips
> <scott@os.amperecomputing.com>; Ilkka Koskinen
> <ilkka@os.amperecomputing.com>; stable@vger.kernel.org
> Subject: [PATCH] arm64: smp: Skip MC domain for SoCs without shared cache
> 
> SoCs such as the Ampere Altra define clusters but have no shared
> processor-side cache. As of v5.16 with CONFIG_SCHED_CLUSTER and
> CONFIG_SCHED_MC, build_sched_domain() will BUG() with:
> 
> BUG: arch topology borken
>      the CLS domain not a subset of the MC domain
> 
> for each CPU (160 times for a 2 socket 80 core Altra system). The MC
> level cpu mask is then extended to that of the CLS child, and is later
> removed entirely as redundant.
> 
> This change detects when all cpu_coregroup_mask weights=1 and uses an
> alternative sched_domain_topology equivalent to the default if
> CONFIG_SCHED_MC were disabled.
> 
> The final resulting sched domain topology is unchanged with or without
> CONFIG_SCHED_CLUSTER, and the BUG is avoided:
> 
> For CPU0:
> 
> With CLS:
> CLS  [0-1]
> DIE  [0-79]
> NUMA [0-159]
> 
> Without CLS:
> DIE  [0-79]
> NUMA [0-159]
> 
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Will Deacon <will@kernel.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Vincent Guittot <vincent.guittot@linaro.org>
> Cc: Barry Song <song.bao.hua@hisilicon.com>
> Cc: Valentin Schneider <valentin.schneider@arm.com>
> Cc: D. Scott Phillips <scott@os.amperecomputing.com>
> Cc: Ilkka Koskinen <ilkka@os.amperecomputing.com>
> Cc: <stable@vger.kernel.org> # 5.16.x
> Signed-off-by: Darren Hart <darren@os.amperecomputing.com>

Hi Darrent,
What kind of resources are clusters sharing on Ampere Altra?
So on Altra, cpus are not sharing LLC? Each LLC is separate
for each cpu?

> ---
>  arch/arm64/kernel/smp.c | 32 ++++++++++++++++++++++++++++++++
>  1 file changed, 32 insertions(+)
> 
> diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
> index 27df5c1e6baa..0a78ac5c8830 100644
> --- a/arch/arm64/kernel/smp.c
> +++ b/arch/arm64/kernel/smp.c
> @@ -715,9 +715,22 @@ void __init smp_init_cpus(void)
>  	}
>  }
> 
> +static struct sched_domain_topology_level arm64_no_mc_topology[] = {
> +#ifdef CONFIG_SCHED_SMT
> +	{ cpu_smt_mask, cpu_smt_flags, SD_INIT_NAME(SMT) },
> +#endif
> +
> +#ifdef CONFIG_SCHED_CLUSTER
> +	{ cpu_clustergroup_mask, cpu_cluster_flags, SD_INIT_NAME(CLS) },
> +#endif
> +	{ cpu_cpu_mask, SD_INIT_NAME(DIE) },
> +	{ NULL, },
> +};
> +
>  void __init smp_prepare_cpus(unsigned int max_cpus)
>  {
>  	const struct cpu_operations *ops;
> +	bool use_no_mc_topology = true;
>  	int err;
>  	unsigned int cpu;
>  	unsigned int this_cpu;
> @@ -758,6 +771,25 @@ void __init smp_prepare_cpus(unsigned int max_cpus)
> 
>  		set_cpu_present(cpu, true);
>  		numa_store_cpu_info(cpu);
> +
> +		/*
> +		 * Only use no_mc topology if all cpu_coregroup_mask weights=1
> +		 */
> +		if (cpumask_weight(cpu_coregroup_mask(cpu)) > 1)
> +			use_no_mc_topology = false;

This seems to be wrong? If you have 5 cpus,
Cpu0 has cpu_coregroup_mask(cpu)== 1, cpu1-4
has cpu_coregroup_mask(cpu)== 4, for cpu0, you still
need to remove MC, but for cpu1-4, you will need
CLS and MC both?

This flag shouldn't be global.

> +	}
> +
> +	/*
> +	 * SoCs with no shared processor-side cache will have cpu_coregroup_mask
> +	 * weights=1. If they also define clusters with cpu_clustergroup_mask
> +	 * weights > 1, build_sched_domain() will trigger a BUG as the CLS
> +	 * cpu_mask will not be a subset of MC. It will extend the MC cpu_mask
> +	 * to match CLS, and later discard the MC level. Avoid the bug by using
> +	 * a topology without the MC if the cpu_coregroup_mask weights=1.
> +	 */
> +	if (use_no_mc_topology) {
> +		pr_info("cpu_coregroup_mask weights=1, skipping MC topology level");
> +		set_sched_topology(arm64_no_mc_topology);
>  	}
>  }
> 
> --
> 2.31.1


Thanks
Barry
Darren Hart Feb. 11, 2022, 5:54 p.m. UTC | #2
On Fri, Feb 11, 2022 at 03:20:51AM +0000, Song Bao Hua (Barry Song) wrote:
> 
> 
> > -----Original Message-----
> > From: Darren Hart [mailto:darren@os.amperecomputing.com]
> > Sent: Friday, February 11, 2022 2:43 PM
> > To: LKML <linux-kernel@vger.kernel.org>; Linux Arm
> > <linux-arm-kernel@lists.infradead.org>
> > Cc: Catalin Marinas <catalin.marinas@arm.com>; Will Deacon <will@kernel.org>;
> > Peter Zijlstra <peterz@infradead.org>; Vincent Guittot
> > <vincent.guittot@linaro.org>; Song Bao Hua (Barry Song)
> > <song.bao.hua@hisilicon.com>; Valentin Schneider
> > <valentin.schneider@arm.com>; D . Scott Phillips
> > <scott@os.amperecomputing.com>; Ilkka Koskinen
> > <ilkka@os.amperecomputing.com>; stable@vger.kernel.org
> > Subject: [PATCH] arm64: smp: Skip MC domain for SoCs without shared cache
> > 
> > SoCs such as the Ampere Altra define clusters but have no shared
> > processor-side cache. As of v5.16 with CONFIG_SCHED_CLUSTER and
> > CONFIG_SCHED_MC, build_sched_domain() will BUG() with:
> > 
> > BUG: arch topology borken
> >      the CLS domain not a subset of the MC domain
> > 
> > for each CPU (160 times for a 2 socket 80 core Altra system). The MC
> > level cpu mask is then extended to that of the CLS child, and is later
> > removed entirely as redundant.
> > 
> > This change detects when all cpu_coregroup_mask weights=1 and uses an
> > alternative sched_domain_topology equivalent to the default if
> > CONFIG_SCHED_MC were disabled.
> > 
> > The final resulting sched domain topology is unchanged with or without
> > CONFIG_SCHED_CLUSTER, and the BUG is avoided:
> > 
> > For CPU0:
> > 
> > With CLS:
> > CLS  [0-1]
> > DIE  [0-79]
> > NUMA [0-159]
> > 
> > Without CLS:
> > DIE  [0-79]
> > NUMA [0-159]
> > 
> > Cc: Catalin Marinas <catalin.marinas@arm.com>
> > Cc: Will Deacon <will@kernel.org>
> > Cc: Peter Zijlstra <peterz@infradead.org>
> > Cc: Vincent Guittot <vincent.guittot@linaro.org>
> > Cc: Barry Song <song.bao.hua@hisilicon.com>
> > Cc: Valentin Schneider <valentin.schneider@arm.com>
> > Cc: D. Scott Phillips <scott@os.amperecomputing.com>
> > Cc: Ilkka Koskinen <ilkka@os.amperecomputing.com>
> > Cc: <stable@vger.kernel.org> # 5.16.x
> > Signed-off-by: Darren Hart <darren@os.amperecomputing.com>
> 
> Hi Darrent,

Hi Barry, thanks for the review.

> What kind of resources are clusters sharing on Ampere Altra?

The cluster pairs are DSU pairs (ARM DynamIQ Shared Unit). While there
is no shared L3 cache, they do share an SCU (snoop control unit) and
have a cache coherency latency advantage relative to non-DSU pairs.

The Anandtech Altra review illustrates this advantage:
https://www.anandtech.com/show/16315/the-ampere-altra-review/3

Notably, the SCHED_CLUSTER change did result in marked improvements for
some interactive workloads.

> So on Altra, cpus are not sharing LLC? Each LLC is separate
> for each cpu?

Correct. On the processor side the last level is at each cpu, and then
there is a memory side SLC (system level cache) that is shared by all
cpus.

> 
> > ---
> >  arch/arm64/kernel/smp.c | 32 ++++++++++++++++++++++++++++++++
> >  1 file changed, 32 insertions(+)
> > 
> > diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
> > index 27df5c1e6baa..0a78ac5c8830 100644
> > --- a/arch/arm64/kernel/smp.c
> > +++ b/arch/arm64/kernel/smp.c
> > @@ -715,9 +715,22 @@ void __init smp_init_cpus(void)
> >  	}
> >  }
> > 
> > +static struct sched_domain_topology_level arm64_no_mc_topology[] = {
> > +#ifdef CONFIG_SCHED_SMT
> > +	{ cpu_smt_mask, cpu_smt_flags, SD_INIT_NAME(SMT) },
> > +#endif
> > +
> > +#ifdef CONFIG_SCHED_CLUSTER
> > +	{ cpu_clustergroup_mask, cpu_cluster_flags, SD_INIT_NAME(CLS) },
> > +#endif
> > +	{ cpu_cpu_mask, SD_INIT_NAME(DIE) },
> > +	{ NULL, },
> > +};
> > +
> >  void __init smp_prepare_cpus(unsigned int max_cpus)
> >  {
> >  	const struct cpu_operations *ops;
> > +	bool use_no_mc_topology = true;
> >  	int err;
> >  	unsigned int cpu;
> >  	unsigned int this_cpu;
> > @@ -758,6 +771,25 @@ void __init smp_prepare_cpus(unsigned int max_cpus)
> > 
> >  		set_cpu_present(cpu, true);
> >  		numa_store_cpu_info(cpu);
> > +
> > +		/*
> > +		 * Only use no_mc topology if all cpu_coregroup_mask weights=1
> > +		 */
> > +		if (cpumask_weight(cpu_coregroup_mask(cpu)) > 1)
> > +			use_no_mc_topology = false;
> 
> This seems to be wrong? If you have 5 cpus,
> Cpu0 has cpu_coregroup_mask(cpu)== 1, cpu1-4
> has cpu_coregroup_mask(cpu)== 4, for cpu0, you still
> need to remove MC, but for cpu1-4, you will need
> CLS and MC both?
> 
> This flag shouldn't be global.

Please note that this patch is intended to maintain an identical final
sched domain construction for a symmetric topology with no shared
processor-side cache and with cache advantaged clusters and avoid the
BUG messages since this topology is correct for this architecture.

By using a sched topology without the MC layer, this more accurately
describes this architecture and does not require changes to
build_sched_domain(), in particular changes to the assumptions about
what a valid topology is.

The test above tests every cpu coregroup weight in order to limit the
impact of this change to this specific kind of topology. It
intentionally does not address, nor change existing behavior for, the
assymetrical topology you describe.

I felt this was the less invasive approach and consistent with how other
architectures handled "non-default" topologies.

If there is interest on working toward a more generic topology builder,
I'd be interested in working on that too, but I think this change makes
sense in the near term.

Thanks,

> 
> > +	}
> > +
> > +	/*
> > +	 * SoCs with no shared processor-side cache will have cpu_coregroup_mask
> > +	 * weights=1. If they also define clusters with cpu_clustergroup_mask
> > +	 * weights > 1, build_sched_domain() will trigger a BUG as the CLS
> > +	 * cpu_mask will not be a subset of MC. It will extend the MC cpu_mask
> > +	 * to match CLS, and later discard the MC level. Avoid the bug by using
> > +	 * a topology without the MC if the cpu_coregroup_mask weights=1.
> > +	 */
> > +	if (use_no_mc_topology) {
> > +		pr_info("cpu_coregroup_mask weights=1, skipping MC topology level");
> > +		set_sched_topology(arm64_no_mc_topology);
> >  	}
> >  }
> > 
> > --
> > 2.31.1
> 
> 
> Thanks
> Barry
>
Barry Song Feb. 15, 2022, 10:25 a.m. UTC | #3
On Tue, Feb 15, 2022 at 12:17 AM Darren Hart
<darren@os.amperecomputing.com> wrote:
>
> On Fri, Feb 11, 2022 at 03:20:51AM +0000, Song Bao Hua (Barry Song) wrote:
> >
> >
> > > -----Original Message-----
> > > From: Darren Hart [mailto:darren@os.amperecomputing.com]
> > > Sent: Friday, February 11, 2022 2:43 PM
> > > To: LKML <linux-kernel@vger.kernel.org>; Linux Arm
> > > <linux-arm-kernel@lists.infradead.org>
> > > Cc: Catalin Marinas <catalin.marinas@arm.com>; Will Deacon <will@kernel.org>;
> > > Peter Zijlstra <peterz@infradead.org>; Vincent Guittot
> > > <vincent.guittot@linaro.org>; Song Bao Hua (Barry Song)
> > > <song.bao.hua@hisilicon.com>; Valentin Schneider
> > > <valentin.schneider@arm.com>; D . Scott Phillips
> > > <scott@os.amperecomputing.com>; Ilkka Koskinen
> > > <ilkka@os.amperecomputing.com>; stable@vger.kernel.org
> > > Subject: [PATCH] arm64: smp: Skip MC domain for SoCs without shared cache
> > >
> > > SoCs such as the Ampere Altra define clusters but have no shared
> > > processor-side cache. As of v5.16 with CONFIG_SCHED_CLUSTER and
> > > CONFIG_SCHED_MC, build_sched_domain() will BUG() with:
> > >
> > > BUG: arch topology borken
> > >      the CLS domain not a subset of the MC domain
> > >
> > > for each CPU (160 times for a 2 socket 80 core Altra system). The MC
> > > level cpu mask is then extended to that of the CLS child, and is later
> > > removed entirely as redundant.
> > >
> > > This change detects when all cpu_coregroup_mask weights=1 and uses an
> > > alternative sched_domain_topology equivalent to the default if
> > > CONFIG_SCHED_MC were disabled.
> > >
> > > The final resulting sched domain topology is unchanged with or without
> > > CONFIG_SCHED_CLUSTER, and the BUG is avoided:
> > >
> > > For CPU0:
> > >
> > > With CLS:
> > > CLS  [0-1]
> > > DIE  [0-79]
> > > NUMA [0-159]
> > >
> > > Without CLS:
> > > DIE  [0-79]
> > > NUMA [0-159]
> > >
> > > Cc: Catalin Marinas <catalin.marinas@arm.com>
> > > Cc: Will Deacon <will@kernel.org>
> > > Cc: Peter Zijlstra <peterz@infradead.org>
> > > Cc: Vincent Guittot <vincent.guittot@linaro.org>
> > > Cc: Barry Song <song.bao.hua@hisilicon.com>
> > > Cc: Valentin Schneider <valentin.schneider@arm.com>
> > > Cc: D. Scott Phillips <scott@os.amperecomputing.com>
> > > Cc: Ilkka Koskinen <ilkka@os.amperecomputing.com>
> > > Cc: <stable@vger.kernel.org> # 5.16.x
> > > Signed-off-by: Darren Hart <darren@os.amperecomputing.com>
> >
> > Hi Darrent,
>
> Hi Barry, thanks for the review.
>
> > What kind of resources are clusters sharing on Ampere Altra?
>
> The cluster pairs are DSU pairs (ARM DynamIQ Shared Unit). While there
> is no shared L3 cache, they do share an SCU (snoop control unit) and
> have a cache coherency latency advantage relative to non-DSU pairs.
>
> The Anandtech Altra review illustrates this advantage:
> https://www.anandtech.com/show/16315/the-ampere-altra-review/3
>
> Notably, the SCHED_CLUSTER change did result in marked improvements for
> some interactive workloads.

Thanks. there is a wake_affine patchset, i also wonder if your device can also
benefit from it:
https://lore.kernel.org/lkml/20220126080947.4529-1-yangyicong@hisilicon.com/

>
> > So on Altra, cpus are not sharing LLC? Each LLC is separate
> > for each cpu?
>
> Correct. On the processor side the last level is at each cpu, and then
> there is a memory side SLC (system level cache) that is shared by all
> cpus.

Thanks.

>
> >
> > > ---
> > >  arch/arm64/kernel/smp.c | 32 ++++++++++++++++++++++++++++++++
> > >  1 file changed, 32 insertions(+)
> > >
> > > diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
> > > index 27df5c1e6baa..0a78ac5c8830 100644
> > > --- a/arch/arm64/kernel/smp.c
> > > +++ b/arch/arm64/kernel/smp.c
> > > @@ -715,9 +715,22 @@ void __init smp_init_cpus(void)
> > >     }
> > >  }
> > >
> > > +static struct sched_domain_topology_level arm64_no_mc_topology[] = {
> > > +#ifdef CONFIG_SCHED_SMT
> > > +   { cpu_smt_mask, cpu_smt_flags, SD_INIT_NAME(SMT) },
> > > +#endif
> > > +
> > > +#ifdef CONFIG_SCHED_CLUSTER
> > > +   { cpu_clustergroup_mask, cpu_cluster_flags, SD_INIT_NAME(CLS) },
> > > +#endif
> > > +   { cpu_cpu_mask, SD_INIT_NAME(DIE) },
> > > +   { NULL, },
> > > +};
> > > +
> > >  void __init smp_prepare_cpus(unsigned int max_cpus)
> > >  {
> > >     const struct cpu_operations *ops;
> > > +   bool use_no_mc_topology = true;
> > >     int err;
> > >     unsigned int cpu;
> > >     unsigned int this_cpu;
> > > @@ -758,6 +771,25 @@ void __init smp_prepare_cpus(unsigned int max_cpus)
> > >
> > >             set_cpu_present(cpu, true);
> > >             numa_store_cpu_info(cpu);
> > > +
> > > +           /*
> > > +            * Only use no_mc topology if all cpu_coregroup_mask weights=1
> > > +            */
> > > +           if (cpumask_weight(cpu_coregroup_mask(cpu)) > 1)
> > > +                   use_no_mc_topology = false;
> >
> > This seems to be wrong? If you have 5 cpus,
> > Cpu0 has cpu_coregroup_mask(cpu)== 1, cpu1-4
> > has cpu_coregroup_mask(cpu)== 4, for cpu0, you still
> > need to remove MC, but for cpu1-4, you will need
> > CLS and MC both?
> >
> > This flag shouldn't be global.
>
> Please note that this patch is intended to maintain an identical final
> sched domain construction for a symmetric topology with no shared
> processor-side cache and with cache advantaged clusters and avoid the
> BUG messages since this topology is correct for this architecture.
>
> By using a sched topology without the MC layer, this more accurately
> describes this architecture and does not require changes to
> build_sched_domain(), in particular changes to the assumptions about
> what a valid topology is.
>
> The test above tests every cpu coregroup weight in order to limit the
> impact of this change to this specific kind of topology. It
> intentionally does not address, nor change existing behavior for, the
> assymetrical topology you describe.
>
> I felt this was the less invasive approach and consistent with how other
> architectures handled "non-default" topologies.
>
> If there is interest on working toward a more generic topology builder,
> I'd be interested in working on that too, but I think this change makes
> sense in the near term.

I do agree this patch makes sense for symmetric topology but
asymmetrical topology
is still breaking. it might be better to be more generic. we had a
similar fix over
here for smt before:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?h=v5.16&id=55409ac5c3

>
> Thanks,
>
> >
> > > +   }
> > > +
> > > +   /*
> > > +    * SoCs with no shared processor-side cache will have cpu_coregroup_mask
> > > +    * weights=1. If they also define clusters with cpu_clustergroup_mask
> > > +    * weights > 1, build_sched_domain() will trigger a BUG as the CLS
> > > +    * cpu_mask will not be a subset of MC. It will extend the MC cpu_mask
> > > +    * to match CLS, and later discard the MC level. Avoid the bug by using
> > > +    * a topology without the MC if the cpu_coregroup_mask weights=1.
> > > +    */
> > > +   if (use_no_mc_topology) {
> > > +           pr_info("cpu_coregroup_mask weights=1, skipping MC topology level");
> > > +           set_sched_topology(arm64_no_mc_topology);
> > >     }
> > >  }
> > >
> > > --
> > > 2.31.1
> >
> >
> > Thanks
> > Barry
> >
>
> --
> Darren Hart
> Ampere Computing / OS and Kernel

Thanks
Barry
Will Deacon Feb. 15, 2022, 4:38 p.m. UTC | #4
On Fri, Feb 11, 2022 at 03:20:51AM +0000, Song Bao Hua (Barry Song) wrote:
> 
> 
> > -----Original Message-----
> > From: Darren Hart [mailto:darren@os.amperecomputing.com]
> > Sent: Friday, February 11, 2022 2:43 PM
> > To: LKML <linux-kernel@vger.kernel.org>; Linux Arm
> > <linux-arm-kernel@lists.infradead.org>
> > Cc: Catalin Marinas <catalin.marinas@arm.com>; Will Deacon <will@kernel.org>;
> > Peter Zijlstra <peterz@infradead.org>; Vincent Guittot
> > <vincent.guittot@linaro.org>; Song Bao Hua (Barry Song)
> > <song.bao.hua@hisilicon.com>; Valentin Schneider
> > <valentin.schneider@arm.com>; D . Scott Phillips
> > <scott@os.amperecomputing.com>; Ilkka Koskinen
> > <ilkka@os.amperecomputing.com>; stable@vger.kernel.org
> > Subject: [PATCH] arm64: smp: Skip MC domain for SoCs without shared cache
> > 
> > SoCs such as the Ampere Altra define clusters but have no shared
> > processor-side cache. As of v5.16 with CONFIG_SCHED_CLUSTER and
> > CONFIG_SCHED_MC, build_sched_domain() will BUG() with:
> > 
> > BUG: arch topology borken
> >      the CLS domain not a subset of the MC domain
> > 
> > for each CPU (160 times for a 2 socket 80 core Altra system). The MC
> > level cpu mask is then extended to that of the CLS child, and is later
> > removed entirely as redundant.
> > 
> > This change detects when all cpu_coregroup_mask weights=1 and uses an
> > alternative sched_domain_topology equivalent to the default if
> > CONFIG_SCHED_MC were disabled.
> > 
> > The final resulting sched domain topology is unchanged with or without
> > CONFIG_SCHED_CLUSTER, and the BUG is avoided:
> > 
> > For CPU0:
> > 
> > With CLS:
> > CLS  [0-1]
> > DIE  [0-79]
> > NUMA [0-159]
> > 
> > Without CLS:
> > DIE  [0-79]
> > NUMA [0-159]
> > 
> > Cc: Catalin Marinas <catalin.marinas@arm.com>
> > Cc: Will Deacon <will@kernel.org>
> > Cc: Peter Zijlstra <peterz@infradead.org>
> > Cc: Vincent Guittot <vincent.guittot@linaro.org>
> > Cc: Barry Song <song.bao.hua@hisilicon.com>
> > Cc: Valentin Schneider <valentin.schneider@arm.com>
> > Cc: D. Scott Phillips <scott@os.amperecomputing.com>
> > Cc: Ilkka Koskinen <ilkka@os.amperecomputing.com>
> > Cc: <stable@vger.kernel.org> # 5.16.x
> > Signed-off-by: Darren Hart <darren@os.amperecomputing.com>
> 
> Hi Darrent,
> What kind of resources are clusters sharing on Ampere Altra?
> So on Altra, cpus are not sharing LLC? Each LLC is separate
> for each cpu?
> 
> > ---
> >  arch/arm64/kernel/smp.c | 32 ++++++++++++++++++++++++++++++++
> >  1 file changed, 32 insertions(+)
> > 
> > diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
> > index 27df5c1e6baa..0a78ac5c8830 100644
> > --- a/arch/arm64/kernel/smp.c
> > +++ b/arch/arm64/kernel/smp.c
> > @@ -715,9 +715,22 @@ void __init smp_init_cpus(void)
> >  	}
> >  }
> > 
> > +static struct sched_domain_topology_level arm64_no_mc_topology[] = {
> > +#ifdef CONFIG_SCHED_SMT
> > +	{ cpu_smt_mask, cpu_smt_flags, SD_INIT_NAME(SMT) },
> > +#endif
> > +
> > +#ifdef CONFIG_SCHED_CLUSTER
> > +	{ cpu_clustergroup_mask, cpu_cluster_flags, SD_INIT_NAME(CLS) },
> > +#endif
> > +	{ cpu_cpu_mask, SD_INIT_NAME(DIE) },
> > +	{ NULL, },
> > +};
> > +
> >  void __init smp_prepare_cpus(unsigned int max_cpus)
> >  {
> >  	const struct cpu_operations *ops;
> > +	bool use_no_mc_topology = true;
> >  	int err;
> >  	unsigned int cpu;
> >  	unsigned int this_cpu;
> > @@ -758,6 +771,25 @@ void __init smp_prepare_cpus(unsigned int max_cpus)
> > 
> >  		set_cpu_present(cpu, true);
> >  		numa_store_cpu_info(cpu);
> > +
> > +		/*
> > +		 * Only use no_mc topology if all cpu_coregroup_mask weights=1
> > +		 */
> > +		if (cpumask_weight(cpu_coregroup_mask(cpu)) > 1)
> > +			use_no_mc_topology = false;
> 
> This seems to be wrong? If you have 5 cpus,
> Cpu0 has cpu_coregroup_mask(cpu)== 1, cpu1-4
> has cpu_coregroup_mask(cpu)== 4, for cpu0, you still
> need to remove MC, but for cpu1-4, you will need
> CLS and MC both?

What is the *current* behaviour on such a system?

Will
Darren Hart Feb. 15, 2022, 4:44 p.m. UTC | #5
On Tue, Feb 15, 2022 at 04:38:59PM +0000, Will Decon wrote:
> On Fri, Feb 11, 2022 at 03:20:51AM +0000, Song Bao Hua (Barry Song) wrote:
> > 
> > 
> > > -----Original Message-----
> > > From: Darren Hart [mailto:darren@os.amperecomputing.com]
> > > Sent: Friday, February 11, 2022 2:43 PM
> > > To: LKML <linux-kernel@vger.kernel.org>; Linux Arm
> > > <linux-arm-kernel@lists.infradead.org>
> > > Cc: Catalin Marinas <catalin.marinas@arm.com>; Will Deacon <will@kernel.org>;
> > > Peter Zijlstra <peterz@infradead.org>; Vincent Guittot
> > > <vincent.guittot@linaro.org>; Song Bao Hua (Barry Song)
> > > <song.bao.hua@hisilicon.com>; Valentin Schneider
> > > <valentin.schneider@arm.com>; D . Scott Phillips
> > > <scott@os.amperecomputing.com>; Ilkka Koskinen
> > > <ilkka@os.amperecomputing.com>; stable@vger.kernel.org
> > > Subject: [PATCH] arm64: smp: Skip MC domain for SoCs without shared cache
> > > 
> > > SoCs such as the Ampere Altra define clusters but have no shared
> > > processor-side cache. As of v5.16 with CONFIG_SCHED_CLUSTER and
> > > CONFIG_SCHED_MC, build_sched_domain() will BUG() with:
> > > 
> > > BUG: arch topology borken
> > >      the CLS domain not a subset of the MC domain
> > > 
> > > for each CPU (160 times for a 2 socket 80 core Altra system). The MC
> > > level cpu mask is then extended to that of the CLS child, and is later
> > > removed entirely as redundant.
> > > 
> > > This change detects when all cpu_coregroup_mask weights=1 and uses an
> > > alternative sched_domain_topology equivalent to the default if
> > > CONFIG_SCHED_MC were disabled.
> > > 
> > > The final resulting sched domain topology is unchanged with or without
> > > CONFIG_SCHED_CLUSTER, and the BUG is avoided:
> > > 
> > > For CPU0:
> > > 
> > > With CLS:
> > > CLS  [0-1]
> > > DIE  [0-79]
> > > NUMA [0-159]
> > > 
> > > Without CLS:
> > > DIE  [0-79]
> > > NUMA [0-159]
> > > 
> > > Cc: Catalin Marinas <catalin.marinas@arm.com>
> > > Cc: Will Deacon <will@kernel.org>
> > > Cc: Peter Zijlstra <peterz@infradead.org>
> > > Cc: Vincent Guittot <vincent.guittot@linaro.org>
> > > Cc: Barry Song <song.bao.hua@hisilicon.com>
> > > Cc: Valentin Schneider <valentin.schneider@arm.com>
> > > Cc: D. Scott Phillips <scott@os.amperecomputing.com>
> > > Cc: Ilkka Koskinen <ilkka@os.amperecomputing.com>
> > > Cc: <stable@vger.kernel.org> # 5.16.x
> > > Signed-off-by: Darren Hart <darren@os.amperecomputing.com>
> > 
> > Hi Darrent,
> > What kind of resources are clusters sharing on Ampere Altra?
> > So on Altra, cpus are not sharing LLC? Each LLC is separate
> > for each cpu?
> > 
> > > ---
> > >  arch/arm64/kernel/smp.c | 32 ++++++++++++++++++++++++++++++++
> > >  1 file changed, 32 insertions(+)
> > > 
> > > diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
> > > index 27df5c1e6baa..0a78ac5c8830 100644
> > > --- a/arch/arm64/kernel/smp.c
> > > +++ b/arch/arm64/kernel/smp.c
> > > @@ -715,9 +715,22 @@ void __init smp_init_cpus(void)
> > >  	}
> > >  }
> > > 
> > > +static struct sched_domain_topology_level arm64_no_mc_topology[] = {
> > > +#ifdef CONFIG_SCHED_SMT
> > > +	{ cpu_smt_mask, cpu_smt_flags, SD_INIT_NAME(SMT) },
> > > +#endif
> > > +
> > > +#ifdef CONFIG_SCHED_CLUSTER
> > > +	{ cpu_clustergroup_mask, cpu_cluster_flags, SD_INIT_NAME(CLS) },
> > > +#endif
> > > +	{ cpu_cpu_mask, SD_INIT_NAME(DIE) },
> > > +	{ NULL, },
> > > +};
> > > +
> > >  void __init smp_prepare_cpus(unsigned int max_cpus)
> > >  {
> > >  	const struct cpu_operations *ops;
> > > +	bool use_no_mc_topology = true;
> > >  	int err;
> > >  	unsigned int cpu;
> > >  	unsigned int this_cpu;
> > > @@ -758,6 +771,25 @@ void __init smp_prepare_cpus(unsigned int max_cpus)
> > > 
> > >  		set_cpu_present(cpu, true);
> > >  		numa_store_cpu_info(cpu);
> > > +
> > > +		/*
> > > +		 * Only use no_mc topology if all cpu_coregroup_mask weights=1
> > > +		 */
> > > +		if (cpumask_weight(cpu_coregroup_mask(cpu)) > 1)
> > > +			use_no_mc_topology = false;
> > 
> > This seems to be wrong? If you have 5 cpus,
> > Cpu0 has cpu_coregroup_mask(cpu)== 1, cpu1-4
> > has cpu_coregroup_mask(cpu)== 4, for cpu0, you still
> > need to remove MC, but for cpu1-4, you will need
> > CLS and MC both?
> 
> What is the *current* behaviour on such a system?
> 

As I understand it, any system that uses the default topology which has
a cpus_coregroup weight of 1 and a child (cluster, smt, ...) weight > 1
will behave as described above by printing the following for each CPU
matching this criteria:

  BUG: arch topology borken
        the [CLS,SMT,...] domain not a subset of the MC domain

And then extend the MC domain cpumask to match that of the child and continue
on.

That would still be the behavior for this type of system after this
patch is applied.

Thanks,
Will Deacon Feb. 15, 2022, 4:46 p.m. UTC | #6
On Tue, Feb 15, 2022 at 08:44:23AM -0800, Darren Hart wrote:
> On Tue, Feb 15, 2022 at 04:38:59PM +0000, Will Decon wrote:
> > On Fri, Feb 11, 2022 at 03:20:51AM +0000, Song Bao Hua (Barry Song) wrote:
> > > 
> > > 
> > > > -----Original Message-----
> > > > From: Darren Hart [mailto:darren@os.amperecomputing.com]
> > > > Sent: Friday, February 11, 2022 2:43 PM
> > > > To: LKML <linux-kernel@vger.kernel.org>; Linux Arm
> > > > <linux-arm-kernel@lists.infradead.org>
> > > > Cc: Catalin Marinas <catalin.marinas@arm.com>; Will Deacon <will@kernel.org>;
> > > > Peter Zijlstra <peterz@infradead.org>; Vincent Guittot
> > > > <vincent.guittot@linaro.org>; Song Bao Hua (Barry Song)
> > > > <song.bao.hua@hisilicon.com>; Valentin Schneider
> > > > <valentin.schneider@arm.com>; D . Scott Phillips
> > > > <scott@os.amperecomputing.com>; Ilkka Koskinen
> > > > <ilkka@os.amperecomputing.com>; stable@vger.kernel.org
> > > > Subject: [PATCH] arm64: smp: Skip MC domain for SoCs without shared cache
> > > > 
> > > > SoCs such as the Ampere Altra define clusters but have no shared
> > > > processor-side cache. As of v5.16 with CONFIG_SCHED_CLUSTER and
> > > > CONFIG_SCHED_MC, build_sched_domain() will BUG() with:
> > > > 
> > > > BUG: arch topology borken
> > > >      the CLS domain not a subset of the MC domain
> > > > 
> > > > for each CPU (160 times for a 2 socket 80 core Altra system). The MC
> > > > level cpu mask is then extended to that of the CLS child, and is later
> > > > removed entirely as redundant.
> > > > 
> > > > This change detects when all cpu_coregroup_mask weights=1 and uses an
> > > > alternative sched_domain_topology equivalent to the default if
> > > > CONFIG_SCHED_MC were disabled.
> > > > 
> > > > The final resulting sched domain topology is unchanged with or without
> > > > CONFIG_SCHED_CLUSTER, and the BUG is avoided:
> > > > 
> > > > For CPU0:
> > > > 
> > > > With CLS:
> > > > CLS  [0-1]
> > > > DIE  [0-79]
> > > > NUMA [0-159]
> > > > 
> > > > Without CLS:
> > > > DIE  [0-79]
> > > > NUMA [0-159]
> > > > 
> > > > Cc: Catalin Marinas <catalin.marinas@arm.com>
> > > > Cc: Will Deacon <will@kernel.org>
> > > > Cc: Peter Zijlstra <peterz@infradead.org>
> > > > Cc: Vincent Guittot <vincent.guittot@linaro.org>
> > > > Cc: Barry Song <song.bao.hua@hisilicon.com>
> > > > Cc: Valentin Schneider <valentin.schneider@arm.com>
> > > > Cc: D. Scott Phillips <scott@os.amperecomputing.com>
> > > > Cc: Ilkka Koskinen <ilkka@os.amperecomputing.com>
> > > > Cc: <stable@vger.kernel.org> # 5.16.x
> > > > Signed-off-by: Darren Hart <darren@os.amperecomputing.com>
> > > 
> > > Hi Darrent,
> > > What kind of resources are clusters sharing on Ampere Altra?
> > > So on Altra, cpus are not sharing LLC? Each LLC is separate
> > > for each cpu?
> > > 
> > > > ---
> > > >  arch/arm64/kernel/smp.c | 32 ++++++++++++++++++++++++++++++++
> > > >  1 file changed, 32 insertions(+)
> > > > 
> > > > diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
> > > > index 27df5c1e6baa..0a78ac5c8830 100644
> > > > --- a/arch/arm64/kernel/smp.c
> > > > +++ b/arch/arm64/kernel/smp.c
> > > > @@ -715,9 +715,22 @@ void __init smp_init_cpus(void)
> > > >  	}
> > > >  }
> > > > 
> > > > +static struct sched_domain_topology_level arm64_no_mc_topology[] = {
> > > > +#ifdef CONFIG_SCHED_SMT
> > > > +	{ cpu_smt_mask, cpu_smt_flags, SD_INIT_NAME(SMT) },
> > > > +#endif
> > > > +
> > > > +#ifdef CONFIG_SCHED_CLUSTER
> > > > +	{ cpu_clustergroup_mask, cpu_cluster_flags, SD_INIT_NAME(CLS) },
> > > > +#endif
> > > > +	{ cpu_cpu_mask, SD_INIT_NAME(DIE) },
> > > > +	{ NULL, },
> > > > +};
> > > > +
> > > >  void __init smp_prepare_cpus(unsigned int max_cpus)
> > > >  {
> > > >  	const struct cpu_operations *ops;
> > > > +	bool use_no_mc_topology = true;
> > > >  	int err;
> > > >  	unsigned int cpu;
> > > >  	unsigned int this_cpu;
> > > > @@ -758,6 +771,25 @@ void __init smp_prepare_cpus(unsigned int max_cpus)
> > > > 
> > > >  		set_cpu_present(cpu, true);
> > > >  		numa_store_cpu_info(cpu);
> > > > +
> > > > +		/*
> > > > +		 * Only use no_mc topology if all cpu_coregroup_mask weights=1
> > > > +		 */
> > > > +		if (cpumask_weight(cpu_coregroup_mask(cpu)) > 1)
> > > > +			use_no_mc_topology = false;
> > > 
> > > This seems to be wrong? If you have 5 cpus,
> > > Cpu0 has cpu_coregroup_mask(cpu)== 1, cpu1-4
> > > has cpu_coregroup_mask(cpu)== 4, for cpu0, you still
> > > need to remove MC, but for cpu1-4, you will need
> > > CLS and MC both?
> > 
> > What is the *current* behaviour on such a system?
> > 
> 
> As I understand it, any system that uses the default topology which has
> a cpus_coregroup weight of 1 and a child (cluster, smt, ...) weight > 1
> will behave as described above by printing the following for each CPU
> matching this criteria:
> 
>   BUG: arch topology borken
>         the [CLS,SMT,...] domain not a subset of the MC domain
> 
> And then extend the MC domain cpumask to match that of the child and continue
> on.
> 
> That would still be the behavior for this type of system after this
> patch is applied.

That's what I thought, but in that case applying your patch is a net
improvement: systems either get current or better behaviour.

Barry -- why shouldn't we take this as-is?

Will
Vincent Guittot Feb. 15, 2022, 5:09 p.m. UTC | #7
On Tue, 15 Feb 2022 at 17:46, Will Deacon <will@kernel.org> wrote:
>
> On Tue, Feb 15, 2022 at 08:44:23AM -0800, Darren Hart wrote:
> > On Tue, Feb 15, 2022 at 04:38:59PM +0000, Will Decon wrote:
> > > On Fri, Feb 11, 2022 at 03:20:51AM +0000, Song Bao Hua (Barry Song) wrote:
> > > >
> > > >
> > > > > -----Original Message-----
> > > > > From: Darren Hart [mailto:darren@os.amperecomputing.com]
> > > > > Sent: Friday, February 11, 2022 2:43 PM
> > > > > To: LKML <linux-kernel@vger.kernel.org>; Linux Arm
> > > > > <linux-arm-kernel@lists.infradead.org>
> > > > > Cc: Catalin Marinas <catalin.marinas@arm.com>; Will Deacon <will@kernel.org>;
> > > > > Peter Zijlstra <peterz@infradead.org>; Vincent Guittot
> > > > > <vincent.guittot@linaro.org>; Song Bao Hua (Barry Song)
> > > > > <song.bao.hua@hisilicon.com>; Valentin Schneider
> > > > > <valentin.schneider@arm.com>; D . Scott Phillips
> > > > > <scott@os.amperecomputing.com>; Ilkka Koskinen
> > > > > <ilkka@os.amperecomputing.com>; stable@vger.kernel.org
> > > > > Subject: [PATCH] arm64: smp: Skip MC domain for SoCs without shared cache
> > > > >
> > > > > SoCs such as the Ampere Altra define clusters but have no shared
> > > > > processor-side cache. As of v5.16 with CONFIG_SCHED_CLUSTER and
> > > > > CONFIG_SCHED_MC, build_sched_domain() will BUG() with:
> > > > >
> > > > > BUG: arch topology borken
> > > > >      the CLS domain not a subset of the MC domain
> > > > >
> > > > > for each CPU (160 times for a 2 socket 80 core Altra system). The MC
> > > > > level cpu mask is then extended to that of the CLS child, and is later
> > > > > removed entirely as redundant.
> > > > >
> > > > > This change detects when all cpu_coregroup_mask weights=1 and uses an
> > > > > alternative sched_domain_topology equivalent to the default if
> > > > > CONFIG_SCHED_MC were disabled.
> > > > >
> > > > > The final resulting sched domain topology is unchanged with or without
> > > > > CONFIG_SCHED_CLUSTER, and the BUG is avoided:
> > > > >
> > > > > For CPU0:
> > > > >
> > > > > With CLS:
> > > > > CLS  [0-1]
> > > > > DIE  [0-79]
> > > > > NUMA [0-159]
> > > > >
> > > > > Without CLS:
> > > > > DIE  [0-79]
> > > > > NUMA [0-159]
> > > > >
> > > > > Cc: Catalin Marinas <catalin.marinas@arm.com>
> > > > > Cc: Will Deacon <will@kernel.org>
> > > > > Cc: Peter Zijlstra <peterz@infradead.org>
> > > > > Cc: Vincent Guittot <vincent.guittot@linaro.org>
> > > > > Cc: Barry Song <song.bao.hua@hisilicon.com>
> > > > > Cc: Valentin Schneider <valentin.schneider@arm.com>
> > > > > Cc: D. Scott Phillips <scott@os.amperecomputing.com>
> > > > > Cc: Ilkka Koskinen <ilkka@os.amperecomputing.com>
> > > > > Cc: <stable@vger.kernel.org> # 5.16.x
> > > > > Signed-off-by: Darren Hart <darren@os.amperecomputing.com>
> > > >
> > > > Hi Darrent,
> > > > What kind of resources are clusters sharing on Ampere Altra?
> > > > So on Altra, cpus are not sharing LLC? Each LLC is separate
> > > > for each cpu?
> > > >
> > > > > ---
> > > > >  arch/arm64/kernel/smp.c | 32 ++++++++++++++++++++++++++++++++
> > > > >  1 file changed, 32 insertions(+)
> > > > >
> > > > > diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
> > > > > index 27df5c1e6baa..0a78ac5c8830 100644
> > > > > --- a/arch/arm64/kernel/smp.c
> > > > > +++ b/arch/arm64/kernel/smp.c
> > > > > @@ -715,9 +715,22 @@ void __init smp_init_cpus(void)
> > > > >         }
> > > > >  }
> > > > >
> > > > > +static struct sched_domain_topology_level arm64_no_mc_topology[] = {
> > > > > +#ifdef CONFIG_SCHED_SMT
> > > > > +       { cpu_smt_mask, cpu_smt_flags, SD_INIT_NAME(SMT) },
> > > > > +#endif
> > > > > +
> > > > > +#ifdef CONFIG_SCHED_CLUSTER
> > > > > +       { cpu_clustergroup_mask, cpu_cluster_flags, SD_INIT_NAME(CLS) },
> > > > > +#endif
> > > > > +       { cpu_cpu_mask, SD_INIT_NAME(DIE) },
> > > > > +       { NULL, },
> > > > > +};
> > > > > +
> > > > >  void __init smp_prepare_cpus(unsigned int max_cpus)
> > > > >  {
> > > > >         const struct cpu_operations *ops;
> > > > > +       bool use_no_mc_topology = true;
> > > > >         int err;
> > > > >         unsigned int cpu;
> > > > >         unsigned int this_cpu;
> > > > > @@ -758,6 +771,25 @@ void __init smp_prepare_cpus(unsigned int max_cpus)
> > > > >
> > > > >                 set_cpu_present(cpu, true);
> > > > >                 numa_store_cpu_info(cpu);
> > > > > +
> > > > > +               /*
> > > > > +                * Only use no_mc topology if all cpu_coregroup_mask weights=1
> > > > > +                */
> > > > > +               if (cpumask_weight(cpu_coregroup_mask(cpu)) > 1)
> > > > > +                       use_no_mc_topology = false;
> > > >
> > > > This seems to be wrong? If you have 5 cpus,
> > > > Cpu0 has cpu_coregroup_mask(cpu)== 1, cpu1-4
> > > > has cpu_coregroup_mask(cpu)== 4, for cpu0, you still
> > > > need to remove MC, but for cpu1-4, you will need
> > > > CLS and MC both?
> > >
> > > What is the *current* behaviour on such a system?
> > >
> >
> > As I understand it, any system that uses the default topology which has
> > a cpus_coregroup weight of 1 and a child (cluster, smt, ...) weight > 1
> > will behave as described above by printing the following for each CPU
> > matching this criteria:
> >
> >   BUG: arch topology borken
> >         the [CLS,SMT,...] domain not a subset of the MC domain
> >
> > And then extend the MC domain cpumask to match that of the child and continue
> > on.
> >
> > That would still be the behavior for this type of system after this
> > patch is applied.
>
> That's what I thought, but in that case applying your patch is a net
> improvement: systems either get current or better behaviour.

CLUSTER level is normally defined as a intermediate group of the MC
level and both levels have the scheduler flag SD_SHARE_PKG_RESOURCES
flag

In the case of Ampere altra, they consider that CPUA have a CLUSTER
level which SD_SHARE_PKG_RESOURCES with another CPUB but the next and
larger MC level then says that CPUA doesn't SD_SHARE_PKG_RESOURCES
with CPUB which seems to be odd because the SD_SHARE_PKG_RESOURCES has
not disappeared.
Looks like there is a mismatch in topology description


>
> Barry -- why shouldn't we take this as-is?
>
> Will
Darren Hart Feb. 15, 2022, 5:32 p.m. UTC | #8
On Tue, Feb 15, 2022 at 06:09:08PM +0100, Vincent Guittot wrote:
> On Tue, 15 Feb 2022 at 17:46, Will Deacon <will@kernel.org> wrote:
> >
> > On Tue, Feb 15, 2022 at 08:44:23AM -0800, Darren Hart wrote:
> > > On Tue, Feb 15, 2022 at 04:38:59PM +0000, Will Decon wrote:
> > > > On Fri, Feb 11, 2022 at 03:20:51AM +0000, Song Bao Hua (Barry Song) wrote:
> > > > >
> > > > >
> > > > > > -----Original Message-----
> > > > > > From: Darren Hart [mailto:darren@os.amperecomputing.com]
> > > > > > Sent: Friday, February 11, 2022 2:43 PM
> > > > > > To: LKML <linux-kernel@vger.kernel.org>; Linux Arm
> > > > > > <linux-arm-kernel@lists.infradead.org>
> > > > > > Cc: Catalin Marinas <catalin.marinas@arm.com>; Will Deacon <will@kernel.org>;
> > > > > > Peter Zijlstra <peterz@infradead.org>; Vincent Guittot
> > > > > > <vincent.guittot@linaro.org>; Song Bao Hua (Barry Song)
> > > > > > <song.bao.hua@hisilicon.com>; Valentin Schneider
> > > > > > <valentin.schneider@arm.com>; D . Scott Phillips
> > > > > > <scott@os.amperecomputing.com>; Ilkka Koskinen
> > > > > > <ilkka@os.amperecomputing.com>; stable@vger.kernel.org
> > > > > > Subject: [PATCH] arm64: smp: Skip MC domain for SoCs without shared cache
> > > > > >
> > > > > > SoCs such as the Ampere Altra define clusters but have no shared
> > > > > > processor-side cache. As of v5.16 with CONFIG_SCHED_CLUSTER and
> > > > > > CONFIG_SCHED_MC, build_sched_domain() will BUG() with:
> > > > > >
> > > > > > BUG: arch topology borken
> > > > > >      the CLS domain not a subset of the MC domain
> > > > > >
> > > > > > for each CPU (160 times for a 2 socket 80 core Altra system). The MC
> > > > > > level cpu mask is then extended to that of the CLS child, and is later
> > > > > > removed entirely as redundant.
> > > > > >
> > > > > > This change detects when all cpu_coregroup_mask weights=1 and uses an
> > > > > > alternative sched_domain_topology equivalent to the default if
> > > > > > CONFIG_SCHED_MC were disabled.
> > > > > >
> > > > > > The final resulting sched domain topology is unchanged with or without
> > > > > > CONFIG_SCHED_CLUSTER, and the BUG is avoided:
> > > > > >
> > > > > > For CPU0:
> > > > > >
> > > > > > With CLS:
> > > > > > CLS  [0-1]
> > > > > > DIE  [0-79]
> > > > > > NUMA [0-159]
> > > > > >
> > > > > > Without CLS:
> > > > > > DIE  [0-79]
> > > > > > NUMA [0-159]
> > > > > >
> > > > > > Cc: Catalin Marinas <catalin.marinas@arm.com>
> > > > > > Cc: Will Deacon <will@kernel.org>
> > > > > > Cc: Peter Zijlstra <peterz@infradead.org>
> > > > > > Cc: Vincent Guittot <vincent.guittot@linaro.org>
> > > > > > Cc: Barry Song <song.bao.hua@hisilicon.com>
> > > > > > Cc: Valentin Schneider <valentin.schneider@arm.com>
> > > > > > Cc: D. Scott Phillips <scott@os.amperecomputing.com>
> > > > > > Cc: Ilkka Koskinen <ilkka@os.amperecomputing.com>
> > > > > > Cc: <stable@vger.kernel.org> # 5.16.x
> > > > > > Signed-off-by: Darren Hart <darren@os.amperecomputing.com>
> > > > >
> > > > > Hi Darrent,
> > > > > What kind of resources are clusters sharing on Ampere Altra?
> > > > > So on Altra, cpus are not sharing LLC? Each LLC is separate
> > > > > for each cpu?
> > > > >
> > > > > > ---
> > > > > >  arch/arm64/kernel/smp.c | 32 ++++++++++++++++++++++++++++++++
> > > > > >  1 file changed, 32 insertions(+)
> > > > > >
> > > > > > diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
> > > > > > index 27df5c1e6baa..0a78ac5c8830 100644
> > > > > > --- a/arch/arm64/kernel/smp.c
> > > > > > +++ b/arch/arm64/kernel/smp.c
> > > > > > @@ -715,9 +715,22 @@ void __init smp_init_cpus(void)
> > > > > >         }
> > > > > >  }
> > > > > >
> > > > > > +static struct sched_domain_topology_level arm64_no_mc_topology[] = {
> > > > > > +#ifdef CONFIG_SCHED_SMT
> > > > > > +       { cpu_smt_mask, cpu_smt_flags, SD_INIT_NAME(SMT) },
> > > > > > +#endif
> > > > > > +
> > > > > > +#ifdef CONFIG_SCHED_CLUSTER
> > > > > > +       { cpu_clustergroup_mask, cpu_cluster_flags, SD_INIT_NAME(CLS) },
> > > > > > +#endif
> > > > > > +       { cpu_cpu_mask, SD_INIT_NAME(DIE) },
> > > > > > +       { NULL, },
> > > > > > +};
> > > > > > +
> > > > > >  void __init smp_prepare_cpus(unsigned int max_cpus)
> > > > > >  {
> > > > > >         const struct cpu_operations *ops;
> > > > > > +       bool use_no_mc_topology = true;
> > > > > >         int err;
> > > > > >         unsigned int cpu;
> > > > > >         unsigned int this_cpu;
> > > > > > @@ -758,6 +771,25 @@ void __init smp_prepare_cpus(unsigned int max_cpus)
> > > > > >
> > > > > >                 set_cpu_present(cpu, true);
> > > > > >                 numa_store_cpu_info(cpu);
> > > > > > +
> > > > > > +               /*
> > > > > > +                * Only use no_mc topology if all cpu_coregroup_mask weights=1
> > > > > > +                */
> > > > > > +               if (cpumask_weight(cpu_coregroup_mask(cpu)) > 1)
> > > > > > +                       use_no_mc_topology = false;
> > > > >
> > > > > This seems to be wrong? If you have 5 cpus,
> > > > > Cpu0 has cpu_coregroup_mask(cpu)== 1, cpu1-4
> > > > > has cpu_coregroup_mask(cpu)== 4, for cpu0, you still
> > > > > need to remove MC, but for cpu1-4, you will need
> > > > > CLS and MC both?
> > > >
> > > > What is the *current* behaviour on such a system?
> > > >
> > >
> > > As I understand it, any system that uses the default topology which has
> > > a cpus_coregroup weight of 1 and a child (cluster, smt, ...) weight > 1
> > > will behave as described above by printing the following for each CPU
> > > matching this criteria:
> > >
> > >   BUG: arch topology borken
> > >         the [CLS,SMT,...] domain not a subset of the MC domain
> > >
> > > And then extend the MC domain cpumask to match that of the child and continue
> > > on.
> > >
> > > That would still be the behavior for this type of system after this
> > > patch is applied.
> >
> > That's what I thought, but in that case applying your patch is a net
> > improvement: systems either get current or better behaviour.
> 
> CLUSTER level is normally defined as a intermediate group of the MC
> level and both levels have the scheduler flag SD_SHARE_PKG_RESOURCES
> flag
> 
> In the case of Ampere altra, they consider that CPUA have a CLUSTER
> level which SD_SHARE_PKG_RESOURCES with another CPUB but the next and
> larger MC level then says that CPUA doesn't SD_SHARE_PKG_RESOURCES
> with CPUB which seems to be odd because the SD_SHARE_PKG_RESOURCES has
> not disappeared Looks like there is a mismatch in topology description

Hi Vincent,

Agree. Where do you think this mismatch exists?

I'd describe this as a mismatch between the default assumptions of
the sched domains construction code (that SD_SHARE_PKG_RESOURCES implies
a shared cpu-side cache) and SoCs without a shared cpu-side cache. This
is encoded in properties of the MC level and the requirement that child
domains be a subset of the parent domain cpumask.

The MC-less topology addresses this in a consistent way with other
architectures using the provided API for non-default topologies without
changing these fundamental assumptions and without changing the final
resulting topology which is correct and matches the topology supplied in
the ACPI PPTT.
Vincent Guittot Feb. 15, 2022, 6:19 p.m. UTC | #9
On Tue, 15 Feb 2022 at 18:32, Darren Hart <darren@os.amperecomputing.com> wrote:
>
> On Tue, Feb 15, 2022 at 06:09:08PM +0100, Vincent Guittot wrote:
> > On Tue, 15 Feb 2022 at 17:46, Will Deacon <will@kernel.org> wrote:
> > >
> > > On Tue, Feb 15, 2022 at 08:44:23AM -0800, Darren Hart wrote:
> > > > On Tue, Feb 15, 2022 at 04:38:59PM +0000, Will Decon wrote:
> > > > > On Fri, Feb 11, 2022 at 03:20:51AM +0000, Song Bao Hua (Barry Song) wrote:
> > > > > >
> > > > > >
> > > > > > > -----Original Message-----
> > > > > > > From: Darren Hart [mailto:darren@os.amperecomputing.com]
> > > > > > > Sent: Friday, February 11, 2022 2:43 PM
> > > > > > > To: LKML <linux-kernel@vger.kernel.org>; Linux Arm
> > > > > > > <linux-arm-kernel@lists.infradead.org>
> > > > > > > Cc: Catalin Marinas <catalin.marinas@arm.com>; Will Deacon <will@kernel.org>;
> > > > > > > Peter Zijlstra <peterz@infradead.org>; Vincent Guittot
> > > > > > > <vincent.guittot@linaro.org>; Song Bao Hua (Barry Song)
> > > > > > > <song.bao.hua@hisilicon.com>; Valentin Schneider
> > > > > > > <valentin.schneider@arm.com>; D . Scott Phillips
> > > > > > > <scott@os.amperecomputing.com>; Ilkka Koskinen
> > > > > > > <ilkka@os.amperecomputing.com>; stable@vger.kernel.org
> > > > > > > Subject: [PATCH] arm64: smp: Skip MC domain for SoCs without shared cache
> > > > > > >
> > > > > > > SoCs such as the Ampere Altra define clusters but have no shared
> > > > > > > processor-side cache. As of v5.16 with CONFIG_SCHED_CLUSTER and
> > > > > > > CONFIG_SCHED_MC, build_sched_domain() will BUG() with:
> > > > > > >
> > > > > > > BUG: arch topology borken
> > > > > > >      the CLS domain not a subset of the MC domain
> > > > > > >
> > > > > > > for each CPU (160 times for a 2 socket 80 core Altra system). The MC
> > > > > > > level cpu mask is then extended to that of the CLS child, and is later
> > > > > > > removed entirely as redundant.
> > > > > > >
> > > > > > > This change detects when all cpu_coregroup_mask weights=1 and uses an
> > > > > > > alternative sched_domain_topology equivalent to the default if
> > > > > > > CONFIG_SCHED_MC were disabled.
> > > > > > >
> > > > > > > The final resulting sched domain topology is unchanged with or without
> > > > > > > CONFIG_SCHED_CLUSTER, and the BUG is avoided:
> > > > > > >
> > > > > > > For CPU0:
> > > > > > >
> > > > > > > With CLS:
> > > > > > > CLS  [0-1]
> > > > > > > DIE  [0-79]
> > > > > > > NUMA [0-159]
> > > > > > >
> > > > > > > Without CLS:
> > > > > > > DIE  [0-79]
> > > > > > > NUMA [0-159]
> > > > > > >
> > > > > > > Cc: Catalin Marinas <catalin.marinas@arm.com>
> > > > > > > Cc: Will Deacon <will@kernel.org>
> > > > > > > Cc: Peter Zijlstra <peterz@infradead.org>
> > > > > > > Cc: Vincent Guittot <vincent.guittot@linaro.org>
> > > > > > > Cc: Barry Song <song.bao.hua@hisilicon.com>
> > > > > > > Cc: Valentin Schneider <valentin.schneider@arm.com>
> > > > > > > Cc: D. Scott Phillips <scott@os.amperecomputing.com>
> > > > > > > Cc: Ilkka Koskinen <ilkka@os.amperecomputing.com>
> > > > > > > Cc: <stable@vger.kernel.org> # 5.16.x
> > > > > > > Signed-off-by: Darren Hart <darren@os.amperecomputing.com>
> > > > > >
> > > > > > Hi Darrent,
> > > > > > What kind of resources are clusters sharing on Ampere Altra?
> > > > > > So on Altra, cpus are not sharing LLC? Each LLC is separate
> > > > > > for each cpu?
> > > > > >
> > > > > > > ---
> > > > > > >  arch/arm64/kernel/smp.c | 32 ++++++++++++++++++++++++++++++++
> > > > > > >  1 file changed, 32 insertions(+)
> > > > > > >
> > > > > > > diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
> > > > > > > index 27df5c1e6baa..0a78ac5c8830 100644
> > > > > > > --- a/arch/arm64/kernel/smp.c
> > > > > > > +++ b/arch/arm64/kernel/smp.c
> > > > > > > @@ -715,9 +715,22 @@ void __init smp_init_cpus(void)
> > > > > > >         }
> > > > > > >  }
> > > > > > >
> > > > > > > +static struct sched_domain_topology_level arm64_no_mc_topology[] = {
> > > > > > > +#ifdef CONFIG_SCHED_SMT
> > > > > > > +       { cpu_smt_mask, cpu_smt_flags, SD_INIT_NAME(SMT) },
> > > > > > > +#endif
> > > > > > > +
> > > > > > > +#ifdef CONFIG_SCHED_CLUSTER
> > > > > > > +       { cpu_clustergroup_mask, cpu_cluster_flags, SD_INIT_NAME(CLS) },
> > > > > > > +#endif
> > > > > > > +       { cpu_cpu_mask, SD_INIT_NAME(DIE) },
> > > > > > > +       { NULL, },
> > > > > > > +};
> > > > > > > +
> > > > > > >  void __init smp_prepare_cpus(unsigned int max_cpus)
> > > > > > >  {
> > > > > > >         const struct cpu_operations *ops;
> > > > > > > +       bool use_no_mc_topology = true;
> > > > > > >         int err;
> > > > > > >         unsigned int cpu;
> > > > > > >         unsigned int this_cpu;
> > > > > > > @@ -758,6 +771,25 @@ void __init smp_prepare_cpus(unsigned int max_cpus)
> > > > > > >
> > > > > > >                 set_cpu_present(cpu, true);
> > > > > > >                 numa_store_cpu_info(cpu);
> > > > > > > +
> > > > > > > +               /*
> > > > > > > +                * Only use no_mc topology if all cpu_coregroup_mask weights=1
> > > > > > > +                */
> > > > > > > +               if (cpumask_weight(cpu_coregroup_mask(cpu)) > 1)
> > > > > > > +                       use_no_mc_topology = false;
> > > > > >
> > > > > > This seems to be wrong? If you have 5 cpus,
> > > > > > Cpu0 has cpu_coregroup_mask(cpu)== 1, cpu1-4
> > > > > > has cpu_coregroup_mask(cpu)== 4, for cpu0, you still
> > > > > > need to remove MC, but for cpu1-4, you will need
> > > > > > CLS and MC both?
> > > > >
> > > > > What is the *current* behaviour on such a system?
> > > > >
> > > >
> > > > As I understand it, any system that uses the default topology which has
> > > > a cpus_coregroup weight of 1 and a child (cluster, smt, ...) weight > 1
> > > > will behave as described above by printing the following for each CPU
> > > > matching this criteria:
> > > >
> > > >   BUG: arch topology borken
> > > >         the [CLS,SMT,...] domain not a subset of the MC domain
> > > >
> > > > And then extend the MC domain cpumask to match that of the child and continue
> > > > on.
> > > >
> > > > That would still be the behavior for this type of system after this
> > > > patch is applied.
> > >
> > > That's what I thought, but in that case applying your patch is a net
> > > improvement: systems either get current or better behaviour.
> >
> > CLUSTER level is normally defined as a intermediate group of the MC
> > level and both levels have the scheduler flag SD_SHARE_PKG_RESOURCES
> > flag
> >
> > In the case of Ampere altra, they consider that CPUA have a CLUSTER
> > level which SD_SHARE_PKG_RESOURCES with another CPUB but the next and
> > larger MC level then says that CPUA doesn't SD_SHARE_PKG_RESOURCES
> > with CPUB which seems to be odd because the SD_SHARE_PKG_RESOURCES has
> > not disappeared Looks like there is a mismatch in topology description
>
> Hi Vincent,
>
> Agree. Where do you think this mismatch exists?

I think that the problem comes from that the default topology order is
assumed to be :
SMT
CLUSTER shares pkg resources i.e. cache
MC
DIE
NUMA

but in your case, you want a topology order like :
SMT
MC
CLUSTER shares SCU
DIE
NUMA

IIUC, the cluster is defined as the 2nd (no SMT) or 3rd (SMT) level in
the PPTT table whereas the MC level is defined as the number of cache
levels. So i would say that you should compare the level to know the
ordering

Then, there is another point:
In your case, CLUSTER level still has the flag SD_SHARE_PKG_RESOURCES
which is used to define some scheduler internal variable like
sd_llc(sched domain last level of cache) which allows fast task
migration between this cpus in this level at wakeup. In your case the
sd_llc should not be the cluster but the MC with only one CPU. But I
would not be surprised that most of perf improvement comes from this
sd_llc wrongly set to cluster instead of the single CPU


>
> I'd describe this as a mismatch between the default assumptions of
> the sched domains construction code (that SD_SHARE_PKG_RESOURCES implies
> a shared cpu-side cache) and SoCs without a shared cpu-side cache. This
> is encoded in properties of the MC level and the requirement that child
> domains be a subset of the parent domain cpumask.
>
> The MC-less topology addresses this in a consistent way with other
> architectures using the provided API for non-default topologies without
> changing these fundamental assumptions and without changing the final
> resulting topology which is correct and matches the topology supplied in
> the ACPI PPTT.


>
> --
> Darren Hart
> Ampere Computing / OS and Kernel
Darren Hart Feb. 15, 2022, 8:05 p.m. UTC | #10
On Tue, Feb 15, 2022 at 07:19:45PM +0100, Vincent Guittot wrote:
> On Tue, 15 Feb 2022 at 18:32, Darren Hart <darren@os.amperecomputing.com> wrote:
> >
> > On Tue, Feb 15, 2022 at 06:09:08PM +0100, Vincent Guittot wrote:
> > > On Tue, 15 Feb 2022 at 17:46, Will Deacon <will@kernel.org> wrote:
> > > >
> > > > On Tue, Feb 15, 2022 at 08:44:23AM -0800, Darren Hart wrote:
> > > > > On Tue, Feb 15, 2022 at 04:38:59PM +0000, Will Decon wrote:
> > > > > > On Fri, Feb 11, 2022 at 03:20:51AM +0000, Song Bao Hua (Barry Song) wrote:
> > > > > > >
> > > > > > >
> > > > > > > > -----Original Message-----
> > > > > > > > From: Darren Hart [mailto:darren@os.amperecomputing.com]
> > > > > > > > Sent: Friday, February 11, 2022 2:43 PM
> > > > > > > > To: LKML <linux-kernel@vger.kernel.org>; Linux Arm
> > > > > > > > <linux-arm-kernel@lists.infradead.org>
> > > > > > > > Cc: Catalin Marinas <catalin.marinas@arm.com>; Will Deacon <will@kernel.org>;
> > > > > > > > Peter Zijlstra <peterz@infradead.org>; Vincent Guittot
> > > > > > > > <vincent.guittot@linaro.org>; Song Bao Hua (Barry Song)
> > > > > > > > <song.bao.hua@hisilicon.com>; Valentin Schneider
> > > > > > > > <valentin.schneider@arm.com>; D . Scott Phillips
> > > > > > > > <scott@os.amperecomputing.com>; Ilkka Koskinen
> > > > > > > > <ilkka@os.amperecomputing.com>; stable@vger.kernel.org
> > > > > > > > Subject: [PATCH] arm64: smp: Skip MC domain for SoCs without shared cache
> > > > > > > >
> > > > > > > > SoCs such as the Ampere Altra define clusters but have no shared
> > > > > > > > processor-side cache. As of v5.16 with CONFIG_SCHED_CLUSTER and
> > > > > > > > CONFIG_SCHED_MC, build_sched_domain() will BUG() with:
> > > > > > > >
> > > > > > > > BUG: arch topology borken
> > > > > > > >      the CLS domain not a subset of the MC domain
> > > > > > > >
> > > > > > > > for each CPU (160 times for a 2 socket 80 core Altra system). The MC
> > > > > > > > level cpu mask is then extended to that of the CLS child, and is later
> > > > > > > > removed entirely as redundant.
> > > > > > > >
> > > > > > > > This change detects when all cpu_coregroup_mask weights=1 and uses an
> > > > > > > > alternative sched_domain_topology equivalent to the default if
> > > > > > > > CONFIG_SCHED_MC were disabled.
> > > > > > > >
> > > > > > > > The final resulting sched domain topology is unchanged with or without
> > > > > > > > CONFIG_SCHED_CLUSTER, and the BUG is avoided:
> > > > > > > >
> > > > > > > > For CPU0:
> > > > > > > >
> > > > > > > > With CLS:
> > > > > > > > CLS  [0-1]
> > > > > > > > DIE  [0-79]
> > > > > > > > NUMA [0-159]
> > > > > > > >
> > > > > > > > Without CLS:
> > > > > > > > DIE  [0-79]
> > > > > > > > NUMA [0-159]
> > > > > > > >
> > > > > > > > Cc: Catalin Marinas <catalin.marinas@arm.com>
> > > > > > > > Cc: Will Deacon <will@kernel.org>
> > > > > > > > Cc: Peter Zijlstra <peterz@infradead.org>
> > > > > > > > Cc: Vincent Guittot <vincent.guittot@linaro.org>
> > > > > > > > Cc: Barry Song <song.bao.hua@hisilicon.com>
> > > > > > > > Cc: Valentin Schneider <valentin.schneider@arm.com>
> > > > > > > > Cc: D. Scott Phillips <scott@os.amperecomputing.com>
> > > > > > > > Cc: Ilkka Koskinen <ilkka@os.amperecomputing.com>
> > > > > > > > Cc: <stable@vger.kernel.org> # 5.16.x
> > > > > > > > Signed-off-by: Darren Hart <darren@os.amperecomputing.com>
> > > > > > >
> > > > > > > Hi Darrent,
> > > > > > > What kind of resources are clusters sharing on Ampere Altra?
> > > > > > > So on Altra, cpus are not sharing LLC? Each LLC is separate
> > > > > > > for each cpu?
> > > > > > >
> > > > > > > > ---
> > > > > > > >  arch/arm64/kernel/smp.c | 32 ++++++++++++++++++++++++++++++++
> > > > > > > >  1 file changed, 32 insertions(+)
> > > > > > > >
> > > > > > > > diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
> > > > > > > > index 27df5c1e6baa..0a78ac5c8830 100644
> > > > > > > > --- a/arch/arm64/kernel/smp.c
> > > > > > > > +++ b/arch/arm64/kernel/smp.c
> > > > > > > > @@ -715,9 +715,22 @@ void __init smp_init_cpus(void)
> > > > > > > >         }
> > > > > > > >  }
> > > > > > > >
> > > > > > > > +static struct sched_domain_topology_level arm64_no_mc_topology[] = {
> > > > > > > > +#ifdef CONFIG_SCHED_SMT
> > > > > > > > +       { cpu_smt_mask, cpu_smt_flags, SD_INIT_NAME(SMT) },
> > > > > > > > +#endif
> > > > > > > > +
> > > > > > > > +#ifdef CONFIG_SCHED_CLUSTER
> > > > > > > > +       { cpu_clustergroup_mask, cpu_cluster_flags, SD_INIT_NAME(CLS) },
> > > > > > > > +#endif
> > > > > > > > +       { cpu_cpu_mask, SD_INIT_NAME(DIE) },
> > > > > > > > +       { NULL, },
> > > > > > > > +};
> > > > > > > > +
> > > > > > > >  void __init smp_prepare_cpus(unsigned int max_cpus)
> > > > > > > >  {
> > > > > > > >         const struct cpu_operations *ops;
> > > > > > > > +       bool use_no_mc_topology = true;
> > > > > > > >         int err;
> > > > > > > >         unsigned int cpu;
> > > > > > > >         unsigned int this_cpu;
> > > > > > > > @@ -758,6 +771,25 @@ void __init smp_prepare_cpus(unsigned int max_cpus)
> > > > > > > >
> > > > > > > >                 set_cpu_present(cpu, true);
> > > > > > > >                 numa_store_cpu_info(cpu);
> > > > > > > > +
> > > > > > > > +               /*
> > > > > > > > +                * Only use no_mc topology if all cpu_coregroup_mask weights=1
> > > > > > > > +                */
> > > > > > > > +               if (cpumask_weight(cpu_coregroup_mask(cpu)) > 1)
> > > > > > > > +                       use_no_mc_topology = false;
> > > > > > >
> > > > > > > This seems to be wrong? If you have 5 cpus,
> > > > > > > Cpu0 has cpu_coregroup_mask(cpu)== 1, cpu1-4
> > > > > > > has cpu_coregroup_mask(cpu)== 4, for cpu0, you still
> > > > > > > need to remove MC, but for cpu1-4, you will need
> > > > > > > CLS and MC both?
> > > > > >
> > > > > > What is the *current* behaviour on such a system?
> > > > > >
> > > > >
> > > > > As I understand it, any system that uses the default topology which has
> > > > > a cpus_coregroup weight of 1 and a child (cluster, smt, ...) weight > 1
> > > > > will behave as described above by printing the following for each CPU
> > > > > matching this criteria:
> > > > >
> > > > >   BUG: arch topology borken
> > > > >         the [CLS,SMT,...] domain not a subset of the MC domain
> > > > >
> > > > > And then extend the MC domain cpumask to match that of the child and continue
> > > > > on.
> > > > >
> > > > > That would still be the behavior for this type of system after this
> > > > > patch is applied.
> > > >
> > > > That's what I thought, but in that case applying your patch is a net
> > > > improvement: systems either get current or better behaviour.
> > >
> > > CLUSTER level is normally defined as a intermediate group of the MC
> > > level and both levels have the scheduler flag SD_SHARE_PKG_RESOURCES
> > > flag
> > >
> > > In the case of Ampere altra, they consider that CPUA have a CLUSTER
> > > level which SD_SHARE_PKG_RESOURCES with another CPUB but the next and
> > > larger MC level then says that CPUA doesn't SD_SHARE_PKG_RESOURCES
> > > with CPUB which seems to be odd because the SD_SHARE_PKG_RESOURCES has
> > > not disappeared Looks like there is a mismatch in topology description
> >
> > Hi Vincent,
> >
> > Agree. Where do you think this mismatch exists?
> 
> I think that the problem comes from that the default topology order is
> assumed to be :
> SMT
> CLUSTER shares pkg resources i.e. cache
> MC
> DIE
> NUMA
> 
> but in your case, you want a topology order like :
> SMT
> MC
> CLUSTER shares SCU
> DIE
> NUMA

Given the fairly loose definition of some of these domains and the
freedom to adjust flags with custom topologies, I think it's difficult
to say it needs to be this or that. As you point out, this stems from an
assumption in the default topology, so eliding the MC level within the
current set of abstractions for a very targeted topology still seems
reasonable to address the BUG in the very near term in a contained way.

> 
> IIUC, the cluster is defined as the 2nd (no SMT) or 3rd (SMT) level in
> the PPTT table whereas the MC level is defined as the number of cache
> levels. So i would say that you should compare the level to know the
> ordering
> 
> Then, there is another point:
> In your case, CLUSTER level still has the flag SD_SHARE_PKG_RESOURCES
> which is used to define some scheduler internal variable like
> sd_llc(sched domain last level of cache) which allows fast task
> migration between this cpus in this level at wakeup. In your case the
> sd_llc should not be the cluster but the MC with only one CPU. But I
> would not be surprised that most of perf improvement comes from this
> sd_llc wrongly set to cluster instead of the single CPU

Agree that we currently have an imperfect representation of SoCs without
a shared CPU-side cache. Within this imperfect abstraction, it seems
valid to consider the SCU as a valid shared resource to be described by
SD_SHARE_PKG_RESOURCES, and as you say, that does result in an overall
performance improvement.

Also agree that it is worth working toward a better abstraction. On the
way to that, I think it makes sense to avoid BUG'ing on the current
topology which is a reasonable one given the current abstraction.

Thanks,

> 
> 
> >
> > I'd describe this as a mismatch between the default assumptions of
> > the sched domains construction code (that SD_SHARE_PKG_RESOURCES implies
> > a shared cpu-side cache) and SoCs without a shared cpu-side cache. This
> > is encoded in properties of the MC level and the requirement that child
> > domains be a subset of the parent domain cpumask.
> >
> > The MC-less topology addresses this in a consistent way with other
> > architectures using the provided API for non-default topologies without
> > changing these fundamental assumptions and without changing the final
> > resulting topology which is correct and matches the topology supplied in
> > the ACPI PPTT.
> 
> 
> >
> > --
> > Darren Hart
> > Ampere Computing / OS and Kernel
Barry Song Feb. 16, 2022, 7:03 a.m. UTC | #11
On Wed, Feb 16, 2022 at 7:30 PM Vincent Guittot
<vincent.guittot@linaro.org> wrote:
>
> On Tue, 15 Feb 2022 at 18:32, Darren Hart <darren@os.amperecomputing.com> wrote:
> >
> > On Tue, Feb 15, 2022 at 06:09:08PM +0100, Vincent Guittot wrote:
> > > On Tue, 15 Feb 2022 at 17:46, Will Deacon <will@kernel.org> wrote:
> > > >
> > > > On Tue, Feb 15, 2022 at 08:44:23AM -0800, Darren Hart wrote:
> > > > > On Tue, Feb 15, 2022 at 04:38:59PM +0000, Will Decon wrote:
> > > > > > On Fri, Feb 11, 2022 at 03:20:51AM +0000, Song Bao Hua (Barry Song) wrote:
> > > > > > >
> > > > > > >
> > > > > > > > -----Original Message-----
> > > > > > > > From: Darren Hart [mailto:darren@os.amperecomputing.com]
> > > > > > > > Sent: Friday, February 11, 2022 2:43 PM
> > > > > > > > To: LKML <linux-kernel@vger.kernel.org>; Linux Arm
> > > > > > > > <linux-arm-kernel@lists.infradead.org>
> > > > > > > > Cc: Catalin Marinas <catalin.marinas@arm.com>; Will Deacon <will@kernel.org>;
> > > > > > > > Peter Zijlstra <peterz@infradead.org>; Vincent Guittot
> > > > > > > > <vincent.guittot@linaro.org>; Song Bao Hua (Barry Song)
> > > > > > > > <song.bao.hua@hisilicon.com>; Valentin Schneider
> > > > > > > > <valentin.schneider@arm.com>; D . Scott Phillips
> > > > > > > > <scott@os.amperecomputing.com>; Ilkka Koskinen
> > > > > > > > <ilkka@os.amperecomputing.com>; stable@vger.kernel.org
> > > > > > > > Subject: [PATCH] arm64: smp: Skip MC domain for SoCs without shared cache
> > > > > > > >
> > > > > > > > SoCs such as the Ampere Altra define clusters but have no shared
> > > > > > > > processor-side cache. As of v5.16 with CONFIG_SCHED_CLUSTER and
> > > > > > > > CONFIG_SCHED_MC, build_sched_domain() will BUG() with:
> > > > > > > >
> > > > > > > > BUG: arch topology borken
> > > > > > > >      the CLS domain not a subset of the MC domain
> > > > > > > >
> > > > > > > > for each CPU (160 times for a 2 socket 80 core Altra system). The MC
> > > > > > > > level cpu mask is then extended to that of the CLS child, and is later
> > > > > > > > removed entirely as redundant.
> > > > > > > >
> > > > > > > > This change detects when all cpu_coregroup_mask weights=1 and uses an
> > > > > > > > alternative sched_domain_topology equivalent to the default if
> > > > > > > > CONFIG_SCHED_MC were disabled.
> > > > > > > >
> > > > > > > > The final resulting sched domain topology is unchanged with or without
> > > > > > > > CONFIG_SCHED_CLUSTER, and the BUG is avoided:
> > > > > > > >
> > > > > > > > For CPU0:
> > > > > > > >
> > > > > > > > With CLS:
> > > > > > > > CLS  [0-1]
> > > > > > > > DIE  [0-79]
> > > > > > > > NUMA [0-159]
> > > > > > > >
> > > > > > > > Without CLS:
> > > > > > > > DIE  [0-79]
> > > > > > > > NUMA [0-159]
> > > > > > > >
> > > > > > > > Cc: Catalin Marinas <catalin.marinas@arm.com>
> > > > > > > > Cc: Will Deacon <will@kernel.org>
> > > > > > > > Cc: Peter Zijlstra <peterz@infradead.org>
> > > > > > > > Cc: Vincent Guittot <vincent.guittot@linaro.org>
> > > > > > > > Cc: Barry Song <song.bao.hua@hisilicon.com>
> > > > > > > > Cc: Valentin Schneider <valentin.schneider@arm.com>
> > > > > > > > Cc: D. Scott Phillips <scott@os.amperecomputing.com>
> > > > > > > > Cc: Ilkka Koskinen <ilkka@os.amperecomputing.com>
> > > > > > > > Cc: <stable@vger.kernel.org> # 5.16.x
> > > > > > > > Signed-off-by: Darren Hart <darren@os.amperecomputing.com>
> > > > > > >
> > > > > > > Hi Darrent,
> > > > > > > What kind of resources are clusters sharing on Ampere Altra?
> > > > > > > So on Altra, cpus are not sharing LLC? Each LLC is separate
> > > > > > > for each cpu?
> > > > > > >
> > > > > > > > ---
> > > > > > > >  arch/arm64/kernel/smp.c | 32 ++++++++++++++++++++++++++++++++
> > > > > > > >  1 file changed, 32 insertions(+)
> > > > > > > >
> > > > > > > > diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
> > > > > > > > index 27df5c1e6baa..0a78ac5c8830 100644
> > > > > > > > --- a/arch/arm64/kernel/smp.c
> > > > > > > > +++ b/arch/arm64/kernel/smp.c
> > > > > > > > @@ -715,9 +715,22 @@ void __init smp_init_cpus(void)
> > > > > > > >         }
> > > > > > > >  }
> > > > > > > >
> > > > > > > > +static struct sched_domain_topology_level arm64_no_mc_topology[] = {
> > > > > > > > +#ifdef CONFIG_SCHED_SMT
> > > > > > > > +       { cpu_smt_mask, cpu_smt_flags, SD_INIT_NAME(SMT) },
> > > > > > > > +#endif
> > > > > > > > +
> > > > > > > > +#ifdef CONFIG_SCHED_CLUSTER
> > > > > > > > +       { cpu_clustergroup_mask, cpu_cluster_flags, SD_INIT_NAME(CLS) },
> > > > > > > > +#endif
> > > > > > > > +       { cpu_cpu_mask, SD_INIT_NAME(DIE) },
> > > > > > > > +       { NULL, },
> > > > > > > > +};
> > > > > > > > +
> > > > > > > >  void __init smp_prepare_cpus(unsigned int max_cpus)
> > > > > > > >  {
> > > > > > > >         const struct cpu_operations *ops;
> > > > > > > > +       bool use_no_mc_topology = true;
> > > > > > > >         int err;
> > > > > > > >         unsigned int cpu;
> > > > > > > >         unsigned int this_cpu;
> > > > > > > > @@ -758,6 +771,25 @@ void __init smp_prepare_cpus(unsigned int max_cpus)
> > > > > > > >
> > > > > > > >                 set_cpu_present(cpu, true);
> > > > > > > >                 numa_store_cpu_info(cpu);
> > > > > > > > +
> > > > > > > > +               /*
> > > > > > > > +                * Only use no_mc topology if all cpu_coregroup_mask weights=1
> > > > > > > > +                */
> > > > > > > > +               if (cpumask_weight(cpu_coregroup_mask(cpu)) > 1)
> > > > > > > > +                       use_no_mc_topology = false;
> > > > > > >
> > > > > > > This seems to be wrong? If you have 5 cpus,
> > > > > > > Cpu0 has cpu_coregroup_mask(cpu)== 1, cpu1-4
> > > > > > > has cpu_coregroup_mask(cpu)== 4, for cpu0, you still
> > > > > > > need to remove MC, but for cpu1-4, you will need
> > > > > > > CLS and MC both?
> > > > > >
> > > > > > What is the *current* behaviour on such a system?
> > > > > >
> > > > >
> > > > > As I understand it, any system that uses the default topology which has
> > > > > a cpus_coregroup weight of 1 and a child (cluster, smt, ...) weight > 1
> > > > > will behave as described above by printing the following for each CPU
> > > > > matching this criteria:
> > > > >
> > > > >   BUG: arch topology borken
> > > > >         the [CLS,SMT,...] domain not a subset of the MC domain
> > > > >
> > > > > And then extend the MC domain cpumask to match that of the child and continue
> > > > > on.
> > > > >
> > > > > That would still be the behavior for this type of system after this
> > > > > patch is applied.
> > > >
> > > > That's what I thought, but in that case applying your patch is a net
> > > > improvement: systems either get current or better behaviour.
> > >
> > > CLUSTER level is normally defined as a intermediate group of the MC
> > > level and both levels have the scheduler flag SD_SHARE_PKG_RESOURCES
> > > flag
> > >
> > > In the case of Ampere altra, they consider that CPUA have a CLUSTER
> > > level which SD_SHARE_PKG_RESOURCES with another CPUB but the next and
> > > larger MC level then says that CPUA doesn't SD_SHARE_PKG_RESOURCES
> > > with CPUB which seems to be odd because the SD_SHARE_PKG_RESOURCES has
> > > not disappeared Looks like there is a mismatch in topology description
> >
> > Hi Vincent,
> >
> > Agree. Where do you think this mismatch exists?
>
> I think that the problem comes from that the default topology order is
> assumed to be :
> SMT
> CLUSTER shares pkg resources i.e. cache
> MC
> DIE
> NUMA
>
> but in your case, you want a topology order like :
> SMT
> MC
> CLUSTER shares SCU
> DIE
> NUMA
>
> IIUC, the cluster is defined as the 2nd (no SMT) or 3rd (SMT) level in
> the PPTT table whereas the MC level is defined as the number of cache
> levels. So i would say that you should compare the level to know the
> ordering
>
> Then, there is another point:
> In your case, CLUSTER level still has the flag SD_SHARE_PKG_RESOURCES
> which is used to define some scheduler internal variable like
> sd_llc(sched domain last level of cache) which allows fast task
> migration between this cpus in this level at wakeup. In your case the
> sd_llc should not be the cluster but the MC with only one CPU. But I
> would not be surprised that most of perf improvement comes from this
> sd_llc wrongly set to cluster instead of the single CPU

I assume this "mistake" is actually what Ampere altra needs while it
is wrong but getting
right result? Ampere altra has already got both:
1. Load Balance between clusters
2. wake_affine by select sibling cpu which is sharing SCU

I am not sure how much 1 and 2 are helping Darren's workloads respectively.

>
>
> >
> > I'd describe this as a mismatch between the default assumptions of
> > the sched domains construction code (that SD_SHARE_PKG_RESOURCES implies
> > a shared cpu-side cache) and SoCs without a shared cpu-side cache. This
> > is encoded in properties of the MC level and the requirement that child
> > domains be a subset of the parent domain cpumask.
> >
> > The MC-less topology addresses this in a consistent way with other
> > architectures using the provided API for non-default topologies without
> > changing these fundamental assumptions and without changing the final
> > resulting topology which is correct and matches the topology supplied in
> > the ACPI PPTT.
>
>
> >
> > --
> > Darren Hart
> > Ampere Computing / OS and Kernel

Thanks
Barry
Vincent Guittot Feb. 16, 2022, 8:30 a.m. UTC | #12
On Tue, 15 Feb 2022 at 21:05, Darren Hart <darren@os.amperecomputing.com> wrote:
>
> On Tue, Feb 15, 2022 at 07:19:45PM +0100, Vincent Guittot wrote:
> > On Tue, 15 Feb 2022 at 18:32, Darren Hart <darren@os.amperecomputing.com> wrote:
> > >
> > > On Tue, Feb 15, 2022 at 06:09:08PM +0100, Vincent Guittot wrote:
> > > > On Tue, 15 Feb 2022 at 17:46, Will Deacon <will@kernel.org> wrote:
> > > > >
> > > > > On Tue, Feb 15, 2022 at 08:44:23AM -0800, Darren Hart wrote:
> > > > > > On Tue, Feb 15, 2022 at 04:38:59PM +0000, Will Decon wrote:
> > > > > > > On Fri, Feb 11, 2022 at 03:20:51AM +0000, Song Bao Hua (Barry Song) wrote:
> > > > > > > >
> > > > > > > >
> > > > > > > > > -----Original Message-----
> > > > > > > > > From: Darren Hart [mailto:darren@os.amperecomputing.com]
> > > > > > > > > Sent: Friday, February 11, 2022 2:43 PM
> > > > > > > > > To: LKML <linux-kernel@vger.kernel.org>; Linux Arm
> > > > > > > > > <linux-arm-kernel@lists.infradead.org>
> > > > > > > > > Cc: Catalin Marinas <catalin.marinas@arm.com>; Will Deacon <will@kernel.org>;
> > > > > > > > > Peter Zijlstra <peterz@infradead.org>; Vincent Guittot
> > > > > > > > > <vincent.guittot@linaro.org>; Song Bao Hua (Barry Song)
> > > > > > > > > <song.bao.hua@hisilicon.com>; Valentin Schneider
> > > > > > > > > <valentin.schneider@arm.com>; D . Scott Phillips
> > > > > > > > > <scott@os.amperecomputing.com>; Ilkka Koskinen
> > > > > > > > > <ilkka@os.amperecomputing.com>; stable@vger.kernel.org
> > > > > > > > > Subject: [PATCH] arm64: smp: Skip MC domain for SoCs without shared cache
> > > > > > > > >
> > > > > > > > > SoCs such as the Ampere Altra define clusters but have no shared
> > > > > > > > > processor-side cache. As of v5.16 with CONFIG_SCHED_CLUSTER and
> > > > > > > > > CONFIG_SCHED_MC, build_sched_domain() will BUG() with:
> > > > > > > > >
> > > > > > > > > BUG: arch topology borken
> > > > > > > > >      the CLS domain not a subset of the MC domain
> > > > > > > > >
> > > > > > > > > for each CPU (160 times for a 2 socket 80 core Altra system). The MC
> > > > > > > > > level cpu mask is then extended to that of the CLS child, and is later
> > > > > > > > > removed entirely as redundant.
> > > > > > > > >
> > > > > > > > > This change detects when all cpu_coregroup_mask weights=1 and uses an
> > > > > > > > > alternative sched_domain_topology equivalent to the default if
> > > > > > > > > CONFIG_SCHED_MC were disabled.
> > > > > > > > >
> > > > > > > > > The final resulting sched domain topology is unchanged with or without
> > > > > > > > > CONFIG_SCHED_CLUSTER, and the BUG is avoided:
> > > > > > > > >
> > > > > > > > > For CPU0:
> > > > > > > > >
> > > > > > > > > With CLS:
> > > > > > > > > CLS  [0-1]
> > > > > > > > > DIE  [0-79]
> > > > > > > > > NUMA [0-159]
> > > > > > > > >
> > > > > > > > > Without CLS:
> > > > > > > > > DIE  [0-79]
> > > > > > > > > NUMA [0-159]
> > > > > > > > >
> > > > > > > > > Cc: Catalin Marinas <catalin.marinas@arm.com>
> > > > > > > > > Cc: Will Deacon <will@kernel.org>
> > > > > > > > > Cc: Peter Zijlstra <peterz@infradead.org>
> > > > > > > > > Cc: Vincent Guittot <vincent.guittot@linaro.org>
> > > > > > > > > Cc: Barry Song <song.bao.hua@hisilicon.com>
> > > > > > > > > Cc: Valentin Schneider <valentin.schneider@arm.com>
> > > > > > > > > Cc: D. Scott Phillips <scott@os.amperecomputing.com>
> > > > > > > > > Cc: Ilkka Koskinen <ilkka@os.amperecomputing.com>
> > > > > > > > > Cc: <stable@vger.kernel.org> # 5.16.x
> > > > > > > > > Signed-off-by: Darren Hart <darren@os.amperecomputing.com>
> > > > > > > >
> > > > > > > > Hi Darrent,
> > > > > > > > What kind of resources are clusters sharing on Ampere Altra?
> > > > > > > > So on Altra, cpus are not sharing LLC? Each LLC is separate
> > > > > > > > for each cpu?
> > > > > > > >
> > > > > > > > > ---
> > > > > > > > >  arch/arm64/kernel/smp.c | 32 ++++++++++++++++++++++++++++++++
> > > > > > > > >  1 file changed, 32 insertions(+)
> > > > > > > > >
> > > > > > > > > diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
> > > > > > > > > index 27df5c1e6baa..0a78ac5c8830 100644
> > > > > > > > > --- a/arch/arm64/kernel/smp.c
> > > > > > > > > +++ b/arch/arm64/kernel/smp.c
> > > > > > > > > @@ -715,9 +715,22 @@ void __init smp_init_cpus(void)
> > > > > > > > >         }
> > > > > > > > >  }
> > > > > > > > >
> > > > > > > > > +static struct sched_domain_topology_level arm64_no_mc_topology[] = {
> > > > > > > > > +#ifdef CONFIG_SCHED_SMT
> > > > > > > > > +       { cpu_smt_mask, cpu_smt_flags, SD_INIT_NAME(SMT) },
> > > > > > > > > +#endif
> > > > > > > > > +
> > > > > > > > > +#ifdef CONFIG_SCHED_CLUSTER
> > > > > > > > > +       { cpu_clustergroup_mask, cpu_cluster_flags, SD_INIT_NAME(CLS) },
> > > > > > > > > +#endif
> > > > > > > > > +       { cpu_cpu_mask, SD_INIT_NAME(DIE) },
> > > > > > > > > +       { NULL, },
> > > > > > > > > +};
> > > > > > > > > +
> > > > > > > > >  void __init smp_prepare_cpus(unsigned int max_cpus)
> > > > > > > > >  {
> > > > > > > > >         const struct cpu_operations *ops;
> > > > > > > > > +       bool use_no_mc_topology = true;
> > > > > > > > >         int err;
> > > > > > > > >         unsigned int cpu;
> > > > > > > > >         unsigned int this_cpu;
> > > > > > > > > @@ -758,6 +771,25 @@ void __init smp_prepare_cpus(unsigned int max_cpus)
> > > > > > > > >
> > > > > > > > >                 set_cpu_present(cpu, true);
> > > > > > > > >                 numa_store_cpu_info(cpu);
> > > > > > > > > +
> > > > > > > > > +               /*
> > > > > > > > > +                * Only use no_mc topology if all cpu_coregroup_mask weights=1
> > > > > > > > > +                */
> > > > > > > > > +               if (cpumask_weight(cpu_coregroup_mask(cpu)) > 1)
> > > > > > > > > +                       use_no_mc_topology = false;
> > > > > > > >
> > > > > > > > This seems to be wrong? If you have 5 cpus,
> > > > > > > > Cpu0 has cpu_coregroup_mask(cpu)== 1, cpu1-4
> > > > > > > > has cpu_coregroup_mask(cpu)== 4, for cpu0, you still
> > > > > > > > need to remove MC, but for cpu1-4, you will need
> > > > > > > > CLS and MC both?
> > > > > > >
> > > > > > > What is the *current* behaviour on such a system?
> > > > > > >
> > > > > >
> > > > > > As I understand it, any system that uses the default topology which has
> > > > > > a cpus_coregroup weight of 1 and a child (cluster, smt, ...) weight > 1
> > > > > > will behave as described above by printing the following for each CPU
> > > > > > matching this criteria:
> > > > > >
> > > > > >   BUG: arch topology borken
> > > > > >         the [CLS,SMT,...] domain not a subset of the MC domain
> > > > > >
> > > > > > And then extend the MC domain cpumask to match that of the child and continue
> > > > > > on.
> > > > > >
> > > > > > That would still be the behavior for this type of system after this
> > > > > > patch is applied.
> > > > >
> > > > > That's what I thought, but in that case applying your patch is a net
> > > > > improvement: systems either get current or better behaviour.
> > > >
> > > > CLUSTER level is normally defined as a intermediate group of the MC
> > > > level and both levels have the scheduler flag SD_SHARE_PKG_RESOURCES
> > > > flag
> > > >
> > > > In the case of Ampere altra, they consider that CPUA have a CLUSTER
> > > > level which SD_SHARE_PKG_RESOURCES with another CPUB but the next and
> > > > larger MC level then says that CPUA doesn't SD_SHARE_PKG_RESOURCES
> > > > with CPUB which seems to be odd because the SD_SHARE_PKG_RESOURCES has
> > > > not disappeared Looks like there is a mismatch in topology description
> > >
> > > Hi Vincent,
> > >
> > > Agree. Where do you think this mismatch exists?
> >
> > I think that the problem comes from that the default topology order is
> > assumed to be :
> > SMT
> > CLUSTER shares pkg resources i.e. cache
> > MC
> > DIE
> > NUMA
> >
> > but in your case, you want a topology order like :
> > SMT
> > MC
> > CLUSTER shares SCU
> > DIE
> > NUMA
>
> Given the fairly loose definition of some of these domains and the
> freedom to adjust flags with custom topologies, I think it's difficult
> to say it needs to be this or that. As you point out, this stems from an
> assumption in the default topology, so eliding the MC level within the
> current set of abstractions for a very targeted topology still seems
> reasonable to address the BUG in the very near term in a contained way.

But if another SoC comes with a valid MC then a CLUSTER, this proposal
will not work.

Keep in mind that the MC level will be removed/degenerate when
building because it is useless in your case so the scheduler topology
will still be the same at the end but it will support more case. That
why I think you should keep MC level

>
> >
> > IIUC, the cluster is defined as the 2nd (no SMT) or 3rd (SMT) level in
> > the PPTT table whereas the MC level is defined as the number of cache
> > levels. So i would say that you should compare the level to know the
> > ordering
> >
> > Then, there is another point:
> > In your case, CLUSTER level still has the flag SD_SHARE_PKG_RESOURCES
> > which is used to define some scheduler internal variable like
> > sd_llc(sched domain last level of cache) which allows fast task
> > migration between this cpus in this level at wakeup. In your case the
> > sd_llc should not be the cluster but the MC with only one CPU. But I
> > would not be surprised that most of perf improvement comes from this
> > sd_llc wrongly set to cluster instead of the single CPU
>
> Agree that we currently have an imperfect representation of SoCs without
> a shared CPU-side cache. Within this imperfect abstraction, it seems
> valid to consider the SCU as a valid shared resource to be described by
> SD_SHARE_PKG_RESOURCES, and as you say, that does result in an overall
> performance improvement.

My concern is that there are some ongoing discussion to make more
usage of the CLUSTER level than what is currently done and it assumes
that we have a valid LLC level after the CLUSTER one which is not your
case and I'm afraid that it will be suboptimal for you because CLUSTER
and LLC are wrongly the same for your case and then you will come back
to add more exception in the generic scheduler code to cover this
first exception

>
> Also agree that it is worth working toward a better abstraction. On the
> way to that, I think it makes sense to avoid BUG'ing on the current
> topology which is a reasonable one given the current abstraction.
>
> Thanks,
>
> >
> >
> > >
> > > I'd describe this as a mismatch between the default assumptions of
> > > the sched domains construction code (that SD_SHARE_PKG_RESOURCES implies
> > > a shared cpu-side cache) and SoCs without a shared cpu-side cache. This
> > > is encoded in properties of the MC level and the requirement that child
> > > domains be a subset of the parent domain cpumask.
> > >
> > > The MC-less topology addresses this in a consistent way with other
> > > architectures using the provided API for non-default topologies without
> > > changing these fundamental assumptions and without changing the final
> > > resulting topology which is correct and matches the topology supplied in
> > > the ACPI PPTT.
> >
> >
> > >
> > > --
> > > Darren Hart
> > > Ampere Computing / OS and Kernel
>
> --
> Darren Hart
> Ampere Computing / OS and Kernel
Darren Hart Feb. 16, 2022, 3:43 p.m. UTC | #13
On Wed, Feb 16, 2022 at 08:03:40PM +1300, Barry Song wrote:
> On Wed, Feb 16, 2022 at 7:30 PM Vincent Guittot
> <vincent.guittot@linaro.org> wrote:
> >
> > On Tue, 15 Feb 2022 at 18:32, Darren Hart <darren@os.amperecomputing.com> wrote:
> > >
> > > On Tue, Feb 15, 2022 at 06:09:08PM +0100, Vincent Guittot wrote:
> > > > On Tue, 15 Feb 2022 at 17:46, Will Deacon <will@kernel.org> wrote:
> > > > >
> > > > > On Tue, Feb 15, 2022 at 08:44:23AM -0800, Darren Hart wrote:
> > > > > > On Tue, Feb 15, 2022 at 04:38:59PM +0000, Will Decon wrote:
> > > > > > > On Fri, Feb 11, 2022 at 03:20:51AM +0000, Song Bao Hua (Barry Song) wrote:
> > > > > > > >
> > > > > > > >
> > > > > > > > > -----Original Message-----
> > > > > > > > > From: Darren Hart [mailto:darren@os.amperecomputing.com]
> > > > > > > > > Sent: Friday, February 11, 2022 2:43 PM
> > > > > > > > > To: LKML <linux-kernel@vger.kernel.org>; Linux Arm
> > > > > > > > > <linux-arm-kernel@lists.infradead.org>
> > > > > > > > > Cc: Catalin Marinas <catalin.marinas@arm.com>; Will Deacon <will@kernel.org>;
> > > > > > > > > Peter Zijlstra <peterz@infradead.org>; Vincent Guittot
> > > > > > > > > <vincent.guittot@linaro.org>; Song Bao Hua (Barry Song)
> > > > > > > > > <song.bao.hua@hisilicon.com>; Valentin Schneider
> > > > > > > > > <valentin.schneider@arm.com>; D . Scott Phillips
> > > > > > > > > <scott@os.amperecomputing.com>; Ilkka Koskinen
> > > > > > > > > <ilkka@os.amperecomputing.com>; stable@vger.kernel.org
> > > > > > > > > Subject: [PATCH] arm64: smp: Skip MC domain for SoCs without shared cache
> > > > > > > > >
> > > > > > > > > SoCs such as the Ampere Altra define clusters but have no shared
> > > > > > > > > processor-side cache. As of v5.16 with CONFIG_SCHED_CLUSTER and
> > > > > > > > > CONFIG_SCHED_MC, build_sched_domain() will BUG() with:
> > > > > > > > >
> > > > > > > > > BUG: arch topology borken
> > > > > > > > >      the CLS domain not a subset of the MC domain
> > > > > > > > >
> > > > > > > > > for each CPU (160 times for a 2 socket 80 core Altra system). The MC
> > > > > > > > > level cpu mask is then extended to that of the CLS child, and is later
> > > > > > > > > removed entirely as redundant.
> > > > > > > > >
> > > > > > > > > This change detects when all cpu_coregroup_mask weights=1 and uses an
> > > > > > > > > alternative sched_domain_topology equivalent to the default if
> > > > > > > > > CONFIG_SCHED_MC were disabled.
> > > > > > > > >
> > > > > > > > > The final resulting sched domain topology is unchanged with or without
> > > > > > > > > CONFIG_SCHED_CLUSTER, and the BUG is avoided:
> > > > > > > > >
> > > > > > > > > For CPU0:
> > > > > > > > >
> > > > > > > > > With CLS:
> > > > > > > > > CLS  [0-1]
> > > > > > > > > DIE  [0-79]
> > > > > > > > > NUMA [0-159]
> > > > > > > > >
> > > > > > > > > Without CLS:
> > > > > > > > > DIE  [0-79]
> > > > > > > > > NUMA [0-159]
> > > > > > > > >
> > > > > > > > > Cc: Catalin Marinas <catalin.marinas@arm.com>
> > > > > > > > > Cc: Will Deacon <will@kernel.org>
> > > > > > > > > Cc: Peter Zijlstra <peterz@infradead.org>
> > > > > > > > > Cc: Vincent Guittot <vincent.guittot@linaro.org>
> > > > > > > > > Cc: Barry Song <song.bao.hua@hisilicon.com>
> > > > > > > > > Cc: Valentin Schneider <valentin.schneider@arm.com>
> > > > > > > > > Cc: D. Scott Phillips <scott@os.amperecomputing.com>
> > > > > > > > > Cc: Ilkka Koskinen <ilkka@os.amperecomputing.com>
> > > > > > > > > Cc: <stable@vger.kernel.org> # 5.16.x
> > > > > > > > > Signed-off-by: Darren Hart <darren@os.amperecomputing.com>
> > > > > > > >
> > > > > > > > Hi Darrent,
> > > > > > > > What kind of resources are clusters sharing on Ampere Altra?
> > > > > > > > So on Altra, cpus are not sharing LLC? Each LLC is separate
> > > > > > > > for each cpu?
> > > > > > > >
> > > > > > > > > ---
> > > > > > > > >  arch/arm64/kernel/smp.c | 32 ++++++++++++++++++++++++++++++++
> > > > > > > > >  1 file changed, 32 insertions(+)
> > > > > > > > >
> > > > > > > > > diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
> > > > > > > > > index 27df5c1e6baa..0a78ac5c8830 100644
> > > > > > > > > --- a/arch/arm64/kernel/smp.c
> > > > > > > > > +++ b/arch/arm64/kernel/smp.c
> > > > > > > > > @@ -715,9 +715,22 @@ void __init smp_init_cpus(void)
> > > > > > > > >         }
> > > > > > > > >  }
> > > > > > > > >
> > > > > > > > > +static struct sched_domain_topology_level arm64_no_mc_topology[] = {
> > > > > > > > > +#ifdef CONFIG_SCHED_SMT
> > > > > > > > > +       { cpu_smt_mask, cpu_smt_flags, SD_INIT_NAME(SMT) },
> > > > > > > > > +#endif
> > > > > > > > > +
> > > > > > > > > +#ifdef CONFIG_SCHED_CLUSTER
> > > > > > > > > +       { cpu_clustergroup_mask, cpu_cluster_flags, SD_INIT_NAME(CLS) },
> > > > > > > > > +#endif
> > > > > > > > > +       { cpu_cpu_mask, SD_INIT_NAME(DIE) },
> > > > > > > > > +       { NULL, },
> > > > > > > > > +};
> > > > > > > > > +
> > > > > > > > >  void __init smp_prepare_cpus(unsigned int max_cpus)
> > > > > > > > >  {
> > > > > > > > >         const struct cpu_operations *ops;
> > > > > > > > > +       bool use_no_mc_topology = true;
> > > > > > > > >         int err;
> > > > > > > > >         unsigned int cpu;
> > > > > > > > >         unsigned int this_cpu;
> > > > > > > > > @@ -758,6 +771,25 @@ void __init smp_prepare_cpus(unsigned int max_cpus)
> > > > > > > > >
> > > > > > > > >                 set_cpu_present(cpu, true);
> > > > > > > > >                 numa_store_cpu_info(cpu);
> > > > > > > > > +
> > > > > > > > > +               /*
> > > > > > > > > +                * Only use no_mc topology if all cpu_coregroup_mask weights=1
> > > > > > > > > +                */
> > > > > > > > > +               if (cpumask_weight(cpu_coregroup_mask(cpu)) > 1)
> > > > > > > > > +                       use_no_mc_topology = false;
> > > > > > > >
> > > > > > > > This seems to be wrong? If you have 5 cpus,
> > > > > > > > Cpu0 has cpu_coregroup_mask(cpu)== 1, cpu1-4
> > > > > > > > has cpu_coregroup_mask(cpu)== 4, for cpu0, you still
> > > > > > > > need to remove MC, but for cpu1-4, you will need
> > > > > > > > CLS and MC both?
> > > > > > >
> > > > > > > What is the *current* behaviour on such a system?
> > > > > > >
> > > > > >
> > > > > > As I understand it, any system that uses the default topology which has
> > > > > > a cpus_coregroup weight of 1 and a child (cluster, smt, ...) weight > 1
> > > > > > will behave as described above by printing the following for each CPU
> > > > > > matching this criteria:
> > > > > >
> > > > > >   BUG: arch topology borken
> > > > > >         the [CLS,SMT,...] domain not a subset of the MC domain
> > > > > >
> > > > > > And then extend the MC domain cpumask to match that of the child and continue
> > > > > > on.
> > > > > >
> > > > > > That would still be the behavior for this type of system after this
> > > > > > patch is applied.
> > > > >
> > > > > That's what I thought, but in that case applying your patch is a net
> > > > > improvement: systems either get current or better behaviour.
> > > >
> > > > CLUSTER level is normally defined as a intermediate group of the MC
> > > > level and both levels have the scheduler flag SD_SHARE_PKG_RESOURCES
> > > > flag
> > > >
> > > > In the case of Ampere altra, they consider that CPUA have a CLUSTER
> > > > level which SD_SHARE_PKG_RESOURCES with another CPUB but the next and
> > > > larger MC level then says that CPUA doesn't SD_SHARE_PKG_RESOURCES
> > > > with CPUB which seems to be odd because the SD_SHARE_PKG_RESOURCES has
> > > > not disappeared Looks like there is a mismatch in topology description
> > >
> > > Hi Vincent,
> > >
> > > Agree. Where do you think this mismatch exists?
> >
> > I think that the problem comes from that the default topology order is
> > assumed to be :
> > SMT
> > CLUSTER shares pkg resources i.e. cache
> > MC
> > DIE
> > NUMA
> >
> > but in your case, you want a topology order like :
> > SMT
> > MC
> > CLUSTER shares SCU
> > DIE
> > NUMA
> >
> > IIUC, the cluster is defined as the 2nd (no SMT) or 3rd (SMT) level in
> > the PPTT table whereas the MC level is defined as the number of cache
> > levels. So i would say that you should compare the level to know the
> > ordering
> >
> > Then, there is another point:
> > In your case, CLUSTER level still has the flag SD_SHARE_PKG_RESOURCES
> > which is used to define some scheduler internal variable like
> > sd_llc(sched domain last level of cache) which allows fast task
> > migration between this cpus in this level at wakeup. In your case the
> > sd_llc should not be the cluster but the MC with only one CPU. But I
> > would not be surprised that most of perf improvement comes from this
> > sd_llc wrongly set to cluster instead of the single CPU
> 
> I assume this "mistake" is actually what Ampere altra needs while it
> is wrong but getting
> right result? Ampere altra has already got both:

Hi Barry,

Generally yes - although I do think we're placing too much emphasis on
the "right" or "wrong" of a heuristic which are more fluid in
definition over time. (e.g. I expect this will look different in a year
based on what we learn from this and other non current default topologies).

> 1. Load Balance between clusters
> 2. wake_affine by select sibling cpu which is sharing SCU
> 
> I am not sure how much 1 and 2 are helping Darren's workloads respectively.

We definitely see improvements with load balancing between clusters.
We're running some tests with the wake_affine patchset you pointed me to
(thanks for that). My initial tbench runs resulted in higher average and
max latencies reported. I need to collect more results and see the
impact to other benchmarks of interest before I have more to share on
that.

Thanks,
Darren Hart Feb. 16, 2022, 4:24 p.m. UTC | #14
On Wed, Feb 16, 2022 at 09:30:49AM +0100, Vincent Guittot wrote:
> On Tue, 15 Feb 2022 at 21:05, Darren Hart <darren@os.amperecomputing.com> wrote:
> >
> > On Tue, Feb 15, 2022 at 07:19:45PM +0100, Vincent Guittot wrote:
> > > On Tue, 15 Feb 2022 at 18:32, Darren Hart <darren@os.amperecomputing.com> wrote:
> > > >
> > > > On Tue, Feb 15, 2022 at 06:09:08PM +0100, Vincent Guittot wrote:
> > > > > On Tue, 15 Feb 2022 at 17:46, Will Deacon <will@kernel.org> wrote:
> > > > > >
> > > > > > On Tue, Feb 15, 2022 at 08:44:23AM -0800, Darren Hart wrote:
> > > > > > > On Tue, Feb 15, 2022 at 04:38:59PM +0000, Will Decon wrote:
> > > > > > > > On Fri, Feb 11, 2022 at 03:20:51AM +0000, Song Bao Hua (Barry Song) wrote:
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > > -----Original Message-----
> > > > > > > > > > From: Darren Hart [mailto:darren@os.amperecomputing.com]
> > > > > > > > > > Sent: Friday, February 11, 2022 2:43 PM
> > > > > > > > > > To: LKML <linux-kernel@vger.kernel.org>; Linux Arm
> > > > > > > > > > <linux-arm-kernel@lists.infradead.org>
> > > > > > > > > > Cc: Catalin Marinas <catalin.marinas@arm.com>; Will Deacon <will@kernel.org>;
> > > > > > > > > > Peter Zijlstra <peterz@infradead.org>; Vincent Guittot
> > > > > > > > > > <vincent.guittot@linaro.org>; Song Bao Hua (Barry Song)
> > > > > > > > > > <song.bao.hua@hisilicon.com>; Valentin Schneider
> > > > > > > > > > <valentin.schneider@arm.com>; D . Scott Phillips
> > > > > > > > > > <scott@os.amperecomputing.com>; Ilkka Koskinen
> > > > > > > > > > <ilkka@os.amperecomputing.com>; stable@vger.kernel.org
> > > > > > > > > > Subject: [PATCH] arm64: smp: Skip MC domain for SoCs without shared cache
> > > > > > > > > >
> > > > > > > > > > SoCs such as the Ampere Altra define clusters but have no shared
> > > > > > > > > > processor-side cache. As of v5.16 with CONFIG_SCHED_CLUSTER and
> > > > > > > > > > CONFIG_SCHED_MC, build_sched_domain() will BUG() with:
> > > > > > > > > >
> > > > > > > > > > BUG: arch topology borken
> > > > > > > > > >      the CLS domain not a subset of the MC domain
> > > > > > > > > >
> > > > > > > > > > for each CPU (160 times for a 2 socket 80 core Altra system). The MC
> > > > > > > > > > level cpu mask is then extended to that of the CLS child, and is later
> > > > > > > > > > removed entirely as redundant.
> > > > > > > > > >
> > > > > > > > > > This change detects when all cpu_coregroup_mask weights=1 and uses an
> > > > > > > > > > alternative sched_domain_topology equivalent to the default if
> > > > > > > > > > CONFIG_SCHED_MC were disabled.
> > > > > > > > > >
> > > > > > > > > > The final resulting sched domain topology is unchanged with or without
> > > > > > > > > > CONFIG_SCHED_CLUSTER, and the BUG is avoided:
> > > > > > > > > >
> > > > > > > > > > For CPU0:
> > > > > > > > > >
> > > > > > > > > > With CLS:
> > > > > > > > > > CLS  [0-1]
> > > > > > > > > > DIE  [0-79]
> > > > > > > > > > NUMA [0-159]
> > > > > > > > > >
> > > > > > > > > > Without CLS:
> > > > > > > > > > DIE  [0-79]
> > > > > > > > > > NUMA [0-159]
> > > > > > > > > >
> > > > > > > > > > Cc: Catalin Marinas <catalin.marinas@arm.com>
> > > > > > > > > > Cc: Will Deacon <will@kernel.org>
> > > > > > > > > > Cc: Peter Zijlstra <peterz@infradead.org>
> > > > > > > > > > Cc: Vincent Guittot <vincent.guittot@linaro.org>
> > > > > > > > > > Cc: Barry Song <song.bao.hua@hisilicon.com>
> > > > > > > > > > Cc: Valentin Schneider <valentin.schneider@arm.com>
> > > > > > > > > > Cc: D. Scott Phillips <scott@os.amperecomputing.com>
> > > > > > > > > > Cc: Ilkka Koskinen <ilkka@os.amperecomputing.com>
> > > > > > > > > > Cc: <stable@vger.kernel.org> # 5.16.x
> > > > > > > > > > Signed-off-by: Darren Hart <darren@os.amperecomputing.com>
> > > > > > > > >
> > > > > > > > > Hi Darrent,
> > > > > > > > > What kind of resources are clusters sharing on Ampere Altra?
> > > > > > > > > So on Altra, cpus are not sharing LLC? Each LLC is separate
> > > > > > > > > for each cpu?
> > > > > > > > >
> > > > > > > > > > ---
> > > > > > > > > >  arch/arm64/kernel/smp.c | 32 ++++++++++++++++++++++++++++++++
> > > > > > > > > >  1 file changed, 32 insertions(+)
> > > > > > > > > >
> > > > > > > > > > diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
> > > > > > > > > > index 27df5c1e6baa..0a78ac5c8830 100644
> > > > > > > > > > --- a/arch/arm64/kernel/smp.c
> > > > > > > > > > +++ b/arch/arm64/kernel/smp.c
> > > > > > > > > > @@ -715,9 +715,22 @@ void __init smp_init_cpus(void)
> > > > > > > > > >         }
> > > > > > > > > >  }
> > > > > > > > > >
> > > > > > > > > > +static struct sched_domain_topology_level arm64_no_mc_topology[] = {
> > > > > > > > > > +#ifdef CONFIG_SCHED_SMT
> > > > > > > > > > +       { cpu_smt_mask, cpu_smt_flags, SD_INIT_NAME(SMT) },
> > > > > > > > > > +#endif
> > > > > > > > > > +
> > > > > > > > > > +#ifdef CONFIG_SCHED_CLUSTER
> > > > > > > > > > +       { cpu_clustergroup_mask, cpu_cluster_flags, SD_INIT_NAME(CLS) },
> > > > > > > > > > +#endif
> > > > > > > > > > +       { cpu_cpu_mask, SD_INIT_NAME(DIE) },
> > > > > > > > > > +       { NULL, },
> > > > > > > > > > +};
> > > > > > > > > > +
> > > > > > > > > >  void __init smp_prepare_cpus(unsigned int max_cpus)
> > > > > > > > > >  {
> > > > > > > > > >         const struct cpu_operations *ops;
> > > > > > > > > > +       bool use_no_mc_topology = true;
> > > > > > > > > >         int err;
> > > > > > > > > >         unsigned int cpu;
> > > > > > > > > >         unsigned int this_cpu;
> > > > > > > > > > @@ -758,6 +771,25 @@ void __init smp_prepare_cpus(unsigned int max_cpus)
> > > > > > > > > >
> > > > > > > > > >                 set_cpu_present(cpu, true);
> > > > > > > > > >                 numa_store_cpu_info(cpu);
> > > > > > > > > > +
> > > > > > > > > > +               /*
> > > > > > > > > > +                * Only use no_mc topology if all cpu_coregroup_mask weights=1
> > > > > > > > > > +                */
> > > > > > > > > > +               if (cpumask_weight(cpu_coregroup_mask(cpu)) > 1)
> > > > > > > > > > +                       use_no_mc_topology = false;
> > > > > > > > >
> > > > > > > > > This seems to be wrong? If you have 5 cpus,
> > > > > > > > > Cpu0 has cpu_coregroup_mask(cpu)== 1, cpu1-4
> > > > > > > > > has cpu_coregroup_mask(cpu)== 4, for cpu0, you still
> > > > > > > > > need to remove MC, but for cpu1-4, you will need
> > > > > > > > > CLS and MC both?
> > > > > > > >
> > > > > > > > What is the *current* behaviour on such a system?
> > > > > > > >
> > > > > > >
> > > > > > > As I understand it, any system that uses the default topology which has
> > > > > > > a cpus_coregroup weight of 1 and a child (cluster, smt, ...) weight > 1
> > > > > > > will behave as described above by printing the following for each CPU
> > > > > > > matching this criteria:
> > > > > > >
> > > > > > >   BUG: arch topology borken
> > > > > > >         the [CLS,SMT,...] domain not a subset of the MC domain
> > > > > > >
> > > > > > > And then extend the MC domain cpumask to match that of the child and continue
> > > > > > > on.
> > > > > > >
> > > > > > > That would still be the behavior for this type of system after this
> > > > > > > patch is applied.
> > > > > >
> > > > > > That's what I thought, but in that case applying your patch is a net
> > > > > > improvement: systems either get current or better behaviour.
> > > > >
> > > > > CLUSTER level is normally defined as a intermediate group of the MC
> > > > > level and both levels have the scheduler flag SD_SHARE_PKG_RESOURCES
> > > > > flag
> > > > >
> > > > > In the case of Ampere altra, they consider that CPUA have a CLUSTER
> > > > > level which SD_SHARE_PKG_RESOURCES with another CPUB but the next and
> > > > > larger MC level then says that CPUA doesn't SD_SHARE_PKG_RESOURCES
> > > > > with CPUB which seems to be odd because the SD_SHARE_PKG_RESOURCES has
> > > > > not disappeared Looks like there is a mismatch in topology description
> > > >
> > > > Hi Vincent,
> > > >
> > > > Agree. Where do you think this mismatch exists?
> > >
> > > I think that the problem comes from that the default topology order is
> > > assumed to be :
> > > SMT
> > > CLUSTER shares pkg resources i.e. cache
> > > MC
> > > DIE
> > > NUMA
> > >
> > > but in your case, you want a topology order like :
> > > SMT
> > > MC
> > > CLUSTER shares SCU
> > > DIE
> > > NUMA
> >
> > Given the fairly loose definition of some of these domains and the
> > freedom to adjust flags with custom topologies, I think it's difficult
> > to say it needs to be this or that. As you point out, this stems from an
> > assumption in the default topology, so eliding the MC level within the
> > current set of abstractions for a very targeted topology still seems
> > reasonable to address the BUG in the very near term in a contained way.
> 
> But if another SoC comes with a valid MC then a CLUSTER, this proposal
> will not work.
> 
> Keep in mind that the MC level will be removed/degenerate when
> building because it is useless in your case so the scheduler topology
> will still be the same at the end but it will support more case. That
> why I think you should keep MC level

Hi Vincent,

Thanks for reiterating, I don't think I quite understood what you were
suggesting before. Is the following in line with what you were thinking?

I am testing a version of this patch which uses a topology like this instead:

MC
CLS
DIE
NUMA

(I tested without an SMT domain since the trigger is still MC weight==1, so
there is no valid topology that includes an SMT level under these conditions).

Which results in no BUG output and a final topology on Altra of:

CLS
DIE
NUMA

Which so far seems right (I still need to do some testing and review the sched
debug data).

If we take this approach, I think to address your concern about other systems
with valid MCs, we would need a different trigger that MC weight == 1 to use
this alternate topology. Do you have a suggestion on what to trigger this on?

Thanks,
Vincent Guittot Feb. 16, 2022, 5:52 p.m. UTC | #15
On Wed, 16 Feb 2022 at 17:24, Darren Hart <darren@os.amperecomputing.com> wrote:
>
> On Wed, Feb 16, 2022 at 09:30:49AM +0100, Vincent Guittot wrote:
> > On Tue, 15 Feb 2022 at 21:05, Darren Hart <darren@os.amperecomputing.com> wrote:
> > >
> > > On Tue, Feb 15, 2022 at 07:19:45PM +0100, Vincent Guittot wrote:
> > > > On Tue, 15 Feb 2022 at 18:32, Darren Hart <darren@os.amperecomputing.com> wrote:
> > > > >
> > > > > On Tue, Feb 15, 2022 at 06:09:08PM +0100, Vincent Guittot wrote:
> > > > > > On Tue, 15 Feb 2022 at 17:46, Will Deacon <will@kernel.org> wrote:
> > > > > > >
> > > > > > > On Tue, Feb 15, 2022 at 08:44:23AM -0800, Darren Hart wrote:
> > > > > > > > On Tue, Feb 15, 2022 at 04:38:59PM +0000, Will Decon wrote:
> > > > > > > > > On Fri, Feb 11, 2022 at 03:20:51AM +0000, Song Bao Hua (Barry Song) wrote:
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > -----Original Message-----
> > > > > > > > > > > From: Darren Hart [mailto:darren@os.amperecomputing.com]
> > > > > > > > > > > Sent: Friday, February 11, 2022 2:43 PM
> > > > > > > > > > > To: LKML <linux-kernel@vger.kernel.org>; Linux Arm
> > > > > > > > > > > <linux-arm-kernel@lists.infradead.org>
> > > > > > > > > > > Cc: Catalin Marinas <catalin.marinas@arm.com>; Will Deacon <will@kernel.org>;
> > > > > > > > > > > Peter Zijlstra <peterz@infradead.org>; Vincent Guittot
> > > > > > > > > > > <vincent.guittot@linaro.org>; Song Bao Hua (Barry Song)
> > > > > > > > > > > <song.bao.hua@hisilicon.com>; Valentin Schneider
> > > > > > > > > > > <valentin.schneider@arm.com>; D . Scott Phillips
> > > > > > > > > > > <scott@os.amperecomputing.com>; Ilkka Koskinen
> > > > > > > > > > > <ilkka@os.amperecomputing.com>; stable@vger.kernel.org
> > > > > > > > > > > Subject: [PATCH] arm64: smp: Skip MC domain for SoCs without shared cache
> > > > > > > > > > >
> > > > > > > > > > > SoCs such as the Ampere Altra define clusters but have no shared
> > > > > > > > > > > processor-side cache. As of v5.16 with CONFIG_SCHED_CLUSTER and
> > > > > > > > > > > CONFIG_SCHED_MC, build_sched_domain() will BUG() with:
> > > > > > > > > > >
> > > > > > > > > > > BUG: arch topology borken
> > > > > > > > > > >      the CLS domain not a subset of the MC domain
> > > > > > > > > > >
> > > > > > > > > > > for each CPU (160 times for a 2 socket 80 core Altra system). The MC
> > > > > > > > > > > level cpu mask is then extended to that of the CLS child, and is later
> > > > > > > > > > > removed entirely as redundant.
> > > > > > > > > > >
> > > > > > > > > > > This change detects when all cpu_coregroup_mask weights=1 and uses an
> > > > > > > > > > > alternative sched_domain_topology equivalent to the default if
> > > > > > > > > > > CONFIG_SCHED_MC were disabled.
> > > > > > > > > > >
> > > > > > > > > > > The final resulting sched domain topology is unchanged with or without
> > > > > > > > > > > CONFIG_SCHED_CLUSTER, and the BUG is avoided:
> > > > > > > > > > >
> > > > > > > > > > > For CPU0:
> > > > > > > > > > >
> > > > > > > > > > > With CLS:
> > > > > > > > > > > CLS  [0-1]
> > > > > > > > > > > DIE  [0-79]
> > > > > > > > > > > NUMA [0-159]
> > > > > > > > > > >
> > > > > > > > > > > Without CLS:
> > > > > > > > > > > DIE  [0-79]
> > > > > > > > > > > NUMA [0-159]
> > > > > > > > > > >
> > > > > > > > > > > Cc: Catalin Marinas <catalin.marinas@arm.com>
> > > > > > > > > > > Cc: Will Deacon <will@kernel.org>
> > > > > > > > > > > Cc: Peter Zijlstra <peterz@infradead.org>
> > > > > > > > > > > Cc: Vincent Guittot <vincent.guittot@linaro.org>
> > > > > > > > > > > Cc: Barry Song <song.bao.hua@hisilicon.com>
> > > > > > > > > > > Cc: Valentin Schneider <valentin.schneider@arm.com>
> > > > > > > > > > > Cc: D. Scott Phillips <scott@os.amperecomputing.com>
> > > > > > > > > > > Cc: Ilkka Koskinen <ilkka@os.amperecomputing.com>
> > > > > > > > > > > Cc: <stable@vger.kernel.org> # 5.16.x
> > > > > > > > > > > Signed-off-by: Darren Hart <darren@os.amperecomputing.com>
> > > > > > > > > >
> > > > > > > > > > Hi Darrent,
> > > > > > > > > > What kind of resources are clusters sharing on Ampere Altra?
> > > > > > > > > > So on Altra, cpus are not sharing LLC? Each LLC is separate
> > > > > > > > > > for each cpu?
> > > > > > > > > >
> > > > > > > > > > > ---
> > > > > > > > > > >  arch/arm64/kernel/smp.c | 32 ++++++++++++++++++++++++++++++++
> > > > > > > > > > >  1 file changed, 32 insertions(+)
> > > > > > > > > > >
> > > > > > > > > > > diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
> > > > > > > > > > > index 27df5c1e6baa..0a78ac5c8830 100644
> > > > > > > > > > > --- a/arch/arm64/kernel/smp.c
> > > > > > > > > > > +++ b/arch/arm64/kernel/smp.c
> > > > > > > > > > > @@ -715,9 +715,22 @@ void __init smp_init_cpus(void)
> > > > > > > > > > >         }
> > > > > > > > > > >  }
> > > > > > > > > > >
> > > > > > > > > > > +static struct sched_domain_topology_level arm64_no_mc_topology[] = {
> > > > > > > > > > > +#ifdef CONFIG_SCHED_SMT
> > > > > > > > > > > +       { cpu_smt_mask, cpu_smt_flags, SD_INIT_NAME(SMT) },
> > > > > > > > > > > +#endif
> > > > > > > > > > > +
> > > > > > > > > > > +#ifdef CONFIG_SCHED_CLUSTER
> > > > > > > > > > > +       { cpu_clustergroup_mask, cpu_cluster_flags, SD_INIT_NAME(CLS) },
> > > > > > > > > > > +#endif
> > > > > > > > > > > +       { cpu_cpu_mask, SD_INIT_NAME(DIE) },
> > > > > > > > > > > +       { NULL, },
> > > > > > > > > > > +};
> > > > > > > > > > > +
> > > > > > > > > > >  void __init smp_prepare_cpus(unsigned int max_cpus)
> > > > > > > > > > >  {
> > > > > > > > > > >         const struct cpu_operations *ops;
> > > > > > > > > > > +       bool use_no_mc_topology = true;
> > > > > > > > > > >         int err;
> > > > > > > > > > >         unsigned int cpu;
> > > > > > > > > > >         unsigned int this_cpu;
> > > > > > > > > > > @@ -758,6 +771,25 @@ void __init smp_prepare_cpus(unsigned int max_cpus)
> > > > > > > > > > >
> > > > > > > > > > >                 set_cpu_present(cpu, true);
> > > > > > > > > > >                 numa_store_cpu_info(cpu);
> > > > > > > > > > > +
> > > > > > > > > > > +               /*
> > > > > > > > > > > +                * Only use no_mc topology if all cpu_coregroup_mask weights=1
> > > > > > > > > > > +                */
> > > > > > > > > > > +               if (cpumask_weight(cpu_coregroup_mask(cpu)) > 1)
> > > > > > > > > > > +                       use_no_mc_topology = false;
> > > > > > > > > >
> > > > > > > > > > This seems to be wrong? If you have 5 cpus,
> > > > > > > > > > Cpu0 has cpu_coregroup_mask(cpu)== 1, cpu1-4
> > > > > > > > > > has cpu_coregroup_mask(cpu)== 4, for cpu0, you still
> > > > > > > > > > need to remove MC, but for cpu1-4, you will need
> > > > > > > > > > CLS and MC both?
> > > > > > > > >
> > > > > > > > > What is the *current* behaviour on such a system?
> > > > > > > > >
> > > > > > > >
> > > > > > > > As I understand it, any system that uses the default topology which has
> > > > > > > > a cpus_coregroup weight of 1 and a child (cluster, smt, ...) weight > 1
> > > > > > > > will behave as described above by printing the following for each CPU
> > > > > > > > matching this criteria:
> > > > > > > >
> > > > > > > >   BUG: arch topology borken
> > > > > > > >         the [CLS,SMT,...] domain not a subset of the MC domain
> > > > > > > >
> > > > > > > > And then extend the MC domain cpumask to match that of the child and continue
> > > > > > > > on.
> > > > > > > >
> > > > > > > > That would still be the behavior for this type of system after this
> > > > > > > > patch is applied.
> > > > > > >
> > > > > > > That's what I thought, but in that case applying your patch is a net
> > > > > > > improvement: systems either get current or better behaviour.
> > > > > >
> > > > > > CLUSTER level is normally defined as a intermediate group of the MC
> > > > > > level and both levels have the scheduler flag SD_SHARE_PKG_RESOURCES
> > > > > > flag
> > > > > >
> > > > > > In the case of Ampere altra, they consider that CPUA have a CLUSTER
> > > > > > level which SD_SHARE_PKG_RESOURCES with another CPUB but the next and
> > > > > > larger MC level then says that CPUA doesn't SD_SHARE_PKG_RESOURCES
> > > > > > with CPUB which seems to be odd because the SD_SHARE_PKG_RESOURCES has
> > > > > > not disappeared Looks like there is a mismatch in topology description
> > > > >
> > > > > Hi Vincent,
> > > > >
> > > > > Agree. Where do you think this mismatch exists?
> > > >
> > > > I think that the problem comes from that the default topology order is
> > > > assumed to be :
> > > > SMT
> > > > CLUSTER shares pkg resources i.e. cache
> > > > MC
> > > > DIE
> > > > NUMA
> > > >
> > > > but in your case, you want a topology order like :
> > > > SMT
> > > > MC
> > > > CLUSTER shares SCU
> > > > DIE
> > > > NUMA
> > >
> > > Given the fairly loose definition of some of these domains and the
> > > freedom to adjust flags with custom topologies, I think it's difficult
> > > to say it needs to be this or that. As you point out, this stems from an
> > > assumption in the default topology, so eliding the MC level within the
> > > current set of abstractions for a very targeted topology still seems
> > > reasonable to address the BUG in the very near term in a contained way.
> >
> > But if another SoC comes with a valid MC then a CLUSTER, this proposal
> > will not work.
> >
> > Keep in mind that the MC level will be removed/degenerate when
> > building because it is useless in your case so the scheduler topology
> > will still be the same at the end but it will support more case. That
> > why I think you should keep MC level
>
> Hi Vincent,
>
> Thanks for reiterating, I don't think I quite understood what you were
> suggesting before. Is the following in line with what you were thinking?
>
> I am testing a version of this patch which uses a topology like this instead:
>
> MC
> CLS
> DIE
> NUMA
>
> (I tested without an SMT domain since the trigger is still MC weight==1, so
> there is no valid topology that includes an SMT level under these conditions).
>
> Which results in no BUG output and a final topology on Altra of:
>
> CLS
> DIE
> NUMA
>
> Which so far seems right (I still need to do some testing and review the sched
> debug data).
>
> If we take this approach, I think to address your concern about other systems
> with valid MCs, we would need a different trigger that MC weight == 1 to use
> this alternate topology. Do you have a suggestion on what to trigger this on?

AFAICT, this CLUSTER level is only supported by ACPI. In
parse_acpi_topology() you should be able to know if cluster level is
above or below the level returned by acpi_find_last_cache_level() and
set the correct topology table accordingly

>
> Thanks,
>
> --
> Darren Hart
> Ampere Computing / OS and Kernel
Barry Song Feb. 16, 2022, 6:56 p.m. UTC | #16
On Thu, Feb 17, 2022 at 4:44 AM Darren Hart
<darren@os.amperecomputing.com> wrote:
>
> On Wed, Feb 16, 2022 at 08:03:40PM +1300, Barry Song wrote:
> > On Wed, Feb 16, 2022 at 7:30 PM Vincent Guittot
> > <vincent.guittot@linaro.org> wrote:
> > >
> > > On Tue, 15 Feb 2022 at 18:32, Darren Hart <darren@os.amperecomputing.com> wrote:
> > > >
> > > > On Tue, Feb 15, 2022 at 06:09:08PM +0100, Vincent Guittot wrote:
> > > > > On Tue, 15 Feb 2022 at 17:46, Will Deacon <will@kernel.org> wrote:
> > > > > >
> > > > > > On Tue, Feb 15, 2022 at 08:44:23AM -0800, Darren Hart wrote:
> > > > > > > On Tue, Feb 15, 2022 at 04:38:59PM +0000, Will Decon wrote:
> > > > > > > > On Fri, Feb 11, 2022 at 03:20:51AM +0000, Song Bao Hua (Barry Song) wrote:
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > > -----Original Message-----
> > > > > > > > > > From: Darren Hart [mailto:darren@os.amperecomputing.com]
> > > > > > > > > > Sent: Friday, February 11, 2022 2:43 PM
> > > > > > > > > > To: LKML <linux-kernel@vger.kernel.org>; Linux Arm
> > > > > > > > > > <linux-arm-kernel@lists.infradead.org>
> > > > > > > > > > Cc: Catalin Marinas <catalin.marinas@arm.com>; Will Deacon <will@kernel.org>;
> > > > > > > > > > Peter Zijlstra <peterz@infradead.org>; Vincent Guittot
> > > > > > > > > > <vincent.guittot@linaro.org>; Song Bao Hua (Barry Song)
> > > > > > > > > > <song.bao.hua@hisilicon.com>; Valentin Schneider
> > > > > > > > > > <valentin.schneider@arm.com>; D . Scott Phillips
> > > > > > > > > > <scott@os.amperecomputing.com>; Ilkka Koskinen
> > > > > > > > > > <ilkka@os.amperecomputing.com>; stable@vger.kernel.org
> > > > > > > > > > Subject: [PATCH] arm64: smp: Skip MC domain for SoCs without shared cache
> > > > > > > > > >
> > > > > > > > > > SoCs such as the Ampere Altra define clusters but have no shared
> > > > > > > > > > processor-side cache. As of v5.16 with CONFIG_SCHED_CLUSTER and
> > > > > > > > > > CONFIG_SCHED_MC, build_sched_domain() will BUG() with:
> > > > > > > > > >
> > > > > > > > > > BUG: arch topology borken
> > > > > > > > > >      the CLS domain not a subset of the MC domain
> > > > > > > > > >
> > > > > > > > > > for each CPU (160 times for a 2 socket 80 core Altra system). The MC
> > > > > > > > > > level cpu mask is then extended to that of the CLS child, and is later
> > > > > > > > > > removed entirely as redundant.
> > > > > > > > > >
> > > > > > > > > > This change detects when all cpu_coregroup_mask weights=1 and uses an
> > > > > > > > > > alternative sched_domain_topology equivalent to the default if
> > > > > > > > > > CONFIG_SCHED_MC were disabled.
> > > > > > > > > >
> > > > > > > > > > The final resulting sched domain topology is unchanged with or without
> > > > > > > > > > CONFIG_SCHED_CLUSTER, and the BUG is avoided:
> > > > > > > > > >
> > > > > > > > > > For CPU0:
> > > > > > > > > >
> > > > > > > > > > With CLS:
> > > > > > > > > > CLS  [0-1]
> > > > > > > > > > DIE  [0-79]
> > > > > > > > > > NUMA [0-159]
> > > > > > > > > >
> > > > > > > > > > Without CLS:
> > > > > > > > > > DIE  [0-79]
> > > > > > > > > > NUMA [0-159]
> > > > > > > > > >
> > > > > > > > > > Cc: Catalin Marinas <catalin.marinas@arm.com>
> > > > > > > > > > Cc: Will Deacon <will@kernel.org>
> > > > > > > > > > Cc: Peter Zijlstra <peterz@infradead.org>
> > > > > > > > > > Cc: Vincent Guittot <vincent.guittot@linaro.org>
> > > > > > > > > > Cc: Barry Song <song.bao.hua@hisilicon.com>
> > > > > > > > > > Cc: Valentin Schneider <valentin.schneider@arm.com>
> > > > > > > > > > Cc: D. Scott Phillips <scott@os.amperecomputing.com>
> > > > > > > > > > Cc: Ilkka Koskinen <ilkka@os.amperecomputing.com>
> > > > > > > > > > Cc: <stable@vger.kernel.org> # 5.16.x
> > > > > > > > > > Signed-off-by: Darren Hart <darren@os.amperecomputing.com>
> > > > > > > > >
> > > > > > > > > Hi Darrent,
> > > > > > > > > What kind of resources are clusters sharing on Ampere Altra?
> > > > > > > > > So on Altra, cpus are not sharing LLC? Each LLC is separate
> > > > > > > > > for each cpu?
> > > > > > > > >
> > > > > > > > > > ---
> > > > > > > > > >  arch/arm64/kernel/smp.c | 32 ++++++++++++++++++++++++++++++++
> > > > > > > > > >  1 file changed, 32 insertions(+)
> > > > > > > > > >
> > > > > > > > > > diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
> > > > > > > > > > index 27df5c1e6baa..0a78ac5c8830 100644
> > > > > > > > > > --- a/arch/arm64/kernel/smp.c
> > > > > > > > > > +++ b/arch/arm64/kernel/smp.c
> > > > > > > > > > @@ -715,9 +715,22 @@ void __init smp_init_cpus(void)
> > > > > > > > > >         }
> > > > > > > > > >  }
> > > > > > > > > >
> > > > > > > > > > +static struct sched_domain_topology_level arm64_no_mc_topology[] = {
> > > > > > > > > > +#ifdef CONFIG_SCHED_SMT
> > > > > > > > > > +       { cpu_smt_mask, cpu_smt_flags, SD_INIT_NAME(SMT) },
> > > > > > > > > > +#endif
> > > > > > > > > > +
> > > > > > > > > > +#ifdef CONFIG_SCHED_CLUSTER
> > > > > > > > > > +       { cpu_clustergroup_mask, cpu_cluster_flags, SD_INIT_NAME(CLS) },
> > > > > > > > > > +#endif
> > > > > > > > > > +       { cpu_cpu_mask, SD_INIT_NAME(DIE) },
> > > > > > > > > > +       { NULL, },
> > > > > > > > > > +};
> > > > > > > > > > +
> > > > > > > > > >  void __init smp_prepare_cpus(unsigned int max_cpus)
> > > > > > > > > >  {
> > > > > > > > > >         const struct cpu_operations *ops;
> > > > > > > > > > +       bool use_no_mc_topology = true;
> > > > > > > > > >         int err;
> > > > > > > > > >         unsigned int cpu;
> > > > > > > > > >         unsigned int this_cpu;
> > > > > > > > > > @@ -758,6 +771,25 @@ void __init smp_prepare_cpus(unsigned int max_cpus)
> > > > > > > > > >
> > > > > > > > > >                 set_cpu_present(cpu, true);
> > > > > > > > > >                 numa_store_cpu_info(cpu);
> > > > > > > > > > +
> > > > > > > > > > +               /*
> > > > > > > > > > +                * Only use no_mc topology if all cpu_coregroup_mask weights=1
> > > > > > > > > > +                */
> > > > > > > > > > +               if (cpumask_weight(cpu_coregroup_mask(cpu)) > 1)
> > > > > > > > > > +                       use_no_mc_topology = false;
> > > > > > > > >
> > > > > > > > > This seems to be wrong? If you have 5 cpus,
> > > > > > > > > Cpu0 has cpu_coregroup_mask(cpu)== 1, cpu1-4
> > > > > > > > > has cpu_coregroup_mask(cpu)== 4, for cpu0, you still
> > > > > > > > > need to remove MC, but for cpu1-4, you will need
> > > > > > > > > CLS and MC both?
> > > > > > > >
> > > > > > > > What is the *current* behaviour on such a system?
> > > > > > > >
> > > > > > >
> > > > > > > As I understand it, any system that uses the default topology which has
> > > > > > > a cpus_coregroup weight of 1 and a child (cluster, smt, ...) weight > 1
> > > > > > > will behave as described above by printing the following for each CPU
> > > > > > > matching this criteria:
> > > > > > >
> > > > > > >   BUG: arch topology borken
> > > > > > >         the [CLS,SMT,...] domain not a subset of the MC domain
> > > > > > >
> > > > > > > And then extend the MC domain cpumask to match that of the child and continue
> > > > > > > on.
> > > > > > >
> > > > > > > That would still be the behavior for this type of system after this
> > > > > > > patch is applied.
> > > > > >
> > > > > > That's what I thought, but in that case applying your patch is a net
> > > > > > improvement: systems either get current or better behaviour.
> > > > >
> > > > > CLUSTER level is normally defined as a intermediate group of the MC
> > > > > level and both levels have the scheduler flag SD_SHARE_PKG_RESOURCES
> > > > > flag
> > > > >
> > > > > In the case of Ampere altra, they consider that CPUA have a CLUSTER
> > > > > level which SD_SHARE_PKG_RESOURCES with another CPUB but the next and
> > > > > larger MC level then says that CPUA doesn't SD_SHARE_PKG_RESOURCES
> > > > > with CPUB which seems to be odd because the SD_SHARE_PKG_RESOURCES has
> > > > > not disappeared Looks like there is a mismatch in topology description
> > > >
> > > > Hi Vincent,
> > > >
> > > > Agree. Where do you think this mismatch exists?
> > >
> > > I think that the problem comes from that the default topology order is
> > > assumed to be :
> > > SMT
> > > CLUSTER shares pkg resources i.e. cache
> > > MC
> > > DIE
> > > NUMA
> > >
> > > but in your case, you want a topology order like :
> > > SMT
> > > MC
> > > CLUSTER shares SCU
> > > DIE
> > > NUMA
> > >
> > > IIUC, the cluster is defined as the 2nd (no SMT) or 3rd (SMT) level in
> > > the PPTT table whereas the MC level is defined as the number of cache
> > > levels. So i would say that you should compare the level to know the
> > > ordering
> > >
> > > Then, there is another point:
> > > In your case, CLUSTER level still has the flag SD_SHARE_PKG_RESOURCES
> > > which is used to define some scheduler internal variable like
> > > sd_llc(sched domain last level of cache) which allows fast task
> > > migration between this cpus in this level at wakeup. In your case the
> > > sd_llc should not be the cluster but the MC with only one CPU. But I
> > > would not be surprised that most of perf improvement comes from this
> > > sd_llc wrongly set to cluster instead of the single CPU
> >
> > I assume this "mistake" is actually what Ampere altra needs while it
> > is wrong but getting
> > right result? Ampere altra has already got both:
>
> Hi Barry,
>
> Generally yes - although I do think we're placing too much emphasis on
> the "right" or "wrong" of a heuristic which are more fluid in
> definition over time. (e.g. I expect this will look different in a year
> based on what we learn from this and other non current default topologies).
>
> > 1. Load Balance between clusters
> > 2. wake_affine by select sibling cpu which is sharing SCU
> >
> > I am not sure how much 1 and 2 are helping Darren's workloads respectively.
>
> We definitely see improvements with load balancing between clusters.
> We're running some tests with the wake_affine patchset you pointed me to
> (thanks for that). My initial tbench runs resulted in higher average and
> max latencies reported. I need to collect more results and see the
> impact to other benchmarks of interest before I have more to share on
> that.

Hi Darren,
if you read Vincent's comments carefully, you will find it is
pointless for you to
test the wake_affine patchset as you have already got it. in your
case, sd_llc_id
is set to sd_cluster level due to PKG_RESOURCES sharing. So with my new
patchset for wake_affine, it is completely redundant for your machine
as it works
with the assumption cluster-> llc. but for your case, llc=cluster, so
it works in
cluster->cluster. please read:

static void update_top_cache_domain(int cpu)
{
struct sched_domain_shared *sds = NULL;
struct sched_domain *sd;
int id = cpu;
int size = 1;

sd = highest_flag_domain(cpu, SD_SHARE_PKG_RESOURCES);
if (sd) {
id = cpumask_first(sched_domain_span(sd));
size = cpumask_weight(sched_domain_span(sd));
sds = sd->shared;
}

rcu_assign_pointer(per_cpu(sd_llc, cpu), sd);
per_cpu(sd_llc_size, cpu) = size;
per_cpu(sd_llc_id, cpu) = id;
rcu_assign_pointer(per_cpu(sd_llc_shared, cpu), sds);

...
}

this is also the content of Vincent's last comment:
"
My concern is that there are some ongoing discussion to make more
usage of the CLUSTER level than what is currently done and it assumes
that we have a valid LLC level after the CLUSTER one which is not your
case and I'm afraid that it will be suboptimal for you because CLUSTER
and LLC are wrongly the same for your case and then you will come back
to add more exception in the generic scheduler code to cover this
first exception"

so it is incorrect for you to say you gain performance only by LB :-)

>
> Thanks,
>
> --
> Darren Hart
> Ampere Computing / OS and Kernel

Thanks
Barry
Darren Hart Feb. 23, 2022, 3:07 a.m. UTC | #17
On Wed, Feb 16, 2022 at 06:52:29PM +0100, Vincent Guittot wrote:
> On Wed, 16 Feb 2022 at 17:24, Darren Hart <darren@os.amperecomputing.com> wrote:
> >
> > On Wed, Feb 16, 2022 at 09:30:49AM +0100, Vincent Guittot wrote:
> > > On Tue, 15 Feb 2022 at 21:05, Darren Hart <darren@os.amperecomputing.com> wrote:
> > > >
> > > > On Tue, Feb 15, 2022 at 07:19:45PM +0100, Vincent Guittot wrote:
> > > > > On Tue, 15 Feb 2022 at 18:32, Darren Hart <darren@os.amperecomputing.com> wrote:
> > > > > >
> > > > > > On Tue, Feb 15, 2022 at 06:09:08PM +0100, Vincent Guittot wrote:
> > > > > > > On Tue, 15 Feb 2022 at 17:46, Will Deacon <will@kernel.org> wrote:
> > > > > > > >
> > > > > > > > On Tue, Feb 15, 2022 at 08:44:23AM -0800, Darren Hart wrote:
> > > > > > > > > On Tue, Feb 15, 2022 at 04:38:59PM +0000, Will Decon wrote:
> > > > > > > > > > On Fri, Feb 11, 2022 at 03:20:51AM +0000, Song Bao Hua (Barry Song) wrote:
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > > -----Original Message-----
> > > > > > > > > > > > From: Darren Hart [mailto:darren@os.amperecomputing.com]
> > > > > > > > > > > > Sent: Friday, February 11, 2022 2:43 PM
> > > > > > > > > > > > To: LKML <linux-kernel@vger.kernel.org>; Linux Arm
> > > > > > > > > > > > <linux-arm-kernel@lists.infradead.org>
> > > > > > > > > > > > Cc: Catalin Marinas <catalin.marinas@arm.com>; Will Deacon <will@kernel.org>;
> > > > > > > > > > > > Peter Zijlstra <peterz@infradead.org>; Vincent Guittot
> > > > > > > > > > > > <vincent.guittot@linaro.org>; Song Bao Hua (Barry Song)
> > > > > > > > > > > > <song.bao.hua@hisilicon.com>; Valentin Schneider
> > > > > > > > > > > > <valentin.schneider@arm.com>; D . Scott Phillips
> > > > > > > > > > > > <scott@os.amperecomputing.com>; Ilkka Koskinen
> > > > > > > > > > > > <ilkka@os.amperecomputing.com>; stable@vger.kernel.org
> > > > > > > > > > > > Subject: [PATCH] arm64: smp: Skip MC domain for SoCs without shared cache
> > > > > > > > > > > >
> > > > > > > > > > > > SoCs such as the Ampere Altra define clusters but have no shared
> > > > > > > > > > > > processor-side cache. As of v5.16 with CONFIG_SCHED_CLUSTER and
> > > > > > > > > > > > CONFIG_SCHED_MC, build_sched_domain() will BUG() with:
> > > > > > > > > > > >
> > > > > > > > > > > > BUG: arch topology borken
> > > > > > > > > > > >      the CLS domain not a subset of the MC domain
> > > > > > > > > > > >
> > > > > > > > > > > > for each CPU (160 times for a 2 socket 80 core Altra system). The MC
> > > > > > > > > > > > level cpu mask is then extended to that of the CLS child, and is later
> > > > > > > > > > > > removed entirely as redundant.
> > > > > > > > > > > >
> > > > > > > > > > > > This change detects when all cpu_coregroup_mask weights=1 and uses an
> > > > > > > > > > > > alternative sched_domain_topology equivalent to the default if
> > > > > > > > > > > > CONFIG_SCHED_MC were disabled.
> > > > > > > > > > > >
> > > > > > > > > > > > The final resulting sched domain topology is unchanged with or without
> > > > > > > > > > > > CONFIG_SCHED_CLUSTER, and the BUG is avoided:
> > > > > > > > > > > >
> > > > > > > > > > > > For CPU0:
> > > > > > > > > > > >
> > > > > > > > > > > > With CLS:
> > > > > > > > > > > > CLS  [0-1]
> > > > > > > > > > > > DIE  [0-79]
> > > > > > > > > > > > NUMA [0-159]
> > > > > > > > > > > >
> > > > > > > > > > > > Without CLS:
> > > > > > > > > > > > DIE  [0-79]
> > > > > > > > > > > > NUMA [0-159]
> > > > > > > > > > > >
> > > > > > > > > > > > Cc: Catalin Marinas <catalin.marinas@arm.com>
> > > > > > > > > > > > Cc: Will Deacon <will@kernel.org>
> > > > > > > > > > > > Cc: Peter Zijlstra <peterz@infradead.org>
> > > > > > > > > > > > Cc: Vincent Guittot <vincent.guittot@linaro.org>
> > > > > > > > > > > > Cc: Barry Song <song.bao.hua@hisilicon.com>
> > > > > > > > > > > > Cc: Valentin Schneider <valentin.schneider@arm.com>
> > > > > > > > > > > > Cc: D. Scott Phillips <scott@os.amperecomputing.com>
> > > > > > > > > > > > Cc: Ilkka Koskinen <ilkka@os.amperecomputing.com>
> > > > > > > > > > > > Cc: <stable@vger.kernel.org> # 5.16.x
> > > > > > > > > > > > Signed-off-by: Darren Hart <darren@os.amperecomputing.com>
> > > > > > > > > > >
> > > > > > > > > > > Hi Darrent,
> > > > > > > > > > > What kind of resources are clusters sharing on Ampere Altra?
> > > > > > > > > > > So on Altra, cpus are not sharing LLC? Each LLC is separate
> > > > > > > > > > > for each cpu?
> > > > > > > > > > >
> > > > > > > > > > > > ---
> > > > > > > > > > > >  arch/arm64/kernel/smp.c | 32 ++++++++++++++++++++++++++++++++
> > > > > > > > > > > >  1 file changed, 32 insertions(+)
> > > > > > > > > > > >
> > > > > > > > > > > > diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
> > > > > > > > > > > > index 27df5c1e6baa..0a78ac5c8830 100644
> > > > > > > > > > > > --- a/arch/arm64/kernel/smp.c
> > > > > > > > > > > > +++ b/arch/arm64/kernel/smp.c
> > > > > > > > > > > > @@ -715,9 +715,22 @@ void __init smp_init_cpus(void)
> > > > > > > > > > > >         }
> > > > > > > > > > > >  }
> > > > > > > > > > > >
> > > > > > > > > > > > +static struct sched_domain_topology_level arm64_no_mc_topology[] = {
> > > > > > > > > > > > +#ifdef CONFIG_SCHED_SMT
> > > > > > > > > > > > +       { cpu_smt_mask, cpu_smt_flags, SD_INIT_NAME(SMT) },
> > > > > > > > > > > > +#endif
> > > > > > > > > > > > +
> > > > > > > > > > > > +#ifdef CONFIG_SCHED_CLUSTER
> > > > > > > > > > > > +       { cpu_clustergroup_mask, cpu_cluster_flags, SD_INIT_NAME(CLS) },
> > > > > > > > > > > > +#endif
> > > > > > > > > > > > +       { cpu_cpu_mask, SD_INIT_NAME(DIE) },
> > > > > > > > > > > > +       { NULL, },
> > > > > > > > > > > > +};
> > > > > > > > > > > > +
> > > > > > > > > > > >  void __init smp_prepare_cpus(unsigned int max_cpus)
> > > > > > > > > > > >  {
> > > > > > > > > > > >         const struct cpu_operations *ops;
> > > > > > > > > > > > +       bool use_no_mc_topology = true;
> > > > > > > > > > > >         int err;
> > > > > > > > > > > >         unsigned int cpu;
> > > > > > > > > > > >         unsigned int this_cpu;
> > > > > > > > > > > > @@ -758,6 +771,25 @@ void __init smp_prepare_cpus(unsigned int max_cpus)
> > > > > > > > > > > >
> > > > > > > > > > > >                 set_cpu_present(cpu, true);
> > > > > > > > > > > >                 numa_store_cpu_info(cpu);
> > > > > > > > > > > > +
> > > > > > > > > > > > +               /*
> > > > > > > > > > > > +                * Only use no_mc topology if all cpu_coregroup_mask weights=1
> > > > > > > > > > > > +                */
> > > > > > > > > > > > +               if (cpumask_weight(cpu_coregroup_mask(cpu)) > 1)
> > > > > > > > > > > > +                       use_no_mc_topology = false;
> > > > > > > > > > >
> > > > > > > > > > > This seems to be wrong? If you have 5 cpus,
> > > > > > > > > > > Cpu0 has cpu_coregroup_mask(cpu)== 1, cpu1-4
> > > > > > > > > > > has cpu_coregroup_mask(cpu)== 4, for cpu0, you still
> > > > > > > > > > > need to remove MC, but for cpu1-4, you will need
> > > > > > > > > > > CLS and MC both?
> > > > > > > > > >
> > > > > > > > > > What is the *current* behaviour on such a system?
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > As I understand it, any system that uses the default topology which has
> > > > > > > > > a cpus_coregroup weight of 1 and a child (cluster, smt, ...) weight > 1
> > > > > > > > > will behave as described above by printing the following for each CPU
> > > > > > > > > matching this criteria:
> > > > > > > > >
> > > > > > > > >   BUG: arch topology borken
> > > > > > > > >         the [CLS,SMT,...] domain not a subset of the MC domain
> > > > > > > > >
> > > > > > > > > And then extend the MC domain cpumask to match that of the child and continue
> > > > > > > > > on.
> > > > > > > > >
> > > > > > > > > That would still be the behavior for this type of system after this
> > > > > > > > > patch is applied.
> > > > > > > >
> > > > > > > > That's what I thought, but in that case applying your patch is a net
> > > > > > > > improvement: systems either get current or better behaviour.
> > > > > > >
> > > > > > > CLUSTER level is normally defined as a intermediate group of the MC
> > > > > > > level and both levels have the scheduler flag SD_SHARE_PKG_RESOURCES
> > > > > > > flag
> > > > > > >
> > > > > > > In the case of Ampere altra, they consider that CPUA have a CLUSTER
> > > > > > > level which SD_SHARE_PKG_RESOURCES with another CPUB but the next and
> > > > > > > larger MC level then says that CPUA doesn't SD_SHARE_PKG_RESOURCES
> > > > > > > with CPUB which seems to be odd because the SD_SHARE_PKG_RESOURCES has
> > > > > > > not disappeared Looks like there is a mismatch in topology description
> > > > > >
> > > > > > Hi Vincent,
> > > > > >
> > > > > > Agree. Where do you think this mismatch exists?
> > > > >
> > > > > I think that the problem comes from that the default topology order is
> > > > > assumed to be :
> > > > > SMT
> > > > > CLUSTER shares pkg resources i.e. cache
> > > > > MC
> > > > > DIE
> > > > > NUMA
> > > > >
> > > > > but in your case, you want a topology order like :
> > > > > SMT
> > > > > MC
> > > > > CLUSTER shares SCU
> > > > > DIE
> > > > > NUMA
> > > >
> > > > Given the fairly loose definition of some of these domains and the
> > > > freedom to adjust flags with custom topologies, I think it's difficult
> > > > to say it needs to be this or that. As you point out, this stems from an
> > > > assumption in the default topology, so eliding the MC level within the
> > > > current set of abstractions for a very targeted topology still seems
> > > > reasonable to address the BUG in the very near term in a contained way.
> > >
> > > But if another SoC comes with a valid MC then a CLUSTER, this proposal
> > > will not work.
> > >
> > > Keep in mind that the MC level will be removed/degenerate when
> > > building because it is useless in your case so the scheduler topology
> > > will still be the same at the end but it will support more case. That
> > > why I think you should keep MC level
> >
> > Hi Vincent,
> >
> > Thanks for reiterating, I don't think I quite understood what you were
> > suggesting before. Is the following in line with what you were thinking?
> >
> > I am testing a version of this patch which uses a topology like this instead:
> >
> > MC
> > CLS
> > DIE
> > NUMA
> >
> > (I tested without an SMT domain since the trigger is still MC weight==1, so
> > there is no valid topology that includes an SMT level under these conditions).
> >
> > Which results in no BUG output and a final topology on Altra of:
> >
> > CLS
> > DIE
> > NUMA
> >
> > Which so far seems right (I still need to do some testing and review the sched
> > debug data).
> >
> > If we take this approach, I think to address your concern about other systems
> > with valid MCs, we would need a different trigger that MC weight == 1 to use
> > this alternate topology. Do you have a suggestion on what to trigger this on?
> 
> AFAICT, this CLUSTER level is only supported by ACPI. In
> parse_acpi_topology() you should be able to know if cluster level is
> above or below the level returned by acpi_find_last_cache_level() and
> set the correct topology table accordingly
> 

Thanks Vincent,

This made sense as a place to start to me. The more I dug into the ACPI PPTT
code, I kept running into conflicts with the API which would require extending
it in ways that seems contrary to its intent. e.g. the exposed API uses Kernel
logical CPUs rather than the topology ids (needed for working with processor
containers). The cpu_topology masks haven't been populated yet, and
acpi_find_last_cache_level is decoupled from the CPU topology level. So what
we're really testing for is if the cluster cpumask is a subset of the coregroup
cpumask or not, and it made the most sense to me to keep that in smp.c after the
cpumasks have been updated and stored.

I'll send v2 out for review shortly using this approach.
Darren Hart Feb. 23, 2022, 3:20 a.m. UTC | #18
On Thu, Feb 17, 2022 at 07:56:00AM +1300, Barry Song wrote:
...
> > > > Then, there is another point:
> > > > In your case, CLUSTER level still has the flag SD_SHARE_PKG_RESOURCES
> > > > which is used to define some scheduler internal variable like
> > > > sd_llc(sched domain last level of cache) which allows fast task
> > > > migration between this cpus in this level at wakeup. In your case the
> > > > sd_llc should not be the cluster but the MC with only one CPU. But I
> > > > would not be surprised that most of perf improvement comes from this
> > > > sd_llc wrongly set to cluster instead of the single CPU
> > >
> > > I assume this "mistake" is actually what Ampere altra needs while it
> > > is wrong but getting
> > > right result? Ampere altra has already got both:
> >
> > Hi Barry,
> >
> > Generally yes - although I do think we're placing too much emphasis on
> > the "right" or "wrong" of a heuristic which are more fluid in
> > definition over time. (e.g. I expect this will look different in a year
> > based on what we learn from this and other non current default topologies).
> >
> > > 1. Load Balance between clusters
> > > 2. wake_affine by select sibling cpu which is sharing SCU
> > >
> > > I am not sure how much 1 and 2 are helping Darren's workloads respectively.
> >
> > We definitely see improvements with load balancing between clusters.
> > We're running some tests with the wake_affine patchset you pointed me to
> > (thanks for that). My initial tbench runs resulted in higher average and
> > max latencies reported. I need to collect more results and see the
> > impact to other benchmarks of interest before I have more to share on
> > that.
> 
> Hi Darren,
> if you read Vincent's comments carefully, you will find it is
> pointless for you to
> test the wake_affine patchset as you have already got it. in your
> case, sd_llc_id
> is set to sd_cluster level due to PKG_RESOURCES sharing. So with my new
> patchset for wake_affine, it is completely redundant for your machine
> as it works
> with the assumption cluster-> llc. but for your case, llc=cluster, so
> it works in
> cluster->cluster.

Thanks Barry,

Makes sense as described. I did see degradation in the tests we ran with this
patch applied to 5.17-rc3. I'll have to follow up with you on that when I can
dig into it more. I'd be interested in the specifics of your testing to run
something similar. I think you said you were reporting on tbench?
Vincent Guittot Feb. 23, 2022, 8:19 a.m. UTC | #19
On Wed, 23 Feb 2022 at 04:08, Darren Hart <darren@os.amperecomputing.com> wrote:
>
> On Wed, Feb 16, 2022 at 06:52:29PM +0100, Vincent Guittot wrote:
> > On Wed, 16 Feb 2022 at 17:24, Darren Hart <darren@os.amperecomputing.com> wrote:
> > >
> > > On Wed, Feb 16, 2022 at 09:30:49AM +0100, Vincent Guittot wrote:
> > > > On Tue, 15 Feb 2022 at 21:05, Darren Hart <darren@os.amperecomputing.com> wrote:
> > > > >
> > > > > On Tue, Feb 15, 2022 at 07:19:45PM +0100, Vincent Guittot wrote:
> > > > > > On Tue, 15 Feb 2022 at 18:32, Darren Hart <darren@os.amperecomputing.com> wrote:
> > > > > > >
> > > > > > > On Tue, Feb 15, 2022 at 06:09:08PM +0100, Vincent Guittot wrote:
> > > > > > > > On Tue, 15 Feb 2022 at 17:46, Will Deacon <will@kernel.org> wrote:
> > > > > > > > >
> > > > > > > > > On Tue, Feb 15, 2022 at 08:44:23AM -0800, Darren Hart wrote:
> > > > > > > > > > On Tue, Feb 15, 2022 at 04:38:59PM +0000, Will Decon wrote:
> > > > > > > > > > > On Fri, Feb 11, 2022 at 03:20:51AM +0000, Song Bao Hua (Barry Song) wrote:
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > > -----Original Message-----
> > > > > > > > > > > > > From: Darren Hart [mailto:darren@os.amperecomputing.com]
> > > > > > > > > > > > > Sent: Friday, February 11, 2022 2:43 PM
> > > > > > > > > > > > > To: LKML <linux-kernel@vger.kernel.org>; Linux Arm
> > > > > > > > > > > > > <linux-arm-kernel@lists.infradead.org>
> > > > > > > > > > > > > Cc: Catalin Marinas <catalin.marinas@arm.com>; Will Deacon <will@kernel.org>;
> > > > > > > > > > > > > Peter Zijlstra <peterz@infradead.org>; Vincent Guittot
> > > > > > > > > > > > > <vincent.guittot@linaro.org>; Song Bao Hua (Barry Song)
> > > > > > > > > > > > > <song.bao.hua@hisilicon.com>; Valentin Schneider
> > > > > > > > > > > > > <valentin.schneider@arm.com>; D . Scott Phillips
> > > > > > > > > > > > > <scott@os.amperecomputing.com>; Ilkka Koskinen
> > > > > > > > > > > > > <ilkka@os.amperecomputing.com>; stable@vger.kernel.org
> > > > > > > > > > > > > Subject: [PATCH] arm64: smp: Skip MC domain for SoCs without shared cache
> > > > > > > > > > > > >
> > > > > > > > > > > > > SoCs such as the Ampere Altra define clusters but have no shared
> > > > > > > > > > > > > processor-side cache. As of v5.16 with CONFIG_SCHED_CLUSTER and
> > > > > > > > > > > > > CONFIG_SCHED_MC, build_sched_domain() will BUG() with:
> > > > > > > > > > > > >
> > > > > > > > > > > > > BUG: arch topology borken
> > > > > > > > > > > > >      the CLS domain not a subset of the MC domain
> > > > > > > > > > > > >
> > > > > > > > > > > > > for each CPU (160 times for a 2 socket 80 core Altra system). The MC
> > > > > > > > > > > > > level cpu mask is then extended to that of the CLS child, and is later
> > > > > > > > > > > > > removed entirely as redundant.
> > > > > > > > > > > > >
> > > > > > > > > > > > > This change detects when all cpu_coregroup_mask weights=1 and uses an
> > > > > > > > > > > > > alternative sched_domain_topology equivalent to the default if
> > > > > > > > > > > > > CONFIG_SCHED_MC were disabled.
> > > > > > > > > > > > >
> > > > > > > > > > > > > The final resulting sched domain topology is unchanged with or without
> > > > > > > > > > > > > CONFIG_SCHED_CLUSTER, and the BUG is avoided:
> > > > > > > > > > > > >
> > > > > > > > > > > > > For CPU0:
> > > > > > > > > > > > >
> > > > > > > > > > > > > With CLS:
> > > > > > > > > > > > > CLS  [0-1]
> > > > > > > > > > > > > DIE  [0-79]
> > > > > > > > > > > > > NUMA [0-159]
> > > > > > > > > > > > >
> > > > > > > > > > > > > Without CLS:
> > > > > > > > > > > > > DIE  [0-79]
> > > > > > > > > > > > > NUMA [0-159]
> > > > > > > > > > > > >
> > > > > > > > > > > > > Cc: Catalin Marinas <catalin.marinas@arm.com>
> > > > > > > > > > > > > Cc: Will Deacon <will@kernel.org>
> > > > > > > > > > > > > Cc: Peter Zijlstra <peterz@infradead.org>
> > > > > > > > > > > > > Cc: Vincent Guittot <vincent.guittot@linaro.org>
> > > > > > > > > > > > > Cc: Barry Song <song.bao.hua@hisilicon.com>
> > > > > > > > > > > > > Cc: Valentin Schneider <valentin.schneider@arm.com>
> > > > > > > > > > > > > Cc: D. Scott Phillips <scott@os.amperecomputing.com>
> > > > > > > > > > > > > Cc: Ilkka Koskinen <ilkka@os.amperecomputing.com>
> > > > > > > > > > > > > Cc: <stable@vger.kernel.org> # 5.16.x
> > > > > > > > > > > > > Signed-off-by: Darren Hart <darren@os.amperecomputing.com>
> > > > > > > > > > > >
> > > > > > > > > > > > Hi Darrent,
> > > > > > > > > > > > What kind of resources are clusters sharing on Ampere Altra?
> > > > > > > > > > > > So on Altra, cpus are not sharing LLC? Each LLC is separate
> > > > > > > > > > > > for each cpu?
> > > > > > > > > > > >
> > > > > > > > > > > > > ---
> > > > > > > > > > > > >  arch/arm64/kernel/smp.c | 32 ++++++++++++++++++++++++++++++++
> > > > > > > > > > > > >  1 file changed, 32 insertions(+)
> > > > > > > > > > > > >
> > > > > > > > > > > > > diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
> > > > > > > > > > > > > index 27df5c1e6baa..0a78ac5c8830 100644
> > > > > > > > > > > > > --- a/arch/arm64/kernel/smp.c
> > > > > > > > > > > > > +++ b/arch/arm64/kernel/smp.c
> > > > > > > > > > > > > @@ -715,9 +715,22 @@ void __init smp_init_cpus(void)
> > > > > > > > > > > > >         }
> > > > > > > > > > > > >  }
> > > > > > > > > > > > >
> > > > > > > > > > > > > +static struct sched_domain_topology_level arm64_no_mc_topology[] = {
> > > > > > > > > > > > > +#ifdef CONFIG_SCHED_SMT
> > > > > > > > > > > > > +       { cpu_smt_mask, cpu_smt_flags, SD_INIT_NAME(SMT) },
> > > > > > > > > > > > > +#endif
> > > > > > > > > > > > > +
> > > > > > > > > > > > > +#ifdef CONFIG_SCHED_CLUSTER
> > > > > > > > > > > > > +       { cpu_clustergroup_mask, cpu_cluster_flags, SD_INIT_NAME(CLS) },
> > > > > > > > > > > > > +#endif
> > > > > > > > > > > > > +       { cpu_cpu_mask, SD_INIT_NAME(DIE) },
> > > > > > > > > > > > > +       { NULL, },
> > > > > > > > > > > > > +};
> > > > > > > > > > > > > +
> > > > > > > > > > > > >  void __init smp_prepare_cpus(unsigned int max_cpus)
> > > > > > > > > > > > >  {
> > > > > > > > > > > > >         const struct cpu_operations *ops;
> > > > > > > > > > > > > +       bool use_no_mc_topology = true;
> > > > > > > > > > > > >         int err;
> > > > > > > > > > > > >         unsigned int cpu;
> > > > > > > > > > > > >         unsigned int this_cpu;
> > > > > > > > > > > > > @@ -758,6 +771,25 @@ void __init smp_prepare_cpus(unsigned int max_cpus)
> > > > > > > > > > > > >
> > > > > > > > > > > > >                 set_cpu_present(cpu, true);
> > > > > > > > > > > > >                 numa_store_cpu_info(cpu);
> > > > > > > > > > > > > +
> > > > > > > > > > > > > +               /*
> > > > > > > > > > > > > +                * Only use no_mc topology if all cpu_coregroup_mask weights=1
> > > > > > > > > > > > > +                */
> > > > > > > > > > > > > +               if (cpumask_weight(cpu_coregroup_mask(cpu)) > 1)
> > > > > > > > > > > > > +                       use_no_mc_topology = false;
> > > > > > > > > > > >
> > > > > > > > > > > > This seems to be wrong? If you have 5 cpus,
> > > > > > > > > > > > Cpu0 has cpu_coregroup_mask(cpu)== 1, cpu1-4
> > > > > > > > > > > > has cpu_coregroup_mask(cpu)== 4, for cpu0, you still
> > > > > > > > > > > > need to remove MC, but for cpu1-4, you will need
> > > > > > > > > > > > CLS and MC both?
> > > > > > > > > > >
> > > > > > > > > > > What is the *current* behaviour on such a system?
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > As I understand it, any system that uses the default topology which has
> > > > > > > > > > a cpus_coregroup weight of 1 and a child (cluster, smt, ...) weight > 1
> > > > > > > > > > will behave as described above by printing the following for each CPU
> > > > > > > > > > matching this criteria:
> > > > > > > > > >
> > > > > > > > > >   BUG: arch topology borken
> > > > > > > > > >         the [CLS,SMT,...] domain not a subset of the MC domain
> > > > > > > > > >
> > > > > > > > > > And then extend the MC domain cpumask to match that of the child and continue
> > > > > > > > > > on.
> > > > > > > > > >
> > > > > > > > > > That would still be the behavior for this type of system after this
> > > > > > > > > > patch is applied.
> > > > > > > > >
> > > > > > > > > That's what I thought, but in that case applying your patch is a net
> > > > > > > > > improvement: systems either get current or better behaviour.
> > > > > > > >
> > > > > > > > CLUSTER level is normally defined as a intermediate group of the MC
> > > > > > > > level and both levels have the scheduler flag SD_SHARE_PKG_RESOURCES
> > > > > > > > flag
> > > > > > > >
> > > > > > > > In the case of Ampere altra, they consider that CPUA have a CLUSTER
> > > > > > > > level which SD_SHARE_PKG_RESOURCES with another CPUB but the next and
> > > > > > > > larger MC level then says that CPUA doesn't SD_SHARE_PKG_RESOURCES
> > > > > > > > with CPUB which seems to be odd because the SD_SHARE_PKG_RESOURCES has
> > > > > > > > not disappeared Looks like there is a mismatch in topology description
> > > > > > >
> > > > > > > Hi Vincent,
> > > > > > >
> > > > > > > Agree. Where do you think this mismatch exists?
> > > > > >
> > > > > > I think that the problem comes from that the default topology order is
> > > > > > assumed to be :
> > > > > > SMT
> > > > > > CLUSTER shares pkg resources i.e. cache
> > > > > > MC
> > > > > > DIE
> > > > > > NUMA
> > > > > >
> > > > > > but in your case, you want a topology order like :
> > > > > > SMT
> > > > > > MC
> > > > > > CLUSTER shares SCU
> > > > > > DIE
> > > > > > NUMA
> > > > >
> > > > > Given the fairly loose definition of some of these domains and the
> > > > > freedom to adjust flags with custom topologies, I think it's difficult
> > > > > to say it needs to be this or that. As you point out, this stems from an
> > > > > assumption in the default topology, so eliding the MC level within the
> > > > > current set of abstractions for a very targeted topology still seems
> > > > > reasonable to address the BUG in the very near term in a contained way.
> > > >
> > > > But if another SoC comes with a valid MC then a CLUSTER, this proposal
> > > > will not work.
> > > >
> > > > Keep in mind that the MC level will be removed/degenerate when
> > > > building because it is useless in your case so the scheduler topology
> > > > will still be the same at the end but it will support more case. That
> > > > why I think you should keep MC level
> > >
> > > Hi Vincent,
> > >
> > > Thanks for reiterating, I don't think I quite understood what you were
> > > suggesting before. Is the following in line with what you were thinking?
> > >
> > > I am testing a version of this patch which uses a topology like this instead:
> > >
> > > MC
> > > CLS
> > > DIE
> > > NUMA
> > >
> > > (I tested without an SMT domain since the trigger is still MC weight==1, so
> > > there is no valid topology that includes an SMT level under these conditions).
> > >
> > > Which results in no BUG output and a final topology on Altra of:
> > >
> > > CLS
> > > DIE
> > > NUMA
> > >
> > > Which so far seems right (I still need to do some testing and review the sched
> > > debug data).
> > >
> > > If we take this approach, I think to address your concern about other systems
> > > with valid MCs, we would need a different trigger that MC weight == 1 to use
> > > this alternate topology. Do you have a suggestion on what to trigger this on?
> >
> > AFAICT, this CLUSTER level is only supported by ACPI. In
> > parse_acpi_topology() you should be able to know if cluster level is
> > above or below the level returned by acpi_find_last_cache_level() and
> > set the correct topology table accordingly
> >
>
> Thanks Vincent,
>
> This made sense as a place to start to me. The more I dug into the ACPI PPTT
> code, I kept running into conflicts with the API which would require extending
> it in ways that seems contrary to its intent. e.g. the exposed API uses Kernel
> logical CPUs rather than the topology ids (needed for working with processor
> containers). The cpu_topology masks haven't been populated yet, and
> acpi_find_last_cache_level is decoupled from the CPU topology level. So what
> we're really testing for is if the cluster cpumask is a subset of the coregroup
> cpumask or not, and it made the most sense to me to keep that in smp.c after the
> cpumasks have been updated and stored.

I'm not sure why you want to compare cpumask when you can directly
compare topology level which is exactly what we are looking for in
order to correctly order the scheduler topology. I was expecting
something like the below to be enough. acpi_find_cluster_level() needs
to be created and should be similar to
find_acpi_cpu_topology_cluster() but return level instead of id. The
main advantage is that everything is contained in topology.c which
makes sense as we are playing with topology

diff --git a/arch/arm64/kernel/topology.c b/arch/arm64/kernel/topology.c
index 9ab78ad826e2..4dac0491b7e3 100644
--- a/arch/arm64/kernel/topology.c
+++ b/arch/arm64/kernel/topology.c
@@ -84,6 +84,7 @@ static bool __init acpi_cpu_is_threaded(int cpu)
 int __init parse_acpi_topology(void)
 {
        int cpu, topology_id;
+       bool default_cluster_topology = true;

        if (acpi_disabled)
                return 0;
@@ -119,8 +120,16 @@ int __init parse_acpi_topology(void)
                        if (cache_id > 0)
                                cpu_topology[cpu].llc_id = cache_id;
                }
+
+               if (default_cluster_topology &&
+                   (i < acpi_find_cluster_level(cpu))) {
+                       default_cluster_topology = false;
+               }
        }

+       if (!default_cluster_topology)
+               set_sched_topology(arm64_no_mc_topology);
+
        return 0;
 }
 #endif

>
> I'll send v2 out for review shortly using this approach.

Ok.

>
> --
> Darren Hart
> Ampere Computing / OS and Kernel
Darren Hart Feb. 23, 2022, 4:39 p.m. UTC | #20
On Wed, Feb 23, 2022 at 09:19:17AM +0100, Vincent Guittot wrote:

...

> > > AFAICT, this CLUSTER level is only supported by ACPI. In
> > > parse_acpi_topology() you should be able to know if cluster level is
> > > above or below the level returned by acpi_find_last_cache_level() and
> > > set the correct topology table accordingly
> > >
> >
> > Thanks Vincent,
> >
> > This made sense as a place to start to me. The more I dug into the ACPI PPTT
> > code, I kept running into conflicts with the API which would require extending
> > it in ways that seems contrary to its intent. e.g. the exposed API uses Kernel
> > logical CPUs rather than the topology ids (needed for working with processor
> > containers). The cpu_topology masks haven't been populated yet, and
> > acpi_find_last_cache_level is decoupled from the CPU topology level. So what
> > we're really testing for is if the cluster cpumask is a subset of the coregroup
> > cpumask or not, and it made the most sense to me to keep that in smp.c after the
> > cpumasks have been updated and stored.
> 
> I'm not sure why you want to compare cpumask when you can directly
> compare topology level which is exactly what we are looking for in
> order to correctly order the scheduler topology. I was expecting
> something like the below to be enough. acpi_find_cluster_level() needs
> to be created and should be similar to
> find_acpi_cpu_topology_cluster() but return level instead of id. The
> main advantage is that everything is contained in topology.c which
> makes sense as we are playing with topology

Hi Vincent,

This was my thinking as well before I dug into the acpi pptt interfaces.

The cpu topology levels and the cache levels are independent and assuming I've
not misunderstood the implementation, acpi_find_cache_level() returns the
highest *cache* level described in the PPTT for a given CPU.

For the Ampere Altra, for example:

CPU Topo  1       System
 Level    2         Package
   |      3           Cluster
   |      4             Processor --- L1 I Cache \____ L2 U Cache -x
  \/                              --- L1 D Cache /

          4             Processor --- L1 I Cache \____ L2 U Cache -x
                                  --- L1 D Cache /

                  Cache Level ---->    1                2

acpi_find_cache_level() returns "2" for any logical cpu, but this doesn't tell
us anything about the CPU topology level across which this cache is shared.

This is what drove me out of topology.c and up into smp.c after the various
topologies are populated and comparing masks. I'll spend a bit more time before
sending a cpumask implementation to see if there is a better point to do this
where the cpu topology level with shared cache is more readily available.

> 
> diff --git a/arch/arm64/kernel/topology.c b/arch/arm64/kernel/topology.c
> index 9ab78ad826e2..4dac0491b7e3 100644
> --- a/arch/arm64/kernel/topology.c
> +++ b/arch/arm64/kernel/topology.c
> @@ -84,6 +84,7 @@ static bool __init acpi_cpu_is_threaded(int cpu)
>  int __init parse_acpi_topology(void)
>  {
>         int cpu, topology_id;
> +       bool default_cluster_topology = true;
> 
>         if (acpi_disabled)
>                 return 0;
> @@ -119,8 +120,16 @@ int __init parse_acpi_topology(void)
>                         if (cache_id > 0)
>                                 cpu_topology[cpu].llc_id = cache_id;
>                 }
> +
> +               if (default_cluster_topology &&
> +                   (i < acpi_find_cluster_level(cpu))) {

Per above, from what I understand, this is comparing cpu topology levels with
cache levels, which are independent from each other.

> +                       default_cluster_topology = false;
> +               }
>         }
> 
> +       if (!default_cluster_topology)
> +               set_sched_topology(arm64_no_mc_topology);
> +
>         return 0;
>  }

Thanks,
Vincent Guittot Feb. 23, 2022, 5:23 p.m. UTC | #21
On Wed, 23 Feb 2022 at 17:39, Darren Hart <darren@os.amperecomputing.com> wrote:
>
> On Wed, Feb 23, 2022 at 09:19:17AM +0100, Vincent Guittot wrote:
>
> ...
>
> > > > AFAICT, this CLUSTER level is only supported by ACPI. In
> > > > parse_acpi_topology() you should be able to know if cluster level is
> > > > above or below the level returned by acpi_find_last_cache_level() and
> > > > set the correct topology table accordingly
> > > >
> > >
> > > Thanks Vincent,
> > >
> > > This made sense as a place to start to me. The more I dug into the ACPI PPTT
> > > code, I kept running into conflicts with the API which would require extending
> > > it in ways that seems contrary to its intent. e.g. the exposed API uses Kernel
> > > logical CPUs rather than the topology ids (needed for working with processor
> > > containers). The cpu_topology masks haven't been populated yet, and
> > > acpi_find_last_cache_level is decoupled from the CPU topology level. So what
> > > we're really testing for is if the cluster cpumask is a subset of the coregroup
> > > cpumask or not, and it made the most sense to me to keep that in smp.c after the
> > > cpumasks have been updated and stored.
> >
> > I'm not sure why you want to compare cpumask when you can directly
> > compare topology level which is exactly what we are looking for in
> > order to correctly order the scheduler topology. I was expecting
> > something like the below to be enough. acpi_find_cluster_level() needs
> > to be created and should be similar to
> > find_acpi_cpu_topology_cluster() but return level instead of id. The
> > main advantage is that everything is contained in topology.c which
> > makes sense as we are playing with topology
>
> Hi Vincent,
>
> This was my thinking as well before I dug into the acpi pptt interfaces.
>
> The cpu topology levels and the cache levels are independent and assuming I've
> not misunderstood the implementation, acpi_find_cache_level() returns the
> highest *cache* level described in the PPTT for a given CPU.
>
> For the Ampere Altra, for example:
>
> CPU Topo  1       System
>  Level    2         Package
>    |      3           Cluster
>    |      4             Processor --- L1 I Cache \____ L2 U Cache -x
>   \/                              --- L1 D Cache /
>
>           4             Processor --- L1 I Cache \____ L2 U Cache -x
>                                   --- L1 D Cache /
>
>                   Cache Level ---->    1                2
>
> acpi_find_cache_level() returns "2" for any logical cpu, but this doesn't tell
> us anything about the CPU topology level across which this cache is shared.

yeah, I have wrongly assumed that we can directly compare cache level
with cluster level because find_acpi_cpu_cache_topology() returns a
cpu node id but find_acpi_cpu_cache_topology walks the cpu topo to
find the cpu node which maps the cache level. This means that we would
have to walk the cpu topology until we find either cluster node id or
cache node id to know which one belongs to the other

>
> This is what drove me out of topology.c and up into smp.c after the various
> topologies are populated and comparing masks. I'll spend a bit more time before
> sending a cpumask implementation to see if there is a better point to do this
> where the cpu topology level with shared cache is more readily available.
>
> >
> > diff --git a/arch/arm64/kernel/topology.c b/arch/arm64/kernel/topology.c
> > index 9ab78ad826e2..4dac0491b7e3 100644
> > --- a/arch/arm64/kernel/topology.c
> > +++ b/arch/arm64/kernel/topology.c
> > @@ -84,6 +84,7 @@ static bool __init acpi_cpu_is_threaded(int cpu)
> >  int __init parse_acpi_topology(void)
> >  {
> >         int cpu, topology_id;
> > +       bool default_cluster_topology = true;
> >
> >         if (acpi_disabled)
> >                 return 0;
> > @@ -119,8 +120,16 @@ int __init parse_acpi_topology(void)
> >                         if (cache_id > 0)
> >                                 cpu_topology[cpu].llc_id = cache_id;
> >                 }
> > +
> > +               if (default_cluster_topology &&
> > +                   (i < acpi_find_cluster_level(cpu))) {
>
> Per above, from what I understand, this is comparing cpu topology levels with
> cache levels, which are independent from each other.
>
> > +                       default_cluster_topology = false;
> > +               }
> >         }
> >
> > +       if (!default_cluster_topology)
> > +               set_sched_topology(arm64_no_mc_topology);
> > +
> >         return 0;
> >  }
>
> Thanks,
>
> --
> Darren Hart
> Ampere Computing / OS and Kernel
diff mbox series

Patch

diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
index 27df5c1e6baa..0a78ac5c8830 100644
--- a/arch/arm64/kernel/smp.c
+++ b/arch/arm64/kernel/smp.c
@@ -715,9 +715,22 @@  void __init smp_init_cpus(void)
 	}
 }
 
+static struct sched_domain_topology_level arm64_no_mc_topology[] = {
+#ifdef CONFIG_SCHED_SMT
+	{ cpu_smt_mask, cpu_smt_flags, SD_INIT_NAME(SMT) },
+#endif
+
+#ifdef CONFIG_SCHED_CLUSTER
+	{ cpu_clustergroup_mask, cpu_cluster_flags, SD_INIT_NAME(CLS) },
+#endif
+	{ cpu_cpu_mask, SD_INIT_NAME(DIE) },
+	{ NULL, },
+};
+
 void __init smp_prepare_cpus(unsigned int max_cpus)
 {
 	const struct cpu_operations *ops;
+	bool use_no_mc_topology = true;
 	int err;
 	unsigned int cpu;
 	unsigned int this_cpu;
@@ -758,6 +771,25 @@  void __init smp_prepare_cpus(unsigned int max_cpus)
 
 		set_cpu_present(cpu, true);
 		numa_store_cpu_info(cpu);
+
+		/*
+		 * Only use no_mc topology if all cpu_coregroup_mask weights=1
+		 */
+		if (cpumask_weight(cpu_coregroup_mask(cpu)) > 1)
+			use_no_mc_topology = false;
+	}
+
+	/*
+	 * SoCs with no shared processor-side cache will have cpu_coregroup_mask
+	 * weights=1. If they also define clusters with cpu_clustergroup_mask
+	 * weights > 1, build_sched_domain() will trigger a BUG as the CLS
+	 * cpu_mask will not be a subset of MC. It will extend the MC cpu_mask
+	 * to match CLS, and later discard the MC level. Avoid the bug by using
+	 * a topology without the MC if the cpu_coregroup_mask weights=1.
+	 */
+	if (use_no_mc_topology) {
+		pr_info("cpu_coregroup_mask weights=1, skipping MC topology level");
+		set_sched_topology(arm64_no_mc_topology);
 	}
 }