diff mbox series

[v6,17/21] arch_topology: Limit span of cpu_clustergroup_mask()

Message ID 20220704101605.1318280-18-sudeep.holla@arm.com (mailing list archive)
State New, archived
Headers show
Series arch_topology: Updates to add socket support and fix cluster ids | expand

Commit Message

Sudeep Holla July 4, 2022, 10:16 a.m. UTC
From: Ionela Voinescu <ionela.voinescu@arm.com>

Currently the cluster identifier is not set on DT based platforms.
The reset or default value is -1 for all the CPUs. Once we assign the
cluster identifier values correctly, the cluster_sibling mask will be
populated and returned by cpu_clustergroup_mask() to contribute in the
creation of the CLS scheduling domain level, if SCHED_CLUSTER is
enabled.

To avoid topologies that will result in questionable or incorrect
scheduling domains, impose restrictions regarding the span of clusters,
as presented to scheduling domains building code: cluster_sibling should
not span more or the same CPUs as cpu_coregroup_mask().

This is needed in order to obtain a strict separation between the MC and
CLS levels, and maintain the same domains for existing platforms in
the presence of CONFIG_SCHED_CLUSTER, where the new cluster information
is redundant and irrelevant for the scheduler.

While previously the scheduling domain builder code would have removed MC
as redundant and kept CLS if SCHED_CLUSTER was enabled and the
cpu_coregroup_mask() and cpu_clustergroup_mask() spanned the same CPUs,
now CLS will be removed and MC kept.

Cc: Darren Hart <darren@os.amperecomputing.com>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
Tested-by: Ionela Voinescu <ionela.voinescu@arm.com>
Signed-off-by: Ionela Voinescu <ionela.voinescu@arm.com>
Signed-off-by: Sudeep Holla <sudeep.holla@arm.com>
---
 drivers/base/arch_topology.c | 8 ++++++++
 1 file changed, 8 insertions(+)

Comments

Darren Hart July 8, 2022, 12:10 a.m. UTC | #1
On Mon, Jul 04, 2022 at 11:16:01AM +0100, Sudeep Holla wrote:
> From: Ionela Voinescu <ionela.voinescu@arm.com>

Hi Sudeep and Ionela,

> 
> Currently the cluster identifier is not set on DT based platforms.
> The reset or default value is -1 for all the CPUs. Once we assign the
> cluster identifier values correctly, the cluster_sibling mask will be
> populated and returned by cpu_clustergroup_mask() to contribute in the
> creation of the CLS scheduling domain level, if SCHED_CLUSTER is
> enabled.
> 
> To avoid topologies that will result in questionable or incorrect
> scheduling domains, impose restrictions regarding the span of clusters,

Can you provide a specific example of a valid topology that results in
the wrong thing currently?

> as presented to scheduling domains building code: cluster_sibling should
> not span more or the same CPUs as cpu_coregroup_mask().
> 
> This is needed in order to obtain a strict separation between the MC and
> CLS levels, and maintain the same domains for existing platforms in
> the presence of CONFIG_SCHED_CLUSTER, where the new cluster information
> is redundant and irrelevant for the scheduler.

Unfortunately, I believe this changes the behavior for the existing
Ampere Altra systems, resulting in degraded performance particularly
latency sensitive workloads by effectively reverting:

  db1e59483d topology: make core_mask include at least cluster_siblings

and ensuring the clustergroup_mask will return with just one CPU for the
condition the above commit addresses.

> 
> While previously the scheduling domain builder code would have removed MC
> as redundant and kept CLS if SCHED_CLUSTER was enabled and the
> cpu_coregroup_mask() and cpu_clustergroup_mask() spanned the same CPUs,
> now CLS will be removed and MC kept.
> 

This is not desireable for all systems, particular those which don't
have an L3 but do share other resources - such as the snoop filter in
the case of the Ampere Altra.

While not universally supported, we agreed in the discussion on the
above patch to allow systems to define clusters independently from the
L3 as an LLC since this is also independently defined in PPTT.

Going back to my first comment - does this fix an existing system with a
valid topology? It's not clear to me what that would look like. The
Ampere Altra presents a cluster level in PPTT because that is the
desireable topology for the system. If it's not desirable for another
system to have the cluster topology - shouldn't it not present that
layer to the kernel in the first place?

Thanks,

> Cc: Darren Hart <darren@os.amperecomputing.com>
> Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
> Tested-by: Ionela Voinescu <ionela.voinescu@arm.com>
> Signed-off-by: Ionela Voinescu <ionela.voinescu@arm.com>
> Signed-off-by: Sudeep Holla <sudeep.holla@arm.com>
> ---
>  drivers/base/arch_topology.c | 8 ++++++++
>  1 file changed, 8 insertions(+)
> 
> diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
> index e384afb6cac7..591c1f8e15e2 100644
> --- a/drivers/base/arch_topology.c
> +++ b/drivers/base/arch_topology.c
> @@ -686,6 +686,14 @@ const struct cpumask *cpu_coregroup_mask(int cpu)
>  
>  const struct cpumask *cpu_clustergroup_mask(int cpu)
>  {
> +	/*
> +	 * Forbid cpu_clustergroup_mask() to span more or the same CPUs as
> +	 * cpu_coregroup_mask().
> +	 */
> +	if (cpumask_subset(cpu_coregroup_mask(cpu),
> +			   &cpu_topology[cpu].cluster_sibling))
> +		return get_cpu_mask(cpu);
> +
>  	return &cpu_topology[cpu].cluster_sibling;
>  }
>  
> -- 
> 2.37.0
>
Sudeep Holla July 8, 2022, 8:04 a.m. UTC | #2
Hi Darren,

I will let Ionela or Dietmar cover some of the scheduler aspects as
I don't have much knowledge in that area.

On Thu, Jul 07, 2022 at 05:10:19PM -0700, Darren Hart wrote:
> On Mon, Jul 04, 2022 at 11:16:01AM +0100, Sudeep Holla wrote:
> > From: Ionela Voinescu <ionela.voinescu@arm.com>
> 
> Hi Sudeep and Ionela,
> 
> > 
> > Currently the cluster identifier is not set on DT based platforms.
> > The reset or default value is -1 for all the CPUs. Once we assign the
> > cluster identifier values correctly, the cluster_sibling mask will be
> > populated and returned by cpu_clustergroup_mask() to contribute in the
> > creation of the CLS scheduling domain level, if SCHED_CLUSTER is
> > enabled.
> > 
> > To avoid topologies that will result in questionable or incorrect
> > scheduling domains, impose restrictions regarding the span of clusters,
> 
> Can you provide a specific example of a valid topology that results in
> the wrong thing currently?
>

As a simple example, Juno with 2 clusters and L2 for each cluster. IIUC
MC is preferred instead of CLS and both MC and CLS domains are exact
match.

> > 
> > While previously the scheduling domain builder code would have removed MC
> > as redundant and kept CLS if SCHED_CLUSTER was enabled and the
> > cpu_coregroup_mask() and cpu_clustergroup_mask() spanned the same CPUs,
> > now CLS will be removed and MC kept.
> > 
> 
> This is not desireable for all systems, particular those which don't
> have an L3 but do share other resources - such as the snoop filter in
> the case of the Ampere Altra.
> 
> While not universally supported, we agreed in the discussion on the
> above patch to allow systems to define clusters independently from the
> L3 as an LLC since this is also independently defined in PPTT.
>
> Going back to my first comment - does this fix an existing system with a
> valid topology? 

Yes as mentioned above Juno.

> It's not clear to me what that would look like. The Ampere Altra presents
> a cluster level in PPTT because that is the desireable topology for the
> system.

Absolutely wrong reason. It should present because the hardware is so,
not because some OSPM desires something in someway. Sorry that's not how
DT/ACPI is designed for. If 2 different OSPM desires different things, then
one ACPI will not be sufficient.

> If it's not desirable for another system to have the cluster topology -
> shouldn't it not present that layer to the kernel in the first place?

Absolutely 100% yes, it must present it if the hardware is designed so.
No if or but.
Ionela Voinescu July 8, 2022, 9:05 a.m. UTC | #3
Hi Darren,

On Thursday 07 Jul 2022 at 17:10:19 (-0700), Darren Hart wrote:
> On Mon, Jul 04, 2022 at 11:16:01AM +0100, Sudeep Holla wrote:
> > From: Ionela Voinescu <ionela.voinescu@arm.com>
> 
> Hi Sudeep and Ionela,
> 
> > 
> > Currently the cluster identifier is not set on DT based platforms.
> > The reset or default value is -1 for all the CPUs. Once we assign the
> > cluster identifier values correctly, the cluster_sibling mask will be
> > populated and returned by cpu_clustergroup_mask() to contribute in the
> > creation of the CLS scheduling domain level, if SCHED_CLUSTER is
> > enabled.
> > 
> > To avoid topologies that will result in questionable or incorrect
> > scheduling domains, impose restrictions regarding the span of clusters,
> 
> Can you provide a specific example of a valid topology that results in
> the wrong thing currently?
> 

When CONFIG_SCHED_CLUSTER=y, all typical big.LITTLE platforms will end up
having a CLS level instead of MC, with an extra flag for the CLS level:
SD_PREFER_SIBLING. Additional to this, potentially broken cluster
descriptions in DT (let's say clusters spanning more CPUs than the LLC
domain) will result in broken scheduler topologies.

This drew our attention that the span of clusters should be restricted
to ensure they always span less CPUs than LLC, if LLC information exists
and LLC spans more than 1 core. But the Ampere Altra functionality you
introduced is maintained. I'll detail this below.

> > as presented to scheduling domains building code: cluster_sibling should
> > not span more or the same CPUs as cpu_coregroup_mask().
> > 
> > This is needed in order to obtain a strict separation between the MC and
> > CLS levels, and maintain the same domains for existing platforms in
> > the presence of CONFIG_SCHED_CLUSTER, where the new cluster information
> > is redundant and irrelevant for the scheduler.
> 
> Unfortunately, I believe this changes the behavior for the existing
> Ampere Altra systems, resulting in degraded performance particularly
> latency sensitive workloads by effectively reverting:
> 
>   db1e59483d topology: make core_mask include at least cluster_siblings
> 
> and ensuring the clustergroup_mask will return with just one CPU for the
> condition the above commit addresses.
> 

It does not change the functionality on Ampere Altra. cpu_coregroup_mask
will still return 2 CPUs (cluster span). The difference is that
cpu_clustergroup_mask will see that cpu_coregroup_masks spans the same
CPUs and it will return a single CPU. This results in the CLS level
being invalidated, and the MC level maintained. But MC will span 2 CPUs,
instead of 1, which was the case before your fix. This is alright as
MC and CLS have the same flags so the existing functionality is fully
maintained.

I've tested on Ampere Altra:

without patch:

cpu0/domain0/flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_SHARE_PKG_RESOURCES SD_PREFER_SIBLING
cpu0/domain0/min_interval:2
cpu0/domain0/name:CLS

cpu0/domain1/flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_PREFER_SIBLING
cpu0/domain1/min_interval:80
cpu0/domain1/name:DIE

cpu0/domain2/flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_SERIALIZE SD_OVERLAP SD_NUMA
cpu0/domain2/min_interval:160
cpu0/domain2/name:NUMA

with this patch:

cpu0/domain0/flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_SHARE_PKG_RESOURCES SD_PREFER_SIBLING
cpu0/domain0/min_interval:2
cpu0/domain0/name:MC

cpu0/domain1/flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_PREFER_SIBLING
cpu0/domain1/min_interval:80
cpu0/domain1/name:DIE

cpu0/domain2/flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_SERIALIZE SD_OVERLAP SD_NUMA
cpu0/domain2/min_interval:160
cpu0/domain2/name:NUMA

> > 
> > While previously the scheduling domain builder code would have removed MC
> > as redundant and kept CLS if SCHED_CLUSTER was enabled and the
> > cpu_coregroup_mask() and cpu_clustergroup_mask() spanned the same CPUs,
> > now CLS will be removed and MC kept.
> > 
> 
> This is not desireable for all systems, particular those which don't
> have an L3 but do share other resources - such as the snoop filter in
> the case of the Ampere Altra.
> 
> While not universally supported, we agreed in the discussion on the
> above patch to allow systems to define clusters independently from the
> L3 as an LLC since this is also independently defined in PPTT.
> 
> Going back to my first comment - does this fix an existing system with a
> valid topology? It's not clear to me what that would look like. The
> Ampere Altra presents a cluster level in PPTT because that is the
> desireable topology for the system. If it's not desirable for another
> system to have the cluster topology - shouldn't it not present that
> layer to the kernel in the first place?
> 

Hopefully my comments above have clarified these.

Thanks,
Ionela.

> Thanks,
> 
> > Cc: Darren Hart <darren@os.amperecomputing.com>
> > Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
> > Tested-by: Ionela Voinescu <ionela.voinescu@arm.com>
> > Signed-off-by: Ionela Voinescu <ionela.voinescu@arm.com>
> > Signed-off-by: Sudeep Holla <sudeep.holla@arm.com>
> > ---
> >  drivers/base/arch_topology.c | 8 ++++++++
> >  1 file changed, 8 insertions(+)
> > 
> > diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
> > index e384afb6cac7..591c1f8e15e2 100644
> > --- a/drivers/base/arch_topology.c
> > +++ b/drivers/base/arch_topology.c
> > @@ -686,6 +686,14 @@ const struct cpumask *cpu_coregroup_mask(int cpu)
> >  
> >  const struct cpumask *cpu_clustergroup_mask(int cpu)
> >  {
> > +	/*
> > +	 * Forbid cpu_clustergroup_mask() to span more or the same CPUs as
> > +	 * cpu_coregroup_mask().
> > +	 */
> > +	if (cpumask_subset(cpu_coregroup_mask(cpu),
> > +			   &cpu_topology[cpu].cluster_sibling))
> > +		return get_cpu_mask(cpu);
> > +
> >  	return &cpu_topology[cpu].cluster_sibling;
> >  }
> >  
> > -- 
> > 2.37.0
> > 
> 
> -- 
> Darren Hart
> Ampere Computing / OS and Kernel
Darren Hart July 8, 2022, 4:14 p.m. UTC | #4
On Fri, Jul 08, 2022 at 10:05:32AM +0100, Ionela Voinescu wrote:
> Hi Darren,
> 
> On Thursday 07 Jul 2022 at 17:10:19 (-0700), Darren Hart wrote:
> > On Mon, Jul 04, 2022 at 11:16:01AM +0100, Sudeep Holla wrote:
> > > From: Ionela Voinescu <ionela.voinescu@arm.com>
> > 
> > Hi Sudeep and Ionela,
> > 
> > > 
> > > Currently the cluster identifier is not set on DT based platforms.
> > > The reset or default value is -1 for all the CPUs. Once we assign the
> > > cluster identifier values correctly, the cluster_sibling mask will be
> > > populated and returned by cpu_clustergroup_mask() to contribute in the
> > > creation of the CLS scheduling domain level, if SCHED_CLUSTER is
> > > enabled.
> > > 
> > > To avoid topologies that will result in questionable or incorrect
> > > scheduling domains, impose restrictions regarding the span of clusters,
> > 
> > Can you provide a specific example of a valid topology that results in
> > the wrong thing currently?
> > 
> 
> When CONFIG_SCHED_CLUSTER=y, all typical big.LITTLE platforms will end up
> having a CLS level instead of MC, with an extra flag for the CLS level:
> SD_PREFER_SIBLING. Additional to this, potentially broken cluster
> descriptions in DT (let's say clusters spanning more CPUs than the LLC
> domain) will result in broken scheduler topologies.

You addressed my primary concern below, thank you. Re this point, I was
concerned that we were prioritizing correcting "broken cluster
descriptions" over "correct, but unusual cluster descriptions". Your
solutions seems to elegantly address both.

> 
> This drew our attention that the span of clusters should be restricted
> to ensure they always span less CPUs than LLC, if LLC information exists
> and LLC spans more than 1 core. But the Ampere Altra functionality you
> introduced is maintained. I'll detail this below.
> 
> > > as presented to scheduling domains building code: cluster_sibling should
> > > not span more or the same CPUs as cpu_coregroup_mask().
> > > 
> > > This is needed in order to obtain a strict separation between the MC and
> > > CLS levels, and maintain the same domains for existing platforms in
> > > the presence of CONFIG_SCHED_CLUSTER, where the new cluster information
> > > is redundant and irrelevant for the scheduler.
> > 
> > Unfortunately, I believe this changes the behavior for the existing
> > Ampere Altra systems, resulting in degraded performance particularly
> > latency sensitive workloads by effectively reverting:
> > 
> >   db1e59483d topology: make core_mask include at least cluster_siblings
> > 
> > and ensuring the clustergroup_mask will return with just one CPU for the
> > condition the above commit addresses.
> > 
> 
> It does not change the functionality on Ampere Altra. cpu_coregroup_mask
> will still return 2 CPUs (cluster span). The difference is that
> cpu_clustergroup_mask will see that cpu_coregroup_masks spans the same
> CPUs and it will return a single CPU. This results in the CLS level
> being invalidated, and the MC level maintained. But MC will span 2 CPUs,
> instead of 1, which was the case before your fix. This is alright as
> MC and CLS have the same flags so the existing functionality is fully
> maintained.

Ah, of course. I missed the combined impact of my earlier change plus
yours, which is to first expand MC and then to collapse CLS. It's a
little round about for the Altra, but that seems reasonable as it's a
bit of a corner case in terms topologies.

Thank you for the explanation.
Darren Hart July 8, 2022, 4:27 p.m. UTC | #5
On Fri, Jul 08, 2022 at 09:04:24AM +0100, Sudeep Holla wrote:
> Hi Darren,
> 
> I will let Ionela or Dietmar cover some of the scheduler aspects as
> I don't have much knowledge in that area.
> 
> On Thu, Jul 07, 2022 at 05:10:19PM -0700, Darren Hart wrote:
> > On Mon, Jul 04, 2022 at 11:16:01AM +0100, Sudeep Holla wrote:
> > > From: Ionela Voinescu <ionela.voinescu@arm.com>
> > 
> > Hi Sudeep and Ionela,
> > 
> > > 
> > > Currently the cluster identifier is not set on DT based platforms.
> > > The reset or default value is -1 for all the CPUs. Once we assign the
> > > cluster identifier values correctly, the cluster_sibling mask will be
> > > populated and returned by cpu_clustergroup_mask() to contribute in the
> > > creation of the CLS scheduling domain level, if SCHED_CLUSTER is
> > > enabled.
> > > 
> > > To avoid topologies that will result in questionable or incorrect
> > > scheduling domains, impose restrictions regarding the span of clusters,
> > 
> > Can you provide a specific example of a valid topology that results in
> > the wrong thing currently?
> >
> 
> As a simple example, Juno with 2 clusters and L2 for each cluster. IIUC
> MC is preferred instead of CLS and both MC and CLS domains are exact
> match.
> 
> > > 
> > > While previously the scheduling domain builder code would have removed MC
> > > as redundant and kept CLS if SCHED_CLUSTER was enabled and the
> > > cpu_coregroup_mask() and cpu_clustergroup_mask() spanned the same CPUs,
> > > now CLS will be removed and MC kept.
> > > 
> > 
> > This is not desireable for all systems, particular those which don't
> > have an L3 but do share other resources - such as the snoop filter in
> > the case of the Ampere Altra.

I was wrong here. This match also modifies the coregroup, the MC after
this patch is equivalent to the CLS before the patch. The Altra is not
negatively impacted here.

> > 
> > While not universally supported, we agreed in the discussion on the
> > above patch to allow systems to define clusters independently from the
> > L3 as an LLC since this is also independently defined in PPTT.
> >
> > Going back to my first comment - does this fix an existing system with a
> > valid topology? 
> 
> Yes as mentioned above Juno.
> 
> > It's not clear to me what that would look like. The Ampere Altra presents
> > a cluster level in PPTT because that is the desireable topology for the
> > system.
> 
> Absolutely wrong reason. It should present because the hardware is so,
> not because some OSPM desires something in someway. Sorry that's not how
> DT/ACPI is designed for. If 2 different OSPM desires different things, then
> one ACPI will not be sufficient.

Agree. I worded that badly. I should have said the Altra presents a PPTT
topology that accurately reflects the hardwere. There is no shared
cpu-side LLC, and there is an affinity between the DSU pairs which share
a snoop filter.

I do think the general assumption that MC shares a cpu-side LLC will
continue to present challenges to the Altra topology in terms of ongoing
to changes to the code. I don't have a good solution to that at the
moment, something I'll continue to think on.

> 
> > If it's not desirable for another system to have the cluster topology -
> > shouldn't it not present that layer to the kernel in the first place?
> 
> Absolutely 100% yes, it must present it if the hardware is designed so.
> No if or but.
> 
> -- 
> Regards,
> Sudeep

Thanks Sudeep,
diff mbox series

Patch

diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
index e384afb6cac7..591c1f8e15e2 100644
--- a/drivers/base/arch_topology.c
+++ b/drivers/base/arch_topology.c
@@ -686,6 +686,14 @@  const struct cpumask *cpu_coregroup_mask(int cpu)
 
 const struct cpumask *cpu_clustergroup_mask(int cpu)
 {
+	/*
+	 * Forbid cpu_clustergroup_mask() to span more or the same CPUs as
+	 * cpu_coregroup_mask().
+	 */
+	if (cpumask_subset(cpu_coregroup_mask(cpu),
+			   &cpu_topology[cpu].cluster_sibling))
+		return get_cpu_mask(cpu);
+
 	return &cpu_topology[cpu].cluster_sibling;
 }