Message ID | f1deaeabfd31fdf512ff6502f38186ef842c2b1f.1646413117.git.darren@os.amperecomputing.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v3] topology: make core_mask include at least cluster_siblings | expand |
On Fri, Mar 04, 2022 at 09:01:36AM -0800, Darren Hart wrote: > Ampere Altra defines CPU clusters in the ACPI PPTT. They share a Snoop > Control Unit, but have no shared CPU-side last level cache. > > cpu_coregroup_mask() will return a cpumask with weight 1, while > cpu_clustergroup_mask() will return a cpumask with weight 2. > > As a result, build_sched_domain() will BUG() once per CPU with: > > BUG: arch topology borken > the CLS domain not a subset of the MC domain > > The MC level cpumask is then extended to that of the CLS child, and is > later removed entirely as redundant. This sched domain topology is an > improvement over previous topologies, or those built without > SCHED_CLUSTER, particularly for certain latency sensitive workloads. > With the current scheduler model and heuristics, this is a desirable > default topology for Ampere Altra and Altra Max system. > > Rather than create a custom sched domains topology structure and > introduce new logic in arch/arm64 to detect these systems, update the > core_mask so coregroup is never a subset of clustergroup, extending it > to cluster_siblings if necessary. > > This has the added benefit over a custom topology of working for both > symmetric and asymmetric topologies. It does not address systems where > the cluster topology is above a populated mc topology, but these are not > considered today and can be addressed separately if and when they > appear. > > The final sched domain topology for a 2 socket Ampere Altra system is > unchanged with or without CONFIG_SCHED_CLUSTER, and the BUG is avoided: > > For CPU0: > > CONFIG_SCHED_CLUSTER=y > CLS [0-1] > DIE [0-79] > NUMA [0-159] > > CONFIG_SCHED_CLUSTER is not set > DIE [0-79] > NUMA [0-159] > > Cc: Sudeep Holla <sudeep.holla@arm.com> > Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> > Cc: "Rafael J. Wysocki" <rafael@kernel.org> > Cc: Catalin Marinas <catalin.marinas@arm.com> > Cc: Will Deacon <will@kernel.org> > Cc: Peter Zijlstra <peterz@infradead.org> > Cc: Vincent Guittot <vincent.guittot@linaro.org> > Cc: Barry Song <song.bao.hua@hisilicon.com> > Cc: Valentin Schneider <valentin.schneider@arm.com> > Cc: D. Scott Phillips <scott@os.amperecomputing.com> > Cc: Ilkka Koskinen <ilkka@os.amperecomputing.com> > Cc: <stable@vger.kernel.org> # 5.16.x > Suggested-by: Barry Song <song.bao.hua@hisilicon.com> > Signed-off-by: Darren Hart <darren@os.amperecomputing.com> > --- > v1: Drop MC level if coregroup weight == 1 > v2: New sd topo in arch/arm64/kernel/smp.c > v3: No new topo, extend core_mask to cluster_siblings > > drivers/base/arch_topology.c | 8 ++++++++ > 1 file changed, 8 insertions(+) > > diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c > index 976154140f0b..a96f45db928b 100644 > --- a/drivers/base/arch_topology.c > +++ b/drivers/base/arch_topology.c > @@ -628,6 +628,14 @@ const struct cpumask *cpu_coregroup_mask(int cpu) > core_mask = &cpu_topology[cpu].llc_sibling; > } > > + /* > + * For systems with no shared cpu-side LLC but with clusters defined, > + * extend core_mask to cluster_siblings. The sched domain builder will > + * then remove MC as redundant with CLS if SCHED_CLUSTER is enabled. > + */ > + if (cpumask_subset(core_mask, &cpu_topology[cpu].cluster_sibling)) > + core_mask = &cpu_topology[cpu].cluster_sibling; > + Sudeep, Vincent, are you happy with this now? Will
On Tue, Mar 08, 2022 at 10:30:12AM +0000, Will Deacon wrote: > On Fri, Mar 04, 2022 at 09:01:36AM -0800, Darren Hart wrote: > > Ampere Altra defines CPU clusters in the ACPI PPTT. They share a Snoop > > Control Unit, but have no shared CPU-side last level cache. > > > > cpu_coregroup_mask() will return a cpumask with weight 1, while > > cpu_clustergroup_mask() will return a cpumask with weight 2. > > > > As a result, build_sched_domain() will BUG() once per CPU with: > > > > BUG: arch topology borken > > the CLS domain not a subset of the MC domain > > > > The MC level cpumask is then extended to that of the CLS child, and is > > later removed entirely as redundant. This sched domain topology is an > > improvement over previous topologies, or those built without > > SCHED_CLUSTER, particularly for certain latency sensitive workloads. > > With the current scheduler model and heuristics, this is a desirable > > default topology for Ampere Altra and Altra Max system. > > > > Rather than create a custom sched domains topology structure and > > introduce new logic in arch/arm64 to detect these systems, update the > > core_mask so coregroup is never a subset of clustergroup, extending it > > to cluster_siblings if necessary. > > > > This has the added benefit over a custom topology of working for both > > symmetric and asymmetric topologies. It does not address systems where > > the cluster topology is above a populated mc topology, but these are not > > considered today and can be addressed separately if and when they > > appear. > > > > The final sched domain topology for a 2 socket Ampere Altra system is > > unchanged with or without CONFIG_SCHED_CLUSTER, and the BUG is avoided: > > > > For CPU0: > > > > CONFIG_SCHED_CLUSTER=y > > CLS [0-1] > > DIE [0-79] > > NUMA [0-159] > > > > CONFIG_SCHED_CLUSTER is not set > > DIE [0-79] > > NUMA [0-159] > > > > Cc: Sudeep Holla <sudeep.holla@arm.com> > > Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> > > Cc: "Rafael J. Wysocki" <rafael@kernel.org> > > Cc: Catalin Marinas <catalin.marinas@arm.com> > > Cc: Will Deacon <will@kernel.org> > > Cc: Peter Zijlstra <peterz@infradead.org> > > Cc: Vincent Guittot <vincent.guittot@linaro.org> > > Cc: Barry Song <song.bao.hua@hisilicon.com> > > Cc: Valentin Schneider <valentin.schneider@arm.com> > > Cc: D. Scott Phillips <scott@os.amperecomputing.com> > > Cc: Ilkka Koskinen <ilkka@os.amperecomputing.com> > > Cc: <stable@vger.kernel.org> # 5.16.x > > Suggested-by: Barry Song <song.bao.hua@hisilicon.com> > > Signed-off-by: Darren Hart <darren@os.amperecomputing.com> > > --- > > v1: Drop MC level if coregroup weight == 1 > > v2: New sd topo in arch/arm64/kernel/smp.c > > v3: No new topo, extend core_mask to cluster_siblings > > > > drivers/base/arch_topology.c | 8 ++++++++ > > 1 file changed, 8 insertions(+) > > > > diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c > > index 976154140f0b..a96f45db928b 100644 > > --- a/drivers/base/arch_topology.c > > +++ b/drivers/base/arch_topology.c > > @@ -628,6 +628,14 @@ const struct cpumask *cpu_coregroup_mask(int cpu) > > core_mask = &cpu_topology[cpu].llc_sibling; > > } > > > > + /* > > + * For systems with no shared cpu-side LLC but with clusters defined, > > + * extend core_mask to cluster_siblings. The sched domain builder will > > + * then remove MC as redundant with CLS if SCHED_CLUSTER is enabled. > > + */ > > + if (cpumask_subset(core_mask, &cpu_topology[cpu].cluster_sibling)) > > + core_mask = &cpu_topology[cpu].cluster_sibling; > > + > > Sudeep, Vincent, are you happy with this now? > It looks good to me. Since I don't have much knowledge on scheduler, I asked Dietmar to have a look as well just in case if I am missing any other impact. For now, Acked-by: Sudeep Holla <sudeep.holla@arm.com> -- Regards, Sudeep
On Tue, 8 Mar 2022 at 11:30, Will Deacon <will@kernel.org> wrote: > > On Fri, Mar 04, 2022 at 09:01:36AM -0800, Darren Hart wrote: > > Ampere Altra defines CPU clusters in the ACPI PPTT. They share a Snoop > > Control Unit, but have no shared CPU-side last level cache. > > > > cpu_coregroup_mask() will return a cpumask with weight 1, while > > cpu_clustergroup_mask() will return a cpumask with weight 2. > > > > As a result, build_sched_domain() will BUG() once per CPU with: > > > > BUG: arch topology borken > > the CLS domain not a subset of the MC domain > > > > The MC level cpumask is then extended to that of the CLS child, and is > > later removed entirely as redundant. This sched domain topology is an > > improvement over previous topologies, or those built without > > SCHED_CLUSTER, particularly for certain latency sensitive workloads. > > With the current scheduler model and heuristics, this is a desirable > > default topology for Ampere Altra and Altra Max system. > > > > Rather than create a custom sched domains topology structure and > > introduce new logic in arch/arm64 to detect these systems, update the > > core_mask so coregroup is never a subset of clustergroup, extending it > > to cluster_siblings if necessary. > > > > This has the added benefit over a custom topology of working for both > > symmetric and asymmetric topologies. It does not address systems where > > the cluster topology is above a populated mc topology, but these are not > > considered today and can be addressed separately if and when they > > appear. > > > > The final sched domain topology for a 2 socket Ampere Altra system is > > unchanged with or without CONFIG_SCHED_CLUSTER, and the BUG is avoided: > > > > For CPU0: > > > > CONFIG_SCHED_CLUSTER=y > > CLS [0-1] > > DIE [0-79] > > NUMA [0-159] > > > > CONFIG_SCHED_CLUSTER is not set > > DIE [0-79] > > NUMA [0-159] > > > > Cc: Sudeep Holla <sudeep.holla@arm.com> > > Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> > > Cc: "Rafael J. Wysocki" <rafael@kernel.org> > > Cc: Catalin Marinas <catalin.marinas@arm.com> > > Cc: Will Deacon <will@kernel.org> > > Cc: Peter Zijlstra <peterz@infradead.org> > > Cc: Vincent Guittot <vincent.guittot@linaro.org> > > Cc: Barry Song <song.bao.hua@hisilicon.com> > > Cc: Valentin Schneider <valentin.schneider@arm.com> > > Cc: D. Scott Phillips <scott@os.amperecomputing.com> > > Cc: Ilkka Koskinen <ilkka@os.amperecomputing.com> > > Cc: <stable@vger.kernel.org> # 5.16.x > > Suggested-by: Barry Song <song.bao.hua@hisilicon.com> > > Signed-off-by: Darren Hart <darren@os.amperecomputing.com> > > --- > > v1: Drop MC level if coregroup weight == 1 > > v2: New sd topo in arch/arm64/kernel/smp.c > > v3: No new topo, extend core_mask to cluster_siblings > > > > drivers/base/arch_topology.c | 8 ++++++++ > > 1 file changed, 8 insertions(+) > > > > diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c > > index 976154140f0b..a96f45db928b 100644 > > --- a/drivers/base/arch_topology.c > > +++ b/drivers/base/arch_topology.c > > @@ -628,6 +628,14 @@ const struct cpumask *cpu_coregroup_mask(int cpu) > > core_mask = &cpu_topology[cpu].llc_sibling; > > } > > > > + /* > > + * For systems with no shared cpu-side LLC but with clusters defined, > > + * extend core_mask to cluster_siblings. The sched domain builder will > > + * then remove MC as redundant with CLS if SCHED_CLUSTER is enabled. > > + */ > > + if (cpumask_subset(core_mask, &cpu_topology[cpu].cluster_sibling)) > > + core_mask = &cpu_topology[cpu].cluster_sibling; > > + > > Sudeep, Vincent, are you happy with this now? I would not say that I'm happy because this solution skews the core cpu mask in order to abuse the scheduler so that it will remove a wrong but useless level when it will build its domains. But this works so as long as the maintainer are happy, I'm fine > > Will
On 08/03/2022 12:04, Vincent Guittot wrote: > On Tue, 8 Mar 2022 at 11:30, Will Deacon <will@kernel.org> wrote: [...] >>> --- >>> v1: Drop MC level if coregroup weight == 1 >>> v2: New sd topo in arch/arm64/kernel/smp.c >>> v3: No new topo, extend core_mask to cluster_siblings >>> >>> drivers/base/arch_topology.c | 8 ++++++++ >>> 1 file changed, 8 insertions(+) >>> >>> diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c >>> index 976154140f0b..a96f45db928b 100644 >>> --- a/drivers/base/arch_topology.c >>> +++ b/drivers/base/arch_topology.c >>> @@ -628,6 +628,14 @@ const struct cpumask *cpu_coregroup_mask(int cpu) >>> core_mask = &cpu_topology[cpu].llc_sibling; >>> } >>> >>> + /* >>> + * For systems with no shared cpu-side LLC but with clusters defined, >>> + * extend core_mask to cluster_siblings. The sched domain builder will >>> + * then remove MC as redundant with CLS if SCHED_CLUSTER is enabled. IMHO, if core_mask weight is 1, MC will be removed/degenerated anyway. This is what I get on my Ampere Altra (I guess I don't have the ACPI changes which would let to a CLS sched domain): # cat /sys/kernel/debug/sched/domains/cpu0/domain*/name DIE NUMA root@oss-altra01:~# zcat /proc/config.gz | grep SCHED_CLUSTER CONFIG_SCHED_CLUSTER=y >>> + */ >>> + if (cpumask_subset(core_mask, &cpu_topology[cpu].cluster_sibling)) >>> + core_mask = &cpu_topology[cpu].cluster_sibling; >>> + >> >> Sudeep, Vincent, are you happy with this now? > > I would not say that I'm happy because this solution skews the core > cpu mask in order to abuse the scheduler so that it will remove a > wrong but useless level when it will build its domains. > But this works so as long as the maintainer are happy, I'm fine I do not have any better idea than this tweak here either in case the platform can't provide a cleaner setup. Maybe the following is easier to read but then we use '&cpu_topology[cpu].llc_sibling' in cpu_coregroup_mask() already ... @@ -617,6 +617,7 @@ EXPORT_SYMBOL_GPL(cpu_topology); const struct cpumask *cpu_coregroup_mask(int cpu) { const cpumask_t *core_mask = cpumask_of_node(cpu_to_node(cpu)); + const cpumask_t *cluster_mask = cpu_clustergroup_mask(cpu); /* Find the smaller of NUMA, core or LLC siblings */ if (cpumask_subset(&cpu_topology[cpu].core_sibling, core_mask)) { @@ -628,6 +629,9 @@ const struct cpumask *cpu_coregroup_mask(int cpu) core_mask = &cpu_topology[cpu].llc_sibling; } + if (cpumask_subset(core_mask, cluster_mask)) + core_mask = cluster_mask; + return core_mask; } Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
On Tue, Mar 08, 2022 at 05:03:07PM +0100, Dietmar Eggemann wrote: > On 08/03/2022 12:04, Vincent Guittot wrote: > > On Tue, 8 Mar 2022 at 11:30, Will Deacon <will@kernel.org> wrote: > > [...] > > >>> --- > >>> v1: Drop MC level if coregroup weight == 1 > >>> v2: New sd topo in arch/arm64/kernel/smp.c > >>> v3: No new topo, extend core_mask to cluster_siblings > >>> > >>> drivers/base/arch_topology.c | 8 ++++++++ > >>> 1 file changed, 8 insertions(+) > >>> > >>> diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c > >>> index 976154140f0b..a96f45db928b 100644 > >>> --- a/drivers/base/arch_topology.c > >>> +++ b/drivers/base/arch_topology.c > >>> @@ -628,6 +628,14 @@ const struct cpumask *cpu_coregroup_mask(int cpu) > >>> core_mask = &cpu_topology[cpu].llc_sibling; > >>> } > >>> > >>> + /* > >>> + * For systems with no shared cpu-side LLC but with clusters defined, > >>> + * extend core_mask to cluster_siblings. The sched domain builder will > >>> + * then remove MC as redundant with CLS if SCHED_CLUSTER is enabled. > > IMHO, if core_mask weight is 1, MC will be removed/degenerated anyway. > > This is what I get on my Ampere Altra (I guess I don't have the ACPI > changes which would let to a CLS sched domain): > > # cat /sys/kernel/debug/sched/domains/cpu0/domain*/name > DIE > NUMA > root@oss-altra01:~# zcat /proc/config.gz | grep SCHED_CLUSTER > CONFIG_SCHED_CLUSTER=y I'd like to follow up on this. Would you share your dmidecode BIOS Information section? Which kernel version? > >>> + */ > >>> + if (cpumask_subset(core_mask, &cpu_topology[cpu].cluster_sibling)) > >>> + core_mask = &cpu_topology[cpu].cluster_sibling; > >>> + > >> > >> Sudeep, Vincent, are you happy with this now? > > > > I would not say that I'm happy because this solution skews the core > > cpu mask in order to abuse the scheduler so that it will remove a > > wrong but useless level when it will build its domains. > > But this works so as long as the maintainer are happy, I'm fine I did explore the other options and they added considerably more complexity without much benefit in my view. I prefer this option which maintains the cpu_topology as described by the platform, and maps it into something that suits the current scheduler abstraction. I agree there is more work to be done here and intend to continue with it. > I do not have any better idea than this tweak here either in case the > platform can't provide a cleaner setup. I'd argue The platform is describing itself accurately in ACPI PPTT terms. The topology doesn't fit nicely within the kernel abstractions today. This is an area where I hope to continue to improve things going forward. > Maybe the following is easier to read but then we use > '&cpu_topology[cpu].llc_sibling' in cpu_coregroup_mask() already ... > > @@ -617,6 +617,7 @@ EXPORT_SYMBOL_GPL(cpu_topology); > const struct cpumask *cpu_coregroup_mask(int cpu) > { > const cpumask_t *core_mask = cpumask_of_node(cpu_to_node(cpu)); > + const cpumask_t *cluster_mask = cpu_clustergroup_mask(cpu); > > /* Find the smaller of NUMA, core or LLC siblings */ > if (cpumask_subset(&cpu_topology[cpu].core_sibling, core_mask)) { > @@ -628,6 +629,9 @@ const struct cpumask *cpu_coregroup_mask(int cpu) > core_mask = &cpu_topology[cpu].llc_sibling; > } > > + if (cpumask_subset(core_mask, cluster_mask)) > + core_mask = cluster_mask; > + Either works for me. I felt the version I sent was parallel to the existing implementation, but have no preference either way. > return core_mask; > } > > Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> > Thanks for the review Dietmar.
On 08/03/2022 18:49, Darren Hart wrote: > On Tue, Mar 08, 2022 at 05:03:07PM +0100, Dietmar Eggemann wrote: >> On 08/03/2022 12:04, Vincent Guittot wrote: >>> On Tue, 8 Mar 2022 at 11:30, Will Deacon <will@kernel.org> wrote: [...] >> IMHO, if core_mask weight is 1, MC will be removed/degenerated anyway. >> >> This is what I get on my Ampere Altra (I guess I don't have the ACPI >> changes which would let to a CLS sched domain): >> >> # cat /sys/kernel/debug/sched/domains/cpu0/domain*/name >> DIE >> NUMA >> root@oss-altra01:~# zcat /proc/config.gz | grep SCHED_CLUSTER >> CONFIG_SCHED_CLUSTER=y > > I'd like to follow up on this. Would you share your dmidecode BIOS > Information section? # dmidecode -t 0 # dmidecode 3.2 Getting SMBIOS data from sysfs. SMBIOS 3.2.0 present. Handle 0x0000, DMI type 0, 26 bytes BIOS Information Vendor: Ampere(TM) Version: 0.9.20200724 Release Date: 2020/07/24 ROM Size: 7680 kB Characteristics: PCI is supported BIOS is upgradeable Boot from CD is supported Selectable boot is supported ACPI is supported UEFI is supported BIOS Revision: 5.15 Firmware Revision: 0.6 > Which kernel version? v5.17-rc5 [...] >>> I would not say that I'm happy because this solution skews the core >>> cpu mask in order to abuse the scheduler so that it will remove a >>> wrong but useless level when it will build its domains. >>> But this works so as long as the maintainer are happy, I'm fine > > I did explore the other options and they added considerably more > complexity without much benefit in my view. I prefer this option which > maintains the cpu_topology as described by the platform, and maps it > into something that suits the current scheduler abstraction. I agree > there is more work to be done here and intend to continue with it. > >> I do not have any better idea than this tweak here either in case the >> platform can't provide a cleaner setup. > > I'd argue The platform is describing itself accurately in ACPI PPTT > terms. The topology doesn't fit nicely within the kernel abstractions > today. This is an area where I hope to continue to improve things going > forward. I see. And I assume lying about SCU/LLC boundaries in ACPI is not an option since it messes up /sys/devices/system/cpu/cpu0/cache/index*/. [...]
On Wed, Mar 09, 2022 at 01:50:07PM +0100, Dietmar Eggemann wrote: > On 08/03/2022 18:49, Darren Hart wrote: > > On Tue, Mar 08, 2022 at 05:03:07PM +0100, Dietmar Eggemann wrote: > >> On 08/03/2022 12:04, Vincent Guittot wrote: > >>> On Tue, 8 Mar 2022 at 11:30, Will Deacon <will@kernel.org> wrote: > > [...] > > >> IMHO, if core_mask weight is 1, MC will be removed/degenerated anyway. > >> > >> This is what I get on my Ampere Altra (I guess I don't have the ACPI > >> changes which would let to a CLS sched domain): > >> > >> # cat /sys/kernel/debug/sched/domains/cpu0/domain*/name > >> DIE > >> NUMA > >> root@oss-altra01:~# zcat /proc/config.gz | grep SCHED_CLUSTER > >> CONFIG_SCHED_CLUSTER=y > > > > I'd like to follow up on this. Would you share your dmidecode BIOS > > Information section? > > # dmidecode -t 0 > # dmidecode 3.2 > Getting SMBIOS data from sysfs. > SMBIOS 3.2.0 present. > > Handle 0x0000, DMI type 0, 26 bytes > BIOS Information > Vendor: Ampere(TM) > Version: 0.9.20200724 > Release Date: 2020/07/24 > ROM Size: 7680 kB > Characteristics: > PCI is supported > BIOS is upgradeable > Boot from CD is supported > Selectable boot is supported > ACPI is supported > UEFI is supported > BIOS Revision: 5.15 > Firmware Revision: 0.6 > Thank you, I'm following internally and will get with you. > > Which kernel version? > > v5.17-rc5 > > [...] > > >>> I would not say that I'm happy because this solution skews the core > >>> cpu mask in order to abuse the scheduler so that it will remove a > >>> wrong but useless level when it will build its domains. > >>> But this works so as long as the maintainer are happy, I'm fine > > > > I did explore the other options and they added considerably more > > complexity without much benefit in my view. I prefer this option which > > maintains the cpu_topology as described by the platform, and maps it > > into something that suits the current scheduler abstraction. I agree > > there is more work to be done here and intend to continue with it. > > > >> I do not have any better idea than this tweak here either in case the > >> platform can't provide a cleaner setup. > > > > I'd argue The platform is describing itself accurately in ACPI PPTT > > terms. The topology doesn't fit nicely within the kernel abstractions > > today. This is an area where I hope to continue to improve things going > > forward. > > I see. And I assume lying about SCU/LLC boundaries in ACPI is not an > option since it messes up /sys/devices/system/cpu/cpu0/cache/index*/. > > [...] I'm not aware of a way to accurately describe the SCU topology in the PPTT, and the risk we run with lying about LLC topology is that lie has to be comprehended by all OSes and not conflict with other lies people may ask for. In general, I think it is preferable and more maintainable to describe the topology as accurately and honestly as we can within the existing platform mechanisms (PPTT, HMAT, etc) and work on the higher level abstractions to accommodate a broader set of topologies as they emerge (as well as working to more fully describe the topology with new platform level mechanisms as needed). As I mentioned, I intend to continue looking in to how to improve the current abstractions. For now, it sounds like we have agreement that this patch can be merged to address the BUG? Thanks all,
On 09/03/2022 19:26, Darren Hart wrote: > On Wed, Mar 09, 2022 at 01:50:07PM +0100, Dietmar Eggemann wrote: >> On 08/03/2022 18:49, Darren Hart wrote: >>> On Tue, Mar 08, 2022 at 05:03:07PM +0100, Dietmar Eggemann wrote: >>>> On 08/03/2022 12:04, Vincent Guittot wrote: >>>>> On Tue, 8 Mar 2022 at 11:30, Will Deacon <will@kernel.org> wrote: >> >> [...] >> >>>> IMHO, if core_mask weight is 1, MC will be removed/degenerated anyway. >>>> >>>> This is what I get on my Ampere Altra (I guess I don't have the ACPI >>>> changes which would let to a CLS sched domain): >>>> >>>> # cat /sys/kernel/debug/sched/domains/cpu0/domain*/name >>>> DIE >>>> NUMA >>>> root@oss-altra01:~# zcat /proc/config.gz | grep SCHED_CLUSTER >>>> CONFIG_SCHED_CLUSTER=y >>> >>> I'd like to follow up on this. Would you share your dmidecode BIOS >>> Information section? >> >> # dmidecode -t 0 >> # dmidecode 3.2 >> Getting SMBIOS data from sysfs. >> SMBIOS 3.2.0 present. >> >> Handle 0x0000, DMI type 0, 26 bytes >> BIOS Information >> Vendor: Ampere(TM) >> Version: 0.9.20200724 >> Release Date: 2020/07/24 >> ROM Size: 7680 kB >> Characteristics: >> PCI is supported >> BIOS is upgradeable >> Boot from CD is supported >> Selectable boot is supported >> ACPI is supported >> UEFI is supported >> BIOS Revision: 5.15 >> Firmware Revision: 0.6 >> > > Thank you, I'm following internally and will get with you. Looks like in my PPTT, the `Processor Hierarchy Nodes` which represents cluster nodes have no valid `ACPI Processor ID`. Example for CPU0: cpu_node-: [1B9Ch 7068 1] Subtable Type : 00 [Processor Hierarchy Node] [1B9Dh 7069 1] Length : 1C [1B9Eh 7070 2] Reserved : 0000 [1BA0h 7072 4] Flags (decoded below) : 0000001A Physical package : 0 ACPI Processor ID valid : 1 <-- valid !!! Processor is a thread : 0 Node is a leaf : 1 Identical Implementation : 1 [1BA4h 7076 4] Parent : 00001B88 <-- parent !!! [1BA8h 7080 4] ACPI Processor ID : 00001200 [1BACh 7084 4] Private Resource Number : 00000002 [1BB0h 7088 4] Private Resource : 00001B58 [1BB4h 7092 4] Private Resource : 00001B70 cluster_node (cpu_node->parent): [1B88h 7048 1] Subtable Type : 00 [Processor Hierarchy Node] [1B89h 7049 1] Length : 14 [1B8Ah 7050 2] Reserved : 0000 [1B8Ch 7052 4] Flags (decoded below) : 00000010 Physical package : 0 ACPI Processor ID valid : 0 <-- not valid !!! ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Processor is a thread : 0 Node is a leaf : 0 Identical Implementation : 1 [1B90h 7056 4] Parent : 000001C8 [1B94h 7060 4] ACPI Processor ID : 00000000 [1B98h 7064 4] Private Resource Number : 00000000 The code which checks this is: int find_acpi_cpu_topology_cluster(unsigned int cpu) { .... if (cluster_node->flags & ACPI_PPTT_ACPI_PROCESSOR_ID_VALID) retval = cluster_node->acpi_processor_id; else retval = ACPI_PTR_DIFF(cluster_node, table); The else patch just returns distinct values for each CPU, so there is no sub-grouping of CPUs which can lead to a CLS SD.
On 09/03/2022 19:26, Darren Hart wrote: > On Wed, Mar 09, 2022 at 01:50:07PM +0100, Dietmar Eggemann wrote: >> On 08/03/2022 18:49, Darren Hart wrote: >>> On Tue, Mar 08, 2022 at 05:03:07PM +0100, Dietmar Eggemann wrote: >>>> On 08/03/2022 12:04, Vincent Guittot wrote: >>>>> On Tue, 8 Mar 2022 at 11:30, Will Deacon <will@kernel.org> wrote: [...] >>>> I do not have any better idea than this tweak here either in case the >>>> platform can't provide a cleaner setup. >>> >>> I'd argue The platform is describing itself accurately in ACPI PPTT >>> terms. The topology doesn't fit nicely within the kernel abstractions >>> today. This is an area where I hope to continue to improve things going >>> forward. >> >> I see. And I assume lying about SCU/LLC boundaries in ACPI is not an >> option since it messes up /sys/devices/system/cpu/cpu0/cache/index*/. >> >> [...] > > I'm not aware of a way to accurately describe the SCU topology in the PPTT, and > the risk we run with lying about LLC topology is that lie has to be comprehended > by all OSes and not conflict with other lies people may ask for. In general, I > think it is preferable and more maintainable to describe the topology as > accurately and honestly as we can within the existing platform mechanisms (PPTT, > HMAT, etc) and work on the higher level abstractions to accommodate a broader > set of topologies as they emerge (as well as working to more fully describe the > topology with new platform level mechanisms as needed). > > As I mentioned, I intend to continue looking in to how to improve the current > abstractions. For now, it sounds like we have agreement that this patch can be > merged to address the BUG? What about swapping the CLS and MC cpumasks for such a machine? This would avoid that the task scheduler has to deal with a system which has CLS but no MC. We essentially promote the CLS cpumask up to MC in this case. cat /sys/kernel/debug/sched/domains/cpu0/domain*/name MC ^^ DIE NUMA cat /sys/kernel/debug/sched/domains/cpu0# cat domain*/flags SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_SHARE_PKG_RESOURCES SD_PREFER_SIBLING ^^^^^^^^^^^^^^^^^^^^^^ SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_PREFER_SIBLING SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_SERIALIZE SD_OVERLAP SD_NUMA Only very lightly tested on Altra and Juno-r0 (DT). --->8--- From 54bef59e7f50fa41b7ae39190fd71af57209c27d Mon Sep 17 00:00:00 2001 From: Dietmar Eggemann <dietmar.eggemann@arm.com> Date: Mon, 14 Mar 2022 15:08:23 +0000 Subject: [PATCH] arch_topology: Swap MC & CLS SD mask if MC weight==1 & subset(MC,CLS) This avoids the issue of having a system with a CLS SD but no MC SD. CLS should be sub-SD of MC. The cpumask under /sys/devices/system/cpu/cpu*/cache/index* and /sys/devices/system/cpu/cpu*/topology are not changed by this. Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> --- drivers/base/arch_topology.c | 30 ++++++++++++++++++++++++++++-- 1 file changed, 28 insertions(+), 2 deletions(-) diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c index 976154140f0b..9af90a5625c7 100644 --- a/drivers/base/arch_topology.c +++ b/drivers/base/arch_topology.c @@ -614,7 +614,7 @@ static int __init parse_dt_topology(void) struct cpu_topology cpu_topology[NR_CPUS]; EXPORT_SYMBOL_GPL(cpu_topology); -const struct cpumask *cpu_coregroup_mask(int cpu) +const struct cpumask *_cpu_coregroup_mask(int cpu) { const cpumask_t *core_mask = cpumask_of_node(cpu_to_node(cpu)); @@ -631,11 +631,37 @@ const struct cpumask *cpu_coregroup_mask(int cpu) return core_mask; } -const struct cpumask *cpu_clustergroup_mask(int cpu) +const struct cpumask *_cpu_clustergroup_mask(int cpu) { return &cpu_topology[cpu].cluster_sibling; } +static int +swap_masks(const cpumask_t *core_mask, const cpumask_t *cluster_mask) +{ + if (cpumask_weight(core_mask) == 1 && + cpumask_subset(core_mask, cluster_mask)) + return 1; + + return 0; +} + +const struct cpumask *cpu_coregroup_mask(int cpu) +{ + const cpumask_t *cluster_mask = _cpu_clustergroup_mask(cpu); + const cpumask_t *core_mask = _cpu_coregroup_mask(cpu); + + return swap_masks(core_mask, cluster_mask) ? cluster_mask : core_mask; +} + +const struct cpumask *cpu_clustergroup_mask(int cpu) +{ + const cpumask_t *cluster_mask = _cpu_clustergroup_mask(cpu); + const cpumask_t *core_mask = _cpu_coregroup_mask(cpu); + + return swap_masks(core_mask, cluster_mask) ? core_mask : cluster_mask; +} + void update_siblings_masks(unsigned int cpuid) { struct cpu_topology *cpu_topo, *cpuid_topo = &cpu_topology[cpuid];
On Mon, Mar 14, 2022 at 05:35:05PM +0100, Dietmar Eggemann wrote: > On 09/03/2022 19:26, Darren Hart wrote: > > On Wed, Mar 09, 2022 at 01:50:07PM +0100, Dietmar Eggemann wrote: > >> On 08/03/2022 18:49, Darren Hart wrote: > >>> On Tue, Mar 08, 2022 at 05:03:07PM +0100, Dietmar Eggemann wrote: > >>>> On 08/03/2022 12:04, Vincent Guittot wrote: > >>>>> On Tue, 8 Mar 2022 at 11:30, Will Deacon <will@kernel.org> wrote: > > [...] > > >>>> I do not have any better idea than this tweak here either in case the > >>>> platform can't provide a cleaner setup. > >>> > >>> I'd argue The platform is describing itself accurately in ACPI PPTT > >>> terms. The topology doesn't fit nicely within the kernel abstractions > >>> today. This is an area where I hope to continue to improve things going > >>> forward. > >> > >> I see. And I assume lying about SCU/LLC boundaries in ACPI is not an > >> option since it messes up /sys/devices/system/cpu/cpu0/cache/index*/. > >> > >> [...] > > > > I'm not aware of a way to accurately describe the SCU topology in the PPTT, and > > the risk we run with lying about LLC topology is that lie has to be comprehended > > by all OSes and not conflict with other lies people may ask for. In general, I > > think it is preferable and more maintainable to describe the topology as > > accurately and honestly as we can within the existing platform mechanisms (PPTT, > > HMAT, etc) and work on the higher level abstractions to accommodate a broader > > set of topologies as they emerge (as well as working to more fully describe the > > topology with new platform level mechanisms as needed). > > > > As I mentioned, I intend to continue looking in to how to improve the current > > abstractions. For now, it sounds like we have agreement that this patch can be > > merged to address the BUG? > > What about swapping the CLS and MC cpumasks for such a machine? This > would avoid that the task scheduler has to deal with a system which has > CLS but no MC. We essentially promote the CLS cpumask up to MC in this > case. > > cat /sys/kernel/debug/sched/domains/cpu0/domain*/name > MC > ^^ > DIE > NUMA > > cat /sys/kernel/debug/sched/domains/cpu0# cat domain*/flags > SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_SHARE_PKG_RESOURCES SD_PREFER_SIBLING > ^^^^^^^^^^^^^^^^^^^^^^ > SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_PREFER_SIBLING > SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_SERIALIZE SD_OVERLAP SD_NUMA > > Only very lightly tested on Altra and Juno-r0 (DT). > > --->8--- > > From 54bef59e7f50fa41b7ae39190fd71af57209c27d Mon Sep 17 00:00:00 2001 > From: Dietmar Eggemann <dietmar.eggemann@arm.com> > Date: Mon, 14 Mar 2022 15:08:23 +0000 > Subject: [PATCH] arch_topology: Swap MC & CLS SD mask if MC weight==1 & > subset(MC,CLS) > > This avoids the issue of having a system with a CLS SD but no MC SD. > CLS should be sub-SD of MC. Hi Dietmar, Ultimately, this delivers the same result. I do think it imposes more complexity for everyone to address what as far as I'm aware only affect the one system. I don't think the term "Cluster" has a clear and universally understood definition, so I don't think it's a given that "CLS should be sub-SD of MC". I think this has been assumed, and that assumption has mostly held up, but this is an abstraction, and the abstraction should follow the physical topologies rather than the other way around in my opinion. If that's the primary motivation for this approach, I don't think it justifies the additional complexity. All told, I prefer the 2 line change contained within cpu_coregroup_mask() which handles the one known exception with minimal impact. It's easy enough to come back to this to address more cases with a more complex solution if needed in the future - but I prefer to introduce the least amount of complexity as possible to address the known issues, especially if the end result is the same and the cost is paid by the affected systems. Thanks, > > The cpumask under /sys/devices/system/cpu/cpu*/cache/index* and > /sys/devices/system/cpu/cpu*/topology are not changed by this. > > Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> > --- > drivers/base/arch_topology.c | 30 ++++++++++++++++++++++++++++-- > 1 file changed, 28 insertions(+), 2 deletions(-) > > diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c > index 976154140f0b..9af90a5625c7 100644 > --- a/drivers/base/arch_topology.c > +++ b/drivers/base/arch_topology.c > @@ -614,7 +614,7 @@ static int __init parse_dt_topology(void) > struct cpu_topology cpu_topology[NR_CPUS]; > EXPORT_SYMBOL_GPL(cpu_topology); > > -const struct cpumask *cpu_coregroup_mask(int cpu) > +const struct cpumask *_cpu_coregroup_mask(int cpu) > { > const cpumask_t *core_mask = cpumask_of_node(cpu_to_node(cpu)); > > @@ -631,11 +631,37 @@ const struct cpumask *cpu_coregroup_mask(int cpu) > return core_mask; > } > > -const struct cpumask *cpu_clustergroup_mask(int cpu) > +const struct cpumask *_cpu_clustergroup_mask(int cpu) > { > return &cpu_topology[cpu].cluster_sibling; > } > > +static int > +swap_masks(const cpumask_t *core_mask, const cpumask_t *cluster_mask) > +{ > + if (cpumask_weight(core_mask) == 1 && > + cpumask_subset(core_mask, cluster_mask)) > + return 1; > + > + return 0; > +} > + > +const struct cpumask *cpu_coregroup_mask(int cpu) > +{ > + const cpumask_t *cluster_mask = _cpu_clustergroup_mask(cpu); > + const cpumask_t *core_mask = _cpu_coregroup_mask(cpu); > + > + return swap_masks(core_mask, cluster_mask) ? cluster_mask : core_mask; > +} > + > +const struct cpumask *cpu_clustergroup_mask(int cpu) > +{ > + const cpumask_t *cluster_mask = _cpu_clustergroup_mask(cpu); > + const cpumask_t *core_mask = _cpu_coregroup_mask(cpu); > + > + return swap_masks(core_mask, cluster_mask) ? core_mask : cluster_mask; > +} > + > void update_siblings_masks(unsigned int cpuid) > { > struct cpu_topology *cpu_topo, *cpuid_topo = &cpu_topology[cpuid]; > -- > 2.25.1
On Mon, Mar 14, 2022 at 10:37:21AM +0100, Dietmar Eggemann wrote: > On 09/03/2022 19:26, Darren Hart wrote: > > On Wed, Mar 09, 2022 at 01:50:07PM +0100, Dietmar Eggemann wrote: > >> On 08/03/2022 18:49, Darren Hart wrote: > >>> On Tue, Mar 08, 2022 at 05:03:07PM +0100, Dietmar Eggemann wrote: > >>>> On 08/03/2022 12:04, Vincent Guittot wrote: > >>>>> On Tue, 8 Mar 2022 at 11:30, Will Deacon <will@kernel.org> wrote: > >> > >> [...] > >> > >>>> IMHO, if core_mask weight is 1, MC will be removed/degenerated anyway. > >>>> > >>>> This is what I get on my Ampere Altra (I guess I don't have the ACPI > >>>> changes which would let to a CLS sched domain): > >>>> > >>>> # cat /sys/kernel/debug/sched/domains/cpu0/domain*/name > >>>> DIE > >>>> NUMA > >>>> root@oss-altra01:~# zcat /proc/config.gz | grep SCHED_CLUSTER > >>>> CONFIG_SCHED_CLUSTER=y > >>> > >>> I'd like to follow up on this. Would you share your dmidecode BIOS > >>> Information section? > >> > >> # dmidecode -t 0 > >> # dmidecode 3.2 > >> Getting SMBIOS data from sysfs. > >> SMBIOS 3.2.0 present. > >> > >> Handle 0x0000, DMI type 0, 26 bytes > >> BIOS Information > >> Vendor: Ampere(TM) > >> Version: 0.9.20200724 > >> Release Date: 2020/07/24 > >> ROM Size: 7680 kB > >> Characteristics: > >> PCI is supported > >> BIOS is upgradeable > >> Boot from CD is supported > >> Selectable boot is supported > >> ACPI is supported > >> UEFI is supported > >> BIOS Revision: 5.15 > >> Firmware Revision: 0.6 > >> > > > > Thank you, I'm following internally and will get with you. > > Looks like in my PPTT, the `Processor Hierarchy Nodes` which represents > cluster nodes have no valid `ACPI Processor ID`. Thanks, I'm looking for the right person to get us the latest information available this. Will follow up once I have.
Hi Dietmar, I love your patch! Perhaps something to improve: [auto build test WARNING on driver-core/driver-core-testing] [also build test WARNING on v5.17-rc8 next-20220310] [If your patch is applied to the wrong git tree, kindly drop us a note. And when submitting patch, we suggest to use '--base' as documented in https://git-scm.com/docs/git-format-patch] url: https://github.com/0day-ci/linux/commits/Dietmar-Eggemann/arch_topology-Swap-MC-CLS-SD-mask-if-MC-weight-1/20220315-004742 base: https://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core.git 4a248f85b3dd8e010ff8335755c927130e9b0764 config: arm-defconfig (https://download.01.org/0day-ci/archive/20220315/202203150553.QRvgHFHm-lkp@intel.com/config) compiler: arm-linux-gnueabi-gcc (GCC) 11.2.0 reproduce (this is a W=1 build): wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross # https://github.com/0day-ci/linux/commit/7528fb2ea1e30038ee1dcc48df9d413502977895 git remote add linux-review https://github.com/0day-ci/linux git fetch --no-tags linux-review Dietmar-Eggemann/arch_topology-Swap-MC-CLS-SD-mask-if-MC-weight-1/20220315-004742 git checkout 7528fb2ea1e30038ee1dcc48df9d413502977895 # save the config file to linux build tree mkdir build_dir COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-11.2.0 make.cross O=build_dir ARCH=arm SHELL=/bin/bash drivers/base/ If you fix the issue, kindly add following tag as appropriate Reported-by: kernel test robot <lkp@intel.com> All warnings (new ones prefixed by >>): >> drivers/base/arch_topology.c:617:23: warning: no previous prototype for '_cpu_coregroup_mask' [-Wmissing-prototypes] 617 | const struct cpumask *_cpu_coregroup_mask(int cpu) | ^~~~~~~~~~~~~~~~~~~ >> drivers/base/arch_topology.c:634:23: warning: no previous prototype for '_cpu_clustergroup_mask' [-Wmissing-prototypes] 634 | const struct cpumask *_cpu_clustergroup_mask(int cpu) | ^~~~~~~~~~~~~~~~~~~~~~ vim +/_cpu_coregroup_mask +617 drivers/base/arch_topology.c 616 > 617 const struct cpumask *_cpu_coregroup_mask(int cpu) 618 { 619 const cpumask_t *core_mask = cpumask_of_node(cpu_to_node(cpu)); 620 621 /* Find the smaller of NUMA, core or LLC siblings */ 622 if (cpumask_subset(&cpu_topology[cpu].core_sibling, core_mask)) { 623 /* not numa in package, lets use the package siblings */ 624 core_mask = &cpu_topology[cpu].core_sibling; 625 } 626 if (cpu_topology[cpu].llc_id != -1) { 627 if (cpumask_subset(&cpu_topology[cpu].llc_sibling, core_mask)) 628 core_mask = &cpu_topology[cpu].llc_sibling; 629 } 630 631 return core_mask; 632 } 633 > 634 const struct cpumask *_cpu_clustergroup_mask(int cpu) 635 { 636 return &cpu_topology[cpu].cluster_sibling; 637 } 638 --- 0-DAY CI Kernel Test Service https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org
Hi Dietmar, I love your patch! Perhaps something to improve: [auto build test WARNING on driver-core/driver-core-testing] [also build test WARNING on v5.17-rc8 next-20220310] [If your patch is applied to the wrong git tree, kindly drop us a note. And when submitting patch, we suggest to use '--base' as documented in https://git-scm.com/docs/git-format-patch] url: https://github.com/0day-ci/linux/commits/Dietmar-Eggemann/arch_topology-Swap-MC-CLS-SD-mask-if-MC-weight-1/20220315-004742 base: https://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core.git 4a248f85b3dd8e010ff8335755c927130e9b0764 config: riscv-randconfig-r042-20220314 (https://download.01.org/0day-ci/archive/20220315/202203150659.T51bnDgz-lkp@intel.com/config) compiler: clang version 15.0.0 (https://github.com/llvm/llvm-project 3e4950d7fa78ac83f33bbf1658e2f49a73719236) reproduce (this is a W=1 build): wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross # install riscv cross compiling tool for clang build # apt-get install binutils-riscv64-linux-gnu # https://github.com/0day-ci/linux/commit/7528fb2ea1e30038ee1dcc48df9d413502977895 git remote add linux-review https://github.com/0day-ci/linux git fetch --no-tags linux-review Dietmar-Eggemann/arch_topology-Swap-MC-CLS-SD-mask-if-MC-weight-1/20220315-004742 git checkout 7528fb2ea1e30038ee1dcc48df9d413502977895 # save the config file to linux build tree mkdir build_dir COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=riscv SHELL=/bin/bash drivers/base/ If you fix the issue, kindly add following tag as appropriate Reported-by: kernel test robot <lkp@intel.com> All warnings (new ones prefixed by >>): >> drivers/base/arch_topology.c:617:23: warning: no previous prototype for function '_cpu_coregroup_mask' [-Wmissing-prototypes] const struct cpumask *_cpu_coregroup_mask(int cpu) ^ drivers/base/arch_topology.c:617:7: note: declare 'static' if the function is not intended to be used outside of this translation unit const struct cpumask *_cpu_coregroup_mask(int cpu) ^ static >> drivers/base/arch_topology.c:634:23: warning: no previous prototype for function '_cpu_clustergroup_mask' [-Wmissing-prototypes] const struct cpumask *_cpu_clustergroup_mask(int cpu) ^ drivers/base/arch_topology.c:634:7: note: declare 'static' if the function is not intended to be used outside of this translation unit const struct cpumask *_cpu_clustergroup_mask(int cpu) ^ static 2 warnings generated. vim +/_cpu_coregroup_mask +617 drivers/base/arch_topology.c 616 > 617 const struct cpumask *_cpu_coregroup_mask(int cpu) 618 { 619 const cpumask_t *core_mask = cpumask_of_node(cpu_to_node(cpu)); 620 621 /* Find the smaller of NUMA, core or LLC siblings */ 622 if (cpumask_subset(&cpu_topology[cpu].core_sibling, core_mask)) { 623 /* not numa in package, lets use the package siblings */ 624 core_mask = &cpu_topology[cpu].core_sibling; 625 } 626 if (cpu_topology[cpu].llc_id != -1) { 627 if (cpumask_subset(&cpu_topology[cpu].llc_sibling, core_mask)) 628 core_mask = &cpu_topology[cpu].llc_sibling; 629 } 630 631 return core_mask; 632 } 633 > 634 const struct cpumask *_cpu_clustergroup_mask(int cpu) 635 { 636 return &cpu_topology[cpu].cluster_sibling; 637 } 638 --- 0-DAY CI Kernel Test Service https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org
On 14/03/2022 17:56, Darren Hart wrote: > On Mon, Mar 14, 2022 at 10:37:21AM +0100, Dietmar Eggemann wrote: >> On 09/03/2022 19:26, Darren Hart wrote: >>> On Wed, Mar 09, 2022 at 01:50:07PM +0100, Dietmar Eggemann wrote: >>>> On 08/03/2022 18:49, Darren Hart wrote: >>>>> On Tue, Mar 08, 2022 at 05:03:07PM +0100, Dietmar Eggemann wrote: >>>>>> On 08/03/2022 12:04, Vincent Guittot wrote: >>>>>>> On Tue, 8 Mar 2022 at 11:30, Will Deacon <will@kernel.org> wrote: [...] >>> Thank you, I'm following internally and will get with you. >> >> Looks like in my PPTT, the `Processor Hierarchy Nodes` which represents >> cluster nodes have no valid `ACPI Processor ID`. > > Thanks, I'm looking for the right person to get us the latest information > available this. Will follow up once I have. We're installing the new FW on our machine right now. Should see the CLS SD level w/o hacking find_acpi_cpu_topology_cluster() then ;-)
- Barry Song <song.bao.hua@hisilicon.com> (always get undelivered mail returned to sender) + Barry Song <21cnbao@gmail.com> On 14/03/2022 17:54, Darren Hart wrote: > On Mon, Mar 14, 2022 at 05:35:05PM +0100, Dietmar Eggemann wrote: >> On 09/03/2022 19:26, Darren Hart wrote: >>> On Wed, Mar 09, 2022 at 01:50:07PM +0100, Dietmar Eggemann wrote: >>>> On 08/03/2022 18:49, Darren Hart wrote: >>>>> On Tue, Mar 08, 2022 at 05:03:07PM +0100, Dietmar Eggemann wrote: >>>>>> On 08/03/2022 12:04, Vincent Guittot wrote: >>>>>>> On Tue, 8 Mar 2022 at 11:30, Will Deacon <will@kernel.org> wrote: [...] > Ultimately, this delivers the same result. I do think it imposes more complexity > for everyone to address what as far as I'm aware only affect the one system. > > I don't think the term "Cluster" has a clear and universally understood > definition, so I don't think it's a given that "CLS should be sub-SD of MC". I I agree, the term 'cluster' is overloaded but default_topology[] clearly says (with direction up means smaller SD spans). #ifdef CONFIG_SCHED_CLUSTER { cpu_clustergroup_mask, cpu_cluster_flags, SD_INIT_NAME(CLS) }, #endif #ifdef CONFIG_SCHED_MC { cpu_coregroup_mask, cpu_core_flags, SD_INIT_NAME(MC) }, #endif In ACPI code we have `cluster_node = fetch_pptt_node(... , cpu_node->parent) but then the cache information (via llc_id/llc_sibling) can change things which make this less easy to grasp. > think this has been assumed, and that assumption has mostly held up, but this is > an abstraction, and the abstraction should follow the physical topologies rather > than the other way around in my opinion. If that's the primary motivation for > this approach, I don't think it justifies the additional complexity. > > All told, I prefer the 2 line change contained within cpu_coregroup_mask() which > handles the one known exception with minimal impact. It's easy enough to come > back to this to address more cases with a more complex solution if needed in the > future - but I prefer to introduce the least amount of complexity as possible to > address the known issues, especially if the end result is the same and the cost > is paid by the affected systems. > > Thanks, Yeah, I can see your point. It's the smaller hack. My solution just prevents us to manipulate the coregroup mask only to get the MC layer degenerated by the core topology code. But people might say that's a clever thing to do here. So I'm fine with your original solution as well. [...]
On Wed, Mar 16, 2022 at 03:48:50PM +0100, Dietmar Eggemann wrote: > - Barry Song <song.bao.hua@hisilicon.com> (always get undelivered mail > returned to sender) > + Barry Song <21cnbao@gmail.com> > > On 14/03/2022 17:54, Darren Hart wrote: > > On Mon, Mar 14, 2022 at 05:35:05PM +0100, Dietmar Eggemann wrote: > >> On 09/03/2022 19:26, Darren Hart wrote: > >>> On Wed, Mar 09, 2022 at 01:50:07PM +0100, Dietmar Eggemann wrote: > >>>> On 08/03/2022 18:49, Darren Hart wrote: > >>>>> On Tue, Mar 08, 2022 at 05:03:07PM +0100, Dietmar Eggemann wrote: > >>>>>> On 08/03/2022 12:04, Vincent Guittot wrote: > >>>>>>> On Tue, 8 Mar 2022 at 11:30, Will Deacon <will@kernel.org> wrote: > > [...] > > > Ultimately, this delivers the same result. I do think it imposes more complexity > > for everyone to address what as far as I'm aware only affect the one system. > > > > I don't think the term "Cluster" has a clear and universally understood > > definition, so I don't think it's a given that "CLS should be sub-SD of MC". I > > I agree, the term 'cluster' is overloaded but default_topology[] clearly > says (with direction up means smaller SD spans). > > #ifdef CONFIG_SCHED_CLUSTER > { cpu_clustergroup_mask, cpu_cluster_flags, SD_INIT_NAME(CLS) }, > #endif > > #ifdef CONFIG_SCHED_MC > { cpu_coregroup_mask, cpu_core_flags, SD_INIT_NAME(MC) }, > #endif > Right, understood. It is a clear expectation of the current Sched Domain topology abstraction. > In ACPI code we have `cluster_node = fetch_pptt_node(... , > cpu_node->parent) but then the cache information (via > llc_id/llc_sibling) can change things which make this less easy to grasp. > > > think this has been assumed, and that assumption has mostly held up, but this is > > an abstraction, and the abstraction should follow the physical topologies rather > > than the other way around in my opinion. If that's the primary motivation for > > this approach, I don't think it justifies the additional complexity. > > > > All told, I prefer the 2 line change contained within cpu_coregroup_mask() which > > handles the one known exception with minimal impact. It's easy enough to come > > back to this to address more cases with a more complex solution if needed in the > > future - but I prefer to introduce the least amount of complexity as possible to > > address the known issues, especially if the end result is the same and the cost > > is paid by the affected systems. > > > > Thanks, > > Yeah, I can see your point. It's the smaller hack. My solution just > prevents us to manipulate the coregroup mask only to get the MC layer > degenerated by the core topology code. But people might say that's a > clever thing to do here. So I'm fine with your original solution as well. > > [...] Thanks Dietmar, Sudeep, do we have sufficient consensus to pull in this patch? Thanks,
On Wed, Mar 16, 2022 at 08:20:32AM -0700, Darren Hart wrote: > On Wed, Mar 16, 2022 at 03:48:50PM +0100, Dietmar Eggemann wrote: [...] > > > > Yeah, I can see your point. It's the smaller hack. My solution just > > prevents us to manipulate the coregroup mask only to get the MC layer > > degenerated by the core topology code. But people might say that's a > > clever thing to do here. So I'm fine with your original solution as well. > > > > [...] > > Thanks Dietmar, > > Sudeep, do we have sufficient consensus to pull in this patch? Indeed. I have already Acked, and sure after all these discussions we have concluded that this is the best we can do though not matches everyone's taste. Greg or Will(not sure why he had asked since v3 doesn't touch arm64), Can one of you pick this patch ? -- Regards, Sudeep
On Sat, Mar 5, 2022 at 8:54 AM Darren Hart <darren@os.amperecomputing.com> wrote: > > Ampere Altra defines CPU clusters in the ACPI PPTT. They share a Snoop > Control Unit, but have no shared CPU-side last level cache. > > cpu_coregroup_mask() will return a cpumask with weight 1, while > cpu_clustergroup_mask() will return a cpumask with weight 2. > > As a result, build_sched_domain() will BUG() once per CPU with: > > BUG: arch topology borken > the CLS domain not a subset of the MC domain > > The MC level cpumask is then extended to that of the CLS child, and is > later removed entirely as redundant. This sched domain topology is an > improvement over previous topologies, or those built without > SCHED_CLUSTER, particularly for certain latency sensitive workloads. > With the current scheduler model and heuristics, this is a desirable > default topology for Ampere Altra and Altra Max system. > > Rather than create a custom sched domains topology structure and > introduce new logic in arch/arm64 to detect these systems, update the > core_mask so coregroup is never a subset of clustergroup, extending it > to cluster_siblings if necessary. > > This has the added benefit over a custom topology of working for both > symmetric and asymmetric topologies. It does not address systems where > the cluster topology is above a populated mc topology, but these are not > considered today and can be addressed separately if and when they > appear. > > The final sched domain topology for a 2 socket Ampere Altra system is > unchanged with or without CONFIG_SCHED_CLUSTER, and the BUG is avoided: > > For CPU0: > > CONFIG_SCHED_CLUSTER=y > CLS [0-1] > DIE [0-79] > NUMA [0-159] > > CONFIG_SCHED_CLUSTER is not set > DIE [0-79] > NUMA [0-159] > > Cc: Sudeep Holla <sudeep.holla@arm.com> > Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> > Cc: "Rafael J. Wysocki" <rafael@kernel.org> > Cc: Catalin Marinas <catalin.marinas@arm.com> > Cc: Will Deacon <will@kernel.org> > Cc: Peter Zijlstra <peterz@infradead.org> > Cc: Vincent Guittot <vincent.guittot@linaro.org> > Cc: Barry Song <song.bao.hua@hisilicon.com> > Cc: Valentin Schneider <valentin.schneider@arm.com> > Cc: D. Scott Phillips <scott@os.amperecomputing.com> > Cc: Ilkka Koskinen <ilkka@os.amperecomputing.com> > Cc: <stable@vger.kernel.org> # 5.16.x > Suggested-by: Barry Song <song.bao.hua@hisilicon.com> > Signed-off-by: Darren Hart <darren@os.amperecomputing.com> Reviewed-by: Barry Song <baohua@kernel.org> > --- > v1: Drop MC level if coregroup weight == 1 > v2: New sd topo in arch/arm64/kernel/smp.c > v3: No new topo, extend core_mask to cluster_siblings > > drivers/base/arch_topology.c | 8 ++++++++ > 1 file changed, 8 insertions(+) > > diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c > index 976154140f0b..a96f45db928b 100644 > --- a/drivers/base/arch_topology.c > +++ b/drivers/base/arch_topology.c > @@ -628,6 +628,14 @@ const struct cpumask *cpu_coregroup_mask(int cpu) > core_mask = &cpu_topology[cpu].llc_sibling; > } > > + /* > + * For systems with no shared cpu-side LLC but with clusters defined, > + * extend core_mask to cluster_siblings. The sched domain builder will > + * then remove MC as redundant with CLS if SCHED_CLUSTER is enabled. > + */ > + if (cpumask_subset(core_mask, &cpu_topology[cpu].cluster_sibling)) > + core_mask = &cpu_topology[cpu].cluster_sibling; > + > return core_mask; > } > > -- > 2.31.1 > Thanks Barry
On Wed, Mar 16, 2022 at 03:55:41PM +0000, Sudeep Holla wrote: > On Wed, Mar 16, 2022 at 08:20:32AM -0700, Darren Hart wrote: > > On Wed, Mar 16, 2022 at 03:48:50PM +0100, Dietmar Eggemann wrote: > > [...] > > > > > > > Yeah, I can see your point. It's the smaller hack. My solution just > > > prevents us to manipulate the coregroup mask only to get the MC layer > > > degenerated by the core topology code. But people might say that's a > > > clever thing to do here. So I'm fine with your original solution as well. > > > > > > [...] > > > > Thanks Dietmar, > > > > Sudeep, do we have sufficient consensus to pull in this patch? > > Indeed. I have already Acked, and sure after all these discussions we have > concluded that this is the best we can do though not matches everyone's taste. > > Greg or Will(not sure why he had asked since v3 doesn't touch arm64), > Can one of you pick this patch ? Right, this doesn't touch arm64 any more so I don't think I'm the right person to queue it. Will
On Mon, Mar 21, 2022 at 02:30:22PM +0000, Will Deacon wrote: > On Wed, Mar 16, 2022 at 03:55:41PM +0000, Sudeep Holla wrote: > > On Wed, Mar 16, 2022 at 08:20:32AM -0700, Darren Hart wrote: > > > On Wed, Mar 16, 2022 at 03:48:50PM +0100, Dietmar Eggemann wrote: > > > > [...] > > > > > > > > > > Yeah, I can see your point. It's the smaller hack. My solution just > > > > prevents us to manipulate the coregroup mask only to get the MC layer > > > > degenerated by the core topology code. But people might say that's a > > > > clever thing to do here. So I'm fine with your original solution as well. > > > > > > > > [...] > > > > > > Thanks Dietmar, > > > > > > Sudeep, do we have sufficient consensus to pull in this patch? > > > > Indeed. I have already Acked, and sure after all these discussions we have > > concluded that this is the best we can do though not matches everyone's taste. > > > > Greg or Will(not sure why he had asked since v3 doesn't touch arm64), > > Can one of you pick this patch ? > > Right, this doesn't touch arm64 any more so I don't think I'm the right > person to queue it. It's too late for 5.18-rc1. Please rebase and resend it after that is out and I will be glad to queue it up. thanks, greg k-h
diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c index 976154140f0b..a96f45db928b 100644 --- a/drivers/base/arch_topology.c +++ b/drivers/base/arch_topology.c @@ -628,6 +628,14 @@ const struct cpumask *cpu_coregroup_mask(int cpu) core_mask = &cpu_topology[cpu].llc_sibling; } + /* + * For systems with no shared cpu-side LLC but with clusters defined, + * extend core_mask to cluster_siblings. The sched domain builder will + * then remove MC as redundant with CLS if SCHED_CLUSTER is enabled. + */ + if (cpumask_subset(core_mask, &cpu_topology[cpu].cluster_sibling)) + core_mask = &cpu_topology[cpu].cluster_sibling; + return core_mask; }
Ampere Altra defines CPU clusters in the ACPI PPTT. They share a Snoop Control Unit, but have no shared CPU-side last level cache. cpu_coregroup_mask() will return a cpumask with weight 1, while cpu_clustergroup_mask() will return a cpumask with weight 2. As a result, build_sched_domain() will BUG() once per CPU with: BUG: arch topology borken the CLS domain not a subset of the MC domain The MC level cpumask is then extended to that of the CLS child, and is later removed entirely as redundant. This sched domain topology is an improvement over previous topologies, or those built without SCHED_CLUSTER, particularly for certain latency sensitive workloads. With the current scheduler model and heuristics, this is a desirable default topology for Ampere Altra and Altra Max system. Rather than create a custom sched domains topology structure and introduce new logic in arch/arm64 to detect these systems, update the core_mask so coregroup is never a subset of clustergroup, extending it to cluster_siblings if necessary. This has the added benefit over a custom topology of working for both symmetric and asymmetric topologies. It does not address systems where the cluster topology is above a populated mc topology, but these are not considered today and can be addressed separately if and when they appear. The final sched domain topology for a 2 socket Ampere Altra system is unchanged with or without CONFIG_SCHED_CLUSTER, and the BUG is avoided: For CPU0: CONFIG_SCHED_CLUSTER=y CLS [0-1] DIE [0-79] NUMA [0-159] CONFIG_SCHED_CLUSTER is not set DIE [0-79] NUMA [0-159] Cc: Sudeep Holla <sudeep.holla@arm.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Barry Song <song.bao.hua@hisilicon.com> Cc: Valentin Schneider <valentin.schneider@arm.com> Cc: D. Scott Phillips <scott@os.amperecomputing.com> Cc: Ilkka Koskinen <ilkka@os.amperecomputing.com> Cc: <stable@vger.kernel.org> # 5.16.x Suggested-by: Barry Song <song.bao.hua@hisilicon.com> Signed-off-by: Darren Hart <darren@os.amperecomputing.com> --- v1: Drop MC level if coregroup weight == 1 v2: New sd topo in arch/arm64/kernel/smp.c v3: No new topo, extend core_mask to cluster_siblings drivers/base/arch_topology.c | 8 ++++++++ 1 file changed, 8 insertions(+)