diff mbox series

[4/4] selftests/resctrl: Adjust SNC support messages

Message ID 8a158ada92f06b97a4679721e84e787e94b94647.1709721159.git.maciej.wieczor-retman@intel.com (mailing list archive)
State New
Headers show
Series SNC support for resctrl selftests | expand

Commit Message

Maciej Wieczor-Retman March 6, 2024, 10:39 a.m. UTC
Resctrl selftest prints a message on test failure that Sub-Numa
Clustering (SNC) could be enabled and points the user to check theirs BIOS
settings. No actual check is performed before printing that message so
it is not very accurate in pinpointing a problem.

Figuring out if SNC is enabled is only one part of the problem, the
other being whether the kernel supports it. As there is no easy
interface that simply states SNC support in the kernel one can find that
information by comparing L3 cache sizes from different sources. Cache
size reported by /sys/devices/system/node/node0/cpu0/cache/index3/size
will always show the full cache size even if it's split by enabled SNC.
On the other hand /sys/fs/resctrl/size has information about L3 size,
that with kernel support is adjusted for enabled SNC.

Add a function to find a cache size from /sys/fs/resctrl/size since
finding that information from the other source is already implemented.

Add a function that compares the two cache sizes and use it to make the
SNC support message more meaningful.

Add the SNC support message just after MBA's check_results() since MBA
shares code with MBM and also can suffer from enabled SNC if there is no
support in the kernel.

Signed-off-by: Maciej Wieczor-Retman <maciej.wieczor-retman@intel.com>
---
 tools/testing/selftests/resctrl/cat_test.c  |  2 +-
 tools/testing/selftests/resctrl/cmt_test.c  |  6 +-
 tools/testing/selftests/resctrl/mba_test.c  |  2 +
 tools/testing/selftests/resctrl/mbm_test.c  |  4 +-
 tools/testing/selftests/resctrl/resctrl.h   |  5 +-
 tools/testing/selftests/resctrl/resctrlfs.c | 69 ++++++++++++++++++++-
 6 files changed, 79 insertions(+), 9 deletions(-)

Comments

Luck, Tony March 6, 2024, 9:54 p.m. UTC | #1
> Figuring out if SNC is enabled is only one part of the problem, the
> other being whether the kernel supports it. As there is no easy
> interface that simply states SNC support in the kernel one can find that
> information by comparing L3 cache sizes from different sources. Cache
> size reported by /sys/devices/system/node/node0/cpu0/cache/index3/size
> will always show the full cache size even if it's split by enabled SNC.
> On the other hand /sys/fs/resctrl/size has information about L3 size,
> that with kernel support is adjusted for enabled SNC.

Early versions of the kernel SNC patch series added an info/l3_MON/snc_ways
file to provide this information. I was talked out of it then:

https://lore.kernel.org/all/f0841866-315b-4727-0a6c-ec60d22ca29c@arm.com/

But that discussion didn't consider the question you discuss here: "Does this
instance of the kernel support SNC?"

So you have a clever solution. But it seems like a roundabout way for
an application to discover whether the kernel has configured resctrl for
SNC mode.

Should the kernel provide an info/ file listing the SNC configuration?

If so, what should it be named? "snc_ways" as a kernel variable was
later replaced by "snc_nodes_per_l3_cache". Is that a good filename?

-Tony
Maciej Wieczor-Retman March 7, 2024, 9:25 a.m. UTC | #2
On 2024-03-06 at 21:54:02 +0000, Luck, Tony wrote:
>> Figuring out if SNC is enabled is only one part of the problem, the
>> other being whether the kernel supports it. As there is no easy
>> interface that simply states SNC support in the kernel one can find that
>> information by comparing L3 cache sizes from different sources. Cache
>> size reported by /sys/devices/system/node/node0/cpu0/cache/index3/size
>> will always show the full cache size even if it's split by enabled SNC.
>> On the other hand /sys/fs/resctrl/size has information about L3 size,
>> that with kernel support is adjusted for enabled SNC.
>
>Early versions of the kernel SNC patch series added an info/l3_MON/snc_ways
>file to provide this information. I was talked out of it then:
>
>https://lore.kernel.org/all/f0841866-315b-4727-0a6c-ec60d22ca29c@arm.com/
>
>But that discussion didn't consider the question you discuss here: "Does this
>instance of the kernel support SNC?"
>
>So you have a clever solution. But it seems like a roundabout way for
>an application to discover whether the kernel has configured resctrl for
>SNC mode.
>
>Should the kernel provide an info/ file listing the SNC configuration?

I suppose it would be a benefit for other numa aware applications to have an
easy access to this kernel support information.

>
>If so, what should it be named? "snc_ways" as a kernel variable was
>later replaced by "snc_nodes_per_l3_cache". Is that a good filename?

"snc_nodes_per_l3_cache" seems okay to me.

And I understand that the file content would show SNC mode and the presence or
absence of this file would tell if kernel supports SNC?

>
>-Tony
Luck, Tony March 7, 2024, 5:18 p.m. UTC | #3
>>If so, what should it be named? "snc_ways" as a kernel variable was
>>later replaced by "snc_nodes_per_l3_cache". Is that a good filename?
>
> "snc_nodes_per_l3_cache" seems okay to me.
>
> And I understand that the file content would show SNC mode and the presence or
> absence of this file would tell if kernel supports SNC?

Yes. The existence of the file indicates the kernel is SNC aware.

The value read from the file would give the number of nodes per L3 (obviously :-) )

SNC not supported by this platform or not enabled:

$ cat /sys/fs/resctrl/info/L3_mon/ snc_nodes_per_l3_cache
1

SNC2 enabled:

$ cat /sys/fs/resctrl/info/L3_mon/ snc_nodes_per_l3_cache
2

etc.

-Tony
Reinette Chatre March 7, 2024, 5:41 p.m. UTC | #4
Hi Tony,

On 3/7/2024 9:18 AM, Luck, Tony wrote:
>>> If so, what should it be named? "snc_ways" as a kernel variable was
>>> later replaced by "snc_nodes_per_l3_cache". Is that a good filename?
>>
>> "snc_nodes_per_l3_cache" seems okay to me.
>>
>> And I understand that the file content would show SNC mode and the presence or
>> absence of this file would tell if kernel supports SNC?
> 
> Yes. The existence of the file indicates the kernel is SNC aware.
> 
> The value read from the file would give the number of nodes per L3 (obviously :-) )
> 
> SNC not supported by this platform or not enabled:
> 
> $ cat /sys/fs/resctrl/info/L3_mon/ snc_nodes_per_l3_cache
> 1
> 
> SNC2 enabled:
> 
> $ cat /sys/fs/resctrl/info/L3_mon/ snc_nodes_per_l3_cache
> 2
> 

This would be useful. I believe "SNC" is architecture specific?
What if the file always exists and is named "nodes_per_l3_cache"?

I assume that the internals of handling more nodes per L3 cache should
be hidden from user space and it should not be necessary for user space
to know if this is because of SNC or potentially some other mechanism on
another platform?

I think that may reduce fragmentation of resctrl .... not having
resctrl look so different on different architectures but maintains
the promise of a generic interface.

I am not sure if this is specific to monitoring though,
why not host file in /sys/fs/resctrl/info/L3 ? 

Reinette
Luck, Tony March 7, 2024, 5:57 p.m. UTC | #5
> > SNC2 enabled:
> >
> > $ cat /sys/fs/resctrl/info/L3_mon/ snc_nodes_per_l3_cache
> > 2
> >
>
> This would be useful. I believe "SNC" is architecture specific?
> What if the file always exists and is named "nodes_per_l3_cache"?
>
> I assume that the internals of handling more nodes per L3 cache should
> be hidden from user space and it should not be necessary for user space
> to know if this is because of SNC or potentially some other mechanism on
> another platform?
>
> I think that may reduce fragmentation of resctrl .... not having
> resctrl look so different on different architectures but maintains
> the promise of a generic interface.
>
> I am not sure if this is specific to monitoring though,
> why not host file in /sys/fs/resctrl/info/L3 ?

Reinette,

On the name change - sure. It doesn't need the "snc_" prefix.

The Intel implementation of SNC has far more effect on monitoring
than on control. The user can read separate cache occupancy and
memory bandwidth values for each SNC node. But cache allocation
bitmasks and memory throttling still have a single control point for
each L3 cache instance, not for each node. There are still some
impacts on control, e.g. each bit in a CAT bitmask represents
less actual space in the L3 cache.

Maybe move it to the top level of the info/ directory:

$ cat /sys/fs/resctrl/info/nodes_per_l3_cache
3

-Tony
Reinette Chatre March 7, 2024, 7:52 p.m. UTC | #6
Hi Tony,

On 3/7/2024 9:57 AM, Luck, Tony wrote:
>>> SNC2 enabled:
>>>
>>> $ cat /sys/fs/resctrl/info/L3_mon/ snc_nodes_per_l3_cache
>>> 2
>>>
>>
>> This would be useful. I believe "SNC" is architecture specific?
>> What if the file always exists and is named "nodes_per_l3_cache"?
>>
>> I assume that the internals of handling more nodes per L3 cache should
>> be hidden from user space and it should not be necessary for user space
>> to know if this is because of SNC or potentially some other mechanism on
>> another platform?
>>
>> I think that may reduce fragmentation of resctrl .... not having
>> resctrl look so different on different architectures but maintains
>> the promise of a generic interface.
>>
>> I am not sure if this is specific to monitoring though,
>> why not host file in /sys/fs/resctrl/info/L3 ?
> 
> Reinette,
> 
> On the name change - sure. It doesn't need the "snc_" prefix.
> 
> The Intel implementation of SNC has far more effect on monitoring
> than on control. The user can read separate cache occupancy and
> memory bandwidth values for each SNC node. But cache allocation
> bitmasks and memory throttling still have a single control point for
> each L3 cache instance, not for each node. There are still some
> impacts on control, e.g. each bit in a CAT bitmask represents
> less actual space in the L3 cache.

I understand the impact but I am trying to view this conceptually.
The info directory exists to "contain information about the enabled
resources" (as per documentation) and it seems appropriate to make
this a property of the L3 resource.

> 
> Maybe move it to the top level of the info/ directory:
> 
> $ cat /sys/fs/resctrl/info/nodes_per_l3_cache
> 3

Thinking about it even differently. The goal is to give information
to userspace so we need to think about what would help user space?
For example, what if there is a file in info that shows 
which CPUs are associated with each domain?

Reinette
Luck, Tony March 7, 2024, 9:14 p.m. UTC | #7
> Thinking about it even differently. The goal is to give information
> to userspace so we need to think about what would help user space?
> For example, what if there is a file in info that shows 
> which CPUs are associated with each domain?

Reinette,

Interesting idea. That would save users from having to chase through
/sys/devices/system/cpu/cpu*/cache/index?/* to figure out what the domain
numbers in schemata files and the mon_data/mon_L3_XX values mean.

May be extra useful for ARM which seems to have big random-looking numbers
for domains that came out of ACPI tables.

For SNC it would get the user directly to what they probably care about
(which CPUs are in which domain).

So something like this for an SNC 2 system:

$ cat /sys/fs/resctrl/info/L3/cpus
0: 0-35,72-107
1: 36-71,108-143

$ cat /sys/fs/resctrl/info/L3_MON/cpus
0: 0-17,72-89
1: 18-35,90-107
2: 36-53,108-125
3: 54-71,126-143

[maybe there is a better name than "cpus" for this file?]

-Tony
Reinette Chatre March 7, 2024, 10:39 p.m. UTC | #8
Hi Tony,

On 3/7/2024 1:14 PM, Luck, Tony wrote:
>> Thinking about it even differently. The goal is to give information
>> to userspace so we need to think about what would help user space?
>> For example, what if there is a file in info that shows 
>> which CPUs are associated with each domain?
> 
> Reinette,
> 
> Interesting idea. That would save users from having to chase through
> /sys/devices/system/cpu/cpu*/cache/index?/* to figure out what the domain
> numbers in schemata files and the mon_data/mon_L3_XX values mean.
> 
> May be extra useful for ARM which seems to have big random-looking numbers
> for domains that came out of ACPI tables.
> 
> For SNC it would get the user directly to what they probably care about
> (which CPUs are in which domain).

I agree.

> 
> So something like this for an SNC 2 system:
> 
> $ cat /sys/fs/resctrl/info/L3/cpus
> 0: 0-35,72-107
> 1: 36-71,108-143
> 
> $ cat /sys/fs/resctrl/info/L3_MON/cpus
> 0: 0-17,72-89
> 1: 18-35,90-107
> 2: 36-53,108-125
> 3: 54-71,126-143
> 
> [maybe there is a better name than "cpus" for this file?]

Thank you for the example. I find that significantly easier to
understand than a single number in a generic "nodes_per_l3_cache".
Especially with potential confusion surrounding inconsistent "nodes"
between allocation and monitoring. 

How about domain_cpu_list and domain_cpu_map ?

Reinette
Luck, Tony March 7, 2024, 11:16 p.m. UTC | #9
On Thu, Mar 07, 2024 at 02:39:08PM -0800, Reinette Chatre wrote:
> Thank you for the example. I find that significantly easier to
> understand than a single number in a generic "nodes_per_l3_cache".
> Especially with potential confusion surrounding inconsistent "nodes"
> between allocation and monitoring. 
> 
> How about domain_cpu_list and domain_cpu_map ?

Reinette,

Like this (my test system doesn't have SNC, so all domains are the same):

$ cd /sys/fs/resctrl/info/
$ grep . */domain*
L3/domain_cpu_list:0: 0-35,72-107
L3/domain_cpu_list:1: 36-71,108-143
L3/domain_cpu_map:0: 0000,00000fff,ffffff00,0000000f,ffffffff
L3/domain_cpu_map:1: ffff,fffff000,000000ff,fffffff0,00000000
L3_MON/domain_cpu_list:0: 0-35,72-107
L3_MON/domain_cpu_list:1: 36-71,108-143
L3_MON/domain_cpu_map:0: 0000,00000fff,ffffff00,0000000f,ffffffff
L3_MON/domain_cpu_map:1: ffff,fffff000,000000ff,fffffff0,00000000
MB/domain_cpu_list:0: 0-35,72-107
MB/domain_cpu_list:1: 36-71,108-143
MB/domain_cpu_map:0: 0000,00000fff,ffffff00,0000000f,ffffffff
MB/domain_cpu_map:1: ffff,fffff000,000000ff,fffffff0,00000000


The patch to do this is pretty straightforward.

-Tony

---

diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index ae80170a0d1b..c180b80640e3 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -957,6 +957,20 @@ static int rdt_num_closids_show(struct kernfs_open_file *of,
 	return 0;
 }
 
+static int rdt_ctrl_cpus_show(struct kernfs_open_file *of,
+			      struct seq_file *seq, void *v)
+{
+	struct resctrl_schema *s = of->kn->parent->priv;
+	struct rdt_resource *r = s->res;
+	struct rdt_ctrl_domain *d;
+
+	list_for_each_entry(d, &r->ctrl_domains, hdr.list)
+		seq_printf(seq, is_cpu_list(of) ? "%d: %*pbl\n" : "%d: %*pb\n",
+			   d->hdr.id, cpumask_pr_args(&d->hdr.cpu_mask));
+
+	return 0;
+}
+
 static int rdt_default_ctrl_show(struct kernfs_open_file *of,
 			     struct seq_file *seq, void *v)
 {
@@ -1103,6 +1117,19 @@ static int rdt_num_rmids_show(struct kernfs_open_file *of,
 	return 0;
 }
 
+static int rdt_mon_cpus_show(struct kernfs_open_file *of,
+			     struct seq_file *seq, void *v)
+{
+	struct rdt_resource *r = of->kn->parent->priv;
+	struct rdt_mon_domain *d;
+
+	list_for_each_entry(d, &r->mon_domains, hdr.list)
+		seq_printf(seq, is_cpu_list(of) ? "%d: %*pbl\n" : "%d: %*pb\n",
+			   d->hdr.id, cpumask_pr_args(&d->hdr.cpu_mask));
+
+	return 0;
+}
+
 static int rdt_mon_features_show(struct kernfs_open_file *of,
 				 struct seq_file *seq, void *v)
 {
@@ -1810,6 +1837,21 @@ static struct rftype res_common_files[] = {
 		.seq_show	= rdt_num_closids_show,
 		.fflags		= RFTYPE_CTRL_INFO,
 	},
+	{
+		.name		= "domain_cpu_list",
+		.mode		= 0444,
+		.kf_ops		= &rdtgroup_kf_single_ops,
+		.seq_show	= rdt_ctrl_cpus_show,
+		.flags		= RFTYPE_FLAGS_CPUS_LIST,
+		.fflags		= RFTYPE_CTRL_INFO,
+	},
+	{
+		.name		= "domain_cpu_map",
+		.mode		= 0444,
+		.kf_ops		= &rdtgroup_kf_single_ops,
+		.seq_show	= rdt_ctrl_cpus_show,
+		.fflags		= RFTYPE_CTRL_INFO,
+	},
 	{
 		.name		= "mon_features",
 		.mode		= 0444,
@@ -1824,6 +1866,21 @@ static struct rftype res_common_files[] = {
 		.seq_show	= rdt_num_rmids_show,
 		.fflags		= RFTYPE_MON_INFO,
 	},
+	{
+		.name		= "domain_cpu_list",
+		.mode		= 0444,
+		.kf_ops		= &rdtgroup_kf_single_ops,
+		.seq_show	= rdt_mon_cpus_show,
+		.flags		= RFTYPE_FLAGS_CPUS_LIST,
+		.fflags		= RFTYPE_MON_INFO,
+	},
+	{
+		.name		= "domain_cpu_map",
+		.mode		= 0444,
+		.kf_ops		= &rdtgroup_kf_single_ops,
+		.seq_show	= rdt_mon_cpus_show,
+		.fflags		= RFTYPE_MON_INFO,
+	},
 	{
 		.name		= "cbm_mask",
 		.mode		= 0444,
Reinette Chatre March 7, 2024, 11:41 p.m. UTC | #10
Hi Tony,

On 3/7/2024 3:16 PM, Tony Luck wrote:
> On Thu, Mar 07, 2024 at 02:39:08PM -0800, Reinette Chatre wrote:
>> Thank you for the example. I find that significantly easier to
>> understand than a single number in a generic "nodes_per_l3_cache".
>> Especially with potential confusion surrounding inconsistent "nodes"
>> between allocation and monitoring. 
>>
>> How about domain_cpu_list and domain_cpu_map ?
> 
> Reinette,
> 
> Like this (my test system doesn't have SNC, so all domains are the same):
> 
> $ cd /sys/fs/resctrl/info/
> $ grep . */domain*
> L3/domain_cpu_list:0: 0-35,72-107
> L3/domain_cpu_list:1: 36-71,108-143
> L3/domain_cpu_map:0: 0000,00000fff,ffffff00,0000000f,ffffffff
> L3/domain_cpu_map:1: ffff,fffff000,000000ff,fffffff0,00000000
> L3_MON/domain_cpu_list:0: 0-35,72-107
> L3_MON/domain_cpu_list:1: 36-71,108-143
> L3_MON/domain_cpu_map:0: 0000,00000fff,ffffff00,0000000f,ffffffff
> L3_MON/domain_cpu_map:1: ffff,fffff000,000000ff,fffffff0,00000000
> MB/domain_cpu_list:0: 0-35,72-107
> MB/domain_cpu_list:1: 36-71,108-143
> MB/domain_cpu_map:0: 0000,00000fff,ffffff00,0000000f,ffffffff
> MB/domain_cpu_map:1: ffff,fffff000,000000ff,fffffff0,00000000
> 
> 
> The patch to do this is pretty straightforward.
> 

Thank you for looking into this. This looks like valuable information
for user space. Back to what started this discussion ... I expect user
space can compare CPUs associated with control and monitoring domains
to learn if SNC is enabled? (And now existence of domain_cpu_list and/or
domain_cpu_map can also be used to determine if kernel supports SNC).

Reinette
James Morse March 8, 2024, 6:06 p.m. UTC | #11
Hi guys,

On 07/03/2024 23:16, Tony Luck wrote:
> On Thu, Mar 07, 2024 at 02:39:08PM -0800, Reinette Chatre wrote:
>> Thank you for the example. I find that significantly easier to
>> understand than a single number in a generic "nodes_per_l3_cache".
>> Especially with potential confusion surrounding inconsistent "nodes"
>> between allocation and monitoring. 
>>
>> How about domain_cpu_list and domain_cpu_map ?

> Like this (my test system doesn't have SNC, so all domains are the same):
> 
> $ cd /sys/fs/resctrl/info/
> $ grep . */domain*
> L3/domain_cpu_list:0: 0-35,72-107
> L3/domain_cpu_list:1: 36-71,108-143
> L3/domain_cpu_map:0: 0000,00000fff,ffffff00,0000000f,ffffffff
> L3/domain_cpu_map:1: ffff,fffff000,000000ff,fffffff0,00000000
> L3_MON/domain_cpu_list:0: 0-35,72-107
> L3_MON/domain_cpu_list:1: 36-71,108-143
> L3_MON/domain_cpu_map:0: 0000,00000fff,ffffff00,0000000f,ffffffff
> L3_MON/domain_cpu_map:1: ffff,fffff000,000000ff,fffffff0,00000000
> MB/domain_cpu_list:0: 0-35,72-107
> MB/domain_cpu_list:1: 36-71,108-143
> MB/domain_cpu_map:0: 0000,00000fff,ffffff00,0000000f,ffffffff
> MB/domain_cpu_map:1: ffff,fffff000,000000ff,fffffff0,00000000

This duplicates the information in /sys/devices/system/cpu/cpuX/cache/indexY ... is this
really because that information is, er, wrong on SNC systems. Is it possible to fix that?

From Tony's earlier description of how SNC changes things, the MB controls remain
per-socket. To me it feels less invasive to fix the definition of L3 on these platforms to
describe how it behaves (assuming that is possible), and define a new 'MB' that is NUMA
scoped.
This direction of redefining L3 means /sys/fs/resctrl and /sys/devices have different
views of 'the' cache hierarchy.

(I also think that this be over the threshold on 'funny machines look funny' - but I bet
someone builds an arm machine that looks like this too!)


Thanks,

James
Luck, Tony March 8, 2024, 6:42 p.m. UTC | #12
On Fri, Mar 08, 2024 at 06:06:45PM +0000, James Morse wrote:
> Hi guys,
> 
> On 07/03/2024 23:16, Tony Luck wrote:
> > On Thu, Mar 07, 2024 at 02:39:08PM -0800, Reinette Chatre wrote:
> >> Thank you for the example. I find that significantly easier to
> >> understand than a single number in a generic "nodes_per_l3_cache".
> >> Especially with potential confusion surrounding inconsistent "nodes"
> >> between allocation and monitoring. 
> >>
> >> How about domain_cpu_list and domain_cpu_map ?
> 
> > Like this (my test system doesn't have SNC, so all domains are the same):
> > 
> > $ cd /sys/fs/resctrl/info/
> > $ grep . */domain*
> > L3/domain_cpu_list:0: 0-35,72-107
> > L3/domain_cpu_list:1: 36-71,108-143
> > L3/domain_cpu_map:0: 0000,00000fff,ffffff00,0000000f,ffffffff
> > L3/domain_cpu_map:1: ffff,fffff000,000000ff,fffffff0,00000000
> > L3_MON/domain_cpu_list:0: 0-35,72-107
> > L3_MON/domain_cpu_list:1: 36-71,108-143
> > L3_MON/domain_cpu_map:0: 0000,00000fff,ffffff00,0000000f,ffffffff
> > L3_MON/domain_cpu_map:1: ffff,fffff000,000000ff,fffffff0,00000000
> > MB/domain_cpu_list:0: 0-35,72-107
> > MB/domain_cpu_list:1: 36-71,108-143
> > MB/domain_cpu_map:0: 0000,00000fff,ffffff00,0000000f,ffffffff
> > MB/domain_cpu_map:1: ffff,fffff000,000000ff,fffffff0,00000000
> 
> This duplicates the information in /sys/devices/system/cpu/cpuX/cache/indexY ... is this
> really because that information is, er, wrong on SNC systems. Is it possible to fix that?

On an SNC system the resctrl domain for L3_MON becomes the SNC node
instead of the L3 cache instance. With 2, 3, or 4 SNC nodes per L3.

Even without the SNC issue this duplication may be a useful
convienience. On Intel to get from a resctrl domain is a multi-step
process to first find which of the indexY directories has level=3
and then look for the "id" that matches the domain.

> >From Tony's earlier description of how SNC changes things, the MB controls remain
> per-socket. To me it feels less invasive to fix the definition of L3 on these platforms to
> describe how it behaves (assuming that is possible), and define a new 'MB' that is NUMA
> scoped.
> This direction of redefining L3 means /sys/fs/resctrl and /sys/devices have different
> views of 'the' cache hierarchy.

I almost went partly in that direction when I started this epic voyage.
The "almost" part was to change the names of the monitoring directories
under mon_data from (legacy non-SNC system):

$ ls -l mon_data
total 0
dr-xr-xr-x. 2 root root 0 Mar  8 10:31 mon_L3_00
dr-xr-xr-x. 2 root root 0 Mar  8 10:31 mon_L3_01

to (2 socket, SNC=2 system):

$ ls -l mon_data
total 0
dr-xr-xr-x. 2 root root 0 Mar  8 10:31 mon_NODE_00
dr-xr-xr-x. 2 root root 0 Mar  8 10:31 mon_NODE_01
dr-xr-xr-x. 2 root root 0 Mar  8 10:31 mon_NODE_02
dr-xr-xr-x. 2 root root 0 Mar  8 10:31 mon_NODE_03

While that is in some ways a more accurate view, it breaks a lot of
legacy monitoring applications that expect the "L3" names.

> (I also think that this be over the threshold on 'funny machines look funny' - but I bet
> someone builds an arm machine that looks like this too!)

-Tony
James Morse March 15, 2024, 6:02 p.m. UTC | #13
Hi Tony,

On 08/03/2024 18:42, Tony Luck wrote:
> On Fri, Mar 08, 2024 at 06:06:45PM +0000, James Morse wrote:
>> Hi guys,
>>
>> On 07/03/2024 23:16, Tony Luck wrote:
>>> On Thu, Mar 07, 2024 at 02:39:08PM -0800, Reinette Chatre wrote:
>>>> Thank you for the example. I find that significantly easier to
>>>> understand than a single number in a generic "nodes_per_l3_cache".
>>>> Especially with potential confusion surrounding inconsistent "nodes"
>>>> between allocation and monitoring. 
>>>>
>>>> How about domain_cpu_list and domain_cpu_map ?
>>
>>> Like this (my test system doesn't have SNC, so all domains are the same):
>>>
>>> $ cd /sys/fs/resctrl/info/
>>> $ grep . */domain*
>>> L3/domain_cpu_list:0: 0-35,72-107
>>> L3/domain_cpu_list:1: 36-71,108-143
>>> L3/domain_cpu_map:0: 0000,00000fff,ffffff00,0000000f,ffffffff
>>> L3/domain_cpu_map:1: ffff,fffff000,000000ff,fffffff0,00000000
>>> L3_MON/domain_cpu_list:0: 0-35,72-107
>>> L3_MON/domain_cpu_list:1: 36-71,108-143
>>> L3_MON/domain_cpu_map:0: 0000,00000fff,ffffff00,0000000f,ffffffff
>>> L3_MON/domain_cpu_map:1: ffff,fffff000,000000ff,fffffff0,00000000
>>> MB/domain_cpu_list:0: 0-35,72-107
>>> MB/domain_cpu_list:1: 36-71,108-143
>>> MB/domain_cpu_map:0: 0000,00000fff,ffffff00,0000000f,ffffffff
>>> MB/domain_cpu_map:1: ffff,fffff000,000000ff,fffffff0,00000000
>>
>> This duplicates the information in /sys/devices/system/cpu/cpuX/cache/indexY ... is this
>> really because that information is, er, wrong on SNC systems. Is it possible to fix that?
> 
> On an SNC system the resctrl domain for L3_MON becomes the SNC node
> instead of the L3 cache instance. With 2, 3, or 4 SNC nodes per L3.
> 
> Even without the SNC issue this duplication may be a useful
> convienience. On Intel to get from a resctrl domain is a multi-step
> process to first find which of the indexY directories has level=3
> and then look for the "id" that matches the domain.
> 
>> >From Tony's earlier description of how SNC changes things, the MB controls remain
>> per-socket. To me it feels less invasive to fix the definition of L3 on these platforms to
>> describe how it behaves (assuming that is possible), and define a new 'MB' that is NUMA
>> scoped.
>> This direction of redefining L3 means /sys/fs/resctrl and /sys/devices have different
>> views of 'the' cache hierarchy.
> 
> I almost went partly in that direction when I started this epic voyage.
> The "almost" part was to change the names of the monitoring directories
> under mon_data from (legacy non-SNC system):
> 
> $ ls -l mon_data
> total 0
> dr-xr-xr-x. 2 root root 0 Mar  8 10:31 mon_L3_00
> dr-xr-xr-x. 2 root root 0 Mar  8 10:31 mon_L3_01
> 
> to (2 socket, SNC=2 system):
> 
> $ ls -l mon_data
> total 0
> dr-xr-xr-x. 2 root root 0 Mar  8 10:31 mon_NODE_00
> dr-xr-xr-x. 2 root root 0 Mar  8 10:31 mon_NODE_01
> dr-xr-xr-x. 2 root root 0 Mar  8 10:31 mon_NODE_02
> dr-xr-xr-x. 2 root root 0 Mar  8 10:31 mon_NODE_03

This would be useful for MPAM. I've seen a couple of MPAM systems that have per-NUMA MPAM
controls on the 'L3', but describe it as a single global L3. The MPAM driver currently
hides this by summing the NUMA node counters and reporting it as the global L3's value.


> While that is in some ways a more accurate view, it breaks a lot of
> legacy monitoring applications that expect the "L3" names.

True - but the behaviour is different from a non SNC system, if this software can read the
file - but goes wrong because the contents of the file represent something different, its
still broken.




Thanks,

James
Reinette Chatre March 18, 2024, 7:15 p.m. UTC | #14
On 3/15/2024 11:02 AM, James Morse wrote:
> On 08/03/2024 18:42, Tony Luck wrote:
>> On Fri, Mar 08, 2024 at 06:06:45PM +0000, James Morse wrote:
>>> Hi guys,
>>>
>>> On 07/03/2024 23:16, Tony Luck wrote:
>>>> On Thu, Mar 07, 2024 at 02:39:08PM -0800, Reinette Chatre wrote:
>>>>> Thank you for the example. I find that significantly easier to
>>>>> understand than a single number in a generic "nodes_per_l3_cache".
>>>>> Especially with potential confusion surrounding inconsistent "nodes"
>>>>> between allocation and monitoring. 
>>>>>
>>>>> How about domain_cpu_list and domain_cpu_map ?
>>>
>>>> Like this (my test system doesn't have SNC, so all domains are the same):
>>>>
>>>> $ cd /sys/fs/resctrl/info/
>>>> $ grep . */domain*
>>>> L3/domain_cpu_list:0: 0-35,72-107
>>>> L3/domain_cpu_list:1: 36-71,108-143
>>>> L3/domain_cpu_map:0: 0000,00000fff,ffffff00,0000000f,ffffffff
>>>> L3/domain_cpu_map:1: ffff,fffff000,000000ff,fffffff0,00000000
>>>> L3_MON/domain_cpu_list:0: 0-35,72-107
>>>> L3_MON/domain_cpu_list:1: 36-71,108-143
>>>> L3_MON/domain_cpu_map:0: 0000,00000fff,ffffff00,0000000f,ffffffff
>>>> L3_MON/domain_cpu_map:1: ffff,fffff000,000000ff,fffffff0,00000000
>>>> MB/domain_cpu_list:0: 0-35,72-107
>>>> MB/domain_cpu_list:1: 36-71,108-143
>>>> MB/domain_cpu_map:0: 0000,00000fff,ffffff00,0000000f,ffffffff
>>>> MB/domain_cpu_map:1: ffff,fffff000,000000ff,fffffff0,00000000
>>>
>>> This duplicates the information in /sys/devices/system/cpu/cpuX/cache/indexY ... is this
>>> really because that information is, er, wrong on SNC systems. Is it possible to fix that?
>>
>> On an SNC system the resctrl domain for L3_MON becomes the SNC node
>> instead of the L3 cache instance. With 2, 3, or 4 SNC nodes per L3.
>>
>> Even without the SNC issue this duplication may be a useful
>> convienience. On Intel to get from a resctrl domain is a multi-step
>> process to first find which of the indexY directories has level=3
>> and then look for the "id" that matches the domain.
>>
>>> >From Tony's earlier description of how SNC changes things, the MB controls remain
>>> per-socket. To me it feels less invasive to fix the definition of L3 on these platforms to
>>> describe how it behaves (assuming that is possible), and define a new 'MB' that is NUMA
>>> scoped.
>>> This direction of redefining L3 means /sys/fs/resctrl and /sys/devices have different
>>> views of 'the' cache hierarchy.
>>
>> I almost went partly in that direction when I started this epic voyage.
>> The "almost" part was to change the names of the monitoring directories
>> under mon_data from (legacy non-SNC system):
>>
>> $ ls -l mon_data
>> total 0
>> dr-xr-xr-x. 2 root root 0 Mar  8 10:31 mon_L3_00
>> dr-xr-xr-x. 2 root root 0 Mar  8 10:31 mon_L3_01
>>
>> to (2 socket, SNC=2 system):
>>
>> $ ls -l mon_data
>> total 0
>> dr-xr-xr-x. 2 root root 0 Mar  8 10:31 mon_NODE_00
>> dr-xr-xr-x. 2 root root 0 Mar  8 10:31 mon_NODE_01
>> dr-xr-xr-x. 2 root root 0 Mar  8 10:31 mon_NODE_02
>> dr-xr-xr-x. 2 root root 0 Mar  8 10:31 mon_NODE_03
> 
> This would be useful for MPAM. I've seen a couple of MPAM systems that have per-NUMA MPAM
> controls on the 'L3', but describe it as a single global L3. The MPAM driver currently
> hides this by summing the NUMA node counters and reporting it as the global L3's value.
> 
> 
>> While that is in some ways a more accurate view, it breaks a lot of
>> legacy monitoring applications that expect the "L3" names.
> 
> True - but the behaviour is different from a non SNC system, if this software can read the
> file - but goes wrong because the contents of the file represent something different, its
> still broken.

This is a good point. There is also /sys/fs/resctrl/info/L3_MON to consider and trying to think
what to do about that makes me go in circles about when user space may expect resctrl to indicate
the resource and when user space may expect resctrl to indicate the scope. For example, 
/sys/fs/resctrl/mon_data/mon_L3_00 contains files with data that monitor the
"L3" _resource_, no? If we change that to /sys/fs/resctrl/mon_data/mon_NODE_00 then it
switches the meaning of the middle term to be "scope" while it still contains the monitoring
data of the "L3" resource. So does that mean user space would need to rely on
/sys/fs/resctrl/info/L3_MON to obtain the information about which monitoring files 
(/sys/fs/resctrl/info/L3_MON/mon_features) are related to the particular resource and then
match those filenames with the filenames in /sys/fs/resctrl/mon_data/mon_NODE_00 to know
which resource it applies to and learn from the directory name what scope measurement is at?

Reinette
Luck, Tony March 18, 2024, 7:34 p.m. UTC | #15
> >> While that is in some ways a more accurate view, it breaks a lot of
> >> legacy monitoring applications that expect the "L3" names.
> >
> > True - but the behaviour is different from a non SNC system, if this software can read the
> > file - but goes wrong because the contents of the file represent something different, its
> > still broken.
>
> This is a good point. There is also /sys/fs/resctrl/info/L3_MON to consider and trying to think
> what to do about that makes me go in circles about when user space may expect resctrl to indicate
> the resource and when user space may expect resctrl to indicate the scope. For example,
> /sys/fs/resctrl/mon_data/mon_L3_00 contains files with data that monitor the
> "L3" _resource_, no? If we change that to /sys/fs/resctrl/mon_data/mon_NODE_00 then it
> switches the meaning of the middle term to be "scope" while it still contains the monitoring
> data of the "L3" resource. So does that mean user space would need to rely on
> /sys/fs/resctrl/info/L3_MON to obtain the information about which monitoring files
> (/sys/fs/resctrl/info/L3_MON/mon_features) are related to the particular resource and then
> match those filenames with the filenames in /sys/fs/resctrl/mon_data/mon_NODE_00 to know
> which resource it applies to and learn from the directory name what scope measurement is at?

Reinette,

It's both a wave and a particle, depending on the observer.

In SNC systems resources on each socket are divided into 2, 3, 4 nodes. But the
division is complicated. Memory and CPU cores are easy. They are each assigned
to an SNC node. The cache is more complicated. The hash function for memory
address to cache index is the part that is SNC aware. So memory on SNC node1
will allocate in the cache indices assigned to SNC node1. But that function has to
be independent of which CPU is doing the access. That's why I keep mentioning
"well behaved NUMA applications when talking about SNC.

So the resctrl monitoring operations still work on the L3 cache, but in SNC mode
they work on a portion of the L3 cache. As long as all accesses are NUMA local you
can think of the cache as partitioned between the SNC nodes.

But not everything is well behaved from a NUMA perspective. It would be misleading
to describe the occupancy and bandwidth as belonging to an SNC node.

It's also a bit misleading to describe in terms of an L3 cache instance. But doing
so doesn't require application changes.

-Tony
Reinette Chatre March 18, 2024, 8:32 p.m. UTC | #16
Hi Tony,

On 3/18/2024 12:34 PM, Luck, Tony wrote:
>>>> While that is in some ways a more accurate view, it breaks a lot of
>>>> legacy monitoring applications that expect the "L3" names.
>>>
>>> True - but the behaviour is different from a non SNC system, if this software can read the
>>> file - but goes wrong because the contents of the file represent something different, its
>>> still broken.
>>
>> This is a good point. There is also /sys/fs/resctrl/info/L3_MON to consider and trying to think
>> what to do about that makes me go in circles about when user space may expect resctrl to indicate
>> the resource and when user space may expect resctrl to indicate the scope. For example,
>> /sys/fs/resctrl/mon_data/mon_L3_00 contains files with data that monitor the
>> "L3" _resource_, no? If we change that to /sys/fs/resctrl/mon_data/mon_NODE_00 then it
>> switches the meaning of the middle term to be "scope" while it still contains the monitoring
>> data of the "L3" resource. So does that mean user space would need to rely on
>> /sys/fs/resctrl/info/L3_MON to obtain the information about which monitoring files
>> (/sys/fs/resctrl/info/L3_MON/mon_features) are related to the particular resource and then
>> match those filenames with the filenames in /sys/fs/resctrl/mon_data/mon_NODE_00 to know
>> which resource it applies to and learn from the directory name what scope measurement is at?
> 
> Reinette,
> 
> It's both a wave and a particle, depending on the observer.
> 
> In SNC systems resources on each socket are divided into 2, 3, 4 nodes. But the
> division is complicated. Memory and CPU cores are easy. They are each assigned
> to an SNC node. The cache is more complicated. The hash function for memory
> address to cache index is the part that is SNC aware. So memory on SNC node1
> will allocate in the cache indices assigned to SNC node1. But that function has to
> be independent of which CPU is doing the access. That's why I keep mentioning
> "well behaved NUMA applications when talking about SNC.
> 
> So the resctrl monitoring operations still work on the L3 cache, but in SNC mode
> they work on a portion of the L3 cache. As long as all accesses are NUMA local you
> can think of the cache as partitioned between the SNC nodes.
> 
> But not everything is well behaved from a NUMA perspective. It would be misleading
> to describe the occupancy and bandwidth as belonging to an SNC node.
> 
> It's also a bit misleading to describe in terms of an L3 cache instance. But doing
> so doesn't require application changes.

What is the use case for needing to expose the individual cluster counts? What if
resctrl just summed the cluster counts and presented the data as before - per L3
cache instance? I doubt that resctrl would be what applications would use to verify
whether they are "well behaved" wrt NUMA.

Reinette
Luck, Tony March 18, 2024, 8:47 p.m. UTC | #17
> What is the use case for needing to expose the individual cluster counts? What if
> resctrl just summed the cluster counts and presented the data as before - per L3
> cache instance? I doubt that resctrl would be what applications would use to verify
> whether they are "well behaved" wrt NUMA.

Reinette,

My (perhaps naïve) belief is that in a cloud server environment there are many
well behaved NUMA applications. Only presenting the sum would lose the detailed
information from each SNC node.

-Tony
Luck, Tony March 18, 2024, 9:04 p.m. UTC | #18
> > What is the use case for needing to expose the individual cluster counts? What if
> > resctrl just summed the cluster counts and presented the data as before - per L3
> > cache instance? I doubt that resctrl would be what applications would use to verify
> > whether they are "well behaved" wrt NUMA.
>
> Reinette,
>
> My (perhaps naïve) belief is that in a cloud server environment there are many
> well behaved NUMA applications. Only presenting the sum would lose the detailed
> information from each SNC node.

Is the answer to "A" or "B" ... why not provide both:

$ ls -l /sys/fs/resctrl/mon_data
total 0
dr-xr-xr-x. 2 root root 0 Mar 18 14:01 mon_L3_00
dr-xr-xr-x. 2 root root 0 Mar 18 14:01 mon_L3_01
dr-xr-xr-x. 2 root root 0 Mar 18 14:01 mon_NODE_00
dr-xr-xr-x. 2 root root 0 Mar 18 14:01 mon_NODE_01
dr-xr-xr-x. 2 root root 0 Mar 18 14:01 mon_NODE_02
dr-xr-xr-x. 2 root root 0 Mar 18 14:01 mon_NODE_03

The "L3" entries provide the sum across all SNC nodes sharing the cache. The NODE ones
give the broken out counts.

-Tony
Reinette Chatre March 18, 2024, 9:23 p.m. UTC | #19
Hi Tony,

On 3/18/2024 1:47 PM, Luck, Tony wrote:
>> What is the use case for needing to expose the individual cluster counts? What if
>> resctrl just summed the cluster counts and presented the data as before - per L3
>> cache instance? I doubt that resctrl would be what applications would use to verify
>> whether they are "well behaved" wrt NUMA.
> 
> Reinette,
> 
> My (perhaps naïve) belief is that in a cloud server environment there are many
> well behaved NUMA applications. Only presenting the sum would lose the detailed
> information from each SNC node.

Yes ...  I understand by providing a sum the values that contributed to the sum
are lost.

Could you please help me understand the details by answering my first
question: What is the use case for needing to expose the individual cluster
counts? 

This is a model specific feature so if this is something needed for just a
couple of systems I think we should be less inclined to make changes to
resctrl interface. I am starting to be concerned about something similar
becoming architectural later and then we need to wrangle this model specific
resctrl support (which has then become ABI) again to support whatever that
may look like.

Reinette
Reinette Chatre March 18, 2024, 9:26 p.m. UTC | #20
On 3/18/2024 2:04 PM, Luck, Tony wrote:
>>> What is the use case for needing to expose the individual cluster counts? What if
>>> resctrl just summed the cluster counts and presented the data as before - per L3
>>> cache instance? I doubt that resctrl would be what applications would use to verify
>>> whether they are "well behaved" wrt NUMA.
>>
>> Reinette,
>>
>> My (perhaps naïve) belief is that in a cloud server environment there are many
>> well behaved NUMA applications. Only presenting the sum would lose the detailed
>> information from each SNC node.
> 
> Is the answer to "A" or "B" ... why not provide both:
> 
> $ ls -l /sys/fs/resctrl/mon_data
> total 0
> dr-xr-xr-x. 2 root root 0 Mar 18 14:01 mon_L3_00
> dr-xr-xr-x. 2 root root 0 Mar 18 14:01 mon_L3_01
> dr-xr-xr-x. 2 root root 0 Mar 18 14:01 mon_NODE_00
> dr-xr-xr-x. 2 root root 0 Mar 18 14:01 mon_NODE_01
> dr-xr-xr-x. 2 root root 0 Mar 18 14:01 mon_NODE_02
> dr-xr-xr-x. 2 root root 0 Mar 18 14:01 mon_NODE_03
> 
> The "L3" entries provide the sum across all SNC nodes sharing the cache. The NODE ones
> give the broken out counts.

Perhaps ... in this case it may make things easier to understand if
those "mon_NODE_*" directories are sub-directories of the appropriate
"mon_L3_*" directories. 

Reinette
Luck, Tony March 18, 2024, 10 p.m. UTC | #21
> Perhaps ... in this case it may make things easier to understand if
> those "mon_NODE_*" directories are sub-directories of the appropriate
> "mon_L3_*" directories. 

Reinette,

Like this?

$ tree mon_data/
mon_data/
├── mon_L3_00
│   ├── llc_occupancy
│   ├── mbm_local_bytes
│   ├── mbm_total_bytes
│   ├── mon_NODE_00
│   │   ├── llc_occupancy
│   │   ├── mbm_local_bytes
│   │   └── mbm_total_bytes
│   └── mon_NODE_01
│       ├── llc_occupancy
│       ├── mbm_local_bytes
│       └── mbm_total_bytes
└── mon_L3_01
    ├── llc_occupancy
    ├── mbm_local_bytes
    ├── mbm_total_bytes
    ├── mon_NODE_02
    │   ├── llc_occupancy
    │   ├── mbm_local_bytes
    │   └── mbm_total_bytes
    └── mon_NODE_03
        ├── llc_occupancy
        ├── mbm_local_bytes
        └── mbm_total_bytes

-Tony
Luck, Tony March 18, 2024, 10:04 p.m. UTC | #22
> Could you please help me understand the details by answering my first
> question: What is the use case for needing to expose the individual cluster
> counts? 
>
> This is a model specific feature so if this is something needed for just a
> couple of systems I think we should be less inclined to make changes to
> resctrl interface. I am starting to be concerned about something similar
> becoming architectural later and then we need to wrangle this model specific
> resctrl support (which has then become ABI) again to support whatever that
> may look like.

Reinette,

Model specific. But present in multiple consecutive generations (Sapphire Rapids,
Emerald Rapids, Granite Rapids, Sierra Forest).

Adding Peter Newman for a resctrl user perspective on SNC, rather than me
continue to speculate on possible ways this might be used.

Peter: You will need to dig back a few messages on lore.kernel.org to
get context.

-Tony
Reinette Chatre March 18, 2024, 10:43 p.m. UTC | #23
Hi Tony,

On 3/18/2024 3:00 PM, Luck, Tony wrote:
>> Perhaps ... in this case it may make things easier to understand if
>> those "mon_NODE_*" directories are sub-directories of the appropriate
>> "mon_L3_*" directories. 
> 
> Reinette,
> 
> Like this?
> 
> $ tree mon_data/
> mon_data/
> ├── mon_L3_00
> │   ├── llc_occupancy
> │   ├── mbm_local_bytes
> │   ├── mbm_total_bytes
> │   ├── mon_NODE_00
> │   │   ├── llc_occupancy
> │   │   ├── mbm_local_bytes
> │   │   └── mbm_total_bytes
> │   └── mon_NODE_01
> │       ├── llc_occupancy
> │       ├── mbm_local_bytes
> │       └── mbm_total_bytes
> └── mon_L3_01
>     ├── llc_occupancy
>     ├── mbm_local_bytes
>     ├── mbm_total_bytes
>     ├── mon_NODE_02
>     │   ├── llc_occupancy
>     │   ├── mbm_local_bytes
>     │   └── mbm_total_bytes
>     └── mon_NODE_03
>         ├── llc_occupancy
>         ├── mbm_local_bytes
>         └── mbm_total_bytes
> 

Yes.

Pro:
* This is familiar to users when compared to existing
  CTRL_MON group counts that are the sum of the MON groups within it.

* Users continue to see the clear connection between files listed in
  /sys/fs/resctrl/info/L3_MON/mon_features found in "mon_L3*" directories.

* If I understand correctly it also would continue to be useful to
  Arm [1] while not breaking existing user space since the
  mon_L3* counts continue to reflect the entire L3 resource.

* This addresses your comment of maintaining the detailed information
  from each SNC node.

* I do assume that if there is only one SNC node per L3 cache then those
  mon_NODE_* directories will not be present and, to address the issue
  that triggered this thread, user space can use presence of
  multiple "mon_NODE_*" directories per cache instance to know if
  SNC is enabled.

Con:
* Unclear how this may need to change if/when SNC becomes architectural.

* Continues to "muddy" the naming of the directories: mon_<resource>_<id>
  vs mon_<scope>_<id>. This cannot be turned into agreement with user space
  where first directory is mon_<resource>_<id> and second directory is
  mon_<scope>_<id> because then we would need to have, for example,
  mon_L3_00/mon_L3_00 where the first directory is for the resource and the
  second is for the scope, which appears redundant.

* Things may get confusing if there is ever a "node" resource.

This is starting to look like an interface that "checks" all the
requirements I've seen so far. Looking at the "con" it is difficult for me
to see how doing something like this now may cause frustrations later.

Reinette

[1] https://lore.kernel.org/lkml/88430722-67b3-4f7d-8db2-95ee52b6f0b0@arm.com/
Peter Newman March 19, 2024, 9:01 p.m. UTC | #24
Hi Tony,

On Mon, Mar 18, 2024 at 3:05 PM Luck, Tony <tony.luck@intel.com> wrote:
>
> > Could you please help me understand the details by answering my first
> > question: What is the use case for needing to expose the individual cluster
> > counts?
> >
> > This is a model specific feature so if this is something needed for just a
> > couple of systems I think we should be less inclined to make changes to
> > resctrl interface. I am starting to be concerned about something similar
> > becoming architectural later and then we need to wrangle this model specific
> > resctrl support (which has then become ABI) again to support whatever that
> > may look like.
>
> Reinette,
>
> Model specific. But present in multiple consecutive generations (Sapphire Rapids,
> Emerald Rapids, Granite Rapids, Sierra Forest).
>
> Adding Peter Newman for a resctrl user perspective on SNC, rather than me
> continue to speculate on possible ways this might be used.
>
> Peter: You will need to dig back a few messages on lore.kernel.org to
> get context.

Our main concern with supporting SNC in resctrl is all of the
monitoring groups successfully recording memory bandwidth from all
CPUs, regardless of the RMIDs they're assigned.

I would prefer that we don't complicate the model of resctrl
monitoring domains for this feature. On ARM SoCs there will be a
plethora of technologies influencing the layout of resources, so we
shouldn't start cluttering the model with special cases for each.

I think it's valid for the number of domains in the L3 resource to
increase or stay the same when the system is configured for SNC. I
don't think the details of how the domains came about is relevant at
the resctrl interface level so long as the user has enough information
to deduce what the domain is referring to based on knowledge of their
system configuration.

I would prefer per-cluster as more information could prove useful in
some future investigation, but if you feel the data is misleading,
providing the clusters combined is also fine. I would prefer that the
choice remains consistent from this point forward on any particular
implementation to avoid breaking existing controller software
developed for that implementation.

In our main use case, we sum mon_data/*/mbm_total_bytes to determine a
group's total bandwidth, so please don't cause this logic to produce
the wrong answer.

Thanks!
-Peter
diff mbox series

Patch

diff --git a/tools/testing/selftests/resctrl/cat_test.c b/tools/testing/selftests/resctrl/cat_test.c
index 4cb991be8e31..1cdaadf35f03 100644
--- a/tools/testing/selftests/resctrl/cat_test.c
+++ b/tools/testing/selftests/resctrl/cat_test.c
@@ -253,7 +253,7 @@  static int cat_run_test(const struct resctrl_test *test, const struct user_param
 		return ret;
 
 	/* Get L3/L2 cache size */
-	ret = get_cache_size(uparams->cpu, test->resource, &cache_total_size);
+	ret = get_sys_cache_size(uparams->cpu, test->resource, &cache_total_size);
 	if (ret)
 		return ret;
 	ksft_print_msg("Cache size :%lu\n", cache_total_size);
diff --git a/tools/testing/selftests/resctrl/cmt_test.c b/tools/testing/selftests/resctrl/cmt_test.c
index a81f91222a89..b7cada602484 100644
--- a/tools/testing/selftests/resctrl/cmt_test.c
+++ b/tools/testing/selftests/resctrl/cmt_test.c
@@ -112,7 +112,7 @@  static int cmt_run_test(const struct resctrl_test *test, const struct user_param
 	if (ret)
 		return ret;
 
-	ret = get_cache_size(uparams->cpu, "L3", &cache_total_size);
+	ret = get_sys_cache_size(uparams->cpu, "L3", &cache_total_size);
 	if (ret)
 		return ret;
 	ksft_print_msg("Cache size :%lu\n", cache_total_size);
@@ -157,8 +157,8 @@  static int cmt_run_test(const struct resctrl_test *test, const struct user_param
 		goto out;
 
 	ret = check_results(&param, span, n);
-	if (ret && (get_vendor() == ARCH_INTEL))
-		ksft_print_msg("Intel CMT may be inaccurate when Sub-NUMA Clustering is enabled. Check BIOS configuration.\n");
+	if (ret && (get_vendor() == ARCH_INTEL) && snc_ways() > 1 && !snc_kernel_support())
+		ksft_print_msg("Kernel doesn't support Sub-NUMA Clustering but it is enabled. Check BIOS configuration.\n");
 
 out:
 	cmt_test_cleanup();
diff --git a/tools/testing/selftests/resctrl/mba_test.c b/tools/testing/selftests/resctrl/mba_test.c
index fc31a61dab0c..89fe3ecbf497 100644
--- a/tools/testing/selftests/resctrl/mba_test.c
+++ b/tools/testing/selftests/resctrl/mba_test.c
@@ -160,6 +160,8 @@  static int mba_run_test(const struct resctrl_test *test, const struct user_param
 		goto out;
 
 	ret = check_results();
+	if (ret && (get_vendor() == ARCH_INTEL) && snc_ways() > 1 && !snc_kernel_support())
+		ksft_print_msg("Kernel doesn't support Sub-NUMA Clustering but it is enabled. Check BIOS configuration.\n");
 
 out:
 	mba_test_cleanup();
diff --git a/tools/testing/selftests/resctrl/mbm_test.c b/tools/testing/selftests/resctrl/mbm_test.c
index d67ffa3ec63a..e12b4b06f6d5 100644
--- a/tools/testing/selftests/resctrl/mbm_test.c
+++ b/tools/testing/selftests/resctrl/mbm_test.c
@@ -129,8 +129,8 @@  static int mbm_run_test(const struct resctrl_test *test, const struct user_param
 		goto out;
 
 	ret = check_results(DEFAULT_SPAN);
-	if (ret && (get_vendor() == ARCH_INTEL))
-		ksft_print_msg("Intel MBM may be inaccurate when Sub-NUMA Clustering is enabled. Check BIOS configuration.\n");
+	if (ret && (get_vendor() == ARCH_INTEL) && snc_ways() > 1 && !snc_kernel_support())
+		ksft_print_msg("Kernel doesn't support Sub-NUMA Clustering but it is enabled. Check BIOS configuration.\n");
 
 out:
 	mbm_test_cleanup();
diff --git a/tools/testing/selftests/resctrl/resctrl.h b/tools/testing/selftests/resctrl/resctrl.h
index 178fb2eab13a..038e1269a3fc 100644
--- a/tools/testing/selftests/resctrl/resctrl.h
+++ b/tools/testing/selftests/resctrl/resctrl.h
@@ -28,6 +28,7 @@ 
 #define RESCTRL_PATH		"/sys/fs/resctrl"
 #define PHYS_ID_PATH		"/sys/devices/system/cpu/cpu"
 #define INFO_PATH		"/sys/fs/resctrl/info"
+#define SIZE_PATH		"/sys/fs/resctrl/size"
 
 /*
  * CPU vendor IDs
@@ -168,14 +169,16 @@  unsigned long create_bit_mask(unsigned int start, unsigned int len);
 unsigned int count_contiguous_bits(unsigned long val, unsigned int *start);
 int get_full_cbm(const char *cache_type, unsigned long *mask);
 int get_mask_no_shareable(const char *cache_type, unsigned long *mask);
-int get_cache_size(int cpu_no, const char *cache_type, unsigned long *cache_size);
 int resource_info_unsigned_get(const char *resource, const char *filename, unsigned int *val);
+int get_sys_cache_size(int cpu_no, const char *cache_type, unsigned long *cache_size);
+int get_resctrl_cache_size(const char *cache_type, unsigned long *cache_size);
 void ctrlc_handler(int signum, siginfo_t *info, void *ptr);
 int signal_handler_register(void);
 void signal_handler_unregister(void);
 void cat_test_cleanup(void);
 unsigned int count_bits(unsigned long n);
 void cmt_test_cleanup(void);
+int snc_kernel_support(void);
 
 void perf_event_attr_initialize(struct perf_event_attr *pea, __u64 config);
 void perf_event_initialize_read_format(struct perf_event_read *pe_read);
diff --git a/tools/testing/selftests/resctrl/resctrlfs.c b/tools/testing/selftests/resctrl/resctrlfs.c
index e4d3624a8817..dbd10cb7abf5 100644
--- a/tools/testing/selftests/resctrl/resctrlfs.c
+++ b/tools/testing/selftests/resctrl/resctrlfs.c
@@ -214,14 +214,14 @@  int snc_ways(void)
 }
 
 /*
- * get_cache_size - Get cache size for a specified CPU
+ * get_sys_cache_size - Get cache size for a specified CPU
  * @cpu_no:	CPU number
  * @cache_type:	Cache level L2/L3
  * @cache_size:	pointer to cache_size
  *
  * Return: = 0 on success, < 0 on failure.
  */
-int get_cache_size(int cpu_no, const char *cache_type, unsigned long *cache_size)
+int get_sys_cache_size(int cpu_no, const char *cache_type, unsigned long *cache_size)
 {
 	char cache_path[1024], cache_str[64];
 	int length, i, cache_num;
@@ -273,6 +273,44 @@  int get_cache_size(int cpu_no, const char *cache_type, unsigned long *cache_size
 	return 0;
 }
 
+/*
+ * get_resctrl_cache_size - Get cache size as reported by resctrl
+ * @cache_type:	Cache level L2/L3
+ * @cache_size:	pointer to cache_size
+ *
+ * Return: = 0 on success, < 0 on failure.
+ */
+int get_resctrl_cache_size(const char *cache_type, unsigned long *cache_size)
+{
+	char line[256], cache_prefix[16], *stripped_line, *token;
+	size_t len;
+	FILE *fp;
+
+	strcpy(cache_prefix, cache_type);
+	strncat(cache_prefix, ":", 1);
+
+	fp = fopen(SIZE_PATH, "r");
+	if (!fp) {
+		ksft_print_msg("Failed to open %s : '%s'\n",
+			       SIZE_PATH, strerror(errno));
+		return -1;
+	}
+
+	while (fgets(line, sizeof(line), fp)) {
+		stripped_line = strstr(line, cache_prefix);
+
+		if (stripped_line) {
+			len = strlen(cache_prefix);
+			stripped_line += len;
+			token = strtok(stripped_line, ";");
+			if (sscanf(token, "0=%lu", cache_size) <= 0)
+				return -1;
+		}
+	}
+	fclose(fp);
+	return 0;
+}
+
 #define CORE_SIBLINGS_PATH	"/sys/bus/cpu/devices/cpu"
 
 /*
@@ -935,3 +973,30 @@  unsigned int count_bits(unsigned long n)
 
 	return count;
 }
+
+/**
+ * snc_kernel_support - Compare system reported cache size and resctrl
+ * reported cache size to get an idea if SNC is supported on the kernel side.
+ * If SNC is enabled and the kernel does support it the value should be equal.
+ * If the kernel doesn't support SNC the.
+ *
+ * Return: 0 if not supported, 1 if supported, < 0 on failure.
+ */
+int snc_kernel_support(void)
+{
+	unsigned long resctrl_cache_size, node_cache_size;
+	int ret;
+
+	ret = get_sys_cache_size(0, "L3", &node_cache_size);
+	if (ret < 0)
+		return ret;
+
+	ret = get_resctrl_cache_size("L3", &resctrl_cache_size);
+	if (ret < 0)
+		return ret;
+
+	if (resctrl_cache_size == node_cache_size)
+		return 1;
+
+	return 0;
+}