Message ID | 8a158ada92f06b97a4679721e84e787e94b94647.1709721159.git.maciej.wieczor-retman@intel.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | SNC support for resctrl selftests | expand |
> Figuring out if SNC is enabled is only one part of the problem, the > other being whether the kernel supports it. As there is no easy > interface that simply states SNC support in the kernel one can find that > information by comparing L3 cache sizes from different sources. Cache > size reported by /sys/devices/system/node/node0/cpu0/cache/index3/size > will always show the full cache size even if it's split by enabled SNC. > On the other hand /sys/fs/resctrl/size has information about L3 size, > that with kernel support is adjusted for enabled SNC. Early versions of the kernel SNC patch series added an info/l3_MON/snc_ways file to provide this information. I was talked out of it then: https://lore.kernel.org/all/f0841866-315b-4727-0a6c-ec60d22ca29c@arm.com/ But that discussion didn't consider the question you discuss here: "Does this instance of the kernel support SNC?" So you have a clever solution. But it seems like a roundabout way for an application to discover whether the kernel has configured resctrl for SNC mode. Should the kernel provide an info/ file listing the SNC configuration? If so, what should it be named? "snc_ways" as a kernel variable was later replaced by "snc_nodes_per_l3_cache". Is that a good filename? -Tony
On 2024-03-06 at 21:54:02 +0000, Luck, Tony wrote: >> Figuring out if SNC is enabled is only one part of the problem, the >> other being whether the kernel supports it. As there is no easy >> interface that simply states SNC support in the kernel one can find that >> information by comparing L3 cache sizes from different sources. Cache >> size reported by /sys/devices/system/node/node0/cpu0/cache/index3/size >> will always show the full cache size even if it's split by enabled SNC. >> On the other hand /sys/fs/resctrl/size has information about L3 size, >> that with kernel support is adjusted for enabled SNC. > >Early versions of the kernel SNC patch series added an info/l3_MON/snc_ways >file to provide this information. I was talked out of it then: > >https://lore.kernel.org/all/f0841866-315b-4727-0a6c-ec60d22ca29c@arm.com/ > >But that discussion didn't consider the question you discuss here: "Does this >instance of the kernel support SNC?" > >So you have a clever solution. But it seems like a roundabout way for >an application to discover whether the kernel has configured resctrl for >SNC mode. > >Should the kernel provide an info/ file listing the SNC configuration? I suppose it would be a benefit for other numa aware applications to have an easy access to this kernel support information. > >If so, what should it be named? "snc_ways" as a kernel variable was >later replaced by "snc_nodes_per_l3_cache". Is that a good filename? "snc_nodes_per_l3_cache" seems okay to me. And I understand that the file content would show SNC mode and the presence or absence of this file would tell if kernel supports SNC? > >-Tony
>>If so, what should it be named? "snc_ways" as a kernel variable was >>later replaced by "snc_nodes_per_l3_cache". Is that a good filename? > > "snc_nodes_per_l3_cache" seems okay to me. > > And I understand that the file content would show SNC mode and the presence or > absence of this file would tell if kernel supports SNC? Yes. The existence of the file indicates the kernel is SNC aware. The value read from the file would give the number of nodes per L3 (obviously :-) ) SNC not supported by this platform or not enabled: $ cat /sys/fs/resctrl/info/L3_mon/ snc_nodes_per_l3_cache 1 SNC2 enabled: $ cat /sys/fs/resctrl/info/L3_mon/ snc_nodes_per_l3_cache 2 etc. -Tony
Hi Tony, On 3/7/2024 9:18 AM, Luck, Tony wrote: >>> If so, what should it be named? "snc_ways" as a kernel variable was >>> later replaced by "snc_nodes_per_l3_cache". Is that a good filename? >> >> "snc_nodes_per_l3_cache" seems okay to me. >> >> And I understand that the file content would show SNC mode and the presence or >> absence of this file would tell if kernel supports SNC? > > Yes. The existence of the file indicates the kernel is SNC aware. > > The value read from the file would give the number of nodes per L3 (obviously :-) ) > > SNC not supported by this platform or not enabled: > > $ cat /sys/fs/resctrl/info/L3_mon/ snc_nodes_per_l3_cache > 1 > > SNC2 enabled: > > $ cat /sys/fs/resctrl/info/L3_mon/ snc_nodes_per_l3_cache > 2 > This would be useful. I believe "SNC" is architecture specific? What if the file always exists and is named "nodes_per_l3_cache"? I assume that the internals of handling more nodes per L3 cache should be hidden from user space and it should not be necessary for user space to know if this is because of SNC or potentially some other mechanism on another platform? I think that may reduce fragmentation of resctrl .... not having resctrl look so different on different architectures but maintains the promise of a generic interface. I am not sure if this is specific to monitoring though, why not host file in /sys/fs/resctrl/info/L3 ? Reinette
> > SNC2 enabled: > > > > $ cat /sys/fs/resctrl/info/L3_mon/ snc_nodes_per_l3_cache > > 2 > > > > This would be useful. I believe "SNC" is architecture specific? > What if the file always exists and is named "nodes_per_l3_cache"? > > I assume that the internals of handling more nodes per L3 cache should > be hidden from user space and it should not be necessary for user space > to know if this is because of SNC or potentially some other mechanism on > another platform? > > I think that may reduce fragmentation of resctrl .... not having > resctrl look so different on different architectures but maintains > the promise of a generic interface. > > I am not sure if this is specific to monitoring though, > why not host file in /sys/fs/resctrl/info/L3 ? Reinette, On the name change - sure. It doesn't need the "snc_" prefix. The Intel implementation of SNC has far more effect on monitoring than on control. The user can read separate cache occupancy and memory bandwidth values for each SNC node. But cache allocation bitmasks and memory throttling still have a single control point for each L3 cache instance, not for each node. There are still some impacts on control, e.g. each bit in a CAT bitmask represents less actual space in the L3 cache. Maybe move it to the top level of the info/ directory: $ cat /sys/fs/resctrl/info/nodes_per_l3_cache 3 -Tony
Hi Tony, On 3/7/2024 9:57 AM, Luck, Tony wrote: >>> SNC2 enabled: >>> >>> $ cat /sys/fs/resctrl/info/L3_mon/ snc_nodes_per_l3_cache >>> 2 >>> >> >> This would be useful. I believe "SNC" is architecture specific? >> What if the file always exists and is named "nodes_per_l3_cache"? >> >> I assume that the internals of handling more nodes per L3 cache should >> be hidden from user space and it should not be necessary for user space >> to know if this is because of SNC or potentially some other mechanism on >> another platform? >> >> I think that may reduce fragmentation of resctrl .... not having >> resctrl look so different on different architectures but maintains >> the promise of a generic interface. >> >> I am not sure if this is specific to monitoring though, >> why not host file in /sys/fs/resctrl/info/L3 ? > > Reinette, > > On the name change - sure. It doesn't need the "snc_" prefix. > > The Intel implementation of SNC has far more effect on monitoring > than on control. The user can read separate cache occupancy and > memory bandwidth values for each SNC node. But cache allocation > bitmasks and memory throttling still have a single control point for > each L3 cache instance, not for each node. There are still some > impacts on control, e.g. each bit in a CAT bitmask represents > less actual space in the L3 cache. I understand the impact but I am trying to view this conceptually. The info directory exists to "contain information about the enabled resources" (as per documentation) and it seems appropriate to make this a property of the L3 resource. > > Maybe move it to the top level of the info/ directory: > > $ cat /sys/fs/resctrl/info/nodes_per_l3_cache > 3 Thinking about it even differently. The goal is to give information to userspace so we need to think about what would help user space? For example, what if there is a file in info that shows which CPUs are associated with each domain? Reinette
> Thinking about it even differently. The goal is to give information > to userspace so we need to think about what would help user space? > For example, what if there is a file in info that shows > which CPUs are associated with each domain? Reinette, Interesting idea. That would save users from having to chase through /sys/devices/system/cpu/cpu*/cache/index?/* to figure out what the domain numbers in schemata files and the mon_data/mon_L3_XX values mean. May be extra useful for ARM which seems to have big random-looking numbers for domains that came out of ACPI tables. For SNC it would get the user directly to what they probably care about (which CPUs are in which domain). So something like this for an SNC 2 system: $ cat /sys/fs/resctrl/info/L3/cpus 0: 0-35,72-107 1: 36-71,108-143 $ cat /sys/fs/resctrl/info/L3_MON/cpus 0: 0-17,72-89 1: 18-35,90-107 2: 36-53,108-125 3: 54-71,126-143 [maybe there is a better name than "cpus" for this file?] -Tony
Hi Tony, On 3/7/2024 1:14 PM, Luck, Tony wrote: >> Thinking about it even differently. The goal is to give information >> to userspace so we need to think about what would help user space? >> For example, what if there is a file in info that shows >> which CPUs are associated with each domain? > > Reinette, > > Interesting idea. That would save users from having to chase through > /sys/devices/system/cpu/cpu*/cache/index?/* to figure out what the domain > numbers in schemata files and the mon_data/mon_L3_XX values mean. > > May be extra useful for ARM which seems to have big random-looking numbers > for domains that came out of ACPI tables. > > For SNC it would get the user directly to what they probably care about > (which CPUs are in which domain). I agree. > > So something like this for an SNC 2 system: > > $ cat /sys/fs/resctrl/info/L3/cpus > 0: 0-35,72-107 > 1: 36-71,108-143 > > $ cat /sys/fs/resctrl/info/L3_MON/cpus > 0: 0-17,72-89 > 1: 18-35,90-107 > 2: 36-53,108-125 > 3: 54-71,126-143 > > [maybe there is a better name than "cpus" for this file?] Thank you for the example. I find that significantly easier to understand than a single number in a generic "nodes_per_l3_cache". Especially with potential confusion surrounding inconsistent "nodes" between allocation and monitoring. How about domain_cpu_list and domain_cpu_map ? Reinette
On Thu, Mar 07, 2024 at 02:39:08PM -0800, Reinette Chatre wrote: > Thank you for the example. I find that significantly easier to > understand than a single number in a generic "nodes_per_l3_cache". > Especially with potential confusion surrounding inconsistent "nodes" > between allocation and monitoring. > > How about domain_cpu_list and domain_cpu_map ? Reinette, Like this (my test system doesn't have SNC, so all domains are the same): $ cd /sys/fs/resctrl/info/ $ grep . */domain* L3/domain_cpu_list:0: 0-35,72-107 L3/domain_cpu_list:1: 36-71,108-143 L3/domain_cpu_map:0: 0000,00000fff,ffffff00,0000000f,ffffffff L3/domain_cpu_map:1: ffff,fffff000,000000ff,fffffff0,00000000 L3_MON/domain_cpu_list:0: 0-35,72-107 L3_MON/domain_cpu_list:1: 36-71,108-143 L3_MON/domain_cpu_map:0: 0000,00000fff,ffffff00,0000000f,ffffffff L3_MON/domain_cpu_map:1: ffff,fffff000,000000ff,fffffff0,00000000 MB/domain_cpu_list:0: 0-35,72-107 MB/domain_cpu_list:1: 36-71,108-143 MB/domain_cpu_map:0: 0000,00000fff,ffffff00,0000000f,ffffffff MB/domain_cpu_map:1: ffff,fffff000,000000ff,fffffff0,00000000 The patch to do this is pretty straightforward. -Tony --- diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c index ae80170a0d1b..c180b80640e3 100644 --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c @@ -957,6 +957,20 @@ static int rdt_num_closids_show(struct kernfs_open_file *of, return 0; } +static int rdt_ctrl_cpus_show(struct kernfs_open_file *of, + struct seq_file *seq, void *v) +{ + struct resctrl_schema *s = of->kn->parent->priv; + struct rdt_resource *r = s->res; + struct rdt_ctrl_domain *d; + + list_for_each_entry(d, &r->ctrl_domains, hdr.list) + seq_printf(seq, is_cpu_list(of) ? "%d: %*pbl\n" : "%d: %*pb\n", + d->hdr.id, cpumask_pr_args(&d->hdr.cpu_mask)); + + return 0; +} + static int rdt_default_ctrl_show(struct kernfs_open_file *of, struct seq_file *seq, void *v) { @@ -1103,6 +1117,19 @@ static int rdt_num_rmids_show(struct kernfs_open_file *of, return 0; } +static int rdt_mon_cpus_show(struct kernfs_open_file *of, + struct seq_file *seq, void *v) +{ + struct rdt_resource *r = of->kn->parent->priv; + struct rdt_mon_domain *d; + + list_for_each_entry(d, &r->mon_domains, hdr.list) + seq_printf(seq, is_cpu_list(of) ? "%d: %*pbl\n" : "%d: %*pb\n", + d->hdr.id, cpumask_pr_args(&d->hdr.cpu_mask)); + + return 0; +} + static int rdt_mon_features_show(struct kernfs_open_file *of, struct seq_file *seq, void *v) { @@ -1810,6 +1837,21 @@ static struct rftype res_common_files[] = { .seq_show = rdt_num_closids_show, .fflags = RFTYPE_CTRL_INFO, }, + { + .name = "domain_cpu_list", + .mode = 0444, + .kf_ops = &rdtgroup_kf_single_ops, + .seq_show = rdt_ctrl_cpus_show, + .flags = RFTYPE_FLAGS_CPUS_LIST, + .fflags = RFTYPE_CTRL_INFO, + }, + { + .name = "domain_cpu_map", + .mode = 0444, + .kf_ops = &rdtgroup_kf_single_ops, + .seq_show = rdt_ctrl_cpus_show, + .fflags = RFTYPE_CTRL_INFO, + }, { .name = "mon_features", .mode = 0444, @@ -1824,6 +1866,21 @@ static struct rftype res_common_files[] = { .seq_show = rdt_num_rmids_show, .fflags = RFTYPE_MON_INFO, }, + { + .name = "domain_cpu_list", + .mode = 0444, + .kf_ops = &rdtgroup_kf_single_ops, + .seq_show = rdt_mon_cpus_show, + .flags = RFTYPE_FLAGS_CPUS_LIST, + .fflags = RFTYPE_MON_INFO, + }, + { + .name = "domain_cpu_map", + .mode = 0444, + .kf_ops = &rdtgroup_kf_single_ops, + .seq_show = rdt_mon_cpus_show, + .fflags = RFTYPE_MON_INFO, + }, { .name = "cbm_mask", .mode = 0444,
Hi Tony, On 3/7/2024 3:16 PM, Tony Luck wrote: > On Thu, Mar 07, 2024 at 02:39:08PM -0800, Reinette Chatre wrote: >> Thank you for the example. I find that significantly easier to >> understand than a single number in a generic "nodes_per_l3_cache". >> Especially with potential confusion surrounding inconsistent "nodes" >> between allocation and monitoring. >> >> How about domain_cpu_list and domain_cpu_map ? > > Reinette, > > Like this (my test system doesn't have SNC, so all domains are the same): > > $ cd /sys/fs/resctrl/info/ > $ grep . */domain* > L3/domain_cpu_list:0: 0-35,72-107 > L3/domain_cpu_list:1: 36-71,108-143 > L3/domain_cpu_map:0: 0000,00000fff,ffffff00,0000000f,ffffffff > L3/domain_cpu_map:1: ffff,fffff000,000000ff,fffffff0,00000000 > L3_MON/domain_cpu_list:0: 0-35,72-107 > L3_MON/domain_cpu_list:1: 36-71,108-143 > L3_MON/domain_cpu_map:0: 0000,00000fff,ffffff00,0000000f,ffffffff > L3_MON/domain_cpu_map:1: ffff,fffff000,000000ff,fffffff0,00000000 > MB/domain_cpu_list:0: 0-35,72-107 > MB/domain_cpu_list:1: 36-71,108-143 > MB/domain_cpu_map:0: 0000,00000fff,ffffff00,0000000f,ffffffff > MB/domain_cpu_map:1: ffff,fffff000,000000ff,fffffff0,00000000 > > > The patch to do this is pretty straightforward. > Thank you for looking into this. This looks like valuable information for user space. Back to what started this discussion ... I expect user space can compare CPUs associated with control and monitoring domains to learn if SNC is enabled? (And now existence of domain_cpu_list and/or domain_cpu_map can also be used to determine if kernel supports SNC). Reinette
Hi guys, On 07/03/2024 23:16, Tony Luck wrote: > On Thu, Mar 07, 2024 at 02:39:08PM -0800, Reinette Chatre wrote: >> Thank you for the example. I find that significantly easier to >> understand than a single number in a generic "nodes_per_l3_cache". >> Especially with potential confusion surrounding inconsistent "nodes" >> between allocation and monitoring. >> >> How about domain_cpu_list and domain_cpu_map ? > Like this (my test system doesn't have SNC, so all domains are the same): > > $ cd /sys/fs/resctrl/info/ > $ grep . */domain* > L3/domain_cpu_list:0: 0-35,72-107 > L3/domain_cpu_list:1: 36-71,108-143 > L3/domain_cpu_map:0: 0000,00000fff,ffffff00,0000000f,ffffffff > L3/domain_cpu_map:1: ffff,fffff000,000000ff,fffffff0,00000000 > L3_MON/domain_cpu_list:0: 0-35,72-107 > L3_MON/domain_cpu_list:1: 36-71,108-143 > L3_MON/domain_cpu_map:0: 0000,00000fff,ffffff00,0000000f,ffffffff > L3_MON/domain_cpu_map:1: ffff,fffff000,000000ff,fffffff0,00000000 > MB/domain_cpu_list:0: 0-35,72-107 > MB/domain_cpu_list:1: 36-71,108-143 > MB/domain_cpu_map:0: 0000,00000fff,ffffff00,0000000f,ffffffff > MB/domain_cpu_map:1: ffff,fffff000,000000ff,fffffff0,00000000 This duplicates the information in /sys/devices/system/cpu/cpuX/cache/indexY ... is this really because that information is, er, wrong on SNC systems. Is it possible to fix that? From Tony's earlier description of how SNC changes things, the MB controls remain per-socket. To me it feels less invasive to fix the definition of L3 on these platforms to describe how it behaves (assuming that is possible), and define a new 'MB' that is NUMA scoped. This direction of redefining L3 means /sys/fs/resctrl and /sys/devices have different views of 'the' cache hierarchy. (I also think that this be over the threshold on 'funny machines look funny' - but I bet someone builds an arm machine that looks like this too!) Thanks, James
On Fri, Mar 08, 2024 at 06:06:45PM +0000, James Morse wrote: > Hi guys, > > On 07/03/2024 23:16, Tony Luck wrote: > > On Thu, Mar 07, 2024 at 02:39:08PM -0800, Reinette Chatre wrote: > >> Thank you for the example. I find that significantly easier to > >> understand than a single number in a generic "nodes_per_l3_cache". > >> Especially with potential confusion surrounding inconsistent "nodes" > >> between allocation and monitoring. > >> > >> How about domain_cpu_list and domain_cpu_map ? > > > Like this (my test system doesn't have SNC, so all domains are the same): > > > > $ cd /sys/fs/resctrl/info/ > > $ grep . */domain* > > L3/domain_cpu_list:0: 0-35,72-107 > > L3/domain_cpu_list:1: 36-71,108-143 > > L3/domain_cpu_map:0: 0000,00000fff,ffffff00,0000000f,ffffffff > > L3/domain_cpu_map:1: ffff,fffff000,000000ff,fffffff0,00000000 > > L3_MON/domain_cpu_list:0: 0-35,72-107 > > L3_MON/domain_cpu_list:1: 36-71,108-143 > > L3_MON/domain_cpu_map:0: 0000,00000fff,ffffff00,0000000f,ffffffff > > L3_MON/domain_cpu_map:1: ffff,fffff000,000000ff,fffffff0,00000000 > > MB/domain_cpu_list:0: 0-35,72-107 > > MB/domain_cpu_list:1: 36-71,108-143 > > MB/domain_cpu_map:0: 0000,00000fff,ffffff00,0000000f,ffffffff > > MB/domain_cpu_map:1: ffff,fffff000,000000ff,fffffff0,00000000 > > This duplicates the information in /sys/devices/system/cpu/cpuX/cache/indexY ... is this > really because that information is, er, wrong on SNC systems. Is it possible to fix that? On an SNC system the resctrl domain for L3_MON becomes the SNC node instead of the L3 cache instance. With 2, 3, or 4 SNC nodes per L3. Even without the SNC issue this duplication may be a useful convienience. On Intel to get from a resctrl domain is a multi-step process to first find which of the indexY directories has level=3 and then look for the "id" that matches the domain. > >From Tony's earlier description of how SNC changes things, the MB controls remain > per-socket. To me it feels less invasive to fix the definition of L3 on these platforms to > describe how it behaves (assuming that is possible), and define a new 'MB' that is NUMA > scoped. > This direction of redefining L3 means /sys/fs/resctrl and /sys/devices have different > views of 'the' cache hierarchy. I almost went partly in that direction when I started this epic voyage. The "almost" part was to change the names of the monitoring directories under mon_data from (legacy non-SNC system): $ ls -l mon_data total 0 dr-xr-xr-x. 2 root root 0 Mar 8 10:31 mon_L3_00 dr-xr-xr-x. 2 root root 0 Mar 8 10:31 mon_L3_01 to (2 socket, SNC=2 system): $ ls -l mon_data total 0 dr-xr-xr-x. 2 root root 0 Mar 8 10:31 mon_NODE_00 dr-xr-xr-x. 2 root root 0 Mar 8 10:31 mon_NODE_01 dr-xr-xr-x. 2 root root 0 Mar 8 10:31 mon_NODE_02 dr-xr-xr-x. 2 root root 0 Mar 8 10:31 mon_NODE_03 While that is in some ways a more accurate view, it breaks a lot of legacy monitoring applications that expect the "L3" names. > (I also think that this be over the threshold on 'funny machines look funny' - but I bet > someone builds an arm machine that looks like this too!) -Tony
Hi Tony, On 08/03/2024 18:42, Tony Luck wrote: > On Fri, Mar 08, 2024 at 06:06:45PM +0000, James Morse wrote: >> Hi guys, >> >> On 07/03/2024 23:16, Tony Luck wrote: >>> On Thu, Mar 07, 2024 at 02:39:08PM -0800, Reinette Chatre wrote: >>>> Thank you for the example. I find that significantly easier to >>>> understand than a single number in a generic "nodes_per_l3_cache". >>>> Especially with potential confusion surrounding inconsistent "nodes" >>>> between allocation and monitoring. >>>> >>>> How about domain_cpu_list and domain_cpu_map ? >> >>> Like this (my test system doesn't have SNC, so all domains are the same): >>> >>> $ cd /sys/fs/resctrl/info/ >>> $ grep . */domain* >>> L3/domain_cpu_list:0: 0-35,72-107 >>> L3/domain_cpu_list:1: 36-71,108-143 >>> L3/domain_cpu_map:0: 0000,00000fff,ffffff00,0000000f,ffffffff >>> L3/domain_cpu_map:1: ffff,fffff000,000000ff,fffffff0,00000000 >>> L3_MON/domain_cpu_list:0: 0-35,72-107 >>> L3_MON/domain_cpu_list:1: 36-71,108-143 >>> L3_MON/domain_cpu_map:0: 0000,00000fff,ffffff00,0000000f,ffffffff >>> L3_MON/domain_cpu_map:1: ffff,fffff000,000000ff,fffffff0,00000000 >>> MB/domain_cpu_list:0: 0-35,72-107 >>> MB/domain_cpu_list:1: 36-71,108-143 >>> MB/domain_cpu_map:0: 0000,00000fff,ffffff00,0000000f,ffffffff >>> MB/domain_cpu_map:1: ffff,fffff000,000000ff,fffffff0,00000000 >> >> This duplicates the information in /sys/devices/system/cpu/cpuX/cache/indexY ... is this >> really because that information is, er, wrong on SNC systems. Is it possible to fix that? > > On an SNC system the resctrl domain for L3_MON becomes the SNC node > instead of the L3 cache instance. With 2, 3, or 4 SNC nodes per L3. > > Even without the SNC issue this duplication may be a useful > convienience. On Intel to get from a resctrl domain is a multi-step > process to first find which of the indexY directories has level=3 > and then look for the "id" that matches the domain. > >> >From Tony's earlier description of how SNC changes things, the MB controls remain >> per-socket. To me it feels less invasive to fix the definition of L3 on these platforms to >> describe how it behaves (assuming that is possible), and define a new 'MB' that is NUMA >> scoped. >> This direction of redefining L3 means /sys/fs/resctrl and /sys/devices have different >> views of 'the' cache hierarchy. > > I almost went partly in that direction when I started this epic voyage. > The "almost" part was to change the names of the monitoring directories > under mon_data from (legacy non-SNC system): > > $ ls -l mon_data > total 0 > dr-xr-xr-x. 2 root root 0 Mar 8 10:31 mon_L3_00 > dr-xr-xr-x. 2 root root 0 Mar 8 10:31 mon_L3_01 > > to (2 socket, SNC=2 system): > > $ ls -l mon_data > total 0 > dr-xr-xr-x. 2 root root 0 Mar 8 10:31 mon_NODE_00 > dr-xr-xr-x. 2 root root 0 Mar 8 10:31 mon_NODE_01 > dr-xr-xr-x. 2 root root 0 Mar 8 10:31 mon_NODE_02 > dr-xr-xr-x. 2 root root 0 Mar 8 10:31 mon_NODE_03 This would be useful for MPAM. I've seen a couple of MPAM systems that have per-NUMA MPAM controls on the 'L3', but describe it as a single global L3. The MPAM driver currently hides this by summing the NUMA node counters and reporting it as the global L3's value. > While that is in some ways a more accurate view, it breaks a lot of > legacy monitoring applications that expect the "L3" names. True - but the behaviour is different from a non SNC system, if this software can read the file - but goes wrong because the contents of the file represent something different, its still broken. Thanks, James
On 3/15/2024 11:02 AM, James Morse wrote: > On 08/03/2024 18:42, Tony Luck wrote: >> On Fri, Mar 08, 2024 at 06:06:45PM +0000, James Morse wrote: >>> Hi guys, >>> >>> On 07/03/2024 23:16, Tony Luck wrote: >>>> On Thu, Mar 07, 2024 at 02:39:08PM -0800, Reinette Chatre wrote: >>>>> Thank you for the example. I find that significantly easier to >>>>> understand than a single number in a generic "nodes_per_l3_cache". >>>>> Especially with potential confusion surrounding inconsistent "nodes" >>>>> between allocation and monitoring. >>>>> >>>>> How about domain_cpu_list and domain_cpu_map ? >>> >>>> Like this (my test system doesn't have SNC, so all domains are the same): >>>> >>>> $ cd /sys/fs/resctrl/info/ >>>> $ grep . */domain* >>>> L3/domain_cpu_list:0: 0-35,72-107 >>>> L3/domain_cpu_list:1: 36-71,108-143 >>>> L3/domain_cpu_map:0: 0000,00000fff,ffffff00,0000000f,ffffffff >>>> L3/domain_cpu_map:1: ffff,fffff000,000000ff,fffffff0,00000000 >>>> L3_MON/domain_cpu_list:0: 0-35,72-107 >>>> L3_MON/domain_cpu_list:1: 36-71,108-143 >>>> L3_MON/domain_cpu_map:0: 0000,00000fff,ffffff00,0000000f,ffffffff >>>> L3_MON/domain_cpu_map:1: ffff,fffff000,000000ff,fffffff0,00000000 >>>> MB/domain_cpu_list:0: 0-35,72-107 >>>> MB/domain_cpu_list:1: 36-71,108-143 >>>> MB/domain_cpu_map:0: 0000,00000fff,ffffff00,0000000f,ffffffff >>>> MB/domain_cpu_map:1: ffff,fffff000,000000ff,fffffff0,00000000 >>> >>> This duplicates the information in /sys/devices/system/cpu/cpuX/cache/indexY ... is this >>> really because that information is, er, wrong on SNC systems. Is it possible to fix that? >> >> On an SNC system the resctrl domain for L3_MON becomes the SNC node >> instead of the L3 cache instance. With 2, 3, or 4 SNC nodes per L3. >> >> Even without the SNC issue this duplication may be a useful >> convienience. On Intel to get from a resctrl domain is a multi-step >> process to first find which of the indexY directories has level=3 >> and then look for the "id" that matches the domain. >> >>> >From Tony's earlier description of how SNC changes things, the MB controls remain >>> per-socket. To me it feels less invasive to fix the definition of L3 on these platforms to >>> describe how it behaves (assuming that is possible), and define a new 'MB' that is NUMA >>> scoped. >>> This direction of redefining L3 means /sys/fs/resctrl and /sys/devices have different >>> views of 'the' cache hierarchy. >> >> I almost went partly in that direction when I started this epic voyage. >> The "almost" part was to change the names of the monitoring directories >> under mon_data from (legacy non-SNC system): >> >> $ ls -l mon_data >> total 0 >> dr-xr-xr-x. 2 root root 0 Mar 8 10:31 mon_L3_00 >> dr-xr-xr-x. 2 root root 0 Mar 8 10:31 mon_L3_01 >> >> to (2 socket, SNC=2 system): >> >> $ ls -l mon_data >> total 0 >> dr-xr-xr-x. 2 root root 0 Mar 8 10:31 mon_NODE_00 >> dr-xr-xr-x. 2 root root 0 Mar 8 10:31 mon_NODE_01 >> dr-xr-xr-x. 2 root root 0 Mar 8 10:31 mon_NODE_02 >> dr-xr-xr-x. 2 root root 0 Mar 8 10:31 mon_NODE_03 > > This would be useful for MPAM. I've seen a couple of MPAM systems that have per-NUMA MPAM > controls on the 'L3', but describe it as a single global L3. The MPAM driver currently > hides this by summing the NUMA node counters and reporting it as the global L3's value. > > >> While that is in some ways a more accurate view, it breaks a lot of >> legacy monitoring applications that expect the "L3" names. > > True - but the behaviour is different from a non SNC system, if this software can read the > file - but goes wrong because the contents of the file represent something different, its > still broken. This is a good point. There is also /sys/fs/resctrl/info/L3_MON to consider and trying to think what to do about that makes me go in circles about when user space may expect resctrl to indicate the resource and when user space may expect resctrl to indicate the scope. For example, /sys/fs/resctrl/mon_data/mon_L3_00 contains files with data that monitor the "L3" _resource_, no? If we change that to /sys/fs/resctrl/mon_data/mon_NODE_00 then it switches the meaning of the middle term to be "scope" while it still contains the monitoring data of the "L3" resource. So does that mean user space would need to rely on /sys/fs/resctrl/info/L3_MON to obtain the information about which monitoring files (/sys/fs/resctrl/info/L3_MON/mon_features) are related to the particular resource and then match those filenames with the filenames in /sys/fs/resctrl/mon_data/mon_NODE_00 to know which resource it applies to and learn from the directory name what scope measurement is at? Reinette
> >> While that is in some ways a more accurate view, it breaks a lot of > >> legacy monitoring applications that expect the "L3" names. > > > > True - but the behaviour is different from a non SNC system, if this software can read the > > file - but goes wrong because the contents of the file represent something different, its > > still broken. > > This is a good point. There is also /sys/fs/resctrl/info/L3_MON to consider and trying to think > what to do about that makes me go in circles about when user space may expect resctrl to indicate > the resource and when user space may expect resctrl to indicate the scope. For example, > /sys/fs/resctrl/mon_data/mon_L3_00 contains files with data that monitor the > "L3" _resource_, no? If we change that to /sys/fs/resctrl/mon_data/mon_NODE_00 then it > switches the meaning of the middle term to be "scope" while it still contains the monitoring > data of the "L3" resource. So does that mean user space would need to rely on > /sys/fs/resctrl/info/L3_MON to obtain the information about which monitoring files > (/sys/fs/resctrl/info/L3_MON/mon_features) are related to the particular resource and then > match those filenames with the filenames in /sys/fs/resctrl/mon_data/mon_NODE_00 to know > which resource it applies to and learn from the directory name what scope measurement is at? Reinette, It's both a wave and a particle, depending on the observer. In SNC systems resources on each socket are divided into 2, 3, 4 nodes. But the division is complicated. Memory and CPU cores are easy. They are each assigned to an SNC node. The cache is more complicated. The hash function for memory address to cache index is the part that is SNC aware. So memory on SNC node1 will allocate in the cache indices assigned to SNC node1. But that function has to be independent of which CPU is doing the access. That's why I keep mentioning "well behaved NUMA applications when talking about SNC. So the resctrl monitoring operations still work on the L3 cache, but in SNC mode they work on a portion of the L3 cache. As long as all accesses are NUMA local you can think of the cache as partitioned between the SNC nodes. But not everything is well behaved from a NUMA perspective. It would be misleading to describe the occupancy and bandwidth as belonging to an SNC node. It's also a bit misleading to describe in terms of an L3 cache instance. But doing so doesn't require application changes. -Tony
Hi Tony, On 3/18/2024 12:34 PM, Luck, Tony wrote: >>>> While that is in some ways a more accurate view, it breaks a lot of >>>> legacy monitoring applications that expect the "L3" names. >>> >>> True - but the behaviour is different from a non SNC system, if this software can read the >>> file - but goes wrong because the contents of the file represent something different, its >>> still broken. >> >> This is a good point. There is also /sys/fs/resctrl/info/L3_MON to consider and trying to think >> what to do about that makes me go in circles about when user space may expect resctrl to indicate >> the resource and when user space may expect resctrl to indicate the scope. For example, >> /sys/fs/resctrl/mon_data/mon_L3_00 contains files with data that monitor the >> "L3" _resource_, no? If we change that to /sys/fs/resctrl/mon_data/mon_NODE_00 then it >> switches the meaning of the middle term to be "scope" while it still contains the monitoring >> data of the "L3" resource. So does that mean user space would need to rely on >> /sys/fs/resctrl/info/L3_MON to obtain the information about which monitoring files >> (/sys/fs/resctrl/info/L3_MON/mon_features) are related to the particular resource and then >> match those filenames with the filenames in /sys/fs/resctrl/mon_data/mon_NODE_00 to know >> which resource it applies to and learn from the directory name what scope measurement is at? > > Reinette, > > It's both a wave and a particle, depending on the observer. > > In SNC systems resources on each socket are divided into 2, 3, 4 nodes. But the > division is complicated. Memory and CPU cores are easy. They are each assigned > to an SNC node. The cache is more complicated. The hash function for memory > address to cache index is the part that is SNC aware. So memory on SNC node1 > will allocate in the cache indices assigned to SNC node1. But that function has to > be independent of which CPU is doing the access. That's why I keep mentioning > "well behaved NUMA applications when talking about SNC. > > So the resctrl monitoring operations still work on the L3 cache, but in SNC mode > they work on a portion of the L3 cache. As long as all accesses are NUMA local you > can think of the cache as partitioned between the SNC nodes. > > But not everything is well behaved from a NUMA perspective. It would be misleading > to describe the occupancy and bandwidth as belonging to an SNC node. > > It's also a bit misleading to describe in terms of an L3 cache instance. But doing > so doesn't require application changes. What is the use case for needing to expose the individual cluster counts? What if resctrl just summed the cluster counts and presented the data as before - per L3 cache instance? I doubt that resctrl would be what applications would use to verify whether they are "well behaved" wrt NUMA. Reinette
> What is the use case for needing to expose the individual cluster counts? What if > resctrl just summed the cluster counts and presented the data as before - per L3 > cache instance? I doubt that resctrl would be what applications would use to verify > whether they are "well behaved" wrt NUMA. Reinette, My (perhaps naïve) belief is that in a cloud server environment there are many well behaved NUMA applications. Only presenting the sum would lose the detailed information from each SNC node. -Tony
> > What is the use case for needing to expose the individual cluster counts? What if > > resctrl just summed the cluster counts and presented the data as before - per L3 > > cache instance? I doubt that resctrl would be what applications would use to verify > > whether they are "well behaved" wrt NUMA. > > Reinette, > > My (perhaps naïve) belief is that in a cloud server environment there are many > well behaved NUMA applications. Only presenting the sum would lose the detailed > information from each SNC node. Is the answer to "A" or "B" ... why not provide both: $ ls -l /sys/fs/resctrl/mon_data total 0 dr-xr-xr-x. 2 root root 0 Mar 18 14:01 mon_L3_00 dr-xr-xr-x. 2 root root 0 Mar 18 14:01 mon_L3_01 dr-xr-xr-x. 2 root root 0 Mar 18 14:01 mon_NODE_00 dr-xr-xr-x. 2 root root 0 Mar 18 14:01 mon_NODE_01 dr-xr-xr-x. 2 root root 0 Mar 18 14:01 mon_NODE_02 dr-xr-xr-x. 2 root root 0 Mar 18 14:01 mon_NODE_03 The "L3" entries provide the sum across all SNC nodes sharing the cache. The NODE ones give the broken out counts. -Tony
Hi Tony, On 3/18/2024 1:47 PM, Luck, Tony wrote: >> What is the use case for needing to expose the individual cluster counts? What if >> resctrl just summed the cluster counts and presented the data as before - per L3 >> cache instance? I doubt that resctrl would be what applications would use to verify >> whether they are "well behaved" wrt NUMA. > > Reinette, > > My (perhaps naïve) belief is that in a cloud server environment there are many > well behaved NUMA applications. Only presenting the sum would lose the detailed > information from each SNC node. Yes ... I understand by providing a sum the values that contributed to the sum are lost. Could you please help me understand the details by answering my first question: What is the use case for needing to expose the individual cluster counts? This is a model specific feature so if this is something needed for just a couple of systems I think we should be less inclined to make changes to resctrl interface. I am starting to be concerned about something similar becoming architectural later and then we need to wrangle this model specific resctrl support (which has then become ABI) again to support whatever that may look like. Reinette
On 3/18/2024 2:04 PM, Luck, Tony wrote: >>> What is the use case for needing to expose the individual cluster counts? What if >>> resctrl just summed the cluster counts and presented the data as before - per L3 >>> cache instance? I doubt that resctrl would be what applications would use to verify >>> whether they are "well behaved" wrt NUMA. >> >> Reinette, >> >> My (perhaps naïve) belief is that in a cloud server environment there are many >> well behaved NUMA applications. Only presenting the sum would lose the detailed >> information from each SNC node. > > Is the answer to "A" or "B" ... why not provide both: > > $ ls -l /sys/fs/resctrl/mon_data > total 0 > dr-xr-xr-x. 2 root root 0 Mar 18 14:01 mon_L3_00 > dr-xr-xr-x. 2 root root 0 Mar 18 14:01 mon_L3_01 > dr-xr-xr-x. 2 root root 0 Mar 18 14:01 mon_NODE_00 > dr-xr-xr-x. 2 root root 0 Mar 18 14:01 mon_NODE_01 > dr-xr-xr-x. 2 root root 0 Mar 18 14:01 mon_NODE_02 > dr-xr-xr-x. 2 root root 0 Mar 18 14:01 mon_NODE_03 > > The "L3" entries provide the sum across all SNC nodes sharing the cache. The NODE ones > give the broken out counts. Perhaps ... in this case it may make things easier to understand if those "mon_NODE_*" directories are sub-directories of the appropriate "mon_L3_*" directories. Reinette
> Perhaps ... in this case it may make things easier to understand if > those "mon_NODE_*" directories are sub-directories of the appropriate > "mon_L3_*" directories. Reinette, Like this? $ tree mon_data/ mon_data/ ├── mon_L3_00 │ ├── llc_occupancy │ ├── mbm_local_bytes │ ├── mbm_total_bytes │ ├── mon_NODE_00 │ │ ├── llc_occupancy │ │ ├── mbm_local_bytes │ │ └── mbm_total_bytes │ └── mon_NODE_01 │ ├── llc_occupancy │ ├── mbm_local_bytes │ └── mbm_total_bytes └── mon_L3_01 ├── llc_occupancy ├── mbm_local_bytes ├── mbm_total_bytes ├── mon_NODE_02 │ ├── llc_occupancy │ ├── mbm_local_bytes │ └── mbm_total_bytes └── mon_NODE_03 ├── llc_occupancy ├── mbm_local_bytes └── mbm_total_bytes -Tony
> Could you please help me understand the details by answering my first > question: What is the use case for needing to expose the individual cluster > counts? > > This is a model specific feature so if this is something needed for just a > couple of systems I think we should be less inclined to make changes to > resctrl interface. I am starting to be concerned about something similar > becoming architectural later and then we need to wrangle this model specific > resctrl support (which has then become ABI) again to support whatever that > may look like. Reinette, Model specific. But present in multiple consecutive generations (Sapphire Rapids, Emerald Rapids, Granite Rapids, Sierra Forest). Adding Peter Newman for a resctrl user perspective on SNC, rather than me continue to speculate on possible ways this might be used. Peter: You will need to dig back a few messages on lore.kernel.org to get context. -Tony
Hi Tony, On 3/18/2024 3:00 PM, Luck, Tony wrote: >> Perhaps ... in this case it may make things easier to understand if >> those "mon_NODE_*" directories are sub-directories of the appropriate >> "mon_L3_*" directories. > > Reinette, > > Like this? > > $ tree mon_data/ > mon_data/ > ├── mon_L3_00 > │ ├── llc_occupancy > │ ├── mbm_local_bytes > │ ├── mbm_total_bytes > │ ├── mon_NODE_00 > │ │ ├── llc_occupancy > │ │ ├── mbm_local_bytes > │ │ └── mbm_total_bytes > │ └── mon_NODE_01 > │ ├── llc_occupancy > │ ├── mbm_local_bytes > │ └── mbm_total_bytes > └── mon_L3_01 > ├── llc_occupancy > ├── mbm_local_bytes > ├── mbm_total_bytes > ├── mon_NODE_02 > │ ├── llc_occupancy > │ ├── mbm_local_bytes > │ └── mbm_total_bytes > └── mon_NODE_03 > ├── llc_occupancy > ├── mbm_local_bytes > └── mbm_total_bytes > Yes. Pro: * This is familiar to users when compared to existing CTRL_MON group counts that are the sum of the MON groups within it. * Users continue to see the clear connection between files listed in /sys/fs/resctrl/info/L3_MON/mon_features found in "mon_L3*" directories. * If I understand correctly it also would continue to be useful to Arm [1] while not breaking existing user space since the mon_L3* counts continue to reflect the entire L3 resource. * This addresses your comment of maintaining the detailed information from each SNC node. * I do assume that if there is only one SNC node per L3 cache then those mon_NODE_* directories will not be present and, to address the issue that triggered this thread, user space can use presence of multiple "mon_NODE_*" directories per cache instance to know if SNC is enabled. Con: * Unclear how this may need to change if/when SNC becomes architectural. * Continues to "muddy" the naming of the directories: mon_<resource>_<id> vs mon_<scope>_<id>. This cannot be turned into agreement with user space where first directory is mon_<resource>_<id> and second directory is mon_<scope>_<id> because then we would need to have, for example, mon_L3_00/mon_L3_00 where the first directory is for the resource and the second is for the scope, which appears redundant. * Things may get confusing if there is ever a "node" resource. This is starting to look like an interface that "checks" all the requirements I've seen so far. Looking at the "con" it is difficult for me to see how doing something like this now may cause frustrations later. Reinette [1] https://lore.kernel.org/lkml/88430722-67b3-4f7d-8db2-95ee52b6f0b0@arm.com/
Hi Tony, On Mon, Mar 18, 2024 at 3:05 PM Luck, Tony <tony.luck@intel.com> wrote: > > > Could you please help me understand the details by answering my first > > question: What is the use case for needing to expose the individual cluster > > counts? > > > > This is a model specific feature so if this is something needed for just a > > couple of systems I think we should be less inclined to make changes to > > resctrl interface. I am starting to be concerned about something similar > > becoming architectural later and then we need to wrangle this model specific > > resctrl support (which has then become ABI) again to support whatever that > > may look like. > > Reinette, > > Model specific. But present in multiple consecutive generations (Sapphire Rapids, > Emerald Rapids, Granite Rapids, Sierra Forest). > > Adding Peter Newman for a resctrl user perspective on SNC, rather than me > continue to speculate on possible ways this might be used. > > Peter: You will need to dig back a few messages on lore.kernel.org to > get context. Our main concern with supporting SNC in resctrl is all of the monitoring groups successfully recording memory bandwidth from all CPUs, regardless of the RMIDs they're assigned. I would prefer that we don't complicate the model of resctrl monitoring domains for this feature. On ARM SoCs there will be a plethora of technologies influencing the layout of resources, so we shouldn't start cluttering the model with special cases for each. I think it's valid for the number of domains in the L3 resource to increase or stay the same when the system is configured for SNC. I don't think the details of how the domains came about is relevant at the resctrl interface level so long as the user has enough information to deduce what the domain is referring to based on knowledge of their system configuration. I would prefer per-cluster as more information could prove useful in some future investigation, but if you feel the data is misleading, providing the clusters combined is also fine. I would prefer that the choice remains consistent from this point forward on any particular implementation to avoid breaking existing controller software developed for that implementation. In our main use case, we sum mon_data/*/mbm_total_bytes to determine a group's total bandwidth, so please don't cause this logic to produce the wrong answer. Thanks! -Peter
diff --git a/tools/testing/selftests/resctrl/cat_test.c b/tools/testing/selftests/resctrl/cat_test.c index 4cb991be8e31..1cdaadf35f03 100644 --- a/tools/testing/selftests/resctrl/cat_test.c +++ b/tools/testing/selftests/resctrl/cat_test.c @@ -253,7 +253,7 @@ static int cat_run_test(const struct resctrl_test *test, const struct user_param return ret; /* Get L3/L2 cache size */ - ret = get_cache_size(uparams->cpu, test->resource, &cache_total_size); + ret = get_sys_cache_size(uparams->cpu, test->resource, &cache_total_size); if (ret) return ret; ksft_print_msg("Cache size :%lu\n", cache_total_size); diff --git a/tools/testing/selftests/resctrl/cmt_test.c b/tools/testing/selftests/resctrl/cmt_test.c index a81f91222a89..b7cada602484 100644 --- a/tools/testing/selftests/resctrl/cmt_test.c +++ b/tools/testing/selftests/resctrl/cmt_test.c @@ -112,7 +112,7 @@ static int cmt_run_test(const struct resctrl_test *test, const struct user_param if (ret) return ret; - ret = get_cache_size(uparams->cpu, "L3", &cache_total_size); + ret = get_sys_cache_size(uparams->cpu, "L3", &cache_total_size); if (ret) return ret; ksft_print_msg("Cache size :%lu\n", cache_total_size); @@ -157,8 +157,8 @@ static int cmt_run_test(const struct resctrl_test *test, const struct user_param goto out; ret = check_results(¶m, span, n); - if (ret && (get_vendor() == ARCH_INTEL)) - ksft_print_msg("Intel CMT may be inaccurate when Sub-NUMA Clustering is enabled. Check BIOS configuration.\n"); + if (ret && (get_vendor() == ARCH_INTEL) && snc_ways() > 1 && !snc_kernel_support()) + ksft_print_msg("Kernel doesn't support Sub-NUMA Clustering but it is enabled. Check BIOS configuration.\n"); out: cmt_test_cleanup(); diff --git a/tools/testing/selftests/resctrl/mba_test.c b/tools/testing/selftests/resctrl/mba_test.c index fc31a61dab0c..89fe3ecbf497 100644 --- a/tools/testing/selftests/resctrl/mba_test.c +++ b/tools/testing/selftests/resctrl/mba_test.c @@ -160,6 +160,8 @@ static int mba_run_test(const struct resctrl_test *test, const struct user_param goto out; ret = check_results(); + if (ret && (get_vendor() == ARCH_INTEL) && snc_ways() > 1 && !snc_kernel_support()) + ksft_print_msg("Kernel doesn't support Sub-NUMA Clustering but it is enabled. Check BIOS configuration.\n"); out: mba_test_cleanup(); diff --git a/tools/testing/selftests/resctrl/mbm_test.c b/tools/testing/selftests/resctrl/mbm_test.c index d67ffa3ec63a..e12b4b06f6d5 100644 --- a/tools/testing/selftests/resctrl/mbm_test.c +++ b/tools/testing/selftests/resctrl/mbm_test.c @@ -129,8 +129,8 @@ static int mbm_run_test(const struct resctrl_test *test, const struct user_param goto out; ret = check_results(DEFAULT_SPAN); - if (ret && (get_vendor() == ARCH_INTEL)) - ksft_print_msg("Intel MBM may be inaccurate when Sub-NUMA Clustering is enabled. Check BIOS configuration.\n"); + if (ret && (get_vendor() == ARCH_INTEL) && snc_ways() > 1 && !snc_kernel_support()) + ksft_print_msg("Kernel doesn't support Sub-NUMA Clustering but it is enabled. Check BIOS configuration.\n"); out: mbm_test_cleanup(); diff --git a/tools/testing/selftests/resctrl/resctrl.h b/tools/testing/selftests/resctrl/resctrl.h index 178fb2eab13a..038e1269a3fc 100644 --- a/tools/testing/selftests/resctrl/resctrl.h +++ b/tools/testing/selftests/resctrl/resctrl.h @@ -28,6 +28,7 @@ #define RESCTRL_PATH "/sys/fs/resctrl" #define PHYS_ID_PATH "/sys/devices/system/cpu/cpu" #define INFO_PATH "/sys/fs/resctrl/info" +#define SIZE_PATH "/sys/fs/resctrl/size" /* * CPU vendor IDs @@ -168,14 +169,16 @@ unsigned long create_bit_mask(unsigned int start, unsigned int len); unsigned int count_contiguous_bits(unsigned long val, unsigned int *start); int get_full_cbm(const char *cache_type, unsigned long *mask); int get_mask_no_shareable(const char *cache_type, unsigned long *mask); -int get_cache_size(int cpu_no, const char *cache_type, unsigned long *cache_size); int resource_info_unsigned_get(const char *resource, const char *filename, unsigned int *val); +int get_sys_cache_size(int cpu_no, const char *cache_type, unsigned long *cache_size); +int get_resctrl_cache_size(const char *cache_type, unsigned long *cache_size); void ctrlc_handler(int signum, siginfo_t *info, void *ptr); int signal_handler_register(void); void signal_handler_unregister(void); void cat_test_cleanup(void); unsigned int count_bits(unsigned long n); void cmt_test_cleanup(void); +int snc_kernel_support(void); void perf_event_attr_initialize(struct perf_event_attr *pea, __u64 config); void perf_event_initialize_read_format(struct perf_event_read *pe_read); diff --git a/tools/testing/selftests/resctrl/resctrlfs.c b/tools/testing/selftests/resctrl/resctrlfs.c index e4d3624a8817..dbd10cb7abf5 100644 --- a/tools/testing/selftests/resctrl/resctrlfs.c +++ b/tools/testing/selftests/resctrl/resctrlfs.c @@ -214,14 +214,14 @@ int snc_ways(void) } /* - * get_cache_size - Get cache size for a specified CPU + * get_sys_cache_size - Get cache size for a specified CPU * @cpu_no: CPU number * @cache_type: Cache level L2/L3 * @cache_size: pointer to cache_size * * Return: = 0 on success, < 0 on failure. */ -int get_cache_size(int cpu_no, const char *cache_type, unsigned long *cache_size) +int get_sys_cache_size(int cpu_no, const char *cache_type, unsigned long *cache_size) { char cache_path[1024], cache_str[64]; int length, i, cache_num; @@ -273,6 +273,44 @@ int get_cache_size(int cpu_no, const char *cache_type, unsigned long *cache_size return 0; } +/* + * get_resctrl_cache_size - Get cache size as reported by resctrl + * @cache_type: Cache level L2/L3 + * @cache_size: pointer to cache_size + * + * Return: = 0 on success, < 0 on failure. + */ +int get_resctrl_cache_size(const char *cache_type, unsigned long *cache_size) +{ + char line[256], cache_prefix[16], *stripped_line, *token; + size_t len; + FILE *fp; + + strcpy(cache_prefix, cache_type); + strncat(cache_prefix, ":", 1); + + fp = fopen(SIZE_PATH, "r"); + if (!fp) { + ksft_print_msg("Failed to open %s : '%s'\n", + SIZE_PATH, strerror(errno)); + return -1; + } + + while (fgets(line, sizeof(line), fp)) { + stripped_line = strstr(line, cache_prefix); + + if (stripped_line) { + len = strlen(cache_prefix); + stripped_line += len; + token = strtok(stripped_line, ";"); + if (sscanf(token, "0=%lu", cache_size) <= 0) + return -1; + } + } + fclose(fp); + return 0; +} + #define CORE_SIBLINGS_PATH "/sys/bus/cpu/devices/cpu" /* @@ -935,3 +973,30 @@ unsigned int count_bits(unsigned long n) return count; } + +/** + * snc_kernel_support - Compare system reported cache size and resctrl + * reported cache size to get an idea if SNC is supported on the kernel side. + * If SNC is enabled and the kernel does support it the value should be equal. + * If the kernel doesn't support SNC the. + * + * Return: 0 if not supported, 1 if supported, < 0 on failure. + */ +int snc_kernel_support(void) +{ + unsigned long resctrl_cache_size, node_cache_size; + int ret; + + ret = get_sys_cache_size(0, "L3", &node_cache_size); + if (ret < 0) + return ret; + + ret = get_resctrl_cache_size("L3", &resctrl_cache_size); + if (ret < 0) + return ret; + + if (resctrl_cache_size == node_cache_size) + return 1; + + return 0; +}
Resctrl selftest prints a message on test failure that Sub-Numa Clustering (SNC) could be enabled and points the user to check theirs BIOS settings. No actual check is performed before printing that message so it is not very accurate in pinpointing a problem. Figuring out if SNC is enabled is only one part of the problem, the other being whether the kernel supports it. As there is no easy interface that simply states SNC support in the kernel one can find that information by comparing L3 cache sizes from different sources. Cache size reported by /sys/devices/system/node/node0/cpu0/cache/index3/size will always show the full cache size even if it's split by enabled SNC. On the other hand /sys/fs/resctrl/size has information about L3 size, that with kernel support is adjusted for enabled SNC. Add a function to find a cache size from /sys/fs/resctrl/size since finding that information from the other source is already implemented. Add a function that compares the two cache sizes and use it to make the SNC support message more meaningful. Add the SNC support message just after MBA's check_results() since MBA shares code with MBM and also can suffer from enabled SNC if there is no support in the kernel. Signed-off-by: Maciej Wieczor-Retman <maciej.wieczor-retman@intel.com> --- tools/testing/selftests/resctrl/cat_test.c | 2 +- tools/testing/selftests/resctrl/cmt_test.c | 6 +- tools/testing/selftests/resctrl/mba_test.c | 2 + tools/testing/selftests/resctrl/mbm_test.c | 4 +- tools/testing/selftests/resctrl/resctrl.h | 5 +- tools/testing/selftests/resctrl/resctrlfs.c | 69 ++++++++++++++++++++- 6 files changed, 79 insertions(+), 9 deletions(-)