diff mbox series

[4/7] node: Add memory caching attributes

Message ID 20181114224921.12123-5-keith.busch@intel.com (mailing list archive)
State New, archived
Headers show
Series ACPI HMAT memory sysfs representation | expand

Commit Message

Keith Busch Nov. 14, 2018, 10:49 p.m. UTC
System memory may have side caches to help improve access speed. While
the system provided cache is transparent to the software accessing
these memory ranges, applications can optimize their own access based
on cache attributes.

In preparation for such systems, provide a new API for the kernel to
register these memory side caches under the memory node that provides it.

The kernel's sysfs representation is modeled from the cpu cacheinfo
attributes, as seen from /sys/devices/system/cpu/cpuX/cache/. Unlike CPU
cacheinfo, though, a higher node's memory cache level is nearer to the
CPU, while lower levels are closer to the backing memory. Also unlike
CPU cache, the system handles flushing any dirty cached memory to the
last level the memory on a power failure if the range is persistent.

The exported attributes are the cache size, the line size, associativity,
and write back policy.

Signed-off-by: Keith Busch <keith.busch@intel.com>
---
 drivers/base/node.c  | 117 +++++++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/node.h |  23 ++++++++++
 2 files changed, 140 insertions(+)

Comments

Dave Hansen Nov. 15, 2018, 12:40 a.m. UTC | #1
On 11/14/18 2:49 PM, Keith Busch wrote:
> System memory may have side caches to help improve access speed. While
> the system provided cache is transparent to the software accessing
> these memory ranges, applications can optimize their own access based
> on cache attributes.
> 
> In preparation for such systems, provide a new API for the kernel to
> register these memory side caches under the memory node that provides it.
> 
> The kernel's sysfs representation is modeled from the cpu cacheinfo
> attributes, as seen from /sys/devices/system/cpu/cpuX/cache/. Unlike CPU
> cacheinfo, though, a higher node's memory cache level is nearer to the
> CPU, while lower levels are closer to the backing memory. Also unlike
> CPU cache, the system handles flushing any dirty cached memory to the
> last level the memory on a power failure if the range is persistent.
> 
> The exported attributes are the cache size, the line size, associativity,
> and write back policy.

Could you also include an example of the layout?
Anshuman Khandual Nov. 19, 2018, 4:14 a.m. UTC | #2
On 11/15/2018 04:19 AM, Keith Busch wrote:
> System memory may have side caches to help improve access speed. While
> the system provided cache is transparent to the software accessing
> these memory ranges, applications can optimize their own access based
> on cache attributes.

Cache is not a separate memory attribute. It impacts how the real attributes
like bandwidth, latency e.g which are already captured in the previous patch.
What is the purpose of adding this as a separate attribute ? Can you explain
how this is going to help the user space apart from the hints it has already
received with bandwidth, latency etc properties.

> 
> In preparation for such systems, provide a new API for the kernel to
> register these memory side caches under the memory node that provides it.

Under target memory node interface /sys/devices/system/node/nodeY/target* ?

> 
> The kernel's sysfs representation is modeled from the cpu cacheinfo
> attributes, as seen from /sys/devices/system/cpu/cpuX/cache/. Unlike CPU
> cacheinfo, though, a higher node's memory cache level is nearer to the
> CPU, while lower levels are closer to the backing memory. Also unlike
> CPU cache, the system handles flushing any dirty cached memory to the
> last level the memory on a power failure if the range is persistent.

Lets assume that a CPU has got four levels of caches L1, L2, L3, L4 before
reaching memory. L4 is the backing cache for the memory and L1-L3 is from
CPU till the system bus. Hence some of them will be represented as CPU
caches and some of them will be represented as memory caches ?

/sys/devices/system/cpu/cpuX/cache/ --> L1, L2, L3
/sys/devices/system/node/nodeY/target --> L4 

L4 will be listed even if the node is memory only ?
Keith Busch Nov. 19, 2018, 11:06 p.m. UTC | #3
On Mon, Nov 19, 2018 at 09:44:00AM +0530, Anshuman Khandual wrote:
> On 11/15/2018 04:19 AM, Keith Busch wrote:
> > System memory may have side caches to help improve access speed. While
> > the system provided cache is transparent to the software accessing
> > these memory ranges, applications can optimize their own access based
> > on cache attributes.
> 
> Cache is not a separate memory attribute. It impacts how the real attributes
> like bandwidth, latency e.g which are already captured in the previous patch.
> What is the purpose of adding this as a separate attribute ? Can you explain
> how this is going to help the user space apart from the hints it has already
> received with bandwidth, latency etc properties.

I am not sure I understand the question here. Access bandwidth and latency
are entirely attributes different than what this patch provides. If the
system side-caches memory, the associativity, line size, and total size
can optionally be used by software to improve performance.
 
> > In preparation for such systems, provide a new API for the kernel to
> > register these memory side caches under the memory node that provides it.
> 
> Under target memory node interface /sys/devices/system/node/nodeY/target* ?

Yes.
 
> > 
> > The kernel's sysfs representation is modeled from the cpu cacheinfo
> > attributes, as seen from /sys/devices/system/cpu/cpuX/cache/. Unlike CPU
> > cacheinfo, though, a higher node's memory cache level is nearer to the
> > CPU, while lower levels are closer to the backing memory. Also unlike
> > CPU cache, the system handles flushing any dirty cached memory to the
> > last level the memory on a power failure if the range is persistent.
> 
> Lets assume that a CPU has got four levels of caches L1, L2, L3, L4 before
> reaching memory. L4 is the backing cache for the memory 

I don't quite understand this question either. The cache doesn't back
the memory; the system side caches access to memory.

> and L1-L3 is from
> CPU till the system bus. Hence some of them will be represented as CPU
> caches and some of them will be represented as memory caches ?
>
> /sys/devices/system/cpu/cpuX/cache/ --> L1, L2, L3
> /sys/devices/system/node/nodeY/target --> L4 
> 
> L4 will be listed even if the node is memory only ?

The system provided memory side caches are independent of the
CPU. I'm just providing the CPU caches as a more familiar example to
compare/contrast system memory cache attributes.
Anshuman Khandual Nov. 22, 2018, 1:29 p.m. UTC | #4
On 11/20/2018 04:36 AM, Keith Busch wrote:
> On Mon, Nov 19, 2018 at 09:44:00AM +0530, Anshuman Khandual wrote:
>> On 11/15/2018 04:19 AM, Keith Busch wrote:
>>> System memory may have side caches to help improve access speed. While
>>> the system provided cache is transparent to the software accessing
>>> these memory ranges, applications can optimize their own access based
>>> on cache attributes.
>>
>> Cache is not a separate memory attribute. It impacts how the real attributes
>> like bandwidth, latency e.g which are already captured in the previous patch.
>> What is the purpose of adding this as a separate attribute ? Can you explain
>> how this is going to help the user space apart from the hints it has already
>> received with bandwidth, latency etc properties.
> 
> I am not sure I understand the question here. Access bandwidth and latency
> are entirely attributes different than what this patch provides. If the
> system side-caches memory, the associativity, line size, and total size
> can optionally be used by software to improve performance.

Okay but then does this belong to this series which about memory attributes ?
Keith Busch Nov. 26, 2018, 3:14 p.m. UTC | #5
On Thu, Nov 22, 2018 at 06:59:21PM +0530, Anshuman Khandual wrote:
> 
> 
> On 11/20/2018 04:36 AM, Keith Busch wrote:
> > On Mon, Nov 19, 2018 at 09:44:00AM +0530, Anshuman Khandual wrote:
> >> On 11/15/2018 04:19 AM, Keith Busch wrote:
> >>> System memory may have side caches to help improve access speed. While
> >>> the system provided cache is transparent to the software accessing
> >>> these memory ranges, applications can optimize their own access based
> >>> on cache attributes.
> >>
> >> Cache is not a separate memory attribute. It impacts how the real attributes
> >> like bandwidth, latency e.g which are already captured in the previous patch.
> >> What is the purpose of adding this as a separate attribute ? Can you explain
> >> how this is going to help the user space apart from the hints it has already
> >> received with bandwidth, latency etc properties.
> > 
> > I am not sure I understand the question here. Access bandwidth and latency
> > are entirely attributes different than what this patch provides. If the
> > system side-caches memory, the associativity, line size, and total size
> > can optionally be used by software to improve performance.
> 
> Okay but then does this belong to this series which about memory attributes ?

This patch series is about exporting memory attributes, and this system
memory caching is  one such attribute, so yes, I think it belongs.
Greg Kroah-Hartman Nov. 26, 2018, 7:06 p.m. UTC | #6
On Wed, Nov 14, 2018 at 03:49:17PM -0700, Keith Busch wrote:
> System memory may have side caches to help improve access speed. While
> the system provided cache is transparent to the software accessing
> these memory ranges, applications can optimize their own access based
> on cache attributes.
> 
> In preparation for such systems, provide a new API for the kernel to
> register these memory side caches under the memory node that provides it.
> 
> The kernel's sysfs representation is modeled from the cpu cacheinfo
> attributes, as seen from /sys/devices/system/cpu/cpuX/cache/. Unlike CPU
> cacheinfo, though, a higher node's memory cache level is nearer to the
> CPU, while lower levels are closer to the backing memory. Also unlike
> CPU cache, the system handles flushing any dirty cached memory to the
> last level the memory on a power failure if the range is persistent.
> 
> The exported attributes are the cache size, the line size, associativity,
> and write back policy.
> 
> Signed-off-by: Keith Busch <keith.busch@intel.com>
> ---
>  drivers/base/node.c  | 117 +++++++++++++++++++++++++++++++++++++++++++++++++++
>  include/linux/node.h |  23 ++++++++++
>  2 files changed, 140 insertions(+)
> 
> diff --git a/drivers/base/node.c b/drivers/base/node.c
> index 232535761998..bb94f1d18115 100644
> --- a/drivers/base/node.c
> +++ b/drivers/base/node.c
> @@ -60,6 +60,12 @@ static DEVICE_ATTR(cpumap,  S_IRUGO, node_read_cpumask, NULL);
>  static DEVICE_ATTR(cpulist, S_IRUGO, node_read_cpulist, NULL);
>  
>  #ifdef CONFIG_HMEM
> +struct node_cache_obj {
> +	struct kobject kobj;
> +	struct list_head node;
> +	struct node_cache_attrs cache_attrs;
> +};

I know you all are off in the weeds designing some new crazy api for
this instead of this current proposal (sorry, I lost the thread, I'll
wait for the patches before commenting on it), but I do want to say one
thing here.

NEVER use a raw kobject as a child for a 'struct device' unless you
REALLY REALLY REALLY REALLY know what you are doing and have a VERY good
reason to do so.

Just use a 'struct device', otherwise you end up having to reinvent all
of the core logic that struct device provides you, like attribute
callbacks (which you had to create), and other good stuff like telling
userspace that a device has shown up so it knows to look at it.

That last one is key, a kobject is suddenly a "black hole" in sysfs as
far as userspace knows because it does not see them for the most part
(unless you are mucking around in the filesystem on your own, and
really, don't do that, use a library like the rest of the world unless
you really like reinventing everything, which, from your patchset it
feels like...)

Anyway, use 'struct device'.  That's all.

greg k-h
Greg Kroah-Hartman Nov. 26, 2018, 7:06 p.m. UTC | #7
On Wed, Nov 14, 2018 at 03:49:17PM -0700, Keith Busch wrote:
> System memory may have side caches to help improve access speed. While
> the system provided cache is transparent to the software accessing
> these memory ranges, applications can optimize their own access based
> on cache attributes.
> 
> In preparation for such systems, provide a new API for the kernel to
> register these memory side caches under the memory node that provides it.
> 
> The kernel's sysfs representation is modeled from the cpu cacheinfo
> attributes, as seen from /sys/devices/system/cpu/cpuX/cache/. Unlike CPU
> cacheinfo, though, a higher node's memory cache level is nearer to the
> CPU, while lower levels are closer to the backing memory. Also unlike
> CPU cache, the system handles flushing any dirty cached memory to the
> last level the memory on a power failure if the range is persistent.
> 
> The exported attributes are the cache size, the line size, associativity,
> and write back policy.

You also didn't document your new sysfs attributes/layout in a
Documentation/ABI/ entry which is required for any sysfs change...

thanks,

greg k-h
Keith Busch Nov. 26, 2018, 7:53 p.m. UTC | #8
On Mon, Nov 26, 2018 at 11:06:19AM -0800, Greg Kroah-Hartman wrote:
> On Wed, Nov 14, 2018 at 03:49:17PM -0700, Keith Busch wrote:
> > System memory may have side caches to help improve access speed. While
> > the system provided cache is transparent to the software accessing
> > these memory ranges, applications can optimize their own access based
> > on cache attributes.
> > 
> > In preparation for such systems, provide a new API for the kernel to
> > register these memory side caches under the memory node that provides it.
> > 
> > The kernel's sysfs representation is modeled from the cpu cacheinfo
> > attributes, as seen from /sys/devices/system/cpu/cpuX/cache/. Unlike CPU
> > cacheinfo, though, a higher node's memory cache level is nearer to the
> > CPU, while lower levels are closer to the backing memory. Also unlike
> > CPU cache, the system handles flushing any dirty cached memory to the
> > last level the memory on a power failure if the range is persistent.
> > 
> > The exported attributes are the cache size, the line size, associativity,
> > and write back policy.
> > 
> > Signed-off-by: Keith Busch <keith.busch@intel.com>
> > ---
> >  drivers/base/node.c  | 117 +++++++++++++++++++++++++++++++++++++++++++++++++++
> >  include/linux/node.h |  23 ++++++++++
> >  2 files changed, 140 insertions(+)
> > 
> > diff --git a/drivers/base/node.c b/drivers/base/node.c
> > index 232535761998..bb94f1d18115 100644
> > --- a/drivers/base/node.c
> > +++ b/drivers/base/node.c
> > @@ -60,6 +60,12 @@ static DEVICE_ATTR(cpumap,  S_IRUGO, node_read_cpumask, NULL);
> >  static DEVICE_ATTR(cpulist, S_IRUGO, node_read_cpulist, NULL);
> >  
> >  #ifdef CONFIG_HMEM
> > +struct node_cache_obj {
> > +	struct kobject kobj;
> > +	struct list_head node;
> > +	struct node_cache_attrs cache_attrs;
> > +};
> 
> I know you all are off in the weeds designing some new crazy api for
> this instead of this current proposal (sorry, I lost the thread, I'll
> wait for the patches before commenting on it), but I do want to say one
> thing here.
> 
> NEVER use a raw kobject as a child for a 'struct device' unless you
> REALLY REALLY REALLY REALLY know what you are doing and have a VERY good
> reason to do so.
> 
> Just use a 'struct device', otherwise you end up having to reinvent all
> of the core logic that struct device provides you, like attribute
> callbacks (which you had to create), and other good stuff like telling
> userspace that a device has shown up so it knows to look at it.
> 
> That last one is key, a kobject is suddenly a "black hole" in sysfs as
> far as userspace knows because it does not see them for the most part
> (unless you are mucking around in the filesystem on your own, and
> really, don't do that, use a library like the rest of the world unless
> you really like reinventing everything, which, from your patchset it
> feels like...)
> 
> Anyway, use 'struct device'.  That's all.
> 
> greg k-h

Okay, thank you for the advice. I prefer to reuse over reinvent. :)

I only used kobject because the power/ directory was automatically
created with 'struct device', but I now I see there are better ways to
suppress that.
diff mbox series

Patch

diff --git a/drivers/base/node.c b/drivers/base/node.c
index 232535761998..bb94f1d18115 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -60,6 +60,12 @@  static DEVICE_ATTR(cpumap,  S_IRUGO, node_read_cpumask, NULL);
 static DEVICE_ATTR(cpulist, S_IRUGO, node_read_cpulist, NULL);
 
 #ifdef CONFIG_HMEM
+struct node_cache_obj {
+	struct kobject kobj;
+	struct list_head node;
+	struct node_cache_attrs cache_attrs;
+};
+
 const struct attribute_group node_access_attrs_group;
 
 #define ACCESS_ATTR(name) 						\
@@ -101,6 +107,115 @@  void node_set_perf_attrs(unsigned int nid, struct node_hmem_attrs *hmem_attrs)
 		pr_info("failed to add performance attribute group to node %d\n",
 			nid);
 }
+
+struct cache_attribute_entry {
+	struct attribute attr;
+	ssize_t (*show)(struct node_cache_attrs *, char *);
+};
+
+#define CACHE_ATTR(name, fmt) 						\
+static ssize_t name##_show(struct node_cache_attrs *cache,		\
+			   char *buf)					\
+{									\
+	return sprintf(buf, fmt "\n", cache->name);			\
+}									\
+static struct cache_attribute_entry cache_attr_##name = __ATTR_RO(name);
+
+CACHE_ATTR(size, "%lld")
+CACHE_ATTR(level, "%d")
+CACHE_ATTR(line_size, "%d")
+CACHE_ATTR(associativity, "%d")
+CACHE_ATTR(write_policy, "%d")
+
+static struct attribute *cache_attrs[] = {
+	&cache_attr_level.attr,
+	&cache_attr_associativity.attr,
+	&cache_attr_size.attr,
+	&cache_attr_line_size.attr,
+	&cache_attr_write_policy.attr,
+	NULL,
+};
+
+static ssize_t cache_attr_show(struct kobject *kobj, struct attribute *attr,
+			       char *page)
+{
+	struct cache_attribute_entry *entry =
+			container_of(attr, struct cache_attribute_entry, attr);
+	struct node_cache_obj *cache_obj =
+			container_of(kobj, struct node_cache_obj, kobj);
+	return entry->show(&cache_obj->cache_attrs, page);
+}
+
+static const struct sysfs_ops cache_ops = {
+	.show	= &cache_attr_show,
+};
+
+static struct kobj_type cache_ktype = {
+	.default_attrs	= cache_attrs,
+	.sysfs_ops	= &cache_ops,
+};
+
+void node_add_cache(unsigned int nid, struct node_cache_attrs *cache_attrs)
+{
+	struct node_cache_obj *cache_obj;
+	struct node *node;
+
+	if (!node_online(nid) || !node_devices[nid])
+		return;
+
+	node = node_devices[nid];
+	list_for_each_entry(cache_obj, &node->cache_attrs, node) {
+		if (cache_obj->cache_attrs.level == cache_attrs->level) {
+			dev_warn(&node->dev,
+				"attempt to add duplicate cache level:%d\n",
+				cache_attrs->level);
+			return;
+		}
+	}
+
+	if (!node->cache_kobj)
+		node->cache_kobj = kobject_create_and_add("cache",
+							  &node->dev.kobj);
+	if (!node->cache_kobj)
+		return;
+
+	cache_obj = kzalloc(sizeof(*cache_obj), GFP_KERNEL);
+	if (!cache_obj)
+		return;
+
+	cache_obj->cache_attrs = *cache_attrs;
+	if (kobject_init_and_add(&cache_obj->kobj, &cache_ktype, node->cache_kobj,
+				 "index%d", cache_attrs->level)) {
+		dev_warn(&node->dev, "failed to add cache level:%d\n",
+			 cache_attrs->level);
+		kfree(cache_obj);
+		return;
+	}
+	list_add_tail(&cache_obj->node, &node->cache_attrs);
+}
+
+static void node_remove_caches(struct node *node)
+{
+	struct node_cache_obj *obj, *next;
+
+	if (!node->cache_kobj)
+		return;
+
+	list_for_each_entry_safe(obj, next, &node->cache_attrs, node) {
+		list_del(&obj->node);
+		kobject_put(&obj->kobj);
+		kfree(obj);
+	}
+	kobject_put(node->cache_kobj);
+}
+
+static void node_init_caches(unsigned int nid)
+{
+	INIT_LIST_HEAD(&node_devices[nid]->cache_attrs);
+}
+#else
+static void node_init_caches(unsigned int nid) { }
+static void node_remove_caches(struct node *node) { }
 #endif
 
 #define K(x) ((x) << (PAGE_SHIFT - 10))
@@ -345,6 +460,7 @@  static void node_device_release(struct device *dev)
 	 */
 	flush_work(&node->node_work);
 #endif
+	node_remove_caches(node);
 	kfree(node);
 }
 
@@ -658,6 +774,7 @@  int __register_one_node(int nid)
 
 	/* initialize work queue for memory hot plug */
 	init_node_hugetlb_work(nid);
+	node_init_caches(nid);
 
 	return error;
 }
diff --git a/include/linux/node.h b/include/linux/node.h
index 6a1aa6a153f8..f499a17f84bc 100644
--- a/include/linux/node.h
+++ b/include/linux/node.h
@@ -36,6 +36,27 @@  struct node_hmem_attrs {
 	unsigned int write_latency;
 };
 void node_set_perf_attrs(unsigned int nid, struct node_hmem_attrs *hmem_attrs);
+
+enum cache_associativity {
+	NODE_CACHE_DIRECT_MAP,
+	NODE_CACHE_INDEXED,
+	NODE_CACHE_OTHER,
+};
+
+enum cache_write_policy {
+	NODE_CACHE_WRITE_BACK,
+	NODE_CACHE_WRITE_THROUGH,
+	NODE_CACHE_WRITE_OTHER,
+};
+
+struct node_cache_attrs {
+	enum cache_associativity associativity;
+	enum cache_write_policy write_policy;
+	u64 size;
+	u16 line_size;
+	u8  level;
+};
+void node_add_cache(unsigned int nid, struct node_cache_attrs *cache_attrs);
 #endif
 
 struct node {
@@ -46,6 +67,8 @@  struct node {
 #endif
 #ifdef CONFIG_HMEM
 	struct node_hmem_attrs hmem_attrs;
+	struct list_head cache_attrs;
+	struct kobject *cache_kobj;
 #endif
 };