[PATCHv4,10/13] node: Add memory caching attributes

Message ID	20190116175804.30196-11-keith.busch@intel.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <owner-linux-mm@kvack.org> Received-SPF: pass (google.com: domain of keith.busch@intel.com designates 192.55.52.88 as permitted sender) client-ip=192.55.52.88; From: Keith Busch <keith.busch@intel.com> To: linux-kernel@vger.kernel.org, linux-acpi@vger.kernel.org, linux-mm@kvack.org Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>, Rafael Wysocki <rafael@kernel.org>, Dave Hansen <dave.hansen@intel.com>, Dan Williams <dan.j.williams@intel.com>, Keith Busch <keith.busch@intel.com> Subject: [PATCHv4 10/13] node: Add memory caching attributes Date: Wed, 16 Jan 2019 10:58:01 -0700 Message-Id: <20190116175804.30196-11-keith.busch@intel.com> In-Reply-To: <20190116175804.30196-1-keith.busch@intel.com> References: <20190116175804.30196-1-keith.busch@intel.com> Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	Heterogeneuos memory node attributes \| expand [PATCHv4,00/13] Heterogeneuos memory node attributes [PATCHv4,01/13] acpi: Create subtable parsing infrastructure [PATCHv4,02/13] acpi: Add HMAT to generic parsing tables [PATCHv4,03/13] acpi/hmat: Parse and report heterogeneous memory [PATCHv4,04/13] node: Link memory nodes to their compute nodes [PATCHv4,05/13] Documentation/ABI: Add new node sysfs attributes [PATCHv4,06/13] acpi/hmat: Register processor domain to its memory [PATCHv4,07/13] node: Add heterogenous memory access attributes [PATCHv4,08/13] Documentation/ABI: Add node performance attributes [PATCHv4,09/13] acpi/hmat: Register performance attributes [PATCHv4,10/13] node: Add memory caching attributes [PATCHv4,11/13] Documentation/ABI: Add node cache attributes [PATCHv4,12/13] acpi/hmat: Register memory side cache attributes [PATCHv4,13/13] doc/mm: New documentation for memory performance

Keith Busch Jan. 16, 2019, 5:58 p.m. UTC

System memory may have side caches to help improve access speed to
frequently requested address ranges. While the system provided cache is
transparent to the software accessing these memory ranges, applications
can optimize their own access based on cache attributes.

Provide a new API for the kernel to register these memory side caches
under the memory node that provides it.

The new sysfs representation is modeled from the existing cpu cacheinfo
attributes, as seen from /sys/devices/system/cpu/cpuX/side_cache/.
Unlike CPU cacheinfo, though, the node cache level is reported from
the view of the memory. A higher number is nearer to the CPU, while
lower levels are closer to the backing memory. Also unlike CPU cache,
it is assumed the system will handle flushing any dirty cached memory
to the last level on a power failure if the range is persistent memory.

The attributes we export are the cache size, the line size, associativity,
and write back policy.

Signed-off-by: Keith Busch <keith.busch@intel.com>
---
 drivers/base/node.c  | 142 +++++++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/node.h |  39 ++++++++++++++
 2 files changed, 181 insertions(+)

Rafael J. Wysocki Jan. 17, 2019, 4 p.m. UTC | #1

On Wed, Jan 16, 2019 at 6:59 PM Keith Busch <keith.busch@intel.com> wrote:
>
> System memory may have side caches to help improve access speed to
> frequently requested address ranges. While the system provided cache is
> transparent to the software accessing these memory ranges, applications
> can optimize their own access based on cache attributes.
>
> Provide a new API for the kernel to register these memory side caches
> under the memory node that provides it.
>
> The new sysfs representation is modeled from the existing cpu cacheinfo
> attributes, as seen from /sys/devices/system/cpu/cpuX/side_cache/.
> Unlike CPU cacheinfo, though, the node cache level is reported from
> the view of the memory. A higher number is nearer to the CPU, while
> lower levels are closer to the backing memory. Also unlike CPU cache,
> it is assumed the system will handle flushing any dirty cached memory
> to the last level on a power failure if the range is persistent memory.
>
> The attributes we export are the cache size, the line size, associativity,
> and write back policy.
>
> Signed-off-by: Keith Busch <keith.busch@intel.com>
> ---
>  drivers/base/node.c  | 142 +++++++++++++++++++++++++++++++++++++++++++++++++++
>  include/linux/node.h |  39 ++++++++++++++
>  2 files changed, 181 insertions(+)
>
> diff --git a/drivers/base/node.c b/drivers/base/node.c
> index 1e909f61e8b1..7ff3ed566d7d 100644
> --- a/drivers/base/node.c
> +++ b/drivers/base/node.c
> @@ -191,6 +191,146 @@ void node_set_perf_attrs(unsigned int nid, struct node_hmem_attrs *hmem_attrs,
>                 pr_info("failed to add performance attribute group to node %d\n",
>                         nid);
>  }
> +
> +struct node_cache_info {
> +       struct device dev;
> +       struct list_head node;
> +       struct node_cache_attrs cache_attrs;
> +};
> +#define to_cache_info(device) container_of(device, struct node_cache_info, dev)
> +
> +#define CACHE_ATTR(name, fmt)                                          \
> +static ssize_t name##_show(struct device *dev,                         \
> +                          struct device_attribute *attr,               \
> +                          char *buf)                                   \
> +{                                                                      \
> +       return sprintf(buf, fmt "\n", to_cache_info(dev)->cache_attrs.name);\
> +}                                                                      \
> +DEVICE_ATTR_RO(name);
> +
> +CACHE_ATTR(size, "%llu")
> +CACHE_ATTR(level, "%u")
> +CACHE_ATTR(line_size, "%u")
> +CACHE_ATTR(associativity, "%u")
> +CACHE_ATTR(write_policy, "%u")
> +
> +static struct attribute *cache_attrs[] = {
> +       &dev_attr_level.attr,
> +       &dev_attr_associativity.attr,
> +       &dev_attr_size.attr,
> +       &dev_attr_line_size.attr,
> +       &dev_attr_write_policy.attr,
> +       NULL,
> +};
> +ATTRIBUTE_GROUPS(cache);
> +
> +static void node_cache_release(struct device *dev)
> +{
> +       kfree(dev);
> +}
> +
> +static void node_cacheinfo_release(struct device *dev)
> +{
> +       struct node_cache_info *info = to_cache_info(dev);
> +       kfree(info);
> +}
> +
> +static void node_init_cache_dev(struct node *node)
> +{
> +       struct device *dev;
> +
> +       dev = kzalloc(sizeof(*dev), GFP_KERNEL);
> +       if (!dev)
> +               return;
> +
> +       dev->parent = &node->dev;
> +       dev->release = node_cache_release;
> +       if (dev_set_name(dev, "side_cache"))
> +               goto free_dev;
> +
> +       if (device_register(dev))
> +               goto free_name;
> +
> +       pm_runtime_no_callbacks(dev);
> +       node->cache_dev = dev;
> +       return;

I would add an empty line here.

> +free_name:
> +       kfree_const(dev->kobj.name);
> +free_dev:
> +       kfree(dev);
> +}
> +
> +void node_add_cache(unsigned int nid, struct node_cache_attrs *cache_attrs)
> +{
> +       struct node_cache_info *info;
> +       struct device *dev;
> +       struct node *node;
> +
> +       if (!node_online(nid) || !node_devices[nid])
> +               return;
> +
> +       node = node_devices[nid];
> +       list_for_each_entry(info, &node->cache_attrs, node) {
> +               if (info->cache_attrs.level == cache_attrs->level) {
> +                       dev_warn(&node->dev,
> +                               "attempt to add duplicate cache level:%d\n",
> +                               cache_attrs->level);

I'd suggest using dev_dbg() for this and I'm not even sure if printing
the message is worth the effort.

Firmware will probably give you duplicates and users cannot do much
about fixing that anyway.

> +                       return;
> +               }
> +       }
> +
> +       if (!node->cache_dev)
> +               node_init_cache_dev(node);
> +       if (!node->cache_dev)
> +               return;
> +
> +       info = kzalloc(sizeof(*info), GFP_KERNEL);
> +       if (!info)
> +               return;
> +
> +       dev = &info->dev;
> +       dev->parent = node->cache_dev;
> +       dev->release = node_cacheinfo_release;
> +       dev->groups = cache_groups;
> +       if (dev_set_name(dev, "index%d", cache_attrs->level))
> +               goto free_cache;
> +
> +       info->cache_attrs = *cache_attrs;
> +       if (device_register(dev)) {
> +               dev_warn(&node->dev, "failed to add cache level:%d\n",
> +                        cache_attrs->level);
> +               goto free_name;
> +       }
> +       pm_runtime_no_callbacks(dev);
> +       list_add_tail(&info->node, &node->cache_attrs);
> +       return;

Again, I'd add an empty line here.

> +free_name:
> +       kfree_const(dev->kobj.name);
> +free_cache:
> +       kfree(info);
> +}
> +
> +static void node_remove_caches(struct node *node)
> +{
> +       struct node_cache_info *info, *next;
> +
> +       if (!node->cache_dev)
> +               return;
> +
> +       list_for_each_entry_safe(info, next, &node->cache_attrs, node) {
> +               list_del(&info->node);
> +               device_unregister(&info->dev);
> +       }
> +       device_unregister(node->cache_dev);
> +}
> +
> +static void node_init_caches(unsigned int nid)
> +{
> +       INIT_LIST_HEAD(&node_devices[nid]->cache_attrs);
> +}
> +#else
> +static void node_init_caches(unsigned int nid) { }
> +static void node_remove_caches(struct node *node) { }
>  #endif
>
>  #define K(x) ((x) << (PAGE_SHIFT - 10))
> @@ -475,6 +615,7 @@ void unregister_node(struct node *node)
>  {
>         hugetlb_unregister_node(node);          /* no-op, if memoryless node */
>         node_remove_classes(node);
> +       node_remove_caches(node);
>         device_unregister(&node->dev);
>  }
>
> @@ -755,6 +896,7 @@ int __register_one_node(int nid)
>         INIT_LIST_HEAD(&node_devices[nid]->class_list);
>         /* initialize work queue for memory hot plug */
>         init_node_hugetlb_work(nid);
> +       node_init_caches(nid);
>
>         return error;
>  }
> diff --git a/include/linux/node.h b/include/linux/node.h
> index e22940a593c2..8cdf2b2808e4 100644
> --- a/include/linux/node.h
> +++ b/include/linux/node.h
> @@ -37,12 +37,47 @@ struct node_hmem_attrs {
>  };
>  void node_set_perf_attrs(unsigned int nid, struct node_hmem_attrs *hmem_attrs,
>                          unsigned class);
> +
> +enum cache_associativity {
> +       NODE_CACHE_DIRECT_MAP,
> +       NODE_CACHE_INDEXED,
> +       NODE_CACHE_OTHER,
> +};
> +
> +enum cache_write_policy {
> +       NODE_CACHE_WRITE_BACK,
> +       NODE_CACHE_WRITE_THROUGH,
> +       NODE_CACHE_WRITE_OTHER,
> +};
> +
> +/**
> + * struct node_cache_attrs - system memory caching attributes
> + *
> + * @associativity:     The ways memory blocks may be placed in cache
> + * @write_policy:      Write back or write through policy
> + * @size:              Total size of cache in bytes
> + * @line_size:         Number of bytes fetched on a cache miss
> + * @level:             Represents the cache hierarchy level
> + */
> +struct node_cache_attrs {
> +       enum cache_associativity associativity;
> +       enum cache_write_policy write_policy;
> +       u64 size;
> +       u16 line_size;
> +       u8  level;
> +};
> +void node_add_cache(unsigned int nid, struct node_cache_attrs *cache_attrs);
>  #else
>  static inline void node_set_perf_attrs(unsigned int nid,
>                                        struct node_hmem_attrs *hmem_attrs,
>                                        unsigned class)
>  {
>  }
> +
> +static inline void node_add_cache(unsigned int nid,
> +                                 struct node_cache_attrs *cache_attrs)
> +{
> +}

And does this really build with CONFIG_HMEM_REPORTING unset?

>  #endif
>
>  struct node {
> @@ -51,6 +86,10 @@ struct node {
>  #if defined(CONFIG_MEMORY_HOTPLUG_SPARSE) && defined(CONFIG_HUGETLBFS)
>         struct work_struct      node_work;
>  #endif
> +#ifdef CONFIG_HMEM_REPORTING
> +       struct list_head cache_attrs;
> +       struct device *cache_dev;
> +#endif
>  };
>
>  struct memory_block;
> --

Brice Goglin Feb. 9, 2019, 8:20 a.m. UTC | #2

Hello Keith

Could we ever have a single side cache in front of two NUMA nodes ? I
don't see a way to find that out in the current implementation. Would we
have an "id" and/or "nodemap" bitmask in the sidecache structure ?

Thanks

Brice



Le 16/01/2019 à 18:58, Keith Busch a écrit :
> System memory may have side caches to help improve access speed to
> frequently requested address ranges. While the system provided cache is
> transparent to the software accessing these memory ranges, applications
> can optimize their own access based on cache attributes.
>
> Provide a new API for the kernel to register these memory side caches
> under the memory node that provides it.
>
> The new sysfs representation is modeled from the existing cpu cacheinfo
> attributes, as seen from /sys/devices/system/cpu/cpuX/side_cache/.
> Unlike CPU cacheinfo, though, the node cache level is reported from
> the view of the memory. A higher number is nearer to the CPU, while
> lower levels are closer to the backing memory. Also unlike CPU cache,
> it is assumed the system will handle flushing any dirty cached memory
> to the last level on a power failure if the range is persistent memory.
>
> The attributes we export are the cache size, the line size, associativity,
> and write back policy.
>
> Signed-off-by: Keith Busch <keith.busch@intel.com>
> ---
>  drivers/base/node.c  | 142 +++++++++++++++++++++++++++++++++++++++++++++++++++
>  include/linux/node.h |  39 ++++++++++++++
>  2 files changed, 181 insertions(+)
>
> diff --git a/drivers/base/node.c b/drivers/base/node.c
> index 1e909f61e8b1..7ff3ed566d7d 100644
> --- a/drivers/base/node.c
> +++ b/drivers/base/node.c
> @@ -191,6 +191,146 @@ void node_set_perf_attrs(unsigned int nid, struct node_hmem_attrs *hmem_attrs,
>  		pr_info("failed to add performance attribute group to node %d\n",
>  			nid);
>  }
> +
> +struct node_cache_info {
> +	struct device dev;
> +	struct list_head node;
> +	struct node_cache_attrs cache_attrs;
> +};
> +#define to_cache_info(device) container_of(device, struct node_cache_info, dev)
> +
> +#define CACHE_ATTR(name, fmt) 						\
> +static ssize_t name##_show(struct device *dev,				\
> +			   struct device_attribute *attr,		\
> +			   char *buf)					\
> +{									\
> +	return sprintf(buf, fmt "\n", to_cache_info(dev)->cache_attrs.name);\
> +}									\
> +DEVICE_ATTR_RO(name);
> +
> +CACHE_ATTR(size, "%llu")
> +CACHE_ATTR(level, "%u")
> +CACHE_ATTR(line_size, "%u")
> +CACHE_ATTR(associativity, "%u")
> +CACHE_ATTR(write_policy, "%u")
> +
> +static struct attribute *cache_attrs[] = {
> +	&dev_attr_level.attr,
> +	&dev_attr_associativity.attr,
> +	&dev_attr_size.attr,
> +	&dev_attr_line_size.attr,
> +	&dev_attr_write_policy.attr,
> +	NULL,
> +};
> +ATTRIBUTE_GROUPS(cache);
> +
> +static void node_cache_release(struct device *dev)
> +{
> +	kfree(dev);
> +}
> +
> +static void node_cacheinfo_release(struct device *dev)
> +{
> +	struct node_cache_info *info = to_cache_info(dev);
> +	kfree(info);
> +}
> +
> +static void node_init_cache_dev(struct node *node)
> +{
> +	struct device *dev;
> +
> +	dev = kzalloc(sizeof(*dev), GFP_KERNEL);
> +	if (!dev)
> +		return;
> +
> +	dev->parent = &node->dev;
> +	dev->release = node_cache_release;
> +	if (dev_set_name(dev, "side_cache"))
> +		goto free_dev;
> +
> +	if (device_register(dev))
> +		goto free_name;
> +
> +	pm_runtime_no_callbacks(dev);
> +	node->cache_dev = dev;
> +	return;
> +free_name:
> +	kfree_const(dev->kobj.name);
> +free_dev:
> +	kfree(dev);
> +}
> +
> +void node_add_cache(unsigned int nid, struct node_cache_attrs *cache_attrs)
> +{
> +	struct node_cache_info *info;
> +	struct device *dev;
> +	struct node *node;
> +
> +	if (!node_online(nid) || !node_devices[nid])
> +		return;
> +
> +	node = node_devices[nid];
> +	list_for_each_entry(info, &node->cache_attrs, node) {
> +		if (info->cache_attrs.level == cache_attrs->level) {
> +			dev_warn(&node->dev,
> +				"attempt to add duplicate cache level:%d\n",
> +				cache_attrs->level);
> +			return;
> +		}
> +	}
> +
> +	if (!node->cache_dev)
> +		node_init_cache_dev(node);
> +	if (!node->cache_dev)
> +		return;
> +
> +	info = kzalloc(sizeof(*info), GFP_KERNEL);
> +	if (!info)
> +		return;
> +
> +	dev = &info->dev;
> +	dev->parent = node->cache_dev;
> +	dev->release = node_cacheinfo_release;
> +	dev->groups = cache_groups;
> +	if (dev_set_name(dev, "index%d", cache_attrs->level))
> +		goto free_cache;
> +
> +	info->cache_attrs = *cache_attrs;
> +	if (device_register(dev)) {
> +		dev_warn(&node->dev, "failed to add cache level:%d\n",
> +			 cache_attrs->level);
> +		goto free_name;
> +	}
> +	pm_runtime_no_callbacks(dev);
> +	list_add_tail(&info->node, &node->cache_attrs);
> +	return;
> +free_name:
> +	kfree_const(dev->kobj.name);
> +free_cache:
> +	kfree(info);
> +}
> +
> +static void node_remove_caches(struct node *node)
> +{
> +	struct node_cache_info *info, *next;
> +
> +	if (!node->cache_dev)
> +		return;
> +
> +	list_for_each_entry_safe(info, next, &node->cache_attrs, node) {
> +		list_del(&info->node);
> +		device_unregister(&info->dev);
> +	}
> +	device_unregister(node->cache_dev);
> +}
> +
> +static void node_init_caches(unsigned int nid)
> +{
> +	INIT_LIST_HEAD(&node_devices[nid]->cache_attrs);
> +}
> +#else
> +static void node_init_caches(unsigned int nid) { }
> +static void node_remove_caches(struct node *node) { }
>  #endif
>  
>  #define K(x) ((x) << (PAGE_SHIFT - 10))
> @@ -475,6 +615,7 @@ void unregister_node(struct node *node)
>  {
>  	hugetlb_unregister_node(node);		/* no-op, if memoryless node */
>  	node_remove_classes(node);
> +	node_remove_caches(node);
>  	device_unregister(&node->dev);
>  }
>  
> @@ -755,6 +896,7 @@ int __register_one_node(int nid)
>  	INIT_LIST_HEAD(&node_devices[nid]->class_list);
>  	/* initialize work queue for memory hot plug */
>  	init_node_hugetlb_work(nid);
> +	node_init_caches(nid);
>  
>  	return error;
>  }
> diff --git a/include/linux/node.h b/include/linux/node.h
> index e22940a593c2..8cdf2b2808e4 100644
> --- a/include/linux/node.h
> +++ b/include/linux/node.h
> @@ -37,12 +37,47 @@ struct node_hmem_attrs {
>  };
>  void node_set_perf_attrs(unsigned int nid, struct node_hmem_attrs *hmem_attrs,
>  			 unsigned class);
> +
> +enum cache_associativity {
> +	NODE_CACHE_DIRECT_MAP,
> +	NODE_CACHE_INDEXED,
> +	NODE_CACHE_OTHER,
> +};
> +
> +enum cache_write_policy {
> +	NODE_CACHE_WRITE_BACK,
> +	NODE_CACHE_WRITE_THROUGH,
> +	NODE_CACHE_WRITE_OTHER,
> +};
> +
> +/**
> + * struct node_cache_attrs - system memory caching attributes
> + *
> + * @associativity:	The ways memory blocks may be placed in cache
> + * @write_policy:	Write back or write through policy
> + * @size:		Total size of cache in bytes
> + * @line_size:		Number of bytes fetched on a cache miss
> + * @level:		Represents the cache hierarchy level
> + */
> +struct node_cache_attrs {
> +	enum cache_associativity associativity;
> +	enum cache_write_policy write_policy;
> +	u64 size;
> +	u16 line_size;
> +	u8  level;
> +};
> +void node_add_cache(unsigned int nid, struct node_cache_attrs *cache_attrs);
>  #else
>  static inline void node_set_perf_attrs(unsigned int nid,
>  				       struct node_hmem_attrs *hmem_attrs,
>  				       unsigned class)
>  {
>  }
> +
> +static inline void node_add_cache(unsigned int nid,
> +				  struct node_cache_attrs *cache_attrs)
> +{
> +}
>  #endif
>  
>  struct node {
> @@ -51,6 +86,10 @@ struct node {
>  #if defined(CONFIG_MEMORY_HOTPLUG_SPARSE) && defined(CONFIG_HUGETLBFS)
>  	struct work_struct	node_work;
>  #endif
> +#ifdef CONFIG_HMEM_REPORTING
> +	struct list_head cache_attrs;
> +	struct device *cache_dev;
> +#endif
>  };
>  
>  struct memory_block;

Jonathan Cameron Feb. 10, 2019, 5:19 p.m. UTC | #3

On Sat, 9 Feb 2019 09:20:53 +0100
Brice Goglin <Brice.Goglin@inria.fr> wrote:

> Hello Keith
> 
> Could we ever have a single side cache in front of two NUMA nodes ? I
> don't see a way to find that out in the current implementation. Would we
> have an "id" and/or "nodemap" bitmask in the sidecache structure ?

This is certainly a possible thing for hardware to do.

ACPI IIRC doesn't provide any means of representing that - your best
option is to represent it as two different entries, one for each of the
memory nodes.  Interesting question of whether you would then claim
they were half as big each, or the full size.  Of course, there are
other possible ways to get this info beyond HMAT, so perhaps the interface
should allow it to be exposed if available?

Also, don't know if it's just me, but calling these sidecaches is
downright confusing.  In ACPI at least they are always
specifically referred to as Memory Side Caches.
I'd argue there should even by a hyphen Memory-Side Caches, the point
being that that they are on the memory side of the interconnected
rather than the processor side.  Of course an implementation
choice might be to put them off to the side (as implied by sidecaches)
in some sense, but it's not the only one.

</terminology rant> :)

Jonathan

> 
> Thanks
> 
> Brice
> 
> 
> 
> Le 16/01/2019 à 18:58, Keith Busch a écrit :
> > System memory may have side caches to help improve access speed to
> > frequently requested address ranges. While the system provided cache is
> > transparent to the software accessing these memory ranges, applications
> > can optimize their own access based on cache attributes.
> >
> > Provide a new API for the kernel to register these memory side caches
> > under the memory node that provides it.
> >
> > The new sysfs representation is modeled from the existing cpu cacheinfo
> > attributes, as seen from /sys/devices/system/cpu/cpuX/side_cache/.
> > Unlike CPU cacheinfo, though, the node cache level is reported from
> > the view of the memory. A higher number is nearer to the CPU, while
> > lower levels are closer to the backing memory. Also unlike CPU cache,
> > it is assumed the system will handle flushing any dirty cached memory
> > to the last level on a power failure if the range is persistent memory.
> >
> > The attributes we export are the cache size, the line size, associativity,
> > and write back policy.
> >
> > Signed-off-by: Keith Busch <keith.busch@intel.com>
> > ---
> >  drivers/base/node.c  | 142 +++++++++++++++++++++++++++++++++++++++++++++++++++
> >  include/linux/node.h |  39 ++++++++++++++
> >  2 files changed, 181 insertions(+)
> >
> > diff --git a/drivers/base/node.c b/drivers/base/node.c
> > index 1e909f61e8b1..7ff3ed566d7d 100644
> > --- a/drivers/base/node.c
> > +++ b/drivers/base/node.c
> > @@ -191,6 +191,146 @@ void node_set_perf_attrs(unsigned int nid, struct node_hmem_attrs *hmem_attrs,
> >  		pr_info("failed to add performance attribute group to node %d\n",
> >  			nid);
> >  }
> > +
> > +struct node_cache_info {
> > +	struct device dev;
> > +	struct list_head node;
> > +	struct node_cache_attrs cache_attrs;
> > +};
> > +#define to_cache_info(device) container_of(device, struct node_cache_info, dev)
> > +
> > +#define CACHE_ATTR(name, fmt) 						\
> > +static ssize_t name##_show(struct device *dev,				\
> > +			   struct device_attribute *attr,		\
> > +			   char *buf)					\
> > +{									\
> > +	return sprintf(buf, fmt "\n", to_cache_info(dev)->cache_attrs.name);\
> > +}									\
> > +DEVICE_ATTR_RO(name);
> > +
> > +CACHE_ATTR(size, "%llu")
> > +CACHE_ATTR(level, "%u")
> > +CACHE_ATTR(line_size, "%u")
> > +CACHE_ATTR(associativity, "%u")
> > +CACHE_ATTR(write_policy, "%u")
> > +
> > +static struct attribute *cache_attrs[] = {
> > +	&dev_attr_level.attr,
> > +	&dev_attr_associativity.attr,
> > +	&dev_attr_size.attr,
> > +	&dev_attr_line_size.attr,
> > +	&dev_attr_write_policy.attr,
> > +	NULL,
> > +};
> > +ATTRIBUTE_GROUPS(cache);
> > +
> > +static void node_cache_release(struct device *dev)
> > +{
> > +	kfree(dev);
> > +}
> > +
> > +static void node_cacheinfo_release(struct device *dev)
> > +{
> > +	struct node_cache_info *info = to_cache_info(dev);
> > +	kfree(info);
> > +}
> > +
> > +static void node_init_cache_dev(struct node *node)
> > +{
> > +	struct device *dev;
> > +
> > +	dev = kzalloc(sizeof(*dev), GFP_KERNEL);
> > +	if (!dev)
> > +		return;
> > +
> > +	dev->parent = &node->dev;
> > +	dev->release = node_cache_release;
> > +	if (dev_set_name(dev, "side_cache"))
> > +		goto free_dev;
> > +
> > +	if (device_register(dev))
> > +		goto free_name;
> > +
> > +	pm_runtime_no_callbacks(dev);
> > +	node->cache_dev = dev;
> > +	return;
> > +free_name:
> > +	kfree_const(dev->kobj.name);
> > +free_dev:
> > +	kfree(dev);
> > +}
> > +
> > +void node_add_cache(unsigned int nid, struct node_cache_attrs *cache_attrs)
> > +{
> > +	struct node_cache_info *info;
> > +	struct device *dev;
> > +	struct node *node;
> > +
> > +	if (!node_online(nid) || !node_devices[nid])
> > +		return;
> > +
> > +	node = node_devices[nid];
> > +	list_for_each_entry(info, &node->cache_attrs, node) {
> > +		if (info->cache_attrs.level == cache_attrs->level) {
> > +			dev_warn(&node->dev,
> > +				"attempt to add duplicate cache level:%d\n",
> > +				cache_attrs->level);
> > +			return;
> > +		}
> > +	}
> > +
> > +	if (!node->cache_dev)
> > +		node_init_cache_dev(node);
> > +	if (!node->cache_dev)
> > +		return;
> > +
> > +	info = kzalloc(sizeof(*info), GFP_KERNEL);
> > +	if (!info)
> > +		return;
> > +
> > +	dev = &info->dev;
> > +	dev->parent = node->cache_dev;
> > +	dev->release = node_cacheinfo_release;
> > +	dev->groups = cache_groups;
> > +	if (dev_set_name(dev, "index%d", cache_attrs->level))
> > +		goto free_cache;
> > +
> > +	info->cache_attrs = *cache_attrs;
> > +	if (device_register(dev)) {
> > +		dev_warn(&node->dev, "failed to add cache level:%d\n",
> > +			 cache_attrs->level);
> > +		goto free_name;
> > +	}
> > +	pm_runtime_no_callbacks(dev);
> > +	list_add_tail(&info->node, &node->cache_attrs);
> > +	return;
> > +free_name:
> > +	kfree_const(dev->kobj.name);
> > +free_cache:
> > +	kfree(info);
> > +}
> > +
> > +static void node_remove_caches(struct node *node)
> > +{
> > +	struct node_cache_info *info, *next;
> > +
> > +	if (!node->cache_dev)
> > +		return;
> > +
> > +	list_for_each_entry_safe(info, next, &node->cache_attrs, node) {
> > +		list_del(&info->node);
> > +		device_unregister(&info->dev);
> > +	}
> > +	device_unregister(node->cache_dev);
> > +}
> > +
> > +static void node_init_caches(unsigned int nid)
> > +{
> > +	INIT_LIST_HEAD(&node_devices[nid]->cache_attrs);
> > +}
> > +#else
> > +static void node_init_caches(unsigned int nid) { }
> > +static void node_remove_caches(struct node *node) { }
> >  #endif
> >  
> >  #define K(x) ((x) << (PAGE_SHIFT - 10))
> > @@ -475,6 +615,7 @@ void unregister_node(struct node *node)
> >  {
> >  	hugetlb_unregister_node(node);		/* no-op, if memoryless node */
> >  	node_remove_classes(node);
> > +	node_remove_caches(node);
> >  	device_unregister(&node->dev);
> >  }
> >  
> > @@ -755,6 +896,7 @@ int __register_one_node(int nid)
> >  	INIT_LIST_HEAD(&node_devices[nid]->class_list);
> >  	/* initialize work queue for memory hot plug */
> >  	init_node_hugetlb_work(nid);
> > +	node_init_caches(nid);
> >  
> >  	return error;
> >  }
> > diff --git a/include/linux/node.h b/include/linux/node.h
> > index e22940a593c2..8cdf2b2808e4 100644
> > --- a/include/linux/node.h
> > +++ b/include/linux/node.h
> > @@ -37,12 +37,47 @@ struct node_hmem_attrs {
> >  };
> >  void node_set_perf_attrs(unsigned int nid, struct node_hmem_attrs *hmem_attrs,
> >  			 unsigned class);
> > +
> > +enum cache_associativity {
> > +	NODE_CACHE_DIRECT_MAP,
> > +	NODE_CACHE_INDEXED,
> > +	NODE_CACHE_OTHER,
> > +};
> > +
> > +enum cache_write_policy {
> > +	NODE_CACHE_WRITE_BACK,
> > +	NODE_CACHE_WRITE_THROUGH,
> > +	NODE_CACHE_WRITE_OTHER,
> > +};
> > +
> > +/**
> > + * struct node_cache_attrs - system memory caching attributes
> > + *
> > + * @associativity:	The ways memory blocks may be placed in cache
> > + * @write_policy:	Write back or write through policy
> > + * @size:		Total size of cache in bytes
> > + * @line_size:		Number of bytes fetched on a cache miss
> > + * @level:		Represents the cache hierarchy level
> > + */
> > +struct node_cache_attrs {
> > +	enum cache_associativity associativity;
> > +	enum cache_write_policy write_policy;
> > +	u64 size;
> > +	u16 line_size;
> > +	u8  level;
> > +};
> > +void node_add_cache(unsigned int nid, struct node_cache_attrs *cache_attrs);
> >  #else
> >  static inline void node_set_perf_attrs(unsigned int nid,
> >  				       struct node_hmem_attrs *hmem_attrs,
> >  				       unsigned class)
> >  {
> >  }
> > +
> > +static inline void node_add_cache(unsigned int nid,
> > +				  struct node_cache_attrs *cache_attrs)
> > +{
> > +}
> >  #endif
> >  
> >  struct node {
> > @@ -51,6 +86,10 @@ struct node {
> >  #if defined(CONFIG_MEMORY_HOTPLUG_SPARSE) && defined(CONFIG_HUGETLBFS)
> >  	struct work_struct	node_work;
> >  #endif
> > +#ifdef CONFIG_HMEM_REPORTING
> > +	struct list_head cache_attrs;
> > +	struct device *cache_dev;
> > +#endif
> >  };
> >  
> >  struct memory_block;  
>

Keith Busch Feb. 11, 2019, 3:23 p.m. UTC | #4

On Sun, Feb 10, 2019 at 09:19:58AM -0800, Jonathan Cameron wrote:
> On Sat, 9 Feb 2019 09:20:53 +0100
> Brice Goglin <Brice.Goglin@inria.fr> wrote:
> 
> > Hello Keith
> > 
> > Could we ever have a single side cache in front of two NUMA nodes ? I
> > don't see a way to find that out in the current implementation. Would we
> > have an "id" and/or "nodemap" bitmask in the sidecache structure ?
> 
> This is certainly a possible thing for hardware to do.
>
> ACPI IIRC doesn't provide any means of representing that - your best
> option is to represent it as two different entries, one for each of the
> memory nodes.  Interesting question of whether you would then claim
> they were half as big each, or the full size.  Of course, there are
> other possible ways to get this info beyond HMAT, so perhaps the interface
> should allow it to be exposed if available?

HMAT doesn't do this, but I want this interface abstracted enough from
HMAT to express whatever is necessary.

The CPU cache is the closest existing exported attributes to this,
and they provide "shared_cpu_list". To that end, I can export a
"shared_node_list", though previous reviews strongly disliked multi-value
sysfs entries. :(

Would shared-node symlinks capture the need, and more acceptable?
 
> Also, don't know if it's just me, but calling these sidecaches is
> downright confusing.  In ACPI at least they are always
> specifically referred to as Memory Side Caches.
> I'd argue there should even by a hyphen Memory-Side Caches, the point
> being that that they are on the memory side of the interconnected
> rather than the processor side.  Of course an implementation
> choice might be to put them off to the side (as implied by sidecaches)
> in some sense, but it's not the only one.
> 
> </terminology rant> :)

Now that you mention it, I agree "side" is ambiguous.  Maybe call it
"numa_cache" or "node_cache"?

Brice Goglin Feb. 12, 2019, 8:11 a.m. UTC | #5

Le 11/02/2019 à 16:23, Keith Busch a écrit :
> On Sun, Feb 10, 2019 at 09:19:58AM -0800, Jonathan Cameron wrote:
>> On Sat, 9 Feb 2019 09:20:53 +0100
>> Brice Goglin <Brice.Goglin@inria.fr> wrote:
>>
>>> Hello Keith
>>>
>>> Could we ever have a single side cache in front of two NUMA nodes ? I
>>> don't see a way to find that out in the current implementation. Would we
>>> have an "id" and/or "nodemap" bitmask in the sidecache structure ?
>> This is certainly a possible thing for hardware to do.
>>
>> ACPI IIRC doesn't provide any means of representing that - your best
>> option is to represent it as two different entries, one for each of the
>> memory nodes.  Interesting question of whether you would then claim
>> they were half as big each, or the full size.  Of course, there are
>> other possible ways to get this info beyond HMAT, so perhaps the interface
>> should allow it to be exposed if available?
> HMAT doesn't do this, but I want this interface abstracted enough from
> HMAT to express whatever is necessary.
>
> The CPU cache is the closest existing exported attributes to this,
> and they provide "shared_cpu_list". To that end, I can export a
> "shared_node_list", though previous reviews strongly disliked multi-value
> sysfs entries. :(
>
> Would shared-node symlinks capture the need, and more acceptable?


As a user-space guy reading these files/symlinks, I would prefer reading
a bitmask just like we do for CPU cache "cpumap" or CPU "siblings" files
(or sibling_list).

Reading a directory and looking for dentries matching "foo%d" is far
less convenient  in C. If all these files are inside a dedicated
subdirectory, it's better but still not as easy.

Brice

Jonathan Cameron Feb. 12, 2019, 8:49 a.m. UTC | #6

On Mon, 11 Feb 2019 08:23:04 -0700
Keith Busch <keith.busch@intel.com> wrote:

> On Sun, Feb 10, 2019 at 09:19:58AM -0800, Jonathan Cameron wrote:
> > On Sat, 9 Feb 2019 09:20:53 +0100
> > Brice Goglin <Brice.Goglin@inria.fr> wrote:
> >   
> > > Hello Keith
> > > 
> > > Could we ever have a single side cache in front of two NUMA nodes ? I
> > > don't see a way to find that out in the current implementation. Would we
> > > have an "id" and/or "nodemap" bitmask in the sidecache structure ?  
> > 
> > This is certainly a possible thing for hardware to do.
> >
> > ACPI IIRC doesn't provide any means of representing that - your best
> > option is to represent it as two different entries, one for each of the
> > memory nodes.  Interesting question of whether you would then claim
> > they were half as big each, or the full size.  Of course, there are
> > other possible ways to get this info beyond HMAT, so perhaps the interface
> > should allow it to be exposed if available?  
> 
> HMAT doesn't do this, but I want this interface abstracted enough from
> HMAT to express whatever is necessary.
> 
> The CPU cache is the closest existing exported attributes to this,
> and they provide "shared_cpu_list". To that end, I can export a
> "shared_node_list", though previous reviews strongly disliked multi-value
> sysfs entries. :(
> 
> Would shared-node symlinks capture the need, and more acceptable?

My inclination is that it's better to follow an existing pattern than
invent a new one that breaks people's expectations.

However, don't feel that strongly about it as long as the interface
is functional and intuitive.


>  
> > Also, don't know if it's just me, but calling these sidecaches is
> > downright confusing.  In ACPI at least they are always
> > specifically referred to as Memory Side Caches.
> > I'd argue there should even by a hyphen Memory-Side Caches, the point
> > being that that they are on the memory side of the interconnected
> > rather than the processor side.  Of course an implementation
> > choice might be to put them off to the side (as implied by sidecaches)
> > in some sense, but it's not the only one.
> > 
> > </terminology rant> :)  
> 
> Now that you mention it, I agree "side" is ambiguous.  Maybe call it
> "numa_cache" or "node_cache"?
I'm not sure any of the options work well.  My inclination would be to
use the full name and keep the somewhat redundant memory there.

The other two feel like they could just as easily be coherent caches
at accelerators for example...

memory_side_cache?

The fun of naming ;)

Jonathan

Keith Busch Feb. 12, 2019, 5:31 p.m. UTC | #7

On Tue, Feb 12, 2019 at 08:49:03AM +0000, Jonathan Cameron wrote:
> On Mon, 11 Feb 2019 08:23:04 -0700
> Keith Busch <keith.busch@intel.com> wrote:
> 
> > On Sun, Feb 10, 2019 at 09:19:58AM -0800, Jonathan Cameron wrote:
> > > On Sat, 9 Feb 2019 09:20:53 +0100
> > > Brice Goglin <Brice.Goglin@inria.fr> wrote:
> > >   
> > > > Hello Keith
> > > > 
> > > > Could we ever have a single side cache in front of two NUMA nodes ? I
> > > > don't see a way to find that out in the current implementation. Would we
> > > > have an "id" and/or "nodemap" bitmask in the sidecache structure ?  
> > > 
> > > This is certainly a possible thing for hardware to do.
> > >
> > > ACPI IIRC doesn't provide any means of representing that - your best
> > > option is to represent it as two different entries, one for each of the
> > > memory nodes.  Interesting question of whether you would then claim
> > > they were half as big each, or the full size.  Of course, there are
> > > other possible ways to get this info beyond HMAT, so perhaps the interface
> > > should allow it to be exposed if available?  
> > 
> > HMAT doesn't do this, but I want this interface abstracted enough from
> > HMAT to express whatever is necessary.
> > 
> > The CPU cache is the closest existing exported attributes to this,
> > and they provide "shared_cpu_list". To that end, I can export a
> > "shared_node_list", though previous reviews strongly disliked multi-value
> > sysfs entries. :(
> > 
> > Would shared-node symlinks capture the need, and more acceptable?
> 
> My inclination is that it's better to follow an existing pattern than
> invent a new one that breaks people's expectations.
> 
> However, don't feel that strongly about it as long as the interface
> is functional and intuitive.

Okay, considering I'd have a difficult time testing such an interface
since it doesn't apply to HMAT, and I've received only conflicting
feedback on list attributes, I would prefer to leave this feature out
of this series for now. I'm certainly not against adding it later.

[PATCHv4,10/13] node: Add memory caching attributes

Commit Message

Comments

Patch