[PATCHv3,07/13] node: Add heterogenous memory access attributes
diff mbox series

Message ID 20190109174341.19818-8-keith.busch@intel.com
State New
Headers show
Series
  • Heterogeneuos memory node attributes
Related show

Commit Message

Keith Busch Jan. 9, 2019, 5:43 p.m. UTC
Heterogeneous memory systems provide memory nodes with different latency
and bandwidth performance attributes. Provide a new kernel interface for
subsystems to register the attributes under the memory target node's
initiator access class. If the system provides this information, applications
may query these attributes when deciding which node to request memory.

The following example shows the new sysfs hierarchy for a node exporting
performance attributes:

  # tree -P "read*|write*" /sys/devices/system/node/nodeY/classZ/
  /sys/devices/system/node/nodeY/classZ/
  |-- read_bandwidth
  |-- read_latency
  |-- write_bandwidth
  `-- write_latency

The bandwidth is exported as MB/s and latency is reported in nanoseconds.
Memory accesses from an initiator node that is not one of the memory's
class "Z" initiator nodes may encounter different performance than
reported here. When a subsystem makes use of this interface, initiators
of a lower class number, "Z", have better performance relative to higher
class numbers. When provided, class 0 is the highest performing access
class.

Signed-off-by: Keith Busch <keith.busch@intel.com>
---
 drivers/base/Kconfig |  8 ++++++++
 drivers/base/node.c  | 48 ++++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/node.h | 25 +++++++++++++++++++++++++
 3 files changed, 81 insertions(+)

Comments

Aneesh Kumar K.V Jan. 10, 2019, 12:37 p.m. UTC | #1
Keith Busch <keith.busch@intel.com> writes:

> Heterogeneous memory systems provide memory nodes with different latency
> and bandwidth performance attributes. Provide a new kernel interface for
> subsystems to register the attributes under the memory target node's
> initiator access class. If the system provides this information, applications
> may query these attributes when deciding which node to request memory.
>
> The following example shows the new sysfs hierarchy for a node exporting
> performance attributes:
>
>   # tree -P "read*|write*" /sys/devices/system/node/nodeY/classZ/
>   /sys/devices/system/node/nodeY/classZ/
>   |-- read_bandwidth
>   |-- read_latency
>   |-- write_bandwidth
>   `-- write_latency
>
> The bandwidth is exported as MB/s and latency is reported in nanoseconds.
> Memory accesses from an initiator node that is not one of the memory's
> class "Z" initiator nodes may encounter different performance than
> reported here. When a subsystem makes use of this interface, initiators
> of a lower class number, "Z", have better performance relative to higher
> class numbers. When provided, class 0 is the highest performing access
> class.

How does the definition of performance relate to bandwidth and latency here?. The
initiator in this class has the least latency and high bandwidth? Can there
be a scenario where both are not best for the same node? ie, for a
target Node Y, initiator Node A gives the highest bandwidth but initiator
Node B gets the least latency. How such a config can be represented? Or is
that not possible?

-aneesh
Keith Busch Jan. 10, 2019, 5:30 p.m. UTC | #2
On Thu, Jan 10, 2019 at 06:07:02PM +0530, Aneesh Kumar K.V wrote:
> Keith Busch <keith.busch@intel.com> writes:
> 
> > Heterogeneous memory systems provide memory nodes with different latency
> > and bandwidth performance attributes. Provide a new kernel interface for
> > subsystems to register the attributes under the memory target node's
> > initiator access class. If the system provides this information, applications
> > may query these attributes when deciding which node to request memory.
> >
> > The following example shows the new sysfs hierarchy for a node exporting
> > performance attributes:
> >
> >   # tree -P "read*|write*" /sys/devices/system/node/nodeY/classZ/
> >   /sys/devices/system/node/nodeY/classZ/
> >   |-- read_bandwidth
> >   |-- read_latency
> >   |-- write_bandwidth
> >   `-- write_latency
> >
> > The bandwidth is exported as MB/s and latency is reported in nanoseconds.
> > Memory accesses from an initiator node that is not one of the memory's
> > class "Z" initiator nodes may encounter different performance than
> > reported here. When a subsystem makes use of this interface, initiators
> > of a lower class number, "Z", have better performance relative to higher
> > class numbers. When provided, class 0 is the highest performing access
> > class.
> 
> How does the definition of performance relate to bandwidth and latency here?. The
> initiator in this class has the least latency and high bandwidth? Can there
> be a scenario where both are not best for the same node? ie, for a
> target Node Y, initiator Node A gives the highest bandwidth but initiator
> Node B gets the least latency. How such a config can be represented? Or is
> that not possible?

I am not aware of a real platform that has an initiator-target pair with
better latency but worse bandwidth than any different initiator paired to
the same target. If such a thing exists and a subsystem wants to report
that, you can register any arbitrary number of groups or classes and
rank them according to how you want them presented.
Jonathan Cameron Jan. 11, 2019, 11:32 a.m. UTC | #3
On Thu, 10 Jan 2019 10:30:17 -0700
Keith Busch <keith.busch@intel.com> wrote:

> On Thu, Jan 10, 2019 at 06:07:02PM +0530, Aneesh Kumar K.V wrote:
> > Keith Busch <keith.busch@intel.com> writes:
> >   
> > > Heterogeneous memory systems provide memory nodes with different latency
> > > and bandwidth performance attributes. Provide a new kernel interface for
> > > subsystems to register the attributes under the memory target node's
> > > initiator access class. If the system provides this information, applications
> > > may query these attributes when deciding which node to request memory.
> > >
> > > The following example shows the new sysfs hierarchy for a node exporting
> > > performance attributes:
> > >
> > >   # tree -P "read*|write*" /sys/devices/system/node/nodeY/classZ/
> > >   /sys/devices/system/node/nodeY/classZ/
> > >   |-- read_bandwidth
> > >   |-- read_latency
> > >   |-- write_bandwidth
> > >   `-- write_latency
> > >
> > > The bandwidth is exported as MB/s and latency is reported in nanoseconds.
> > > Memory accesses from an initiator node that is not one of the memory's
> > > class "Z" initiator nodes may encounter different performance than
> > > reported here. When a subsystem makes use of this interface, initiators
> > > of a lower class number, "Z", have better performance relative to higher
> > > class numbers. When provided, class 0 is the highest performing access
> > > class.  
> > 
> > How does the definition of performance relate to bandwidth and latency here?. The
> > initiator in this class has the least latency and high bandwidth? Can there
> > be a scenario where both are not best for the same node? ie, for a
> > target Node Y, initiator Node A gives the highest bandwidth but initiator
> > Node B gets the least latency. How such a config can be represented? Or is
> > that not possible?  
> 
> I am not aware of a real platform that has an initiator-target pair with
> better latency but worse bandwidth than any different initiator paired to
> the same target. If such a thing exists and a subsystem wants to report
> that, you can register any arbitrary number of groups or classes and
> rank them according to how you want them presented.
> 

It's certainly possible if you are trading off against pin count by going
out of the soc on a serial bus for some large SCM pool and also have a local
SCM pool on a ddr 'like' bus or just ddr on fairly small number of channels
(because some one didn't put memory on all of them).
We will see this fairly soon in production parts.

So need an 'ordering' choice for this circumstance that is predictable.

Jonathan
Keith Busch Jan. 11, 2019, 3:58 p.m. UTC | #4
On Fri, Jan 11, 2019 at 11:32:38AM +0000, Jonathan Cameron wrote:
> On Thu, 10 Jan 2019 10:30:17 -0700
> Keith Busch <keith.busch@intel.com> wrote:
> > I am not aware of a real platform that has an initiator-target pair with
> > better latency but worse bandwidth than any different initiator paired to
> > the same target. If such a thing exists and a subsystem wants to report
> > that, you can register any arbitrary number of groups or classes and
> > rank them according to how you want them presented.
> > 
> 
> It's certainly possible if you are trading off against pin count by going
> out of the soc on a serial bus for some large SCM pool and also have a local
> SCM pool on a ddr 'like' bus or just ddr on fairly small number of channels
> (because some one didn't put memory on all of them).
> We will see this fairly soon in production parts.
> 
> So need an 'ordering' choice for this circumstance that is predictable.

As long as the reported memory target access attributes are accurate for
the initiator nodes listed under an access class, I'm not sure that it
matters what order you use. All the information needed to make a choice
on which pair to use is available, and the order is just an implementation
specific decision.
Dan Williams Jan. 11, 2019, 4:25 p.m. UTC | #5
On Fri, Jan 11, 2019 at 7:59 AM Keith Busch <keith.busch@intel.com> wrote:
>
> On Fri, Jan 11, 2019 at 11:32:38AM +0000, Jonathan Cameron wrote:
> > On Thu, 10 Jan 2019 10:30:17 -0700
> > Keith Busch <keith.busch@intel.com> wrote:
> > > I am not aware of a real platform that has an initiator-target pair with
> > > better latency but worse bandwidth than any different initiator paired to
> > > the same target. If such a thing exists and a subsystem wants to report
> > > that, you can register any arbitrary number of groups or classes and
> > > rank them according to how you want them presented.
> > >
> >
> > It's certainly possible if you are trading off against pin count by going
> > out of the soc on a serial bus for some large SCM pool and also have a local
> > SCM pool on a ddr 'like' bus or just ddr on fairly small number of channels
> > (because some one didn't put memory on all of them).
> > We will see this fairly soon in production parts.
> >
> > So need an 'ordering' choice for this circumstance that is predictable.
>
> As long as the reported memory target access attributes are accurate for
> the initiator nodes listed under an access class, I'm not sure that it
> matters what order you use. All the information needed to make a choice
> on which pair to use is available, and the order is just an implementation
> specific decision.

Agree with Keith. If the performance is differentiated it will be in a
separate class. A hierarchy of classes is not enforced by the
interface, but it tries to advertise some semblance of the "best"
initiator pairing for a given target by default with the flexibility
to go more complex if the situation arises.

As was seen in the SCSI specification efforts to advertise all manner
of cache hinting the kernel community discovered that only a small
fraction of what hardware vendors thought mattered actually
demonstrated value in practice. That experience is instructive that
the kernel interfaces for hardware performance hints should prioritize
what makes sense for the kernel and applications generally, not
necessarily every conceivable performance detail that a hardware
platform chooses to expose, or niche applications might consume.

Patch
diff mbox series

diff --git a/drivers/base/Kconfig b/drivers/base/Kconfig
index 3e63a900b330..6014980238e8 100644
--- a/drivers/base/Kconfig
+++ b/drivers/base/Kconfig
@@ -149,6 +149,14 @@  config DEBUG_TEST_DRIVER_REMOVE
 	  unusable. You should say N here unless you are explicitly looking to
 	  test this functionality.
 
+config HMEM_REPORTING
+	bool
+	default y
+	depends on NUMA
+	help
+	  Enable reporting for heterogenous memory access attributes under
+	  their non-uniform memory nodes.
+
 source "drivers/base/test/Kconfig"
 
 config SYS_HYPERVISOR
diff --git a/drivers/base/node.c b/drivers/base/node.c
index 1da5072116ab..1e909f61e8b1 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -66,6 +66,9 @@  struct node_class_nodes {
 	unsigned		class;
 	nodemask_t		initiator_nodes;
 	nodemask_t		target_nodes;
+#ifdef CONFIG_HMEM_REPORTING
+	struct node_hmem_attrs	hmem_attrs;
+#endif
 };
 #define to_class_nodes(dev) container_of(dev, struct node_class_nodes, dev)
 
@@ -145,6 +148,51 @@  static struct node_class_nodes *node_init_node_class(struct device *parent,
 	return NULL;
 }
 
+#ifdef CONFIG_HMEM_REPORTING
+#define ACCESS_ATTR(name) 						   \
+static ssize_t name##_show(struct device *dev,				   \
+			   struct device_attribute *attr,		   \
+			   char *buf)					   \
+{									   \
+	return sprintf(buf, "%u\n", to_class_nodes(dev)->hmem_attrs.name); \
+}									   \
+static DEVICE_ATTR_RO(name);
+
+ACCESS_ATTR(read_bandwidth)
+ACCESS_ATTR(read_latency)
+ACCESS_ATTR(write_bandwidth)
+ACCESS_ATTR(write_latency)
+
+static struct attribute *access_attrs[] = {
+	&dev_attr_read_bandwidth.attr,
+	&dev_attr_read_latency.attr,
+	&dev_attr_write_bandwidth.attr,
+	&dev_attr_write_latency.attr,
+	NULL,
+};
+ATTRIBUTE_GROUPS(access);
+
+void node_set_perf_attrs(unsigned int nid, struct node_hmem_attrs *hmem_attrs,
+			 unsigned class)
+{
+	struct node_class_nodes *c;
+	struct node *node;
+
+	if (WARN_ON_ONCE(!node_online(nid)))
+		return;
+
+	node = node_devices[nid];
+	c = node_init_node_class(&node->dev, &node->class_list, class);
+	if (!c)
+		return;
+
+	c->hmem_attrs = *hmem_attrs;
+	if (sysfs_create_groups(&c->dev.kobj, access_groups))
+		pr_info("failed to add performance attribute group to node %d\n",
+			nid);
+}
+#endif
+
 #define K(x) ((x) << (PAGE_SHIFT - 10))
 static ssize_t node_read_meminfo(struct device *dev,
 			struct device_attribute *attr, char *buf)
diff --git a/include/linux/node.h b/include/linux/node.h
index 8e3666c12ef2..e22940a593c2 100644
--- a/include/linux/node.h
+++ b/include/linux/node.h
@@ -20,6 +20,31 @@ 
 #include <linux/list.h>
 #include <linux/workqueue.h>
 
+#ifdef CONFIG_HMEM_REPORTING
+/**
+ * struct node_hmem_attrs - heterogeneous memory performance attributes
+ *
+ * @read_bandwidth:	Read bandwidth in MB/s
+ * @write_bandwidth:	Write bandwidth in MB/s
+ * @read_latency:	Read latency in nanoseconds
+ * @write_latency:	Write latency in nanoseconds
+ */
+struct node_hmem_attrs {
+	unsigned int read_bandwidth;
+	unsigned int write_bandwidth;
+	unsigned int read_latency;
+	unsigned int write_latency;
+};
+void node_set_perf_attrs(unsigned int nid, struct node_hmem_attrs *hmem_attrs,
+			 unsigned class);
+#else
+static inline void node_set_perf_attrs(unsigned int nid,
+				       struct node_hmem_attrs *hmem_attrs,
+				       unsigned class)
+{
+}
+#endif
+
 struct node {
 	struct device	dev;
 	struct list_head class_list;