[PATCHv4,00/13] Heterogeneuos memory node attributes

Message ID	20190116175804.30196-1-keith.busch@intel.com (mailing list archive)
Headers	show Return-Path: <owner-linux-mm@kvack.org> Received-SPF: pass (google.com: domain of keith.busch@intel.com designates 192.55.52.88 as permitted sender) client-ip=192.55.52.88; From: Keith Busch <keith.busch@intel.com> To: linux-kernel@vger.kernel.org, linux-acpi@vger.kernel.org, linux-mm@kvack.org Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>, Rafael Wysocki <rafael@kernel.org>, Dave Hansen <dave.hansen@intel.com>, Dan Williams <dan.j.williams@intel.com>, Keith Busch <keith.busch@intel.com> Subject: [PATCHv4 00/13] Heterogeneuos memory node attributes Date: Wed, 16 Jan 2019 10:57:51 -0700 Message-Id: <20190116175804.30196-1-keith.busch@intel.com> Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	Heterogeneuos memory node attributes \| expand [PATCHv4,00/13] Heterogeneuos memory node attributes [PATCHv4,01/13] acpi: Create subtable parsing infrastructure [PATCHv4,02/13] acpi: Add HMAT to generic parsing tables [PATCHv4,03/13] acpi/hmat: Parse and report heterogeneous memory [PATCHv4,04/13] node: Link memory nodes to their compute nodes [PATCHv4,05/13] Documentation/ABI: Add new node sysfs attributes [PATCHv4,06/13] acpi/hmat: Register processor domain to its memory [PATCHv4,07/13] node: Add heterogenous memory access attributes [PATCHv4,08/13] Documentation/ABI: Add node performance attributes [PATCHv4,09/13] acpi/hmat: Register performance attributes [PATCHv4,10/13] node: Add memory caching attributes [PATCHv4,11/13] Documentation/ABI: Add node cache attributes [PATCHv4,12/13] acpi/hmat: Register memory side cache attributes [PATCHv4,13/13] doc/mm: New documentation for memory performance

Keith Busch Jan. 16, 2019, 5:57 p.m. UTC

The series seems quite calm now. I've received some approvals of the
on the proposal, and heard no objections on the new core interfaces.

Please let me know if there is anyone or group of people I should request
and wait for a review. And if anyone reading this would like additional
time as well before I post a potentially subsequent version, please let
me know.

I also wanted to inquire on upstream strategy if/when all desired
reviews are received. The series is spanning a few subsystems, so I'm
not sure who's tree is the best candidate. I could see an argument for
driver-core, acpi, or mm as possible paths. Please let me know if there's
a more appropriate option or any other gating concerns.

== Changes from v3 ==

  I've fixed the documentation issues that have been raised for v3 

  Moved the hmat files according to Rafael's recommendation

  Added received Reviewed-by's

Otherwise this v4 is much the same as v3.

== Background ==

Platforms may provide multiple types of cpu attached system memory. The
memory ranges for each type may have different characteristics that
applications may wish to know about when considering what node they want
their memory allocated from. 

It had previously been difficult to describe these setups as memory
rangers were generally lumped into the NUMA node of the CPUs. New
platform attributes have been created and in use today that describe
the more complex memory hierarchies that can be created.

This series' objective is to provide the attributes from such systems
that are useful for applications to know about, and readily usable with
existing tools and libraries.

Keith Busch (13):
  acpi: Create subtable parsing infrastructure
  acpi: Add HMAT to generic parsing tables
  acpi/hmat: Parse and report heterogeneous memory
  node: Link memory nodes to their compute nodes
  Documentation/ABI: Add new node sysfs attributes
  acpi/hmat: Register processor domain to its memory
  node: Add heterogenous memory access attributes
  Documentation/ABI: Add node performance attributes
  acpi/hmat: Register performance attributes
  node: Add memory caching attributes
  Documentation/ABI: Add node cache attributes
  acpi/hmat: Register memory side cache attributes
  doc/mm: New documentation for memory performance

 Documentation/ABI/stable/sysfs-devices-node   |  87 +++++-
 Documentation/admin-guide/mm/numaperf.rst     | 184 +++++++++++++
 arch/arm64/kernel/acpi_numa.c                 |   2 +-
 arch/arm64/kernel/smp.c                       |   4 +-
 arch/ia64/kernel/acpi.c                       |  12 +-
 arch/x86/kernel/acpi/boot.c                   |  36 +--
 drivers/acpi/Kconfig                          |   1 +
 drivers/acpi/Makefile                         |   1 +
 drivers/acpi/hmat/Kconfig                     |   9 +
 drivers/acpi/hmat/Makefile                    |   1 +
 drivers/acpi/hmat/hmat.c                      | 375 ++++++++++++++++++++++++++
 drivers/acpi/numa.c                           |  16 +-
 drivers/acpi/scan.c                           |   4 +-
 drivers/acpi/tables.c                         |  76 +++++-
 drivers/base/Kconfig                          |   8 +
 drivers/base/node.c                           | 317 +++++++++++++++++++++-
 drivers/irqchip/irq-gic-v2m.c                 |   2 +-
 drivers/irqchip/irq-gic-v3-its-pci-msi.c      |   2 +-
 drivers/irqchip/irq-gic-v3-its-platform-msi.c |   2 +-
 drivers/irqchip/irq-gic-v3-its.c              |   6 +-
 drivers/irqchip/irq-gic-v3.c                  |  10 +-
 drivers/irqchip/irq-gic.c                     |   4 +-
 drivers/mailbox/pcc.c                         |   2 +-
 include/linux/acpi.h                          |   6 +-
 include/linux/node.h                          |  70 ++++-
 25 files changed, 1172 insertions(+), 65 deletions(-)
 create mode 100644 Documentation/admin-guide/mm/numaperf.rst
 create mode 100644 drivers/acpi/hmat/Kconfig
 create mode 100644 drivers/acpi/hmat/Makefile
 create mode 100644 drivers/acpi/hmat/hmat.c

Education Directorate Jan. 17, 2019, 12:58 p.m. UTC | #1

On Wed, Jan 16, 2019 at 10:57:51AM -0700, Keith Busch wrote:
> The series seems quite calm now. I've received some approvals of the
> on the proposal, and heard no objections on the new core interfaces.
> 
> Please let me know if there is anyone or group of people I should request
> and wait for a review. And if anyone reading this would like additional
> time as well before I post a potentially subsequent version, please let
> me know.
> 
> I also wanted to inquire on upstream strategy if/when all desired
> reviews are received. The series is spanning a few subsystems, so I'm
> not sure who's tree is the best candidate. I could see an argument for
> driver-core, acpi, or mm as possible paths. Please let me know if there's
> a more appropriate option or any other gating concerns.
> 
> == Changes from v3 ==
> 
>   I've fixed the documentation issues that have been raised for v3 
> 
>   Moved the hmat files according to Rafael's recommendation
> 
>   Added received Reviewed-by's
> 
> Otherwise this v4 is much the same as v3.
> 
> == Background ==
> 
> Platforms may provide multiple types of cpu attached system memory. The
> memory ranges for each type may have different characteristics that
> applications may wish to know about when considering what node they want
> their memory allocated from. 
> 
> It had previously been difficult to describe these setups as memory
> rangers were generally lumped into the NUMA node of the CPUs. New
> platform attributes have been created and in use today that describe
> the more complex memory hierarchies that can be created.
> 

Could you please expand on this text -- how are these attributes
exposed/consumed by both the kernel and user space?

> This series' objective is to provide the attributes from such systems
> that are useful for applications to know about, and readily usable with
> existing tools and libraries.

I presume these tools and libraries are numactl and mbind()?

Balbir Singh.

Keith Busch Jan. 17, 2019, 3:44 p.m. UTC | #2

On Thu, Jan 17, 2019 at 11:58:21PM +1100, Balbir Singh wrote:
> On Wed, Jan 16, 2019 at 10:57:51AM -0700, Keith Busch wrote:
> > It had previously been difficult to describe these setups as memory
> > rangers were generally lumped into the NUMA node of the CPUs. New
> > platform attributes have been created and in use today that describe
> > the more complex memory hierarchies that can be created.
> > 
> 
> Could you please expand on this text -- how are these attributes
> exposed/consumed by both the kernel and user space?
> 
> > This series' objective is to provide the attributes from such systems
> > that are useful for applications to know about, and readily usable with
> > existing tools and libraries.
> 
> I presume these tools and libraries are numactl and mbind()?

Yes, and numactl is used the examples provided in both changelogs and
documentation in this series. Do you want to see those in the cover
letter as well?

Jonathan Cameron Jan. 17, 2019, 6:18 p.m. UTC | #3

On Wed, 16 Jan 2019 10:57:51 -0700
Keith Busch <keith.busch@intel.com> wrote:

> The series seems quite calm now. I've received some approvals of the
> on the proposal, and heard no objections on the new core interfaces.
> 
> Please let me know if there is anyone or group of people I should request
> and wait for a review. And if anyone reading this would like additional
> time as well before I post a potentially subsequent version, please let
> me know.
> 
> I also wanted to inquire on upstream strategy if/when all desired
> reviews are received. The series is spanning a few subsystems, so I'm
> not sure who's tree is the best candidate. I could see an argument for
> driver-core, acpi, or mm as possible paths. Please let me know if there's
> a more appropriate option or any other gating concerns.
> 
> == Changes from v3 ==
> 
>   I've fixed the documentation issues that have been raised for v3 
> 
>   Moved the hmat files according to Rafael's recommendation
> 
>   Added received Reviewed-by's
> 
> Otherwise this v4 is much the same as v3.
> 
> == Background ==
> 
> Platforms may provide multiple types of cpu attached system memory. The
> memory ranges for each type may have different characteristics that
> applications may wish to know about when considering what node they want
> their memory allocated from. 
> 
> It had previously been difficult to describe these setups as memory
> rangers were generally lumped into the NUMA node of the CPUs. New
> platform attributes have been created and in use today that describe
> the more complex memory hierarchies that can be created.
> 
> This series' objective is to provide the attributes from such systems
> that are useful for applications to know about, and readily usable with
> existing tools and libraries.
> 

Hi Keith, 

I've been having a play with various hand constructed HMAT tables to allow
me to try breaking them in all sorts of ways.

Mostly working as expected.

Two places I am so far unsure on...

1. Concept of 'best' is not implemented in a consistent fashion.

I don't agree with the logic to match on 'best' because it can give some counter
intuitive sets of target nodes.

For my simple test case we have both the latency and bandwidth specified (using
access as I'm lazy and it saves typing).

Rather that matching when both are the best value, we match when _any_ of the
measurements is the 'best' for the type of measurement.

A simple system with a high bandwidth interconnect between two SoCs
might well have identical bandwidths to memory connected to each node, but
much worse latency to the remote one.  Another simple case would be DDR and
SCM on roughly the same memory controller.  Bandwidths likely to be equal,
latencies very different.

Right now we get both nodes in the list of 'best' ones because the bandwidths
are equal which is far from ideal.  It also means we are presenting one value
for both latency and bandwidth, misrepresenting the ones where it doesn't apply.

If we aren't going to specify that both must be "best", then I think we should
separate the bandwidth and latency classes, requiring userspace to check
both if they want the best combination of latency and bandwidth. I'm also
happy enough (having not thought about it much) to have one class where the 'best'
is the value sorted first on best latency and then on best bandwidth.

2. Handling of memory only nodes - that might have a device attached - _PXM

This is a common situation in CCIX for example where you have an accelerator
with coherent memory homed at it. Looks like a pci device in a domain with
the memory.   Right now you can't actually do this as _PXM is processed
for pci devices, but we'll get that fixed (broken threadripper firmwares
meant it got reverted last cycle).

In my case I have 4 nodes with cpu and memory (0,1,2,3) and 2 memory only (4,5)
Memory only are longer latency and lower bandwidth.

Now
ls /sys/bus/nodes/devices/node0/class0/
...

initiator0
target0
target4
target5

read_bandwidth = 15000
read_latency = 10000

These two values (and their paired write values) are correct for initiator0 to target0
but completely wrong for initiator0 to target4 or target5.

This occurs because we loop over the targets looking for the best values and add
set the relevant bit in t->p_nodes based on that.  These memory only nodes have
a best value that happens to be equal from all the initiators.  The issue is it
isn't the one reported in the node0/class0.

Also if we look in
/sys/bus/nodes/devices/node4/class0 there are no targets listed (there are the expected
4 initiators 0-3).

I'm not sure what the intended behavior would be in this case.

I'll run some more tests tomorrow.

Jonathan

> Keith Busch (13):
>   acpi: Create subtable parsing infrastructure
>   acpi: Add HMAT to generic parsing tables
>   acpi/hmat: Parse and report heterogeneous memory
>   node: Link memory nodes to their compute nodes
>   Documentation/ABI: Add new node sysfs attributes
>   acpi/hmat: Register processor domain to its memory
>   node: Add heterogenous memory access attributes
>   Documentation/ABI: Add node performance attributes
>   acpi/hmat: Register performance attributes
>   node: Add memory caching attributes
>   Documentation/ABI: Add node cache attributes
>   acpi/hmat: Register memory side cache attributes
>   doc/mm: New documentation for memory performance
> 
>  Documentation/ABI/stable/sysfs-devices-node   |  87 +++++-
>  Documentation/admin-guide/mm/numaperf.rst     | 184 +++++++++++++
>  arch/arm64/kernel/acpi_numa.c                 |   2 +-
>  arch/arm64/kernel/smp.c                       |   4 +-
>  arch/ia64/kernel/acpi.c                       |  12 +-
>  arch/x86/kernel/acpi/boot.c                   |  36 +--
>  drivers/acpi/Kconfig                          |   1 +
>  drivers/acpi/Makefile                         |   1 +
>  drivers/acpi/hmat/Kconfig                     |   9 +
>  drivers/acpi/hmat/Makefile                    |   1 +
>  drivers/acpi/hmat/hmat.c                      | 375 ++++++++++++++++++++++++++
>  drivers/acpi/numa.c                           |  16 +-
>  drivers/acpi/scan.c                           |   4 +-
>  drivers/acpi/tables.c                         |  76 +++++-
>  drivers/base/Kconfig                          |   8 +
>  drivers/base/node.c                           | 317 +++++++++++++++++++++-
>  drivers/irqchip/irq-gic-v2m.c                 |   2 +-
>  drivers/irqchip/irq-gic-v3-its-pci-msi.c      |   2 +-
>  drivers/irqchip/irq-gic-v3-its-platform-msi.c |   2 +-
>  drivers/irqchip/irq-gic-v3-its.c              |   6 +-
>  drivers/irqchip/irq-gic-v3.c                  |  10 +-
>  drivers/irqchip/irq-gic.c                     |   4 +-
>  drivers/mailbox/pcc.c                         |   2 +-
>  include/linux/acpi.h                          |   6 +-
>  include/linux/node.h                          |  70 ++++-
>  25 files changed, 1172 insertions(+), 65 deletions(-)
>  create mode 100644 Documentation/admin-guide/mm/numaperf.rst
>  create mode 100644 drivers/acpi/hmat/Kconfig
>  create mode 100644 drivers/acpi/hmat/Makefile
>  create mode 100644 drivers/acpi/hmat/hmat.c
>

Keith Busch Jan. 17, 2019, 7:47 p.m. UTC | #4

On Thu, Jan 17, 2019 at 10:18:35AM -0800, Jonathan Cameron wrote:
> I've been having a play with various hand constructed HMAT tables to allow
> me to try breaking them in all sorts of ways.
> 
> Mostly working as expected.
> 
> Two places I am so far unsure on...
> 
> 1. Concept of 'best' is not implemented in a consistent fashion.
> 
> I don't agree with the logic to match on 'best' because it can give some counter
> intuitive sets of target nodes.
> 
> For my simple test case we have both the latency and bandwidth specified (using
> access as I'm lazy and it saves typing).
> 
> Rather that matching when both are the best value, we match when _any_ of the
> measurements is the 'best' for the type of measurement.
> 
> A simple system with a high bandwidth interconnect between two SoCs
> might well have identical bandwidths to memory connected to each node, but
> much worse latency to the remote one.  Another simple case would be DDR and
> SCM on roughly the same memory controller.  Bandwidths likely to be equal,
> latencies very different.
> 
> Right now we get both nodes in the list of 'best' ones because the bandwidths
> are equal which is far from ideal.  It also means we are presenting one value
> for both latency and bandwidth, misrepresenting the ones where it doesn't apply.
> 
> If we aren't going to specify that both must be "best", then I think we should
> separate the bandwidth and latency classes, requiring userspace to check
> both if they want the best combination of latency and bandwidth. I'm also
> happy enough (having not thought about it much) to have one class where the 'best'
> is the value sorted first on best latency and then on best bandwidth.

Okay, I see what you mean. I must admit my test environment doesn't have
nodes with the same bandwith but different latency, so we may get the
wrong information with the HMAT parsing in this series. I'll look into
fixing that and consider your sugggestions.
 
> 2. Handling of memory only nodes - that might have a device attached - _PXM
> 
> This is a common situation in CCIX for example where you have an accelerator
> with coherent memory homed at it. Looks like a pci device in a domain with
> the memory.   Right now you can't actually do this as _PXM is processed
> for pci devices, but we'll get that fixed (broken threadripper firmwares
> meant it got reverted last cycle).
> 
> In my case I have 4 nodes with cpu and memory (0,1,2,3) and 2 memory only (4,5)
> Memory only are longer latency and lower bandwidth.
> 
> Now
> ls /sys/bus/nodes/devices/node0/class0/
> ...
> 
> initiator0
> target0
> target4
> target5
> 
> read_bandwidth = 15000
> read_latency = 10000
> 
> These two values (and their paired write values) are correct for initiator0 to target0
> but completely wrong for initiator0 to target4 or target5.

Hm, this wasn't intended to tell us performance for the initiator's
targets. The performance data here is when you access node0's memory
target from a node in its initiator_list, or one of the simlinked
initiatorX's.

If you want to see the performance attributes for accessing
initiator0->target4, you can check:

  /sys/devices/system/node/node0/class0/target4/class0/read_bandwidth

> This occurs because we loop over the targets looking for the best values and add
> set the relevant bit in t->p_nodes based on that.  These memory only nodes have
> a best value that happens to be equal from all the initiators.  The issue is it
> isn't the one reported in the node0/class0.
>
> Also if we look in
> /sys/bus/nodes/devices/node4/class0 there are no targets listed (there are the expected
> 4 initiators 0-3).
> 
> I'm not sure what the intended behavior would be in this case.

You mentioned that node 4 is a memory-only node, so it can't have any
targets, right?

Jonathan Cameron Jan. 18, 2019, 11:12 a.m. UTC | #5

On Thu, 17 Jan 2019 12:47:51 -0700
Keith Busch <keith.busch@intel.com> wrote:

> On Thu, Jan 17, 2019 at 10:18:35AM -0800, Jonathan Cameron wrote:
> > I've been having a play with various hand constructed HMAT tables to allow
> > me to try breaking them in all sorts of ways.
> > 
> > Mostly working as expected.
> > 
> > Two places I am so far unsure on...
> > 
> > 1. Concept of 'best' is not implemented in a consistent fashion.
> > 
> > I don't agree with the logic to match on 'best' because it can give some counter
> > intuitive sets of target nodes.
> > 
> > For my simple test case we have both the latency and bandwidth specified (using
> > access as I'm lazy and it saves typing).
> > 
> > Rather that matching when both are the best value, we match when _any_ of the
> > measurements is the 'best' for the type of measurement.
> > 
> > A simple system with a high bandwidth interconnect between two SoCs
> > might well have identical bandwidths to memory connected to each node, but
> > much worse latency to the remote one.  Another simple case would be DDR and
> > SCM on roughly the same memory controller.  Bandwidths likely to be equal,
> > latencies very different.
> > 
> > Right now we get both nodes in the list of 'best' ones because the bandwidths
> > are equal which is far from ideal.  It also means we are presenting one value
> > for both latency and bandwidth, misrepresenting the ones where it doesn't apply.
> > 
> > If we aren't going to specify that both must be "best", then I think we should
> > separate the bandwidth and latency classes, requiring userspace to check
> > both if they want the best combination of latency and bandwidth. I'm also
> > happy enough (having not thought about it much) to have one class where the 'best'
> > is the value sorted first on best latency and then on best bandwidth.  
> 
> Okay, I see what you mean. I must admit my test environment doesn't have
> nodes with the same bandwith but different latency, so we may get the
> wrong information with the HMAT parsing in this series. I'll look into
> fixing that and consider your sugggestions.

Great.

>  
> > 2. Handling of memory only nodes - that might have a device attached - _PXM
> > 
> > This is a common situation in CCIX for example where you have an accelerator
> > with coherent memory homed at it. Looks like a pci device in a domain with
> > the memory.   Right now you can't actually do this as _PXM is processed
> > for pci devices, but we'll get that fixed (broken threadripper firmwares
> > meant it got reverted last cycle).
> > 
> > In my case I have 4 nodes with cpu and memory (0,1,2,3) and 2 memory only (4,5)
> > Memory only are longer latency and lower bandwidth.
> > 
> > Now
> > ls /sys/bus/nodes/devices/node0/class0/
> > ...
> > 
> > initiator0
> > target0
> > target4
> > target5
> > 
> > read_bandwidth = 15000
> > read_latency = 10000
> > 
> > These two values (and their paired write values) are correct for initiator0 to target0
> > but completely wrong for initiator0 to target4 or target5.  
> 
> Hm, this wasn't intended to tell us performance for the initiator's
> targets. The performance data here is when you access node0's memory
> target from a node in its initiator_list, or one of the simlinked
> initiatorX's.

> 
> If you want to see the performance attributes for accessing
> initiator0->target4, you can check:
> 
>   /sys/devices/system/node/node0/class0/target4/class0/read_bandwidth

Ah.  That makes sense, but does raise the question of whether this interface
is rather unintuitive and that the example given in the docs for the PCI device
doesn't always work.  Perhaps it is that documentation that needs refining.

Having values that don't apply to particular combinations of entries
in the initiator_list and target_list based on which directory we are
in doesn't seem great to me.

So the presence of a target directory in a node indicates the memory
in the target is 'best' accessed from this node, but is unrelated to the
values provided in this node.

One thought is we are trying to combine two unrelated questions and that
is what is leading to the confusion.

1) I have a process (or similar) in this node, which is the 'best' memory
   to use and what are it's characteristics.

2) I have data in this memory, which processor node should I schedule my
   processing on.

Ideally we want to avoid searching all the nodes.
To test this I disabled the memory on one of my nodes to make it even more
pathological (this is valid on the real hardware as the cpus don't have to
have memory attached to them).  So same as before but now node 3 is initiator only.

Using bandwidth as a proxy for all the other measurements..

For question 1 (what memory)

Initiator in node 0

Need to check
node0/class0/target0/class0/read_bandwidth (can shortcut this one obviously)
node0/class0/target4/class0/read_bandwidth
node0/class0/target5/class0/read_bandwidth

and discover that it's smaller for node0/class0/target0

Initiator in node 3 (initiator only)
node3/class0/initiator_nodelist is empty so no useful information available.

Initiator in node 4 (pci card for example) can assume node 4 as no
other information and it has memory (which incidentally might have long
latencies compared to memory over the interconnect.).

For question 2 (what processor)

We are on better grounds

node0/class0/initiator0
node1/class0/initiator1
node2/class0/initiator2
node3 doesn't make sense (no memory)
node4/class0/initiator[0-3]
node5/class0/initiator[0-3]
All the memory / bandwidth numbers are as expected.

So my conclusion is this works fine for suggesting processor to use for
given memory (or accelerator or whatever, though only if they are closely
coupled with the processors). Doesn't work for the what memory to use
for a given processor / pci card etc.  Sometimes there is no answer,
sometimes you have to search to find it.

Does it make more sense to just have two classes

1) Which memory is nearest to me?
2) Which processor is nearest to me? 
(3) which processor of type X is nearest to me is harder to answer but useful).

Note that case 2 is clearly covered well by the existing, but I can't actually
see what benefit having the target links has for that use case.

To illustrate how I think that would work.

Class 0, existing but with target links dropped.

What processor for each memory
node0/class0/initiator_nodelist 0
node1/class0/initiator_nodelist 1
node2/class0/initiator_nodelist 2
node3/class0/initiator_nodelist ""
node4/class0/initiator_nodelist 0-3
node5/class0/initatior_nodelist 0-3
All the memory stats reflect from the value for the given initiator to access this memory.

Class 1 new one for the what memory is nearest to me - no initiator files as that's 'me'.
node0/class1/target_nodelist 0
node1/class1/target_nodelist 1
node2/class1/target_nodelist 2
node3/class1/target_nodelist 0-2
node4/class1/target_nodelist "" as not specified as an initiator in hmat. 
				Ideally class1 wouldn't even be there.
node5/class1/target_nodelist "" same

For now we would have no information for accelerators in node4, 5 but for now
ACPI doesn't provide us the info for them anyway really.  I suppose you could
have HMAT describing 'initiators' in those nodes without there being any processors
in SRAT.  Let's park that one to solve another day.

Your pci example then becomes

+  # NODE=$(cat /sys/devices/pci:0000:00/.../numa_node)
+  # numactl --membind=$(cat /sys/devices/node/node${NODE}/class1/target_nodelist) \
+      --cpunodebind=$(cat /sys/devices/node/node${NODE}/class0/initiator_nodelist) \
+      -- <some-program-to-execute>

What do you think?

Jonathan

> 
> > This occurs because we loop over the targets looking for the best values and add
> > set the relevant bit in t->p_nodes based on that.  These memory only nodes have
> > a best value that happens to be equal from all the initiators.  The issue is it
> > isn't the one reported in the node0/class0.
> >
> > Also if we look in
> > /sys/bus/nodes/devices/node4/class0 there are no targets listed (there are the expected
> > 4 initiators 0-3).
> > 
> > I'm not sure what the intended behavior would be in this case.  
> 
> You mentioned that node 4 is a memory-only node, so it can't have any
> targets, right?
Depends on what is there that ACPI 6.2 doesn't describe :)  Can have
PCI cards for example.

Jonathan

Education Directorate Jan. 18, 2019, 1:16 p.m. UTC | #6

On Thu, Jan 17, 2019 at 08:44:37AM -0700, Keith Busch wrote:
> On Thu, Jan 17, 2019 at 11:58:21PM +1100, Balbir Singh wrote:
> > On Wed, Jan 16, 2019 at 10:57:51AM -0700, Keith Busch wrote:
> > > It had previously been difficult to describe these setups as memory
> > > rangers were generally lumped into the NUMA node of the CPUs. New
> > > platform attributes have been created and in use today that describe
> > > the more complex memory hierarchies that can be created.
> > > 
> > 
> > Could you please expand on this text -- how are these attributes
> > exposed/consumed by both the kernel and user space?
> > 
> > > This series' objective is to provide the attributes from such systems
> > > that are useful for applications to know about, and readily usable with
> > > existing tools and libraries.
> > 
> > I presume these tools and libraries are numactl and mbind()?
> 
> Yes, and numactl is used the examples provided in both changelogs and
> documentation in this series. Do you want to see those in the cover
> letter as well?

Not really, I was just reading through the cover letter and was curious

Balbir Singh.

[PATCHv4,00/13] Heterogeneuos memory node attributes

Message

Comments