mbox series

[PATCHv6,00/10] Heterogenous memory node attributes

Message ID 20190214171017.9362-1-keith.busch@intel.com (mailing list archive)
Headers show
Series Heterogenous memory node attributes | expand

Message

Keith Busch Feb. 14, 2019, 5:10 p.m. UTC
== Changes since v5 ==

  Updated HMAT parsing to account for the recently released ACPI 6.3
  changes.

  HMAT attribute calculation overflow checks.

  Fixed memory leak if HMAT parse fails.

  Minor change to the patch order. All the base node attributes occur
  before HMAT usage for these new node attributes to resolve a
  dependency on a new struct.

  Reporting failures to parse HMAT or allocate structures are elevated
  to a NOTICE level from DEBUG. Any failure will result in just one
  print so that it is obvious something may need to be investigated
  rather than silently fail, but also not to be too alarming either.

  Determining the cpu and memory node local relationships is quite
  different this time (PATCH 7/10). The local relationship to a memory
  target will be either *only* the node from the Initiator Proximity
  Domain if provided, or if it is not provided, all the nodes that have
  the same highest performance. Latency was chosen to take prioirty over
  bandwidth when ranking performance.

  Renamed "side_cache" to "memory_side_cache". The previous name was
  ambiguous.

  Removed "level" as an exported cache attribute. It was redundant with
  the directory name anyway.

  Minor changelog updates, added received reviews, and documentation
  fixes.

Just want to point out that I am sticking with struct device
instead of using struct kobject embedded in the attribute tracking
structures. Previous feedback was leaning either way on this point.

== Background ==

Platforms may provide multiple types of cpu attached system memory. The
memory ranges for each type may have different characteristics that
applications may wish to know about when considering what node they want
their memory allocated from. 

It had previously been difficult to describe these setups as memory
rangers were generally lumped into the NUMA node of the CPUs. New
platform attributes have been created and in use today that describe
the more complex memory hierarchies that can be created.

This series' objective is to provide the attributes from such systems
that are useful for applications to know about, and readily usable with
existing tools and libraries. Those applications may query performance
attributes relative to a particular CPU they're running on in order to
make more informed choices for where they want to allocate hot and cold
data. This works with mbind() or the numactl library.

Keith Busch (10):
  acpi: Create subtable parsing infrastructure
  acpi: Add HMAT to generic parsing tables
  acpi/hmat: Parse and report heterogeneous memory
  node: Link memory nodes to their compute nodes
  node: Add heterogenous memory access attributes
  node: Add memory-side caching attributes
  acpi/hmat: Register processor domain to its memory
  acpi/hmat: Register performance attributes
  acpi/hmat: Register memory side cache attributes
  doc/mm: New documentation for memory performance

 Documentation/ABI/stable/sysfs-devices-node   |  89 +++-
 Documentation/admin-guide/mm/numaperf.rst     | 164 +++++++
 arch/arm64/kernel/acpi_numa.c                 |   2 +-
 arch/arm64/kernel/smp.c                       |   4 +-
 arch/ia64/kernel/acpi.c                       |  12 +-
 arch/x86/kernel/acpi/boot.c                   |  36 +-
 drivers/acpi/Kconfig                          |   1 +
 drivers/acpi/Makefile                         |   1 +
 drivers/acpi/hmat/Kconfig                     |   9 +
 drivers/acpi/hmat/Makefile                    |   1 +
 drivers/acpi/hmat/hmat.c                      | 677 ++++++++++++++++++++++++++
 drivers/acpi/numa.c                           |  16 +-
 drivers/acpi/scan.c                           |   4 +-
 drivers/acpi/tables.c                         |  76 ++-
 drivers/base/Kconfig                          |   8 +
 drivers/base/node.c                           | 351 ++++++++++++-
 drivers/irqchip/irq-gic-v2m.c                 |   2 +-
 drivers/irqchip/irq-gic-v3-its-pci-msi.c      |   2 +-
 drivers/irqchip/irq-gic-v3-its-platform-msi.c |   2 +-
 drivers/irqchip/irq-gic-v3-its.c              |   6 +-
 drivers/irqchip/irq-gic-v3.c                  |  10 +-
 drivers/irqchip/irq-gic.c                     |   4 +-
 drivers/mailbox/pcc.c                         |   2 +-
 include/linux/acpi.h                          |   6 +-
 include/linux/node.h                          |  60 ++-
 25 files changed, 1480 insertions(+), 65 deletions(-)
 create mode 100644 Documentation/admin-guide/mm/numaperf.rst
 create mode 100644 drivers/acpi/hmat/Kconfig
 create mode 100644 drivers/acpi/hmat/Makefile
 create mode 100644 drivers/acpi/hmat/hmat.c

Comments

Brice Goglin Feb. 18, 2019, 2:25 p.m. UTC | #1
Le 14/02/2019 à 18:10, Keith Busch a écrit :
> == Changes since v5 ==
>
>   Updated HMAT parsing to account for the recently released ACPI 6.3
>   changes.
>
>   HMAT attribute calculation overflow checks.
>
>   Fixed memory leak if HMAT parse fails.
>
>   Minor change to the patch order. All the base node attributes occur
>   before HMAT usage for these new node attributes to resolve a
>   dependency on a new struct.
>
>   Reporting failures to parse HMAT or allocate structures are elevated
>   to a NOTICE level from DEBUG. Any failure will result in just one
>   print so that it is obvious something may need to be investigated
>   rather than silently fail, but also not to be too alarming either.
>
>   Determining the cpu and memory node local relationships is quite
>   different this time (PATCH 7/10). The local relationship to a memory
>   target will be either *only* the node from the Initiator Proximity
>   Domain if provided, or if it is not provided, all the nodes that have
>   the same highest performance. Latency was chosen to take prioirty over
>   bandwidth when ranking performance.


Hello Keith

I am trying to understand what this last paragraph means.

Let's say I have a machine with DDR and NVDIMM both attached to the same
socket, and I use Dave Hansen's kmem patchs to make the NVDIMM appear as
"normal memory" in an additional NUMA node. Let's call node0 the DDR and
node1 the NVDIMM kmem node.

Now user-space wants to find out which CPUs are actually close to the
NVDIMMs. My understanding is that SRAT says that CPUs are local to the
DDR only. Hence /sys/devices/system/node/node1/cpumap says there are no
CPU local to the NVDIMM. And HMAT won't change this, right?

Will node1 contain access0/initiators/node0 to clarify that CPUs local
to the NVDIMM are those of node0? Even if latency from node0 to node1
latency is higher than node0 to node0?

Another way to ask this: Is the latency/performance only used for
distinguishing the local initiator CPUs among multiple CPU nodes
accesing the same memory node? Or is it also used to distinguish the
local memory target among multiple memories access by a single CPU node?

The Intel machine I am currently testing patches on doesn't have a HMAT
in 1-level-memory unfortunately.

Thanks

Brice
Keith Busch Feb. 19, 2019, 5:20 p.m. UTC | #2
On Mon, Feb 18, 2019 at 03:25:31PM +0100, Brice Goglin wrote:
> Le 14/02/2019 à 18:10, Keith Busch a écrit :
> >   Determining the cpu and memory node local relationships is quite
> >   different this time (PATCH 7/10). The local relationship to a memory
> >   target will be either *only* the node from the Initiator Proximity
> >   Domain if provided, or if it is not provided, all the nodes that have
> >   the same highest performance. Latency was chosen to take prioirty over
> >   bandwidth when ranking performance.
> 
> 
> Hello Keith
> 
> I am trying to understand what this last paragraph means.
> 
> Let's say I have a machine with DDR and NVDIMM both attached to the same
> socket, and I use Dave Hansen's kmem patchs to make the NVDIMM appear as
> "normal memory" in an additional NUMA node. Let's call node0 the DDR and
> node1 the NVDIMM kmem node.
> 
> Now user-space wants to find out which CPUs are actually close to the
> NVDIMMs. My understanding is that SRAT says that CPUs are local to the
> DDR only. Hence /sys/devices/system/node/node1/cpumap says there are no
> CPU local to the NVDIMM. And HMAT won't change this, right?

HMAT actually does change this. The relationship is in 6.2's HMAT
Address Range or 6.3's Proximity Domain Attributes, and that's
something SRAT wasn't providing.

The problem with these HMAT structures is that the CPU node is
optional. The last paragraph is saying that if that optional information
is provided, we will use that. If it is not provided, we will fallback
to performance attributes to determine what is the "local" CPU domain.
 
> Will node1 contain access0/initiators/node0 to clarify that CPUs local
> to the NVDIMM are those of node0? Even if latency from node0 to node1
> latency is higher than node0 to node0?

Exactly, yes. To expand on this, what you'd see from sysfs:

  /sys/devices/system/node/node0/access0/targets/node1 -> ../../../node1

And

  /sys/devices/system/node/node1/access0/initiators/node0 -> ../../../node0

> Another way to ask this: Is the latency/performance only used for
> distinguishing the local initiator CPUs among multiple CPU nodes
> accesing the same memory node? Or is it also used to distinguish the
> local memory target among multiple memories access by a single CPU node?

It's the first one. A single CPU domain may have multiple local targets,
but each of those targets may have different performance.

For example, you could have something like this with "normal" DDR
memory, high-bandwidth memory, and slower nvdimm:

 +------------------+    +------------------+
 | CPU Node 0       +----+ CPU Node 1       |
 | Node0 DDR Mem    |    | Node1 DDR Mem    |
 +--------+---------+    +--------+---------+
          |                       |
 +--------+---------+    +--------+---------+
 | Node2 HBMem      |    | Node3 HBMem      |
 +--------+---------+    +--------+---------+
          |                       |
 +--------+---------+    +--------+---------+
 | Node4 Slow NVMem |    | Node5 Slow NVMem |
 +------------------+    +------------------+

In the above, Initiator node0 is "local" to targets 0, 2, and 4, and
would show up in node0's access0/targets/. Each memory target node,
though, has different performance than the others that are local to the
same intiator domain.

> The Intel machine I am currently testing patches on doesn't have a HMAT
> in 1-level-memory unfortunately.

Platforms providing HMAT tables are still rare at the moment, but expect
will become more common.
Keith Busch Feb. 20, 2019, 6:25 p.m. UTC | #3
On Thu, Feb 14, 2019 at 10:10:07AM -0700, Keith Busch wrote:
> Platforms may provide multiple types of cpu attached system memory. The
> memory ranges for each type may have different characteristics that
> applications may wish to know about when considering what node they want
> their memory allocated from. 
> 
> It had previously been difficult to describe these setups as memory
> rangers were generally lumped into the NUMA node of the CPUs. New
> platform attributes have been created and in use today that describe
> the more complex memory hierarchies that can be created.
> 
> This series' objective is to provide the attributes from such systems
> that are useful for applications to know about, and readily usable with
> existing tools and libraries. Those applications may query performance
> attributes relative to a particular CPU they're running on in order to
> make more informed choices for where they want to allocate hot and cold
> data. This works with mbind() or the numactl library.

Hi all,

So this seems very calm at this point. Unless there are any late concerns
or suggestions, could we open consideration for queueing in a staging
tree for a future merge window?

Thanks,
Keith

 
> Keith Busch (10):
>   acpi: Create subtable parsing infrastructure
>   acpi: Add HMAT to generic parsing tables
>   acpi/hmat: Parse and report heterogeneous memory
>   node: Link memory nodes to their compute nodes
>   node: Add heterogenous memory access attributes
>   node: Add memory-side caching attributes
>   acpi/hmat: Register processor domain to its memory
>   acpi/hmat: Register performance attributes
>   acpi/hmat: Register memory side cache attributes
>   doc/mm: New documentation for memory performance
> 
>  Documentation/ABI/stable/sysfs-devices-node   |  89 +++-
>  Documentation/admin-guide/mm/numaperf.rst     | 164 +++++++
>  arch/arm64/kernel/acpi_numa.c                 |   2 +-
>  arch/arm64/kernel/smp.c                       |   4 +-
>  arch/ia64/kernel/acpi.c                       |  12 +-
>  arch/x86/kernel/acpi/boot.c                   |  36 +-
>  drivers/acpi/Kconfig                          |   1 +
>  drivers/acpi/Makefile                         |   1 +
>  drivers/acpi/hmat/Kconfig                     |   9 +
>  drivers/acpi/hmat/Makefile                    |   1 +
>  drivers/acpi/hmat/hmat.c                      | 677 ++++++++++++++++++++++++++
>  drivers/acpi/numa.c                           |  16 +-
>  drivers/acpi/scan.c                           |   4 +-
>  drivers/acpi/tables.c                         |  76 ++-
>  drivers/base/Kconfig                          |   8 +
>  drivers/base/node.c                           | 351 ++++++++++++-
>  drivers/irqchip/irq-gic-v2m.c                 |   2 +-
>  drivers/irqchip/irq-gic-v3-its-pci-msi.c      |   2 +-
>  drivers/irqchip/irq-gic-v3-its-platform-msi.c |   2 +-
>  drivers/irqchip/irq-gic-v3-its.c              |   6 +-
>  drivers/irqchip/irq-gic-v3.c                  |  10 +-
>  drivers/irqchip/irq-gic.c                     |   4 +-
>  drivers/mailbox/pcc.c                         |   2 +-
>  include/linux/acpi.h                          |   6 +-
>  include/linux/node.h                          |  60 ++-
>  25 files changed, 1480 insertions(+), 65 deletions(-)
>  create mode 100644 Documentation/admin-guide/mm/numaperf.rst
>  create mode 100644 drivers/acpi/hmat/Kconfig
>  create mode 100644 drivers/acpi/hmat/Makefile
>  create mode 100644 drivers/acpi/hmat/hmat.c