mbox series

[PATCHv5,00/10] Heterogeneuos memory node attributes

Message ID 20190124230724.10022-1-keith.busch@intel.com (mailing list archive)
Headers show
Series Heterogeneuos memory node attributes | expand

Message

Keith Busch Jan. 24, 2019, 11:07 p.m. UTC
== Changes since v4 ==

  All public interfaces have kernel docs.

  Renamed "class" to "access", docs and changed logs updated
  accordingly. (Rafael)

  The sysfs hierarchy is altered to put initiators and targets in their
  own attribute group directories (Rafael).

  The node lists are removed. This feedback is in conflict with v1
  feedback, but consensus wants to remove multi-value sysfs attributes,
  which includes lists. We only have symlinks now, just like v1 provided.

  Documentation and code patches are combined such that the code
  introducing new attributes and its documentation are in the same
  patch. (Rafael and Dan).

  The performance attributes, bandwidth and latency, are moved into the
  initiators directory. This should make it obvious for which node
  access the attributes apply, which was previously ambiguous.
  (Jonathan Cameron).

  The HMAT code selecting "local" initiators is substantially changed.
  Only PXM's that have identical performance to the HMAT's processor PXM
  in Address Range Structure are registered. This is to avoid considering
  nodes identical when only one of several perf attributes are the same.
  (Jonathan Cameron).

  Verbose variable naming. Examples include "initiator" and "target"
  instead of "i" and "t", "mem_pxm" and "cpu_pxm" instead of "m" and
  "p". (Rafael)

  Compile fixes for when HMEM_REPORTING is not set. This is not a user
  selectable config option, default 'n', and will have to be selected
  by other config options that require it (Greg KH and Rafael).

== Background ==

Platforms may provide multiple types of cpu attached system memory. The
memory ranges for each type may have different characteristics that
applications may wish to know about when considering what node they want
their memory allocated from. 

It had previously been difficult to describe these setups as memory
rangers were generally lumped into the NUMA node of the CPUs. New
platform attributes have been created and in use today that describe
the more complex memory hierarchies that can be created.

This series' objective is to provide the attributes from such systems
that are useful for applications to know about, and readily usable with
existing tools and libraries.

Keith Busch (10):
  acpi: Create subtable parsing infrastructure
  acpi: Add HMAT to generic parsing tables
  acpi/hmat: Parse and report heterogeneous memory
  node: Link memory nodes to their compute nodes
  acpi/hmat: Register processor domain to its memory
  node: Add heterogenous memory access attributes
  acpi/hmat: Register performance attributes
  node: Add memory caching attributes
  acpi/hmat: Register memory side cache attributes
  doc/mm: New documentation for memory performance

 Documentation/ABI/stable/sysfs-devices-node   |  87 ++++-
 Documentation/admin-guide/mm/numaperf.rst     | 167 ++++++++
 arch/arm64/kernel/acpi_numa.c                 |   2 +-
 arch/arm64/kernel/smp.c                       |   4 +-
 arch/ia64/kernel/acpi.c                       |  12 +-
 arch/x86/kernel/acpi/boot.c                   |  36 +-
 drivers/acpi/Kconfig                          |   1 +
 drivers/acpi/Makefile                         |   1 +
 drivers/acpi/hmat/Kconfig                     |   9 +
 drivers/acpi/hmat/Makefile                    |   1 +
 drivers/acpi/hmat/hmat.c                      | 537 ++++++++++++++++++++++++++
 drivers/acpi/numa.c                           |  16 +-
 drivers/acpi/scan.c                           |   4 +-
 drivers/acpi/tables.c                         |  76 +++-
 drivers/base/Kconfig                          |   8 +
 drivers/base/node.c                           | 354 ++++++++++++++++-
 drivers/irqchip/irq-gic-v2m.c                 |   2 +-
 drivers/irqchip/irq-gic-v3-its-pci-msi.c      |   2 +-
 drivers/irqchip/irq-gic-v3-its-platform-msi.c |   2 +-
 drivers/irqchip/irq-gic-v3-its.c              |   6 +-
 drivers/irqchip/irq-gic-v3.c                  |  10 +-
 drivers/irqchip/irq-gic.c                     |   4 +-
 drivers/mailbox/pcc.c                         |   2 +-
 include/linux/acpi.h                          |   6 +-
 include/linux/node.h                          |  60 ++-
 25 files changed, 1344 insertions(+), 65 deletions(-)
 create mode 100644 Documentation/admin-guide/mm/numaperf.rst
 create mode 100644 drivers/acpi/hmat/Kconfig
 create mode 100644 drivers/acpi/hmat/Makefile
 create mode 100644 drivers/acpi/hmat/hmat.c

Comments

Michal Hocko Jan. 28, 2019, 2 p.m. UTC | #1
Is there any reason why is this not CCing linux-api (Cced now).

On Thu 24-01-19 16:07:14, Keith Busch wrote:
> == Changes since v4 ==
> 
>   All public interfaces have kernel docs.
> 
>   Renamed "class" to "access", docs and changed logs updated
>   accordingly. (Rafael)
> 
>   The sysfs hierarchy is altered to put initiators and targets in their
>   own attribute group directories (Rafael).
> 
>   The node lists are removed. This feedback is in conflict with v1
>   feedback, but consensus wants to remove multi-value sysfs attributes,
>   which includes lists. We only have symlinks now, just like v1 provided.
> 
>   Documentation and code patches are combined such that the code
>   introducing new attributes and its documentation are in the same
>   patch. (Rafael and Dan).
> 
>   The performance attributes, bandwidth and latency, are moved into the
>   initiators directory. This should make it obvious for which node
>   access the attributes apply, which was previously ambiguous.
>   (Jonathan Cameron).
> 
>   The HMAT code selecting "local" initiators is substantially changed.
>   Only PXM's that have identical performance to the HMAT's processor PXM
>   in Address Range Structure are registered. This is to avoid considering
>   nodes identical when only one of several perf attributes are the same.
>   (Jonathan Cameron).
> 
>   Verbose variable naming. Examples include "initiator" and "target"
>   instead of "i" and "t", "mem_pxm" and "cpu_pxm" instead of "m" and
>   "p". (Rafael)
> 
>   Compile fixes for when HMEM_REPORTING is not set. This is not a user
>   selectable config option, default 'n', and will have to be selected
>   by other config options that require it (Greg KH and Rafael).
> 
> == Background ==
> 
> Platforms may provide multiple types of cpu attached system memory. The
> memory ranges for each type may have different characteristics that
> applications may wish to know about when considering what node they want
> their memory allocated from. 
> 
> It had previously been difficult to describe these setups as memory
> rangers were generally lumped into the NUMA node of the CPUs. New
> platform attributes have been created and in use today that describe
> the more complex memory hierarchies that can be created.
> 
> This series' objective is to provide the attributes from such systems
> that are useful for applications to know about, and readily usable with
> existing tools and libraries.

Can you provide a highlevel description of these new attributes and how
they are supposed to be used.

Mentioning usecases is also due consideirng the amount of code this
adds. 

> Keith Busch (10):
>   acpi: Create subtable parsing infrastructure
>   acpi: Add HMAT to generic parsing tables
>   acpi/hmat: Parse and report heterogeneous memory
>   node: Link memory nodes to their compute nodes
>   acpi/hmat: Register processor domain to its memory
>   node: Add heterogenous memory access attributes
>   acpi/hmat: Register performance attributes
>   node: Add memory caching attributes
>   acpi/hmat: Register memory side cache attributes
>   doc/mm: New documentation for memory performance
> 
>  Documentation/ABI/stable/sysfs-devices-node   |  87 ++++-
>  Documentation/admin-guide/mm/numaperf.rst     | 167 ++++++++
>  arch/arm64/kernel/acpi_numa.c                 |   2 +-
>  arch/arm64/kernel/smp.c                       |   4 +-
>  arch/ia64/kernel/acpi.c                       |  12 +-
>  arch/x86/kernel/acpi/boot.c                   |  36 +-
>  drivers/acpi/Kconfig                          |   1 +
>  drivers/acpi/Makefile                         |   1 +
>  drivers/acpi/hmat/Kconfig                     |   9 +
>  drivers/acpi/hmat/Makefile                    |   1 +
>  drivers/acpi/hmat/hmat.c                      | 537 ++++++++++++++++++++++++++
>  drivers/acpi/numa.c                           |  16 +-
>  drivers/acpi/scan.c                           |   4 +-
>  drivers/acpi/tables.c                         |  76 +++-
>  drivers/base/Kconfig                          |   8 +
>  drivers/base/node.c                           | 354 ++++++++++++++++-
>  drivers/irqchip/irq-gic-v2m.c                 |   2 +-
>  drivers/irqchip/irq-gic-v3-its-pci-msi.c      |   2 +-
>  drivers/irqchip/irq-gic-v3-its-platform-msi.c |   2 +-
>  drivers/irqchip/irq-gic-v3-its.c              |   6 +-
>  drivers/irqchip/irq-gic-v3.c                  |  10 +-
>  drivers/irqchip/irq-gic.c                     |   4 +-
>  drivers/mailbox/pcc.c                         |   2 +-
>  include/linux/acpi.h                          |   6 +-
>  include/linux/node.h                          |  60 ++-
>  25 files changed, 1344 insertions(+), 65 deletions(-)
>  create mode 100644 Documentation/admin-guide/mm/numaperf.rst
>  create mode 100644 drivers/acpi/hmat/Kconfig
>  create mode 100644 drivers/acpi/hmat/Makefile
>  create mode 100644 drivers/acpi/hmat/hmat.c
> 
> -- 
> 2.14.4
Jonathan Cameron Feb. 6, 2019, 12:31 p.m. UTC | #2
On Thu, 24 Jan 2019 16:07:14 -0700
Keith Busch <keith.busch@intel.com> wrote:

> == Changes since v4 ==
> 
>   All public interfaces have kernel docs.
> 
>   Renamed "class" to "access", docs and changed logs updated
>   accordingly. (Rafael)
> 
>   The sysfs hierarchy is altered to put initiators and targets in their
>   own attribute group directories (Rafael).
> 
>   The node lists are removed. This feedback is in conflict with v1
>   feedback, but consensus wants to remove multi-value sysfs attributes,
>   which includes lists. We only have symlinks now, just like v1 provided.
> 
>   Documentation and code patches are combined such that the code
>   introducing new attributes and its documentation are in the same
>   patch. (Rafael and Dan).
> 
>   The performance attributes, bandwidth and latency, are moved into the
>   initiators directory. This should make it obvious for which node
>   access the attributes apply, which was previously ambiguous.
>   (Jonathan Cameron).
> 
>   The HMAT code selecting "local" initiators is substantially changed.
>   Only PXM's that have identical performance to the HMAT's processor PXM
>   in Address Range Structure are registered. This is to avoid considering
>   nodes identical when only one of several perf attributes are the same.
>   (Jonathan Cameron).
> 
>   Verbose variable naming. Examples include "initiator" and "target"
>   instead of "i" and "t", "mem_pxm" and "cpu_pxm" instead of "m" and
>   "p". (Rafael)
> 
>   Compile fixes for when HMEM_REPORTING is not set. This is not a user
>   selectable config option, default 'n', and will have to be selected
>   by other config options that require it (Greg KH and Rafael).
> 
> == Background ==
> 
> Platforms may provide multiple types of cpu attached system memory. The
> memory ranges for each type may have different characteristics that
> applications may wish to know about when considering what node they want
> their memory allocated from. 
> 
> It had previously been difficult to describe these setups as memory
> rangers were generally lumped into the NUMA node of the CPUs. New
> platform attributes have been created and in use today that describe
> the more complex memory hierarchies that can be created.
> 
> This series' objective is to provide the attributes from such systems
> that are useful for applications to know about, and readily usable with
> existing tools and libraries.

Hi Keith,

Seems to be heading in the right direction to me... (though I personally
want to see the whole of HMAT expose, but meh, that seems unpopular :)

I've fired up a new test rig (someone pinched the fan on the previous one)
that I can make present pretty much anything to this code.

First up is a system with 4 nodes with cpu and local ddr [0-3] + 1 remote node
with just memory [4]. All the figures as you might expect between the nodes with
CPUs. The remote node has equal numbers from all the cpus.

First some general comments on places this doesn't work as my gut feeling
said it would...

I'm going to keep this somewhat vague on certain points as ACPI 6.3 should
be public any day now and I think it is fair to say we should take into
account any changes in there...
There is definitely one place the current patches won't work with 6.3, but
I'll point it out in a few days.  There may be others.

1) It seems this version added a hard dependence on having the memory node
   listed in the Memory Proximity Domain attribute structures.  I'm not 100%
   sure there is actually any requirement to have those structures. If you aren't
   using the hint bit, they don't convey any information.  It could be argued
   that they provide info on what is found in the other hmat entries, but there
   is little purpose as those entries are explicit in what the provide.
   (Given I didn't have any of these structures and things  worked fine with
    v4 it seems this is a new check).

   This is also somewhat inconsistent.
   a) If a given entry isn't there, we still get for example
      node4/access0/initiators/[read|write]_* but all values are 0.
      If we want to do the check you have it needs to not create the files in
      this case.  Whilst they have no meaning as there are no initiators, it
      is inconsistent to my mind.

   b) Having one "Memory Proximity Domain attribute structure" for node 4 linking
      it to node0 is sufficient to allow
      node4/access0/initiators/node0
      node4/access0/initiators/node1
      node4/access0/initiators/node2
      node4/access0/initiators/node3
      I think if we are going to enforce the presence of that structure then only
      the node0 link should exist.

2) Error handling could perhaps do to spit out some nasty warnings.
   If we have an entry for nodes that don't exist we shouldn't just fail silently,
   that's just one example I managed to trigger with minor table tweaking.

Personally I would just get rid of enforcing anything based on the presence of
that structure.

I'll send more focused comments on some of the individual patches.

Thanks,

Jonathan
   

> 
> Keith Busch (10):
>   acpi: Create subtable parsing infrastructure
>   acpi: Add HMAT to generic parsing tables
>   acpi/hmat: Parse and report heterogeneous memory
>   node: Link memory nodes to their compute nodes
>   acpi/hmat: Register processor domain to its memory
>   node: Add heterogenous memory access attributes
>   acpi/hmat: Register performance attributes
>   node: Add memory caching attributes
>   acpi/hmat: Register memory side cache attributes
>   doc/mm: New documentation for memory performance
> 
>  Documentation/ABI/stable/sysfs-devices-node   |  87 ++++-
>  Documentation/admin-guide/mm/numaperf.rst     | 167 ++++++++
>  arch/arm64/kernel/acpi_numa.c                 |   2 +-
>  arch/arm64/kernel/smp.c                       |   4 +-
>  arch/ia64/kernel/acpi.c                       |  12 +-
>  arch/x86/kernel/acpi/boot.c                   |  36 +-
>  drivers/acpi/Kconfig                          |   1 +
>  drivers/acpi/Makefile                         |   1 +
>  drivers/acpi/hmat/Kconfig                     |   9 +
>  drivers/acpi/hmat/Makefile                    |   1 +
>  drivers/acpi/hmat/hmat.c                      | 537 ++++++++++++++++++++++++++
>  drivers/acpi/numa.c                           |  16 +-
>  drivers/acpi/scan.c                           |   4 +-
>  drivers/acpi/tables.c                         |  76 +++-
>  drivers/base/Kconfig                          |   8 +
>  drivers/base/node.c                           | 354 ++++++++++++++++-
>  drivers/irqchip/irq-gic-v2m.c                 |   2 +-
>  drivers/irqchip/irq-gic-v3-its-pci-msi.c      |   2 +-
>  drivers/irqchip/irq-gic-v3-its-platform-msi.c |   2 +-
>  drivers/irqchip/irq-gic-v3-its.c              |   6 +-
>  drivers/irqchip/irq-gic-v3.c                  |  10 +-
>  drivers/irqchip/irq-gic.c                     |   4 +-
>  drivers/mailbox/pcc.c                         |   2 +-
>  include/linux/acpi.h                          |   6 +-
>  include/linux/node.h                          |  60 ++-
>  25 files changed, 1344 insertions(+), 65 deletions(-)
>  create mode 100644 Documentation/admin-guide/mm/numaperf.rst
>  create mode 100644 drivers/acpi/hmat/Kconfig
>  create mode 100644 drivers/acpi/hmat/Makefile
>  create mode 100644 drivers/acpi/hmat/hmat.c
>
Keith Busch Feb. 6, 2019, 5:19 p.m. UTC | #3
On Wed, Feb 06, 2019 at 12:31:00PM +0000, Jonathan Cameron wrote:
> On Thu, 24 Jan 2019 16:07:14 -0700
> Keith Busch <keith.busch@intel.com> wrote:
> 
> 1) It seems this version added a hard dependence on having the memory node
>    listed in the Memory Proximity Domain attribute structures.  I'm not 100%
>    sure there is actually any requirement to have those structures. If you aren't
>    using the hint bit, they don't convey any information.  It could be argued
>    that they provide info on what is found in the other hmat entries, but there
>    is little purpose as those entries are explicit in what the provide.
>    (Given I didn't have any of these structures and things  worked fine with
>     v4 it seems this is a new check).

Right, v4 just used the node(s) with the highest performance. You mentioned
systems having nodes with different performance, but no winner across all
attributes, so there's no clear way to rank these for access class linkage.
Requiring an initiator PXM present clears that up.

Maybe we can fallback to performance if the initiator pxm isn't provided,
but the ranking is going to require an arbitrary decision, like prioritize
latency over bandwidth.
 
>    This is also somewhat inconsistent.
>    a) If a given entry isn't there, we still get for example
>       node4/access0/initiators/[read|write]_* but all values are 0.
>       If we want to do the check you have it needs to not create the files in
>       this case.  Whilst they have no meaning as there are no initiators, it
>       is inconsistent to my mind.
> 
>    b) Having one "Memory Proximity Domain attribute structure" for node 4 linking
>       it to node0 is sufficient to allow
>       node4/access0/initiators/node0
>       node4/access0/initiators/node1
>       node4/access0/initiators/node2
>       node4/access0/initiators/node3
>       I think if we are going to enforce the presence of that structure then only
>       the node0 link should exist.

We'd link the initiator pxm in the Address Range Structure, and also any
other nodes with identical performance access. I think that makes sense.
 
> 2) Error handling could perhaps do to spit out some nasty warnings.
>    If we have an entry for nodes that don't exist we shouldn't just fail silently,
>    that's just one example I managed to trigger with minor table tweaking.
> 
> Personally I would just get rid of enforcing anything based on the presence of
> that structure.
Jonathan Cameron Feb. 6, 2019, 5:30 p.m. UTC | #4
On Wed, 6 Feb 2019 10:19:37 -0700
Keith Busch <keith.busch@intel.com> wrote:

> On Wed, Feb 06, 2019 at 12:31:00PM +0000, Jonathan Cameron wrote:
> > On Thu, 24 Jan 2019 16:07:14 -0700
> > Keith Busch <keith.busch@intel.com> wrote:
> > 
> > 1) It seems this version added a hard dependence on having the memory node
> >    listed in the Memory Proximity Domain attribute structures.  I'm not 100%
> >    sure there is actually any requirement to have those structures. If you aren't
> >    using the hint bit, they don't convey any information.  It could be argued
> >    that they provide info on what is found in the other hmat entries, but there
> >    is little purpose as those entries are explicit in what the provide.
> >    (Given I didn't have any of these structures and things  worked fine with
> >     v4 it seems this is a new check).  
> 
> Right, v4 just used the node(s) with the highest performance. You mentioned
> systems having nodes with different performance, but no winner across all
> attributes, so there's no clear way to rank these for access class linkage.
> Requiring an initiator PXM present clears that up.
> 
> Maybe we can fallback to performance if the initiator pxm isn't provided,
> but the ranking is going to require an arbitrary decision, like prioritize
> latency over bandwidth.

I'd certainly prefer to see that fall back and would argue it is
the only valid route.  What is 'best' if we don't put a preference on
one parameter over the other.

Perfectly fine to have another access class that does bandwidth preferred
if that is of sufficient use to people.

>  
> >    This is also somewhat inconsistent.
> >    a) If a given entry isn't there, we still get for example
> >       node4/access0/initiators/[read|write]_* but all values are 0.
> >       If we want to do the check you have it needs to not create the files in
> >       this case.  Whilst they have no meaning as there are no initiators, it
> >       is inconsistent to my mind.
> > 
> >    b) Having one "Memory Proximity Domain attribute structure" for node 4 linking
> >       it to node0 is sufficient to allow
> >       node4/access0/initiators/node0
> >       node4/access0/initiators/node1
> >       node4/access0/initiators/node2
> >       node4/access0/initiators/node3
> >       I think if we are going to enforce the presence of that structure then only
> >       the node0 link should exist.  
> 
> We'd link the initiator pxm in the Address Range Structure, and also any
> other nodes with identical performance access. I think that makes sense.

I disagree on this. It is either / or, it seem really illogical to build
all of them if only one initiator is specified for the target.

If someone deliberately only specified one initiator for this target then they
meant to do that (hopefully).  Probably because they wanted to set one
of the flags.

>  
> > 2) Error handling could perhaps do to spit out some nasty warnings.
> >    If we have an entry for nodes that don't exist we shouldn't just fail silently,
> >    that's just one example I managed to trigger with minor table tweaking.
> > 
> > Personally I would just get rid of enforcing anything based on the presence of
> > that structure.  

Thanks,

Jonathan
Jonathan Cameron Feb. 7, 2019, 9:53 a.m. UTC | #5
On Thu, 24 Jan 2019 16:07:14 -0700
Keith Busch <keith.busch@intel.com> wrote:

> == Changes since v4 ==
> 
>   All public interfaces have kernel docs.
> 
>   Renamed "class" to "access", docs and changed logs updated
>   accordingly. (Rafael)
> 
>   The sysfs hierarchy is altered to put initiators and targets in their
>   own attribute group directories (Rafael).
> 
>   The node lists are removed. This feedback is in conflict with v1
>   feedback, but consensus wants to remove multi-value sysfs attributes,
>   which includes lists. We only have symlinks now, just like v1 provided.
> 
>   Documentation and code patches are combined such that the code
>   introducing new attributes and its documentation are in the same
>   patch. (Rafael and Dan).
> 
>   The performance attributes, bandwidth and latency, are moved into the
>   initiators directory. This should make it obvious for which node
>   access the attributes apply, which was previously ambiguous.
>   (Jonathan Cameron).
> 
>   The HMAT code selecting "local" initiators is substantially changed.
>   Only PXM's that have identical performance to the HMAT's processor PXM
>   in Address Range Structure are registered. This is to avoid considering
>   nodes identical when only one of several perf attributes are the same.
>   (Jonathan Cameron).
> 
>   Verbose variable naming. Examples include "initiator" and "target"
>   instead of "i" and "t", "mem_pxm" and "cpu_pxm" instead of "m" and
>   "p". (Rafael)
> 
>   Compile fixes for when HMEM_REPORTING is not set. This is not a user
>   selectable config option, default 'n', and will have to be selected
>   by other config options that require it (Greg KH and Rafael).
> 
> == Background ==
> 
> Platforms may provide multiple types of cpu attached system memory. The
> memory ranges for each type may have different characteristics that
> applications may wish to know about when considering what node they want
> their memory allocated from. 
> 
> It had previously been difficult to describe these setups as memory
> rangers were generally lumped into the NUMA node of the CPUs. New
> platform attributes have been created and in use today that describe
> the more complex memory hierarchies that can be created.
> 
> This series' objective is to provide the attributes from such systems
> that are useful for applications to know about, and readily usable with
> existing tools and libraries.

As a general heads up, ACPI 6.3 is out and makes some changes.
Discussions I've had in the past suggested there were few systems
shipping with 6.2 HMAT and that many firmwares would start at 6.3.
Of course, that might not be true, but there was fairly wide participation
in the meeting so fingers crossed it's accurate.

https://uefi.org/sites/default/files/resources/ACPI_6_3_final_Jan30.pdf

Particular points to note:
1. Most of the Memory Proximity Domain Attributes Structure was deprecated.
   This includes the reservation hint which has been replaced
   with a new mechanism (not used in this patch set)

2. Base units for latency changed to picoseconds.  There is a lot more
   explanatory text around how those work.

3. The measurements of latency and bandwidth no longer have an
   'aggregate performance' version.  Given the work load was not described
   this never made any sense.  Better for a knowledgeable bit of software
   to work out it's own estimate.

4. There are now Generic Initiator Domains that have neither memory nor
   processors.  I'll come back with proposals on handling those soon if
   no one beats me to it. (I think it's really easy but may be wrong ;)
   I've not really thought out how this series applies to GI only domains
   yet.  Probably not useful to know you have an accelerator near to
   particular memory if you are deciding where to pin your host processor
   task ;)

Jonathan

> 
> Keith Busch (10):
>   acpi: Create subtable parsing infrastructure
>   acpi: Add HMAT to generic parsing tables
>   acpi/hmat: Parse and report heterogeneous memory
>   node: Link memory nodes to their compute nodes
>   acpi/hmat: Register processor domain to its memory
>   node: Add heterogenous memory access attributes
>   acpi/hmat: Register performance attributes
>   node: Add memory caching attributes
>   acpi/hmat: Register memory side cache attributes
>   doc/mm: New documentation for memory performance
> 
>  Documentation/ABI/stable/sysfs-devices-node   |  87 ++++-
>  Documentation/admin-guide/mm/numaperf.rst     | 167 ++++++++
>  arch/arm64/kernel/acpi_numa.c                 |   2 +-
>  arch/arm64/kernel/smp.c                       |   4 +-
>  arch/ia64/kernel/acpi.c                       |  12 +-
>  arch/x86/kernel/acpi/boot.c                   |  36 +-
>  drivers/acpi/Kconfig                          |   1 +
>  drivers/acpi/Makefile                         |   1 +
>  drivers/acpi/hmat/Kconfig                     |   9 +
>  drivers/acpi/hmat/Makefile                    |   1 +
>  drivers/acpi/hmat/hmat.c                      | 537 ++++++++++++++++++++++++++
>  drivers/acpi/numa.c                           |  16 +-
>  drivers/acpi/scan.c                           |   4 +-
>  drivers/acpi/tables.c                         |  76 +++-
>  drivers/base/Kconfig                          |   8 +
>  drivers/base/node.c                           | 354 ++++++++++++++++-
>  drivers/irqchip/irq-gic-v2m.c                 |   2 +-
>  drivers/irqchip/irq-gic-v3-its-pci-msi.c      |   2 +-
>  drivers/irqchip/irq-gic-v3-its-platform-msi.c |   2 +-
>  drivers/irqchip/irq-gic-v3-its.c              |   6 +-
>  drivers/irqchip/irq-gic-v3.c                  |  10 +-
>  drivers/irqchip/irq-gic.c                     |   4 +-
>  drivers/mailbox/pcc.c                         |   2 +-
>  include/linux/acpi.h                          |   6 +-
>  include/linux/node.h                          |  60 ++-
>  25 files changed, 1344 insertions(+), 65 deletions(-)
>  create mode 100644 Documentation/admin-guide/mm/numaperf.rst
>  create mode 100644 drivers/acpi/hmat/Kconfig
>  create mode 100644 drivers/acpi/hmat/Makefile
>  create mode 100644 drivers/acpi/hmat/hmat.c
>
Keith Busch Feb. 7, 2019, 3:08 p.m. UTC | #6
On Thu, Feb 07, 2019 at 01:53:36AM -0800, Jonathan Cameron wrote:
> As a general heads up, ACPI 6.3 is out and makes some changes.
> Discussions I've had in the past suggested there were few systems
> shipping with 6.2 HMAT and that many firmwares would start at 6.3.
> Of course, that might not be true, but there was fairly wide participation
> in the meeting so fingers crossed it's accurate.
> 
> https://uefi.org/sites/default/files/resources/ACPI_6_3_final_Jan30.pdf
> 
> Particular points to note:
> 1. Most of the Memory Proximity Domain Attributes Structure was deprecated.
>    This includes the reservation hint which has been replaced
>    with a new mechanism (not used in this patch set)

Yes, and duplicating all the address ranges with SRAT never made any
sense. No need to define the same thing in multiple places; that's just
another opprotunity to get it wrong.
 
> 2. Base units for latency changed to picoseconds.  There is a lot more
>    explanatory text around how those work.
>
> 3. The measurements of latency and bandwidth no longer have an
>    'aggregate performance' version.  Given the work load was not described
>    this never made any sense.  Better for a knowledgeable bit of software
>    to work out it's own estimate.

Nice. Though they shifted 1st level cached to occupy the same value that
the aggregate used. They could have just deprecated the old value so we
could maintain compatibility, but that's okay!
 
> 4. There are now Generic Initiator Domains that have neither memory nor
>    processors.  I'll come back with proposals on handling those soon if
>    no one beats me to it. (I think it's really easy but may be wrong ;)
>    I've not really thought out how this series applies to GI only domains
>    yet.  Probably not useful to know you have an accelerator near to
>    particular memory if you are deciding where to pin your host processor
>    task ;)

I haven't any particular use for these at the moment either, though it
shouldn't change what this is going to export.

Thanks for the heads up! I'll incorporate 6.3 into v6.