mbox series

[V6,0/7] ACPI: Support Generic Initiator proximity domains

Message ID 20191216153809.105463-1-Jonathan.Cameron@huawei.com (mailing list archive)
Headers show
Series ACPI: Support Generic Initiator proximity domains | expand

Message

Jonathan Cameron Dec. 16, 2019, 3:38 p.m. UTC
Introduces a new type of NUMA node for cases where we want to represent
the access characteristics of a non CPU initiator of memory requests,
as these differ from all those for existing nodes containing CPUs and/or
memory.

These Generic Initiators are presented by the node access0 class in
sysfs in the same way as a CPU.   It seems likely that there will be
usecases in which the best 'CPU' is desired and Generic Initiators
should be ignored.  The final few patches in this series introduced
access1 which is a new performance class in the sysfs node description
which presents only CPU to memory relationships.  Test cases for this
are described below.

Thanks to Dan for suggestions on V5.  Most of the changes are
an attempt to implement what was discussed in that thread.

The new patch makes it clear that some of the existing naming is perhaps
more specific than it should be. It may be worth a follow up patch
to rename from *cpu* to *initiator* in a few places where this might
cause confusion.

One outstanding question to highlight in this series is whether
we should assume all ACPI supporting architectures support Generic
Initiator domains, or whether to introduce an
ARCH_HAS_GENERIC_INITIATOR_DOMAINS entry in Kconfig.

Changes since V5:

3 new patches:
* A fix for a subtlety in how ACPI 6.3 changed part of the HMAT table.
* Introduction of access1 class to represent characteristics between CPU
  and memory, ingnoring GIs unlike access0 which includes them.
* Docs to describe the new access0 class.

Note that I ran a number of test cases for the new class which are
described at the end of this email.

Changes since V4:

At Rafael's suggestion:

Rebase on top of Dan William's Specific Purpose Memory series as that
moves srat.c Original patches cherry-picked fine onto mmotm with Dan's
patches applied.

Applies to mmotm-2019-09-25 +
https://lore.kernel.org/linux-acpi/156140036490.2951909.1837804994781523185.stgit@dwillia2-desk3.amr.corp.intel.com/
[PATCH v4 00/10] EFI Specific Purpose Memory Support
(note there are some trivial conflicts to deal with when applying
the SPM series).

Change since V3.
* Rebase.

Changes since RFC V2.
* RFC dropped as now we have x86 support, so the lack of guards in in the
  ACPI code etc should now be fine.
  * Added x86 support.  Note this has only been tested on QEMU as I don't have
    a convenient x86 NUMA machine to play with.  Note that this fitted together
      rather differently from arm64 so I'm particularly interested in feedback
        on the two solutions.

Since RFC V1.
* Fix incorrect interpretation of the ACPI entry noted by Keith Busch
* Use the acpica headers definitions that are now in mmotm.

It's worth noting that, to safely put a given device in a GI node, may
require changes to the existing drivers as it's not unusual to assume
you have local memory or processor core. There may be further constraints
not yet covered by this patch.

Original cover letter...

ACPI 6.3 introduced a new entity that can be part of a NUMA proximity domain.
It may share such a domain with the existing options (memory, CPU etc) but it
may also exist on it's own.

The intent is to allow the description of the NUMA properties (particularly
via HMAT) of accelerators and other initiators of memory activity that are not
the host processor running the operating system.

This patch set introduces 'just enough' to make them work for arm64 and x86.
It should be trivial to support other architectures, I just don't suitable
NUMA systems readily available to test.

There are a few quirks that need to be considered.

1. Fall back nodes
******************

As pre ACPI 6.3 supporting operating systems do not have Generic Initiator
Proximity Domains it is possible to specify, via _PXM in DSDT that another
device is part of such a GI only node.  This currently blows up spectacularly.

Whilst we can obviously 'now' protect against such a situation (see the related
thread on PCI _PXM support and the  threadripper board identified there as
also falling into the  problem of using non existent nodes
https://patchwork.kernel.org/patch/10723311/ ), there is no way to  be sure
we will never have legacy OSes that are not protected  against this.  It would
also be 'non ideal' to fallback to  a default node as there may be a better
(non GI) node to pick  if GI nodes aren't available.

The work around is that we also have a new system wide OSC bit that allows
an operating system to 'announce' that it supports Generic Initiators.  This
allows, the firmware to us DSDT magic to 'move' devices between the nodes
dependent on whether our new nodes are there or not.

2. New ways of assigning a proximity domain for devices
*******************************************************

Until now, the only way firmware could indicate that a particular device
(outside the 'special' set of cpus etc) was to be found in a particular
Proximity Domain by the use of _PXM in DSDT.

That is equally valid with GI domains, but we have new options. The SRAT
affinity structure includes a handle (ACPI or PCI) to identify devices
with the system and specify their proximity domain that way.  If both _PXM
and this are provided, they should give the same answer.

For now this patch set completely ignores that feature as we don't need
it to start the discussion.  It will form a follow up set at some point
(if no one else fancies doing it).

Test cases for the access1 class
********************************

Test cases for Generic Initiator additions to HMAT.

Setup

PXM0 (node 0) - CPU0 CPU1, 2G memory
PXM1 (node 1) - CPU2 CPU3, 2G memory
PXM2 (node 2) - CPU4 CPU5, 2G memory
PXM3 (node 4) - 2G memory (GI in one case below)
PXM4 (node 3) - GI only.

Config 1:  GI in PXM4 nearer to memory in PXM 3 than CPUs, not direct attached

[    2.384064] acpi/hmat: HMAT: Locality: Flags:00 Type:Access Latency Initiator Domains:4 Target Domains:4 Base:256
[    2.384913] acpi/hmat:   Initiator-Target[0-0]:1 nsec
[    2.385190] acpi/hmat:   Initiator-Target[0-1]:9 nsec
[    2.385736] acpi/hmat:   Initiator-Target[0-2]:9 nsec
[    2.385984] acpi/hmat:   Initiator-Target[0-3]:9 nsec
[    2.386447] acpi/hmat:   Initiator-Target[1-0]:9 nsec
[    2.386740] acpi/hmat:   Initiator-Target[1-1]:1 nsec
[    2.386964] acpi/hmat:   Initiator-Target[1-2]:9 nsec
[    2.387174] acpi/hmat:   Initiator-Target[1-3]:9 nsec
[    2.387624] acpi/hmat:   Initiator-Target[2-0]:9 nsec
[    2.387953] acpi/hmat:   Initiator-Target[2-1]:9 nsec
[    2.388155] acpi/hmat:   Initiator-Target[2-2]:1 nsec
[    2.388607] acpi/hmat:   Initiator-Target[2-3]:9 nsec
[    2.388861] acpi/hmat:   Initiator-Target[4-0]:13 nsec
[    2.389126] acpi/hmat:   Initiator-Target[4-1]:13 nsec
[    2.389574] acpi/hmat:   Initiator-Target[4-2]:13 nsec
[    2.389805] acpi/hmat:   Initiator-Target[4-3]:5 nsec

# Sysfs reads the same for nodes 0-2 for access0 and access1 as no GI involved.

/sys/bus/node/devices/...
    node0 #1 and 2 similar.
        access0
            initiators
                node0
                read_bandwidth  0 #not specificed in hmat
                read_latency    1
                write_bandwidth 0
                write_latency   1
            power
            targets
                node0
            uevent
        access1
            initiators
                node0
                read_bandwidth  0
                read_latency    1
                write_bandwidth 0
                read_bandwidth  1   
            power
            targets
                node 0
            uevent
        compact
        cpu0
        cpu1
        ...
    node3 # Note PXM 4, contains GI only
        access0
            initiators
                *empty*
            power
            targets
                node4
            uevent
        compact
        ...
    node4
        access0
            initiators
                node3
                read_bandwidth  0
                read_latency    5
                write_bandwidth 0
                write_latency   5
            power
            targets
                *empty*
            uevent
        access1
            initiators
                node0
                node1
                node2
                read_bandwidth  0
                read_latency    9
                write_bandwidth 0
                write_latency   9
            power
            targets
                *empty*
            uevent
        compact
        ...

Config 2:  GI in PXM4 further to memory in PXM 3 than CPUs, not direct attached

[    4.073493] acpi/hmat: HMAT: Locality: Flags:00 Type:Access Latency Initiator Domains:4 Target Domains:4 Base:256
[    4.074785] acpi/hmat:   Initiator-Target[0-0]:1 nsec
[    4.075150] acpi/hmat:   Initiator-Target[0-1]:9 nsec
[    4.075423] acpi/hmat:   Initiator-Target[0-2]:9 nsec
[    4.076184] acpi/hmat:   Initiator-Target[0-3]:9 nsec
[    4.077116] acpi/hmat:   Initiator-Target[1-0]:9 nsec
[    4.077366] acpi/hmat:   Initiator-Target[1-1]:1 nsec
[    4.077640] acpi/hmat:   Initiator-Target[1-2]:9 nsec
[    4.078156] acpi/hmat:   Initiator-Target[1-3]:9 nsec
[    4.078471] acpi/hmat:   Initiator-Target[2-0]:9 nsec
[    4.078994] acpi/hmat:   Initiator-Target[2-1]:9 nsec
[    4.079277] acpi/hmat:   Initiator-Target[2-2]:1 nsec
[    4.079505] acpi/hmat:   Initiator-Target[2-3]:9 nsec
[    4.080126] acpi/hmat:   Initiator-Target[4-0]:13 nsec
[    4.080995] acpi/hmat:   Initiator-Target[4-1]:13 nsec
[    4.081351] acpi/hmat:   Initiator-Target[4-2]:13 nsec
[    4.082125] acpi/hmat:   Initiator-Target[4-3]:13 nsec

/sys/bus/node/devices/...
    node0 #1 and 2 similar.
        access0
            initiators
                node0
                read_bandwidth  0 #not specificed in hmat
                read_latency    1
                write_bandwidth 0
                write_latency   1
            power
            targets
                node0
                node4
            uevent
        access1
            initiators
                node0
                read_bandwidth  0
                read_latency    1
                write_bandwidth 0
                read_bandwidth  1   
            power
            targets
                node0
                node4
            uevent
        compact
        cpu0
        cpu1
        ...
    node3 # Note PXM 4, contains GI only
        #No accessX directories.
        compact
        ...
    node4
        access0
            initiators
                node0
                node1
                node2
                read_bandwidth  0
                read_latency    9
                write_bandwidth 0
                write_latency   9
            power
            targets
                *empty*
            uevent
        access1
            initiators
                node0
                node1
                node2
                read_bandwidth  0
                read_latency    9
                write_bandwidth 0
                write_latency   9
            power
            targets
                *empty*
            uevent
        compact
        ...


case 3 - as per case 2 but now the memory in node 3 is direct attached to the
GI but nearer the main nodes (not physically sensible :))

/sys/bus/node/devices/...
    node0 #1 and 2 similar.
        access0
            initiators
                node0
                read_bandwidth  0 #not specificed in hmat
                read_latency    1
                write_bandwidth 0
                write_latency   1
            power
            targets
                node0
                node4
            uevent
        access1
            initiators
                node0
                read_bandwidth  0
                read_latency    1
                write_bandwidth 0
                read_bandwidth  1   
            power
            targets
                node0
                node4
            uevent
        compact
        cpu0
        cpu1
        ...
    node3 # Note PXM 4, contains GI only
        access0
            initiators
                *empty*
            power
            targets
                node4
            uevent
        compact
        ...
    node4
        access0
            initiators
                node3
                read_bandwidth  0
                read_latency    13
                write_bandwidth 0
                write_latency   13
            power
            targets
                *empty*
            uevent
        access1
            initiators
                node0
                node1
                node2
                read_bandwidth  0
                read_latency    9
                write_bandwidth 0
                write_latency   9
            power
            targets
                *empty*
            uevent
        compact
        ...

Case 4 - nearer the GI, but direct attached to one of the CPUS.
# Another bonkers one.

/sys/bus/node/devices/...
    node0 #1 similar.
        access0
            initiators
                node0
                read_bandwidth  0 #not specificed in hmat
                read_latency    1
                write_bandwidth 0
                write_latency   1
            power
            targets
                node0
                node4
            uevent
        access1
            initiators
                node0
                read_bandwidth  0
                read_latency    1
                write_bandwidth 0
                read_bandwidth  1   
            power
            targets
                node0
            uevent
        compact
        cpu0
        cpu1
        ...
    node2 # Direct attached to memory in node 3
        access0
            initiators
                node2
                read_bandwidth  0 #not specificed in hmat
                read_latency    1
                write_bandwidth 0
                write_latency   1
            power
            targets
                node2
                node4 #direct attached
            uevent
        access1
            initiators
                node2
                read_bandwidth  0
                read_latency    1
                write_bandwidth 0
                read_bandwidth  1   
            power
            targets
                node2
                node4 #direct attached
            uevent
        compact
        cpu0
        cpu1
        ...

    node3 # Note PXM 4, contains GI only
        #No accessX directories.
        compact
        ...
    node4
        access0
            initiators
                node3
                read_bandwidth  0
                read_latency    13
                write_bandwidth 0
                write_latency   13
            power
            targets
                *empty*
            uevent
        access1
            initiators
                node0
                node1
                node2
                read_bandwidth  0
                read_latency    9
                write_bandwidth 0
                write_latency   9
            power
            targets
                *empty*
            uevent
        compact
        ...

case 5 memory and GI together in node 3 (added an extra GI to node 3)
Note hmat should also reflect this extra initiator domain.

/sys/bus/node/devices/...
    node0 #1 and 2 similar.
        access0
            initiators
                node0
                read_bandwidth  0 #not specificed in hmat
                read_latency    1
                write_bandwidth 0
                write_latency   1
            power
            targets
                node0
                node4
            uevent
        access1
            initiators
                node0
                read_bandwidth  0
                read_latency    1
                write_bandwidth 0
                read_bandwidth  1   
            power
            targets
                node0
            uevent
        compact
        cpu0
        cpu1
        ...
    node3 # Note PXM 3, contains GI only
        #No accessX directories.
        compact
        ...
    node4 # Now memory and GI.
        access0
            initiators
                node4
                read_bandwidth  0
                read_latency    1
                write_bandwidth 0
                write_latency   1
            power
            targets
                node4
            uevent
        access1
            initiators
                node0
                node1
                node2
                read_bandwidth  0
                read_latency    9
                write_bandwidth 0
                write_latency   9
            power
            targets
                *empty* # as expected GI doesn't paticipate in access 1.
            uevent
        compact
        ...

Jonathan Cameron (7):
  ACPI: Support Generic Initiator only domains
  arm64: Support Generic Initiator only domains
  x86: Support Generic Initiator only proximity domains
  ACPI: Let ACPI know we support Generic Initiator Affinity Structures
  ACPI: HMAT: Fix handling of changes from ACPI 6.2 to ACPI 6.3
  node: Add access1 class to represent CPU to memory characteristics
  docs: mm: numaperf.rst Add brief description for access class 1.

 Documentation/admin-guide/mm/numaperf.rst |  8 ++
 arch/arm64/kernel/smp.c                   |  8 ++
 arch/x86/include/asm/numa.h               |  2 +
 arch/x86/kernel/setup.c                   |  1 +
 arch/x86/mm/numa.c                        | 14 ++++
 drivers/acpi/bus.c                        |  1 +
 drivers/acpi/numa/hmat.c                  | 89 ++++++++++++++++++-----
 drivers/acpi/numa/srat.c                  | 62 +++++++++++++++-
 drivers/base/node.c                       |  3 +
 include/asm-generic/topology.h            |  3 +
 include/linux/acpi.h                      |  1 +
 include/linux/nodemask.h                  |  1 +
 include/linux/topology.h                  |  7 ++
 13 files changed, 179 insertions(+), 21 deletions(-)

Comments

Brice Goglin Dec. 18, 2019, 11:32 a.m. UTC | #1
Le 16/12/2019 à 16:38, Jonathan Cameron a écrit :
> Introduces a new type of NUMA node for cases where we want to represent
> the access characteristics of a non CPU initiator of memory requests,
> as these differ from all those for existing nodes containing CPUs and/or
> memory.
>
> These Generic Initiators are presented by the node access0 class in
> sysfs in the same way as a CPU.   It seems likely that there will be
> usecases in which the best 'CPU' is desired and Generic Initiators
> should be ignored.  The final few patches in this series introduced
> access1 which is a new performance class in the sysfs node description
> which presents only CPU to memory relationships.  Test cases for this
> are described below.


Hello Jonathan

If I want to test this with a fake GI, what are the minimal set of
changes I should put in my ACPI tables? Can I just specify a dummy GI in
SRAT? What handle should I use there?

Thanks

Brice
Jonathan Cameron Dec. 18, 2019, 2:50 p.m. UTC | #2
On Wed, 18 Dec 2019 12:32:06 +0100
Brice Goglin <brice.goglin@gmail.com> wrote:

> Le 16/12/2019 à 16:38, Jonathan Cameron a écrit :
> > Introduces a new type of NUMA node for cases where we want to represent
> > the access characteristics of a non CPU initiator of memory requests,
> > as these differ from all those for existing nodes containing CPUs and/or
> > memory.
> >
> > These Generic Initiators are presented by the node access0 class in
> > sysfs in the same way as a CPU.   It seems likely that there will be
> > usecases in which the best 'CPU' is desired and Generic Initiators
> > should be ignored.  The final few patches in this series introduced
> > access1 which is a new performance class in the sysfs node description
> > which presents only CPU to memory relationships.  Test cases for this
> > are described below.  
> 
> 
> Hello Jonathan
> 
> If I want to test this with a fake GI, what are the minimal set of
> changes I should put in my ACPI tables? Can I just specify a dummy GI in
> SRAT? What handle should I use there?

Exactly that for a dummy GI.  Also extend HMAT and SLIT for the extra
proximity domain / initiator.

For the handle, anything is fine.  This patch set doesn't currently use it.
That handle was a bit controversial when this spec feature was being
discussed because it can 'disagree' with information from _PXM.

The ACPI spec ended up effectively relying on them agreeing.  So any handle
must identify a device that either doesn't have a _PXM entry or that
has one that refers to the same proximity domain.

Also note there is a fiddly corner case which is covered by an _OSC.
If you have a device that you want to use _PXM to put in a GI only
domain then older kernels will not know about the GI domain. Hence
ACPI goes through a dance to ensure that a kernel that hasn't
announced it is GI aware, doesn't get told anything is in a GI only domain.
For testing this series though you can just ignore that.

The logic to actually pass that handle based specification through to the
devices is complex, so this set relies on _PXM in DSDT to actually associate
any device with the Generic Initiator domain.  If doing this for a PCI
device, note that you need the fix mentioned in the cover letter to actually
have _PXM apply to PCI EPs.  Note that the _PXM case needs to work anyway
as you might have a GI node with multiple GIs and there is no obligation
for them all to be specified in SRAT.

Once this initial set is in place we can work out how to use the SRAT
handle to associate it with a device.  To be honest, I haven't really
thought about how we'd do that yet.

Thanks,

Jonathan


> 
> Thanks
> 
> Brice
> 
>
Brice Goglin Dec. 20, 2019, 9:40 p.m. UTC | #3
Le 18/12/2019 à 15:50, Jonathan Cameron a écrit :
> On Wed, 18 Dec 2019 12:32:06 +0100
> Brice Goglin <brice.goglin@gmail.com> wrote:
>
>> Le 16/12/2019 à 16:38, Jonathan Cameron a écrit :
>>> Introduces a new type of NUMA node for cases where we want to represent
>>> the access characteristics of a non CPU initiator of memory requests,
>>> as these differ from all those for existing nodes containing CPUs and/or
>>> memory.
>>>
>>> These Generic Initiators are presented by the node access0 class in
>>> sysfs in the same way as a CPU.   It seems likely that there will be
>>> usecases in which the best 'CPU' is desired and Generic Initiators
>>> should be ignored.  The final few patches in this series introduced
>>> access1 which is a new performance class in the sysfs node description
>>> which presents only CPU to memory relationships.  Test cases for this
>>> are described below.  
>>
>> Hello Jonathan
>>
>> If I want to test this with a fake GI, what are the minimal set of
>> changes I should put in my ACPI tables? Can I just specify a dummy GI in
>> SRAT? What handle should I use there?
> Exactly that for a dummy GI.  Also extend HMAT and SLIT for the extra
> proximity domain / initiator.


I couldn't get this to work (your patches on top of 5.5-rc2). I added
the GI in SRAT, and extended HMAT and SLIT accordingly.

I don't know if that's expected but I get an additional node in sysfs,
with 0kB memory.

However the HMAT table gets ignored because find_mem_target() fails in
hmat_parse_proximity_domain(). The target should have been allocated in
alloc_memory_target() which is called in srat_parse_mem_affinity(), but
it seems to me that this function isn't called for GI nodes. Or should
SRAT also contain a normal Memory node with same PM as the GI?

Brice
Jonathan Cameron Jan. 2, 2020, 3:27 p.m. UTC | #4
On Fri, 20 Dec 2019 22:40:18 +0100
Brice Goglin <brice.goglin@gmail.com> wrote:

> Le 18/12/2019 à 15:50, Jonathan Cameron a écrit :
> > On Wed, 18 Dec 2019 12:32:06 +0100
> > Brice Goglin <brice.goglin@gmail.com> wrote:
> >  
> >> Le 16/12/2019 à 16:38, Jonathan Cameron a écrit :  
> >>> Introduces a new type of NUMA node for cases where we want to represent
> >>> the access characteristics of a non CPU initiator of memory requests,
> >>> as these differ from all those for existing nodes containing CPUs and/or
> >>> memory.
> >>>
> >>> These Generic Initiators are presented by the node access0 class in
> >>> sysfs in the same way as a CPU.   It seems likely that there will be
> >>> usecases in which the best 'CPU' is desired and Generic Initiators
> >>> should be ignored.  The final few patches in this series introduced
> >>> access1 which is a new performance class in the sysfs node description
> >>> which presents only CPU to memory relationships.  Test cases for this
> >>> are described below.    
> >>
> >> Hello Jonathan
> >>
> >> If I want to test this with a fake GI, what are the minimal set of
> >> changes I should put in my ACPI tables? Can I just specify a dummy GI in
> >> SRAT? What handle should I use there?  
> > Exactly that for a dummy GI.  Also extend HMAT and SLIT for the extra
> > proximity domain / initiator.  
> 
> 
> I couldn't get this to work (your patches on top of 5.5-rc2). I added
> the GI in SRAT, and extended HMAT and SLIT accordingly.
> 
> I don't know if that's expected but I get an additional node in sysfs,
> with 0kB memory.
> 
> However the HMAT table gets ignored because find_mem_target() fails in
> hmat_parse_proximity_domain(). The target should have been allocated in
> alloc_memory_target() which is called in srat_parse_mem_affinity(), but
> it seems to me that this function isn't called for GI nodes. Or should
> SRAT also contain a normal Memory node with same PM as the GI?
> 
Hi Brice,

Yes you should see a node with 0kB memory.  Same as you get for a processor
only node I believe.

srat_parse_mem_affinity shouldn't call alloc_memory_target for the GI nodes
as they don't have any memory.   The hmat table should only refer to
GI domains as initiators.  Just to check, do you have them listed as
a target node?  Or perhaps in some hmat proximity entry as memory_PD?

To answer your question, SRAT should not contain a normal memory node
with the same PXM as that would defeat the whole purpose as we would have
been able to have such a domain without Generic Initiators.

Also, just to check, x86 or arm64?

Thanks for testing this.

Jonathan


> Brice
> 
>
Brice Goglin Jan. 2, 2020, 9:37 p.m. UTC | #5
Le 02/01/2020 à 16:27, Jonathan Cameron a écrit :
>
>> However the HMAT table gets ignored because find_mem_target() fails in
>> hmat_parse_proximity_domain(). The target should have been allocated in
>> alloc_memory_target() which is called in srat_parse_mem_affinity(), but
>> it seems to me that this function isn't called for GI nodes. Or should
>> SRAT also contain a normal Memory node with same PM as the GI?
>>
> Hi Brice,
>
> Yes you should see a node with 0kB memory.  Same as you get for a processor
> only node I believe.
>
> srat_parse_mem_affinity shouldn't call alloc_memory_target for the GI nodes
> as they don't have any memory.   The hmat table should only refer to
> GI domains as initiators.  Just to check, do you have them listed as
> a target node?  Or perhaps in some hmat proximity entry as memory_PD?
>

Thanks, I finally got things to work. I am on x86. It's a dual-socket
machine with SubNUMA clusters (2 nodes per socket) and NVDIMMs (one
dax-kmem node per socket). Before adding a GI, initiators look like this:

node0 -> node0 and node4

node1 -> node1 and node5

node2 -> node2 and node4

node3 -> node3 and node5

I added a GI with faster access to node0, node2, node4 (first socket).

The GI node becomes an access0 initiator for node4, and node0 and node2
remain access1 initiators.

The GI node doesn't become access0 initiator for node0 and node2, likely
because of this test :

        /*
         * If the Address Range Structure provides a local processor pxm, link
         * only that one. Otherwise, find the best performance attributes and
         * register all initiators that match.
         */
        if (target->processor_pxm != PXM_INVAL) {

I guess I should split node0-3 into separate CPU nodes and memory nodes
in SRAT?

Brice
Jonathan Cameron Jan. 3, 2020, 10:09 a.m. UTC | #6
On Thu, 2 Jan 2020 22:37:04 +0100
Brice Goglin <brice.goglin@gmail.com> wrote:

> Le 02/01/2020 à 16:27, Jonathan Cameron a écrit :
> >  
> >> However the HMAT table gets ignored because find_mem_target() fails in
> >> hmat_parse_proximity_domain(). The target should have been allocated in
> >> alloc_memory_target() which is called in srat_parse_mem_affinity(), but
> >> it seems to me that this function isn't called for GI nodes. Or should
> >> SRAT also contain a normal Memory node with same PM as the GI?
> >>  
> > Hi Brice,
> >
> > Yes you should see a node with 0kB memory.  Same as you get for a processor
> > only node I believe.
> >
> > srat_parse_mem_affinity shouldn't call alloc_memory_target for the GI nodes
> > as they don't have any memory.   The hmat table should only refer to
> > GI domains as initiators.  Just to check, do you have them listed as
> > a target node?  Or perhaps in some hmat proximity entry as memory_PD?
> >  
> 
> Thanks, I finally got things to work. I am on x86. It's a dual-socket
> machine with SubNUMA clusters (2 nodes per socket) and NVDIMMs (one
> dax-kmem node per socket). Before adding a GI, initiators look like this:
> 
> node0 -> node0 and node4
> 
> node1 -> node1 and node5
> 
> node2 -> node2 and node4
> 
> node3 -> node3 and node5
> 
> I added a GI with faster access to node0, node2, node4 (first socket).
> 
> The GI node becomes an access0 initiator for node4, and node0 and node2
> remain access1 initiators.
> 
> The GI node doesn't become access0 initiator for node0 and node2, likely
> because of this test :
> 
>         /*
>          * If the Address Range Structure provides a local processor pxm, link
>          * only that one. Otherwise, find the best performance attributes and
>          * register all initiators that match.
>          */
>         if (target->processor_pxm != PXM_INVAL) {
> 
> I guess I should split node0-3 into separate CPU nodes and memory nodes
> in SRAT?

It sounds like it's working as expected.  There are a few assumptions made about
'sensible' hmat configurations.

1) If the memory and processor are in the same domain, that should mean the
access characteristics within that domain are the best in the system.
It is possible to have a setup with very low latency access
from a particular processor but also low bandwidth.  Another domain may have
high bandwidth but long latency.   Such systems may occur, but they are probably
going to not be for 'normal memory the OS can just use'.

2) If we have a relevant "Memory Proximity Domain Attributes Structure"
Note this was renamed in acpi 6.3 from "Address Range Structure" as
it no longer has any address ranges.
(which are entirely optional btw) that indicates that the memory controller
for a given memory lies in the proximity domain of the Initiator specified.
If that happens we ignore cases where hmat says somewhere else is nearer
via bandwidth and latency.

For case 1) I'm not sure we actually enforce it.
I think you've hit case 2).  

Removing the address range structures should work, or as you say you can
move that memory into separate memory nodes.  It will be a bit of a strange
setup though.  Assuming node4 is an NVDIMM then that would be closer to a
potential real system.  With a suitable coherent bus (CCIX is most familiar
to me and can do this) You might have

 _______       ________    _______
|       |     |        |   |       |
| Node0 |     | Node4  |---| Node6 |
| CPU   |-----| Mem +  |---| GI    |
| Mem   |     | CCHome |---|       |
|_______|     |________|   |_______|
   |                          |
   |__________________________|

CCHome Cache Coherency directory location to avoid the need for more
esoteric cache coherency short cut methods etc.

Idea being the GI node is some big fat DB accelerator or similar doing
offloaded queries.  It has a fat pipe to the NVDIMMs.  

Lets ignore that, to actually justify the use of a GI only node,
you need some more elements as this situation could be represented
by fusing node4 and node6 and having asymmetric HMAT between Node0
and the fused Node4.

So in conclusion, with your setup, only the NVDIMM nodes look like the
sort of memory that might be in a node nearer to a GI than the host.
> 
> Brice

Thanks again for looking at this!

Jonathan
> 
> 
> 
>
Brice Goglin Jan. 3, 2020, 12:18 p.m. UTC | #7
Le 03/01/2020 à 11:09, Jonathan Cameron a écrit :
>
> 1) If the memory and processor are in the same domain, that should mean the
> access characteristics within that domain are the best in the system.
> It is possible to have a setup with very low latency access
> from a particular processor but also low bandwidth.  Another domain may have
> high bandwidth but long latency.   Such systems may occur, but they are probably
> going to not be for 'normal memory the OS can just use'.
>
> 2) If we have a relevant "Memory Proximity Domain Attributes Structure"
> Note this was renamed in acpi 6.3 from "Address Range Structure" as
> it no longer has any address ranges.
> (which are entirely optional btw) that indicates that the memory controller
> for a given memory lies in the proximity domain of the Initiator specified.
> If that happens we ignore cases where hmat says somewhere else is nearer
> via bandwidth and latency.
>
> For case 1) I'm not sure we actually enforce it.
> I think you've hit case 2).  
>
> Removing the address range structures should work, or as you say you can
> move that memory into separate memory nodes.


I removed the "processor proximity domain valid" flag from the address
range structure of node2, and the GI is now its access0 initiator
instead of node2 itself. Looks like it confirms I was in case 2)

Thanks

Brice
Jonathan Cameron Jan. 3, 2020, 1:08 p.m. UTC | #8
On Fri, 3 Jan 2020 13:18:59 +0100
Brice Goglin <brice.goglin@gmail.com> wrote:

> Le 03/01/2020 à 11:09, Jonathan Cameron a écrit :
> >
> > 1) If the memory and processor are in the same domain, that should mean the
> > access characteristics within that domain are the best in the system.
> > It is possible to have a setup with very low latency access
> > from a particular processor but also low bandwidth.  Another domain may have
> > high bandwidth but long latency.   Such systems may occur, but they are probably
> > going to not be for 'normal memory the OS can just use'.
> >
> > 2) If we have a relevant "Memory Proximity Domain Attributes Structure"
> > Note this was renamed in acpi 6.3 from "Address Range Structure" as
> > it no longer has any address ranges.
> > (which are entirely optional btw) that indicates that the memory controller
> > for a given memory lies in the proximity domain of the Initiator specified.
> > If that happens we ignore cases where hmat says somewhere else is nearer
> > via bandwidth and latency.
> >
> > For case 1) I'm not sure we actually enforce it.
> > I think you've hit case 2).  
> >
> > Removing the address range structures should work, or as you say you can
> > move that memory into separate memory nodes.  
> 
> 
> I removed the "processor proximity domain valid" flag from the address
> range structure of node2, and the GI is now its access0 initiator
> instead of node2 itself. Looks like it confirms I was in case 2)
> 
> Thanks
> 
> Brice

Cool. I was wondering if that change would work fine.
It is a somewhat crazy setup so I didn't have an equivalent in my test set.

Sounds like all is working as expected.

Thanks,

Jonathan