diff mbox series

[1/7] node: Link memory nodes to their compute nodes

Message ID 20181114224921.12123-2-keith.busch@intel.com (mailing list archive)
State New, archived
Headers show
Series ACPI HMAT memory sysfs representation | expand

Commit Message

Keith Busch Nov. 14, 2018, 10:49 p.m. UTC
Memory-only nodes will often have affinity to a compute node, and
platforms have ways to express that locality relationship.

A node containing CPUs or other DMA devices that can initiate memory
access are referred to as "memory iniators". A "memory target" is a
node that provides at least one phyiscal address range accessible to a
memory initiator.

In preparation for these systems, provide a new kernel API to link
the target memory node to its initiator compute node with symlinks to
each other.

The following example shows the new sysfs hierarchy setup for memory node
'Y' local to commpute node 'X':

  # ls -l /sys/devices/system/node/nodeX/initiator*
  /sys/devices/system/node/nodeX/targetY -> ../nodeY

  # ls -l /sys/devices/system/node/nodeY/target*
  /sys/devices/system/node/nodeY/initiatorX -> ../nodeX

Signed-off-by: Keith Busch <keith.busch@intel.com>
---
 drivers/base/node.c  | 32 ++++++++++++++++++++++++++++++++
 include/linux/node.h |  2 ++
 2 files changed, 34 insertions(+)

Comments

Matthew Wilcox Nov. 15, 2018, 1:57 p.m. UTC | #1
On Wed, Nov 14, 2018 at 03:49:14PM -0700, Keith Busch wrote:
> Memory-only nodes will often have affinity to a compute node, and
> platforms have ways to express that locality relationship.
> 
> A node containing CPUs or other DMA devices that can initiate memory
> access are referred to as "memory iniators". A "memory target" is a
> node that provides at least one phyiscal address range accessible to a
> memory initiator.

I think I may be confused here.  If there is _no_ link from node X to
node Y, does that mean that node X's CPUs cannot access the memory on
node Y?  In my mind, all nodes can access all memory in the system,
just not with uniform bandwidth/latency.
Keith Busch Nov. 15, 2018, 2:59 p.m. UTC | #2
On Thu, Nov 15, 2018 at 05:57:10AM -0800, Matthew Wilcox wrote:
> On Wed, Nov 14, 2018 at 03:49:14PM -0700, Keith Busch wrote:
> > Memory-only nodes will often have affinity to a compute node, and
> > platforms have ways to express that locality relationship.
> > 
> > A node containing CPUs or other DMA devices that can initiate memory
> > access are referred to as "memory iniators". A "memory target" is a
> > node that provides at least one phyiscal address range accessible to a
> > memory initiator.
> 
> I think I may be confused here.  If there is _no_ link from node X to
> node Y, does that mean that node X's CPUs cannot access the memory on
> node Y?  In my mind, all nodes can access all memory in the system,
> just not with uniform bandwidth/latency.

The link is just about which nodes are "local". It's like how nodes have
a cpulist. Other CPUs not in the node's list can acces that node's memory,
but the ones in the mask are local, and provide useful optimization hints.

Would a node mask would be prefered to symlinks?
Dan Williams Nov. 15, 2018, 5:50 p.m. UTC | #3
On Thu, Nov 15, 2018 at 7:02 AM Keith Busch <keith.busch@intel.com> wrote:
>
> On Thu, Nov 15, 2018 at 05:57:10AM -0800, Matthew Wilcox wrote:
> > On Wed, Nov 14, 2018 at 03:49:14PM -0700, Keith Busch wrote:
> > > Memory-only nodes will often have affinity to a compute node, and
> > > platforms have ways to express that locality relationship.
> > >
> > > A node containing CPUs or other DMA devices that can initiate memory
> > > access are referred to as "memory iniators". A "memory target" is a
> > > node that provides at least one phyiscal address range accessible to a
> > > memory initiator.
> >
> > I think I may be confused here.  If there is _no_ link from node X to
> > node Y, does that mean that node X's CPUs cannot access the memory on
> > node Y?  In my mind, all nodes can access all memory in the system,
> > just not with uniform bandwidth/latency.
>
> The link is just about which nodes are "local". It's like how nodes have
> a cpulist. Other CPUs not in the node's list can acces that node's memory,
> but the ones in the mask are local, and provide useful optimization hints.
>
> Would a node mask would be prefered to symlinks?

I think that would be more flexible, because the set of initiators
that may have "best" or "local" access to a target may be more than 1.
Matthew Wilcox Nov. 15, 2018, 8:36 p.m. UTC | #4
On Thu, Nov 15, 2018 at 07:59:20AM -0700, Keith Busch wrote:
> On Thu, Nov 15, 2018 at 05:57:10AM -0800, Matthew Wilcox wrote:
> > On Wed, Nov 14, 2018 at 03:49:14PM -0700, Keith Busch wrote:
> > > Memory-only nodes will often have affinity to a compute node, and
> > > platforms have ways to express that locality relationship.
> > > 
> > > A node containing CPUs or other DMA devices that can initiate memory
> > > access are referred to as "memory iniators". A "memory target" is a
> > > node that provides at least one phyiscal address range accessible to a
> > > memory initiator.
> > 
> > I think I may be confused here.  If there is _no_ link from node X to
> > node Y, does that mean that node X's CPUs cannot access the memory on
> > node Y?  In my mind, all nodes can access all memory in the system,
> > just not with uniform bandwidth/latency.
> 
> The link is just about which nodes are "local". It's like how nodes have
> a cpulist. Other CPUs not in the node's list can acces that node's memory,
> but the ones in the mask are local, and provide useful optimization hints.

So ... let's imagine a hypothetical system (I've never seen one built like
this, but it doesn't seem too implausible).  Connect four CPU sockets in
a square, each of which has some regular DIMMs attached to it.  CPU A is
0 hops to Memory A, one hop to Memory B and Memory C, and two hops from
Memory D (each CPU only has two "QPI" links).  Then maybe there's some
special memory extender device attached on the PCIe bus.  Now there's
Memory B1 and B2 that's attached to CPU B and it's local to CPU B, but
not as local as Memory B is ... and we'd probably _prefer_ to allocate
memory for CPU A from Memory B1 than from Memory D.  But ... *mumble*,
this seems hard.

I understand you're trying to reflect what the HMAT table is telling you,
I'm just really fuzzy on who's ultimately consuming this information
and what decisions they're trying to drive from it.

> Would a node mask would be prefered to symlinks?

I don't have a strong opinion here, but what Dan says makes sense.
Keith Busch Nov. 16, 2018, 6:32 p.m. UTC | #5
On Thu, Nov 15, 2018 at 12:36:54PM -0800, Matthew Wilcox wrote:
> On Thu, Nov 15, 2018 at 07:59:20AM -0700, Keith Busch wrote:
> > On Thu, Nov 15, 2018 at 05:57:10AM -0800, Matthew Wilcox wrote:
> > > On Wed, Nov 14, 2018 at 03:49:14PM -0700, Keith Busch wrote:
> > > > Memory-only nodes will often have affinity to a compute node, and
> > > > platforms have ways to express that locality relationship.
> > > > 
> > > > A node containing CPUs or other DMA devices that can initiate memory
> > > > access are referred to as "memory iniators". A "memory target" is a
> > > > node that provides at least one phyiscal address range accessible to a
> > > > memory initiator.
> > > 
> > > I think I may be confused here.  If there is _no_ link from node X to
> > > node Y, does that mean that node X's CPUs cannot access the memory on
> > > node Y?  In my mind, all nodes can access all memory in the system,
> > > just not with uniform bandwidth/latency.
> > 
> > The link is just about which nodes are "local". It's like how nodes have
> > a cpulist. Other CPUs not in the node's list can acces that node's memory,
> > but the ones in the mask are local, and provide useful optimization hints.
> 
> So ... let's imagine a hypothetical system (I've never seen one built like
> this, but it doesn't seem too implausible).  Connect four CPU sockets in
> a square, each of which has some regular DIMMs attached to it.  CPU A is
> 0 hops to Memory A, one hop to Memory B and Memory C, and two hops from
> Memory D (each CPU only has two "QPI" links).  Then maybe there's some
> special memory extender device attached on the PCIe bus.  Now there's
> Memory B1 and B2 that's attached to CPU B and it's local to CPU B, but
> not as local as Memory B is ... and we'd probably _prefer_ to allocate
> memory for CPU A from Memory B1 than from Memory D.  But ... *mumble*,
> this seems hard.

Indeed, that particular example is out of scope for this series. The
first objective is to aid a process running in node B's CPUs to allocate
memory in B1. Anything that crosses QPI are their own.

> I understand you're trying to reflect what the HMAT table is telling you,
> I'm just really fuzzy on who's ultimately consuming this information
> and what decisions they're trying to drive from it.

Intended consumers include processes using numa_alloc_onnode() and mbind().

Consider a system with faster DRAM and slower persistent memory. Such
a system may group the DRAM in a different proximity domain than the
persistent memory, and both are local to yet another proximity domain
that contains the CPUs. HMAT provides a way to express that relationship,
and this patch provides a user facing abstraction for that information.
Dan Williams Nov. 16, 2018, 10:55 p.m. UTC | #6
On Thu, Nov 15, 2018 at 12:37 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Thu, Nov 15, 2018 at 07:59:20AM -0700, Keith Busch wrote:
> > On Thu, Nov 15, 2018 at 05:57:10AM -0800, Matthew Wilcox wrote:
> > > On Wed, Nov 14, 2018 at 03:49:14PM -0700, Keith Busch wrote:
> > > > Memory-only nodes will often have affinity to a compute node, and
> > > > platforms have ways to express that locality relationship.
> > > >
> > > > A node containing CPUs or other DMA devices that can initiate memory
> > > > access are referred to as "memory iniators". A "memory target" is a
> > > > node that provides at least one phyiscal address range accessible to a
> > > > memory initiator.
> > >
> > > I think I may be confused here.  If there is _no_ link from node X to
> > > node Y, does that mean that node X's CPUs cannot access the memory on
> > > node Y?  In my mind, all nodes can access all memory in the system,
> > > just not with uniform bandwidth/latency.
> >
> > The link is just about which nodes are "local". It's like how nodes have
> > a cpulist. Other CPUs not in the node's list can acces that node's memory,
> > but the ones in the mask are local, and provide useful optimization hints.
>
> So ... let's imagine a hypothetical system (I've never seen one built like
> this, but it doesn't seem too implausible).  Connect four CPU sockets in
> a square, each of which has some regular DIMMs attached to it.  CPU A is
> 0 hops to Memory A, one hop to Memory B and Memory C, and two hops from
> Memory D (each CPU only has two "QPI" links).  Then maybe there's some
> special memory extender device attached on the PCIe bus.  Now there's
> Memory B1 and B2 that's attached to CPU B and it's local to CPU B, but
> not as local as Memory B is ... and we'd probably _prefer_ to allocate
> memory for CPU A from Memory B1 than from Memory D.  But ... *mumble*,
> this seems hard.
>
> I understand you're trying to reflect what the HMAT table is telling you,
> I'm just really fuzzy on who's ultimately consuming this information
> and what decisions they're trying to drive from it.

The singular "local" is a limitation of the HMAT, but I would expect
the Linux translation of "local" would allow for multiple initiators
that can achieve some semblance of the "best" performance. Anything
less than best is going to have a wide range of variance and will
likely devolve to looking at the platform firmware data table
directly. The expected 80% case is software wants to be able to ask
"which CPUs should I run on to get the best access to this memory?"
Anshuman Khandual Nov. 19, 2018, 2:46 a.m. UTC | #7
On 11/15/2018 04:19 AM, Keith Busch wrote:
> Memory-only nodes will often have affinity to a compute node, and
> platforms have ways to express that locality relationship.

It may not have a local affinity to any compute node but it might have a
valid NUMA distance from all available compute nodes. This is particularly
true when the coherent device memory which is accessible from all available
compute nodes without having local affinity to any compute node other than
the device compute which may or not be represented as a NUMA node in itself.

But in case of normally system memory also, a memory only node might be far
from other CPU nodes and may not have CPUs of it's own. In that case there
is no local affinity anyways.

> 
> A node containing CPUs or other DMA devices that can initiate memory
> access are referred to as "memory iniators". A "memory target" is a

Memory initiators should also include heterogeneous compute elements like
GPU cores, FPGA elements etc apart from CPU and DMA engines.

> node that provides at least one phyiscal address range accessible to a
> memory initiator.

This definition for "memory target" makes sense. Coherent accesses within
PA range from all possible "memory initiators" which should also include
heterogeneous compute elements as mentioned before.

> 
> In preparation for these systems, provide a new kernel API to link
> the target memory node to its initiator compute node with symlinks to
> each other.

Makes sense but how would we really define NUMA placement for various
heterogeneous compute elements which are connected differently to the
system bus differently than the CPU and DMA. 

> 
> The following example shows the new sysfs hierarchy setup for memory node
> 'Y' local to commpute node 'X':
> 
>   # ls -l /sys/devices/system/node/nodeX/initiator*
>   /sys/devices/system/node/nodeX/targetY -> ../nodeY
> 
>   # ls -l /sys/devices/system/node/nodeY/target*
>   /sys/devices/system/node/nodeY/initiatorX -> ../nodeX

This inter linking makes sense but once we are able to define all possible
memory initiators and memory targets as NUMA nodes (which might not very
trivial) taking into account heterogeneous compute environment. But this
linking at least establishes the coherency relationship between memory
initiators and memory targets.

> 
> Signed-off-by: Keith Busch <keith.busch@intel.com>
> ---
>  drivers/base/node.c  | 32 ++++++++++++++++++++++++++++++++
>  include/linux/node.h |  2 ++
>  2 files changed, 34 insertions(+)
> 
> diff --git a/drivers/base/node.c b/drivers/base/node.c
> index 86d6cd92ce3d..a9b7512a9502 100644
> --- a/drivers/base/node.c
> +++ b/drivers/base/node.c
> @@ -372,6 +372,38 @@ int register_cpu_under_node(unsigned int cpu, unsigned int nid)
>  				 kobject_name(&node_devices[nid]->dev.kobj));
>  }
>  
> +int register_memory_node_under_compute_node(unsigned int m, unsigned int p)
> +{
> +	int ret;
> +	char initiator[20], target[17];

20, 17 seems arbitrary here.

> +
> +	if (!node_online(p) || !node_online(m))
> +		return -ENODEV;

Just wondering how a NUMA node for group of GPU compute elements will look
like which are not manage by kernel but are still memory initiators having
access to a number of memory targets.

> +	if (m == p)
> +		return 0;

Why skip ? Should not we link memory target to it's own node which can be
it's memory initiator as well. Caller of this linking function might decide
on whether the memory target is accessible from same NUMA node as a memory
initiator or not.

> +
> +	snprintf(initiator, sizeof(initiator), "initiator%d", p);
> +	snprintf(target, sizeof(target), "target%d", m);
> +
> +	ret = sysfs_create_link(&node_devices[p]->dev.kobj,
> +				&node_devices[m]->dev.kobj,
> +				target);
> +	if (ret)
> +		return ret;
> +
> +	ret = sysfs_create_link(&node_devices[m]->dev.kobj,
> +				&node_devices[p]->dev.kobj,
> +				initiator);
> +	if (ret)
> +		goto err;
> +
> +	return 0;
> + err:
> +	sysfs_remove_link(&node_devices[p]->dev.kobj,
> +			  kobject_name(&node_devices[m]->dev.kobj));
> +	return ret;
> +}
> +
>  int unregister_cpu_under_node(unsigned int cpu, unsigned int nid)
>  {
>  	struct device *obj;
> diff --git a/include/linux/node.h b/include/linux/node.h
> index 257bb3d6d014..1fd734a3fb3f 100644
> --- a/include/linux/node.h
> +++ b/include/linux/node.h
> @@ -75,6 +75,8 @@ extern int register_mem_sect_under_node(struct memory_block *mem_blk,
>  extern int unregister_mem_sect_under_nodes(struct memory_block *mem_blk,
>  					   unsigned long phys_index);
>  
> +extern int register_memory_node_under_compute_node(unsigned int m, unsigned int p);
> +
>  #ifdef CONFIG_HUGETLBFS
>  extern void register_hugetlbfs_with_node(node_registration_func_t doregister,
>  					 node_registration_func_t unregister);
>
The code is all good but as mentioned before the primary concern is whether
this semantics will be able to correctly represent all possible present and
future heterogeneous compute environments with multi attribute memory. This
is going to be a kernel API. So apart from various NUMA representation for
all possible kinds, the interface has to be abstract with generic elements
and room for future extension.
Anshuman Khandual Nov. 19, 2018, 2:52 a.m. UTC | #8
On 11/15/2018 08:29 PM, Keith Busch wrote:
> On Thu, Nov 15, 2018 at 05:57:10AM -0800, Matthew Wilcox wrote:
>> On Wed, Nov 14, 2018 at 03:49:14PM -0700, Keith Busch wrote:
>>> Memory-only nodes will often have affinity to a compute node, and
>>> platforms have ways to express that locality relationship.
>>>
>>> A node containing CPUs or other DMA devices that can initiate memory
>>> access are referred to as "memory iniators". A "memory target" is a
>>> node that provides at least one phyiscal address range accessible to a
>>> memory initiator.
>>
>> I think I may be confused here.  If there is _no_ link from node X to
>> node Y, does that mean that node X's CPUs cannot access the memory on
>> node Y?  In my mind, all nodes can access all memory in the system,
>> just not with uniform bandwidth/latency.
> 
> The link is just about which nodes are "local". It's like how nodes have
> a cpulist. Other CPUs not in the node's list can acces that node's memory,
> but the ones in the mask are local, and provide useful optimization hints.
> 
> Would a node mask would be prefered to symlinks?

Having hint for local affinity is definitely a plus but this must provide
the coherency matrix to the user preferably in the form of a nodemask for
each memory target.
Anshuman Khandual Nov. 19, 2018, 3:04 a.m. UTC | #9
On 11/15/2018 11:20 PM, Dan Williams wrote:
> On Thu, Nov 15, 2018 at 7:02 AM Keith Busch <keith.busch@intel.com> wrote:
>>
>> On Thu, Nov 15, 2018 at 05:57:10AM -0800, Matthew Wilcox wrote:
>>> On Wed, Nov 14, 2018 at 03:49:14PM -0700, Keith Busch wrote:
>>>> Memory-only nodes will often have affinity to a compute node, and
>>>> platforms have ways to express that locality relationship.
>>>>
>>>> A node containing CPUs or other DMA devices that can initiate memory
>>>> access are referred to as "memory iniators". A "memory target" is a
>>>> node that provides at least one phyiscal address range accessible to a
>>>> memory initiator.
>>>
>>> I think I may be confused here.  If there is _no_ link from node X to
>>> node Y, does that mean that node X's CPUs cannot access the memory on
>>> node Y?  In my mind, all nodes can access all memory in the system,
>>> just not with uniform bandwidth/latency.
>>
>> The link is just about which nodes are "local". It's like how nodes have
>> a cpulist. Other CPUs not in the node's list can acces that node's memory,
>> but the ones in the mask are local, and provide useful optimization hints.
>>
>> Would a node mask would be prefered to symlinks?
> 
> I think that would be more flexible, because the set of initiators
> that may have "best" or "local" access to a target may be more than 1.

Right. The memory target should have two nodemasks (for now at least). One
enumerating which initiator nodes can access the memory coherently and the
other one which are nearer and can benefit from local allocation.
Anshuman Khandual Nov. 19, 2018, 3:15 a.m. UTC | #10
On 11/17/2018 12:02 AM, Keith Busch wrote:
> On Thu, Nov 15, 2018 at 12:36:54PM -0800, Matthew Wilcox wrote:
>> On Thu, Nov 15, 2018 at 07:59:20AM -0700, Keith Busch wrote:
>>> On Thu, Nov 15, 2018 at 05:57:10AM -0800, Matthew Wilcox wrote:
>>>> On Wed, Nov 14, 2018 at 03:49:14PM -0700, Keith Busch wrote:
>>>>> Memory-only nodes will often have affinity to a compute node, and
>>>>> platforms have ways to express that locality relationship.
>>>>>
>>>>> A node containing CPUs or other DMA devices that can initiate memory
>>>>> access are referred to as "memory iniators". A "memory target" is a
>>>>> node that provides at least one phyiscal address range accessible to a
>>>>> memory initiator.
>>>>
>>>> I think I may be confused here.  If there is _no_ link from node X to
>>>> node Y, does that mean that node X's CPUs cannot access the memory on
>>>> node Y?  In my mind, all nodes can access all memory in the system,
>>>> just not with uniform bandwidth/latency.
>>>
>>> The link is just about which nodes are "local". It's like how nodes have
>>> a cpulist. Other CPUs not in the node's list can acces that node's memory,
>>> but the ones in the mask are local, and provide useful optimization hints.
>>
>> So ... let's imagine a hypothetical system (I've never seen one built like
>> this, but it doesn't seem too implausible).  Connect four CPU sockets in
>> a square, each of which has some regular DIMMs attached to it.  CPU A is
>> 0 hops to Memory A, one hop to Memory B and Memory C, and two hops from
>> Memory D (each CPU only has two "QPI" links).  Then maybe there's some
>> special memory extender device attached on the PCIe bus.  Now there's
>> Memory B1 and B2 that's attached to CPU B and it's local to CPU B, but
>> not as local as Memory B is ... and we'd probably _prefer_ to allocate
>> memory for CPU A from Memory B1 than from Memory D.  But ... *mumble*,
>> this seems hard.
> 
> Indeed, that particular example is out of scope for this series. The
> first objective is to aid a process running in node B's CPUs to allocate
> memory in B1. Anything that crosses QPI are their own.

This is problematic. Any new kernel API interface should accommodate B2 type
memory as well from the above example which is on a PCIe bus. Because
eventually they would be represented as some sort of a NUMA node and then
applications will have to depend on this sysfs interface for their desired
memory placement requirements. Unless this interface is thought through for
B2 type of memory, it might not be extensible in the future.
Keith Busch Nov. 19, 2018, 3:49 p.m. UTC | #11
On Mon, Nov 19, 2018 at 08:45:25AM +0530, Anshuman Khandual wrote:
> On 11/17/2018 12:02 AM, Keith Busch wrote:
> > On Thu, Nov 15, 2018 at 12:36:54PM -0800, Matthew Wilcox wrote:
> >> So ... let's imagine a hypothetical system (I've never seen one built like
> >> this, but it doesn't seem too implausible).  Connect four CPU sockets in
> >> a square, each of which has some regular DIMMs attached to it.  CPU A is
> >> 0 hops to Memory A, one hop to Memory B and Memory C, and two hops from
> >> Memory D (each CPU only has two "QPI" links).  Then maybe there's some
> >> special memory extender device attached on the PCIe bus.  Now there's
> >> Memory B1 and B2 that's attached to CPU B and it's local to CPU B, but
> >> not as local as Memory B is ... and we'd probably _prefer_ to allocate
> >> memory for CPU A from Memory B1 than from Memory D.  But ... *mumble*,
> >> this seems hard.
> > 
> > Indeed, that particular example is out of scope for this series. The
> > first objective is to aid a process running in node B's CPUs to allocate
> > memory in B1. Anything that crosses QPI are their own.
> 
> This is problematic. Any new kernel API interface should accommodate B2 type
> memory as well from the above example which is on a PCIe bus. Because
> eventually they would be represented as some sort of a NUMA node and then
> applications will have to depend on this sysfs interface for their desired
> memory placement requirements. Unless this interface is thought through for
> B2 type of memory, it might not be extensible in the future.

I'm not sure I understand the concern. The proposal allows linking B
to B2 memory.
Aneesh Kumar K.V Dec. 4, 2018, 3:43 p.m. UTC | #12
Keith Busch <keith.busch@intel.com> writes:

> On Thu, Nov 15, 2018 at 12:36:54PM -0800, Matthew Wilcox wrote:
>> On Thu, Nov 15, 2018 at 07:59:20AM -0700, Keith Busch wrote:
>> > On Thu, Nov 15, 2018 at 05:57:10AM -0800, Matthew Wilcox wrote:
>> > > On Wed, Nov 14, 2018 at 03:49:14PM -0700, Keith Busch wrote:
>> > > > Memory-only nodes will often have affinity to a compute node, and
>> > > > platforms have ways to express that locality relationship.
>> > > > 
>> > > > A node containing CPUs or other DMA devices that can initiate memory
>> > > > access are referred to as "memory iniators". A "memory target" is a
>> > > > node that provides at least one phyiscal address range accessible to a
>> > > > memory initiator.
>> > > 
>> > > I think I may be confused here.  If there is _no_ link from node X to
>> > > node Y, does that mean that node X's CPUs cannot access the memory on
>> > > node Y?  In my mind, all nodes can access all memory in the system,
>> > > just not with uniform bandwidth/latency.
>> > 
>> > The link is just about which nodes are "local". It's like how nodes have
>> > a cpulist. Other CPUs not in the node's list can acces that node's memory,
>> > but the ones in the mask are local, and provide useful optimization hints.
>> 
>> So ... let's imagine a hypothetical system (I've never seen one built like
>> this, but it doesn't seem too implausible).  Connect four CPU sockets in
>> a square, each of which has some regular DIMMs attached to it.  CPU A is
>> 0 hops to Memory A, one hop to Memory B and Memory C, and two hops from
>> Memory D (each CPU only has two "QPI" links).  Then maybe there's some
>> special memory extender device attached on the PCIe bus.  Now there's
>> Memory B1 and B2 that's attached to CPU B and it's local to CPU B, but
>> not as local as Memory B is ... and we'd probably _prefer_ to allocate
>> memory for CPU A from Memory B1 than from Memory D.  But ... *mumble*,
>> this seems hard.
>
> Indeed, that particular example is out of scope for this series. The
> first objective is to aid a process running in node B's CPUs to allocate
> memory in B1. Anything that crosses QPI are their own.

But if you can extrapolate how such a system can possibly be expressed
using what is propsed here, it would help in reviewing this. Also how
do we intent to express the locality of memory w.r.t to other computing
units like GPU/FPGA?

I understand that this is looked at as ACPI HMAT in sysfs format.
But as mentioned by others in this thread, if we don't do this platform
and device independent way, we can have application portability issues
going forward?

-aneesh
Keith Busch Dec. 4, 2018, 4:54 p.m. UTC | #13
On Tue, Dec 04, 2018 at 09:13:33PM +0530, Aneesh Kumar K.V wrote:
> Keith Busch <keith.busch@intel.com> writes:
> >
> > Indeed, that particular example is out of scope for this series. The
> > first objective is to aid a process running in node B's CPUs to allocate
> > memory in B1. Anything that crosses QPI are their own.
> 
> But if you can extrapolate how such a system can possibly be expressed
> using what is propsed here, it would help in reviewing this.

Expressed to what end? This proposal is not trying to express anything
other than the best possible pairings because that is the most common
information applications will want to know.

> Also how
> do we intent to express the locality of memory w.r.t to other computing
> units like GPU/FPGA?

The HMAT parsing at the end of the series provides an example for how
others may use the proposed interfaces.

> I understand that this is looked at as ACPI HMAT in sysfs format.
> But as mentioned by others in this thread, if we don't do this platform
> and device independent way, we can have application portability issues
> going forward?

Only the last patch is specific to HMAT. If there are other ways to get
the same attributes, then those drivers or subsystems may also register
them with these new kernel interfaces.
diff mbox series

Patch

diff --git a/drivers/base/node.c b/drivers/base/node.c
index 86d6cd92ce3d..a9b7512a9502 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -372,6 +372,38 @@  int register_cpu_under_node(unsigned int cpu, unsigned int nid)
 				 kobject_name(&node_devices[nid]->dev.kobj));
 }
 
+int register_memory_node_under_compute_node(unsigned int m, unsigned int p)
+{
+	int ret;
+	char initiator[20], target[17];
+
+	if (!node_online(p) || !node_online(m))
+		return -ENODEV;
+	if (m == p)
+		return 0;
+
+	snprintf(initiator, sizeof(initiator), "initiator%d", p);
+	snprintf(target, sizeof(target), "target%d", m);
+
+	ret = sysfs_create_link(&node_devices[p]->dev.kobj,
+				&node_devices[m]->dev.kobj,
+				target);
+	if (ret)
+		return ret;
+
+	ret = sysfs_create_link(&node_devices[m]->dev.kobj,
+				&node_devices[p]->dev.kobj,
+				initiator);
+	if (ret)
+		goto err;
+
+	return 0;
+ err:
+	sysfs_remove_link(&node_devices[p]->dev.kobj,
+			  kobject_name(&node_devices[m]->dev.kobj));
+	return ret;
+}
+
 int unregister_cpu_under_node(unsigned int cpu, unsigned int nid)
 {
 	struct device *obj;
diff --git a/include/linux/node.h b/include/linux/node.h
index 257bb3d6d014..1fd734a3fb3f 100644
--- a/include/linux/node.h
+++ b/include/linux/node.h
@@ -75,6 +75,8 @@  extern int register_mem_sect_under_node(struct memory_block *mem_blk,
 extern int unregister_mem_sect_under_nodes(struct memory_block *mem_blk,
 					   unsigned long phys_index);
 
+extern int register_memory_node_under_compute_node(unsigned int m, unsigned int p);
+
 #ifdef CONFIG_HUGETLBFS
 extern void register_hugetlbfs_with_node(node_registration_func_t doregister,
 					 node_registration_func_t unregister);