mbox series

[v1,0/4] vfio: report NUMA nodes for device memory

Message ID 20230915024559.6565-1-ankita@nvidia.com (mailing list archive)
Headers show
Series vfio: report NUMA nodes for device memory | expand

Message

Ankit Agrawal Sept. 15, 2023, 2:45 a.m. UTC
From: Ankit Agrawal <ankita@nvidia.com>

For devices which allow CPU to cache coherently access their memory,
it is sensible to expose such memory as NUMA nodes separate from
the sysmem node. Qemu currently do not provide a mechanism for creation
of NUMA nodes associated with a vfio-pci device.

Implement a mechanism to create and associate a set of unique NUMA nodes
with a vfio-pci device.

NUMA node is created by inserting a series of the unique proximity
domains (PXM) in the VM SRAT ACPI table. The ACPI tables are read once
at the time of bootup by the kernel to determine the NUMA configuration
and is inflexible post that. Hence this feature is incompatible with
device hotplug. The added node range associated with the device is
communicated through ACPI DSD and can be fetched by the VM kernel or
kernel modules. QEMU's VM SRAT and DSD builder code is modified
accordingly.

New command line params are introduced for admin to have a control on
the NUMA node assignment.

It is expected for a vfio-pci driver to expose this feature through
sysfs. Presence of the feature is checked to enable these code changes.

Applied over v8.1.0-rc4.

Ankit Agrawal (4):
  vfio: new command line params for device memory NUMA nodes
  vfio: assign default values to node params
  hw/arm/virt-acpi-build: patch guest SRAT for NUMA nodes
  acpi/gpex: patch guest DSDT for dev mem information

 hw/arm/virt-acpi-build.c    |  54 +++++++++++++
 hw/pci-host/gpex-acpi.c     |  69 +++++++++++++++++
 hw/vfio/pci.c               | 146 ++++++++++++++++++++++++++++++++++++
 hw/vfio/pci.h               |   2 +
 include/hw/pci/pci_device.h |   3 +
 5 files changed, 274 insertions(+)

Comments

Cédric Le Goater Sept. 15, 2023, 2:19 p.m. UTC | #1
Hello Ankit,

On 9/15/23 04:45, ankita@nvidia.com wrote:
> From: Ankit Agrawal <ankita@nvidia.com>
> 
> For devices which allow CPU to cache coherently access their memory,
> it is sensible to expose such memory as NUMA nodes separate from
> the sysmem node. Qemu currently do not provide a mechanism for creation
> of NUMA nodes associated with a vfio-pci device.
> 
> Implement a mechanism to create and associate a set of unique NUMA nodes
> with a vfio-pci device.>
> NUMA node is created by inserting a series of the unique proximity
> domains (PXM) in the VM SRAT ACPI table. The ACPI tables are read once
> at the time of bootup by the kernel to determine the NUMA configuration
> and is inflexible post that. Hence this feature is incompatible with
> device hotplug. The added node range associated with the device is
> communicated through ACPI DSD and can be fetched by the VM kernel or
> kernel modules. QEMU's VM SRAT and DSD builder code is modified
> accordingly.
> 
> New command line params are introduced for admin to have a control on
> the NUMA node assignment.

This approach seems to bypass the NUMA framework in place in QEMU and
will be a challenge for the upper layers. QEMU is generally used from
libvirt when dealing with KVM guests.

Typically, a command line for a virt machine with NUMA nodes would look
like :

   -object memory-backend-ram,id=ram-node0,size=1G \
   -numa node,nodeid=0,memdev=ram-node0 \
   -object memory-backend-ram,id=ram-node1,size=1G \
   -numa node,nodeid=1,cpus=0-3,memdev=ram-node1

which defines 2 nodes, one with memory and all CPUs and a second with
only memory.

   # numactl -H
   available: 2 nodes (0-1)
   node 0 cpus: 0 1 2 3
   node 0 size: 1003 MB
   node 0 free: 734 MB
   node 1 cpus:
   node 1 size: 975 MB
   node 1 free: 968 MB
   node distances:
   node   0   1
     0:  10  20
     1:  20  10

   
Could it be a new type of host memory backend ?  Have you considered
this approach ?

Thanks,

C.

> 
> It is expected for a vfio-pci driver to expose this feature through
> sysfs. Presence of the feature is checked to enable these code changes.
> 
> Applied over v8.1.0-rc4.
> 
> Ankit Agrawal (4):
>    vfio: new command line params for device memory NUMA nodes
>    vfio: assign default values to node params
>    hw/arm/virt-acpi-build: patch guest SRAT for NUMA nodes
>    acpi/gpex: patch guest DSDT for dev mem information
> 
>   hw/arm/virt-acpi-build.c    |  54 +++++++++++++
>   hw/pci-host/gpex-acpi.c     |  69 +++++++++++++++++
>   hw/vfio/pci.c               | 146 ++++++++++++++++++++++++++++++++++++
>   hw/vfio/pci.h               |   2 +
>   include/hw/pci/pci_device.h |   3 +
>   5 files changed, 274 insertions(+)
>
Alex Williamson Sept. 15, 2023, 2:47 p.m. UTC | #2
On Fri, 15 Sep 2023 16:19:29 +0200
Cédric Le Goater <clg@redhat.com> wrote:

> Hello Ankit,
> 
> On 9/15/23 04:45, ankita@nvidia.com wrote:
> > From: Ankit Agrawal <ankita@nvidia.com>
> > 
> > For devices which allow CPU to cache coherently access their memory,
> > it is sensible to expose such memory as NUMA nodes separate from
> > the sysmem node. Qemu currently do not provide a mechanism for creation
> > of NUMA nodes associated with a vfio-pci device.
> > 
> > Implement a mechanism to create and associate a set of unique NUMA nodes
> > with a vfio-pci device.>
> > NUMA node is created by inserting a series of the unique proximity
> > domains (PXM) in the VM SRAT ACPI table. The ACPI tables are read once
> > at the time of bootup by the kernel to determine the NUMA configuration
> > and is inflexible post that. Hence this feature is incompatible with
> > device hotplug. The added node range associated with the device is
> > communicated through ACPI DSD and can be fetched by the VM kernel or
> > kernel modules. QEMU's VM SRAT and DSD builder code is modified
> > accordingly.
> > 
> > New command line params are introduced for admin to have a control on
> > the NUMA node assignment.  
> 
> This approach seems to bypass the NUMA framework in place in QEMU and
> will be a challenge for the upper layers. QEMU is generally used from
> libvirt when dealing with KVM guests.
> 
> Typically, a command line for a virt machine with NUMA nodes would look
> like :
> 
>    -object memory-backend-ram,id=ram-node0,size=1G \
>    -numa node,nodeid=0,memdev=ram-node0 \
>    -object memory-backend-ram,id=ram-node1,size=1G \
>    -numa node,nodeid=1,cpus=0-3,memdev=ram-node1
> 
> which defines 2 nodes, one with memory and all CPUs and a second with
> only memory.
> 
>    # numactl -H
>    available: 2 nodes (0-1)
>    node 0 cpus: 0 1 2 3
>    node 0 size: 1003 MB
>    node 0 free: 734 MB
>    node 1 cpus:
>    node 1 size: 975 MB
>    node 1 free: 968 MB
>    node distances:
>    node   0   1
>      0:  10  20
>      1:  20  10
> 
>    
> Could it be a new type of host memory backend ?  Have you considered
> this approach ?

Good idea.  Fundamentally the device should not be creating NUMA nodes,
the VM should be configured with NUMA nodes and the device memory
associated with those nodes.

I think we're also dealing with a lot of very, very device specific
behavior, so I question whether we shouldn't create a separate device
for this beyond vifo-pci or vfio-pci-nohotplug.

In particular, a PCI device typically only has association to a single
proximity domain, so what sense does it make to describe the coherent
memory as a PCI BAR to only then create a confusing mapping where the
device has a proximity domain separate from a resources associated with
the device?

It's seeming like this device should create memory objects that can be
associated as memory backing for command line specified NUMA nodes.
Thanks,

Alex
David Hildenbrand Sept. 15, 2023, 6:34 p.m. UTC | #3
On 15.09.23 16:47, Alex Williamson wrote:
> On Fri, 15 Sep 2023 16:19:29 +0200
> Cédric Le Goater <clg@redhat.com> wrote:
> 
>> Hello Ankit,
>>
>> On 9/15/23 04:45, ankita@nvidia.com wrote:
>>> From: Ankit Agrawal <ankita@nvidia.com>
>>>
>>> For devices which allow CPU to cache coherently access their memory,
>>> it is sensible to expose such memory as NUMA nodes separate from
>>> the sysmem node. Qemu currently do not provide a mechanism for creation
>>> of NUMA nodes associated with a vfio-pci device.
>>>
>>> Implement a mechanism to create and associate a set of unique NUMA nodes
>>> with a vfio-pci device.>
>>> NUMA node is created by inserting a series of the unique proximity
>>> domains (PXM) in the VM SRAT ACPI table. The ACPI tables are read once
>>> at the time of bootup by the kernel to determine the NUMA configuration
>>> and is inflexible post that. Hence this feature is incompatible with
>>> device hotplug. The added node range associated with the device is
>>> communicated through ACPI DSD and can be fetched by the VM kernel or
>>> kernel modules. QEMU's VM SRAT and DSD builder code is modified
>>> accordingly.
>>>
>>> New command line params are introduced for admin to have a control on
>>> the NUMA node assignment.
>>
>> This approach seems to bypass the NUMA framework in place in QEMU and
>> will be a challenge for the upper layers. QEMU is generally used from
>> libvirt when dealing with KVM guests.
>>
>> Typically, a command line for a virt machine with NUMA nodes would look
>> like :
>>
>>     -object memory-backend-ram,id=ram-node0,size=1G \
>>     -numa node,nodeid=0,memdev=ram-node0 \
>>     -object memory-backend-ram,id=ram-node1,size=1G \
>>     -numa node,nodeid=1,cpus=0-3,memdev=ram-node1
>>
>> which defines 2 nodes, one with memory and all CPUs and a second with
>> only memory.
>>
>>     # numactl -H
>>     available: 2 nodes (0-1)
>>     node 0 cpus: 0 1 2 3
>>     node 0 size: 1003 MB
>>     node 0 free: 734 MB
>>     node 1 cpus:
>>     node 1 size: 975 MB
>>     node 1 free: 968 MB
>>     node distances:
>>     node   0   1
>>       0:  10  20
>>       1:  20  10
>>
>>     
>> Could it be a new type of host memory backend ?  Have you considered
>> this approach ?
> 
> Good idea.  Fundamentally the device should not be creating NUMA nodes,
> the VM should be configured with NUMA nodes and the device memory
> associated with those nodes.

+1. That would also make it fly with DIMMs and virtio-mem, where you 
would want NUMA-less nodes ass well (imagine passing CXL memory to a VM 
using virtio-mem).
Ankit Agrawal Sept. 22, 2023, 8:11 a.m. UTC | #4
> >>
> >> Typically, a command line for a virt machine with NUMA nodes would
> >> look like :
> >>
> >>     -object memory-backend-ram,id=ram-node0,size=1G \
> >>     -numa node,nodeid=0,memdev=ram-node0 \
> >>     -object memory-backend-ram,id=ram-node1,size=1G \
> >>     -numa node,nodeid=1,cpus=0-3,memdev=ram-node1
> >>
> >> which defines 2 nodes, one with memory and all CPUs and a second with
> >> only memory.
> >>
> >>     # numactl -H
> >>     available: 2 nodes (0-1)
> >>     node 0 cpus: 0 1 2 3
> >>     node 0 size: 1003 MB
> >>     node 0 free: 734 MB
> >>     node 1 cpus:
> >>     node 1 size: 975 MB
> >>     node 1 free: 968 MB
> >>     node distances:
> >>     node   0   1
> >>       0:  10  20
> >>       1:  20  10
> >>
> >>
> >> Could it be a new type of host memory backend ?  Have you considered
> >> this approach ?
> >
> > Good idea.  Fundamentally the device should not be creating NUMA
> > nodes, the VM should be configured with NUMA nodes and the device
> > memory associated with those nodes.
> 
> +1. That would also make it fly with DIMMs and virtio-mem, where you
> would want NUMA-less nodes ass well (imagine passing CXL memory to a VM
> using virtio-mem).
> 

We actually do not add the device memory on the host, instead
map it into the Qemu VMA using remap_pfn_range(). Please checkout the
mmap function in vfio-pci variant driver code managing the device.
https://lore.kernel.org/all/20230915025415.6762-1-ankita@nvidia.com/
And I think host memory backend would need memory that is added on the
host.

Moreover since we want to passthrough the entire device memory, the
-object memory-backend-ram would have to be passed a size that is equal
to the device memory. I wonder if that would be too much of a trouble
for an admin (or libvirt) triggering the Qemu process.

Both these items are avoided by exposing the device memory as BAR as in the
current  implementation (referenced above) since it lets Qemu to naturally
discover the device memory region and do mmap.
David Hildenbrand Sept. 22, 2023, 8:15 a.m. UTC | #5
On 22.09.23 10:11, Ankit Agrawal wrote:
>>>>
>>>> Typically, a command line for a virt machine with NUMA nodes would
>>>> look like :
>>>>
>>>>      -object memory-backend-ram,id=ram-node0,size=1G \
>>>>      -numa node,nodeid=0,memdev=ram-node0 \
>>>>      -object memory-backend-ram,id=ram-node1,size=1G \
>>>>      -numa node,nodeid=1,cpus=0-3,memdev=ram-node1
>>>>
>>>> which defines 2 nodes, one with memory and all CPUs and a second with
>>>> only memory.
>>>>
>>>>      # numactl -H
>>>>      available: 2 nodes (0-1)
>>>>      node 0 cpus: 0 1 2 3
>>>>      node 0 size: 1003 MB
>>>>      node 0 free: 734 MB
>>>>      node 1 cpus:
>>>>      node 1 size: 975 MB
>>>>      node 1 free: 968 MB
>>>>      node distances:
>>>>      node   0   1
>>>>        0:  10  20
>>>>        1:  20  10
>>>>
>>>>
>>>> Could it be a new type of host memory backend ?  Have you considered
>>>> this approach ?
>>>
>>> Good idea.  Fundamentally the device should not be creating NUMA
>>> nodes, the VM should be configured with NUMA nodes and the device
>>> memory associated with those nodes.
>>
>> +1. That would also make it fly with DIMMs and virtio-mem, where you
>> would want NUMA-less nodes ass well (imagine passing CXL memory to a VM
>> using virtio-mem).
>>
> 
> We actually do not add the device memory on the host, instead
> map it into the Qemu VMA using remap_pfn_range(). Please checkout the
> mmap function in vfio-pci variant driver code managing the device.
> https://lore.kernel.org/all/20230915025415.6762-1-ankita@nvidia.com/
> And I think host memory backend would need memory that is added on the
> host.
> 
> Moreover since we want to passthrough the entire device memory, the
> -object memory-backend-ram would have to be passed a size that is equal
> to the device memory. I wonder if that would be too much of a trouble
> for an admin (or libvirt) triggering the Qemu process.
> 
> Both these items are avoided by exposing the device memory as BAR as in the
> current  implementation (referenced above) since it lets Qemu to naturally
> discover the device memory region and do mmap.
> 

Just to clarify: nNUMA nodes for DIMMs/NVDIMMs/virtio-mem are configured 
on the device, not on the memory backend.

e.g., -device pc-dimm,node=3,memdev=mem1,...
Ankit Agrawal Sept. 26, 2023, 2:52 p.m. UTC | #6
>>>> Good idea.  Fundamentally the device should not be creating NUMA
>>>> nodes, the VM should be configured with NUMA nodes and the device
>>>> memory associated with those nodes.
>>>
>>> +1. That would also make it fly with DIMMs and virtio-mem, where you
>>> would want NUMA-less nodes ass well (imagine passing CXL memory to a VM
>>> using virtio-mem).
>>>
>>
>> We actually do not add the device memory on the host, instead
>> map it into the Qemu VMA using remap_pfn_range(). Please checkout the
>> mmap function in vfio-pci variant driver code managing the device.
>> https://lore.kernel.org/all/20230915025415.6762-1-ankita@nvidia.com/
>> And I think host memory backend would need memory that is added on the
>> host.
>>
>> Moreover since we want to passthrough the entire device memory, the
>> -object memory-backend-ram would have to be passed a size that is equal
>> to the device memory. I wonder if that would be too much of a trouble
>> for an admin (or libvirt) triggering the Qemu process.
>>
>> Both these items are avoided by exposing the device memory as BAR as in the
>> current  implementation (referenced above) since it lets Qemu to naturally
>> discover the device memory region and do mmap.
>>
>
> Just to clarify: nNUMA nodes for DIMMs/NVDIMMs/virtio-mem are configured
> on the device, not on the memory backend.
> 
> e.g., -device pc-dimm,node=3,memdev=mem1,...

Agreed, but still we will have the aforementioned issues viz.
1. The backing memory for the memory device would need to be allocated
on the host. However, we do not add the device memory on the host in this
case. Instead the Qemu VMA is mapped to the device memory physical
address using remap_pfn_range().
2. The memory device need to be passed an allocation size such that all of
the device memory is mapped into the Qemu VMA. This may not be readily
available to the admin/libvirt.

Based on the suggestions here, can we consider something like the 
following?
1. Introduce a new -numa subparam 'devnode', which tells Qemu to mark
the node with MEM_AFFINITY_HOTPLUGGABLE in the SRAT's memory affinity
structure to make it hotpluggable.
2. Create several NUMA nodes with 'devnode' which are supposed to be
associated with the vfio-pci device.
3. Pass the numa node start and count to associate the nodes created.

So, the command would look something like the following.
...
        -numa node,nodeid=2,devnode=on \
        -numa node,nodeid=3,devnode=on \
        -numa node,nodeid=4,devnode=on \
        -numa node,nodeid=5,devnode=on \
        -numa node,nodeid=6,devnode=on \
        -numa node,nodeid=7,devnode=on \
        -numa node,nodeid=8,devnode=on \
        -numa node,nodeid=9,devnode=on \
        -device vfio-pci-nohotplug,host=0009:01:00.0,bus=pcie.0,addr=04.0,rombar=0,numa-node-start=2,numa-node-count=8 \

Thoughts?
David Hildenbrand Sept. 26, 2023, 4:54 p.m. UTC | #7
On 26.09.23 16:52, Ankit Agrawal wrote:
>>>>> Good idea.  Fundamentally the device should not be creating NUMA
>>>>> nodes, the VM should be configured with NUMA nodes and the device
>>>>> memory associated with those nodes.
>>>>
>>>> +1. That would also make it fly with DIMMs and virtio-mem, where you
>>>> would want NUMA-less nodes ass well (imagine passing CXL memory to a VM
>>>> using virtio-mem).
>>>>
>>>
>>> We actually do not add the device memory on the host, instead
>>> map it into the Qemu VMA using remap_pfn_range(). Please checkout the
>>> mmap function in vfio-pci variant driver code managing the device.
>>> https://lore.kernel.org/all/20230915025415.6762-1-ankita@nvidia.com/
>>> And I think host memory backend would need memory that is added on the
>>> host.
>>>
>>> Moreover since we want to passthrough the entire device memory, the
>>> -object memory-backend-ram would have to be passed a size that is equal
>>> to the device memory. I wonder if that would be too much of a trouble
>>> for an admin (or libvirt) triggering the Qemu process.
>>>
>>> Both these items are avoided by exposing the device memory as BAR as in the
>>> current  implementation (referenced above) since it lets Qemu to naturally
>>> discover the device memory region and do mmap.
>>>
>>
>> Just to clarify: nNUMA nodes for DIMMs/NVDIMMs/virtio-mem are configured
>> on the device, not on the memory backend.
>>
>> e.g., -device pc-dimm,node=3,memdev=mem1,...
> 

Alco CCing Gavin, I remember he once experimented with virtio-mem + 
multiple memory-less nodes and it was quite working (because of 
MEM_AFFINITY_HOTPLUGGABLE only on the last node, below).

> Agreed, but still we will have the aforementioned issues viz.
> 1. The backing memory for the memory device would need to be allocated
> on the host. However, we do not add the device memory on the host in this
> case. Instead the Qemu VMA is mapped to the device memory physical
> address using remap_pfn_range().

I don't see why that would be necessary ...

> 2. The memory device need to be passed an allocation size such that all of
> the device memory is mapped into the Qemu VMA. This may not be readily
> available to the admin/libvirt.

... or that. But your proposal roughly looks like what I had in mind, so 
let's focus on that.

> 
> Based on the suggestions here, can we consider something like the
> following?
> 1. Introduce a new -numa subparam 'devnode', which tells Qemu to mark
> the node with MEM_AFFINITY_HOTPLUGGABLE in the SRAT's memory affinity
> structure to make it hotpluggable.

Is that "devnode=on" parameter required? Can't we simply expose any node 
that does *not* have any boot memory assigned as MEM_AFFINITY_HOTPLUGGABLE?

Right now, with "ordinary", fixed-location memory devices 
(DIMM/NVDIMM/virtio-mem/virtio-pmem), we create an srat entry that 
covers the device memory region for these devices with 
MEM_AFFINITY_HOTPLUGGABLE. We use the highest NUMA node in the machine, 
which does not quite work IIRC. All applicable nodes that don't have 
boot memory would need MEM_AFFINITY_HOTPLUGGABLE for Linux to create them.

In your example, which memory ranges would we use for these nodes in SRAT?

> 2. Create several NUMA nodes with 'devnode' which are supposed to be
> associated with the vfio-pci device.
> 3. Pass the numa node start and count to associate the nodes created.
> 
> So, the command would look something like the following.
> ...
>          -numa node,nodeid=2,devnode=on \
>          -numa node,nodeid=3,devnode=on \
>          -numa node,nodeid=4,devnode=on \
>          -numa node,nodeid=5,devnode=on \
>          -numa node,nodeid=6,devnode=on \
>          -numa node,nodeid=7,devnode=on \
>          -numa node,nodeid=8,devnode=on \
>          -numa node,nodeid=9,devnode=on \
>          -device vfio-pci-nohotplug,host=0009:01:00.0,bus=pcie.0,addr=04.0,rombar=0,numa-node-start=2,numa-node-count=8 \

Better an array/list like "numa-nodes=2-9"

... but how would the device actually use these nodes? (which for which?)
Alex Williamson Sept. 26, 2023, 7:14 p.m. UTC | #8
On Tue, 26 Sep 2023 18:54:53 +0200
David Hildenbrand <david@redhat.com> wrote:

> On 26.09.23 16:52, Ankit Agrawal wrote:
> >>>>> Good idea.  Fundamentally the device should not be creating NUMA
> >>>>> nodes, the VM should be configured with NUMA nodes and the device
> >>>>> memory associated with those nodes.  
> >>>>
> >>>> +1. That would also make it fly with DIMMs and virtio-mem, where you
> >>>> would want NUMA-less nodes ass well (imagine passing CXL memory to a VM
> >>>> using virtio-mem).
> >>>>  
> >>>
> >>> We actually do not add the device memory on the host, instead
> >>> map it into the Qemu VMA using remap_pfn_range(). Please checkout the
> >>> mmap function in vfio-pci variant driver code managing the device.
> >>> https://lore.kernel.org/all/20230915025415.6762-1-ankita@nvidia.com/
> >>> And I think host memory backend would need memory that is added on the
> >>> host.
> >>>
> >>> Moreover since we want to passthrough the entire device memory, the
> >>> -object memory-backend-ram would have to be passed a size that is equal
> >>> to the device memory. I wonder if that would be too much of a trouble
> >>> for an admin (or libvirt) triggering the Qemu process.
> >>>
> >>> Both these items are avoided by exposing the device memory as BAR as in the
> >>> current  implementation (referenced above) since it lets Qemu to naturally
> >>> discover the device memory region and do mmap.
> >>>  
> >>
> >> Just to clarify: nNUMA nodes for DIMMs/NVDIMMs/virtio-mem are configured
> >> on the device, not on the memory backend.
> >>
> >> e.g., -device pc-dimm,node=3,memdev=mem1,...  
> >   
> 
> Alco CCing Gavin, I remember he once experimented with virtio-mem + 
> multiple memory-less nodes and it was quite working (because of 
> MEM_AFFINITY_HOTPLUGGABLE only on the last node, below).
> 
> > Agreed, but still we will have the aforementioned issues viz.
> > 1. The backing memory for the memory device would need to be allocated
> > on the host. However, we do not add the device memory on the host in this
> > case. Instead the Qemu VMA is mapped to the device memory physical
> > address using remap_pfn_range().  
> 
> I don't see why that would be necessary ...
> 
> > 2. The memory device need to be passed an allocation size such that all of
> > the device memory is mapped into the Qemu VMA. This may not be readily
> > available to the admin/libvirt.  
> 
> ... or that. But your proposal roughly looks like what I had in mind, so 
> let's focus on that.
> 
> > 
> > Based on the suggestions here, can we consider something like the
> > following?
> > 1. Introduce a new -numa subparam 'devnode', which tells Qemu to mark
> > the node with MEM_AFFINITY_HOTPLUGGABLE in the SRAT's memory affinity
> > structure to make it hotpluggable.  
> 
> Is that "devnode=on" parameter required? Can't we simply expose any node 
> that does *not* have any boot memory assigned as MEM_AFFINITY_HOTPLUGGABLE?
> 
> Right now, with "ordinary", fixed-location memory devices 
> (DIMM/NVDIMM/virtio-mem/virtio-pmem), we create an srat entry that 
> covers the device memory region for these devices with 
> MEM_AFFINITY_HOTPLUGGABLE. We use the highest NUMA node in the machine, 
> which does not quite work IIRC. All applicable nodes that don't have 
> boot memory would need MEM_AFFINITY_HOTPLUGGABLE for Linux to create them.
> 
> In your example, which memory ranges would we use for these nodes in SRAT?
> 
> > 2. Create several NUMA nodes with 'devnode' which are supposed to be
> > associated with the vfio-pci device.
> > 3. Pass the numa node start and count to associate the nodes created.
> > 
> > So, the command would look something like the following.
> > ...
> >          -numa node,nodeid=2,devnode=on \
> >          -numa node,nodeid=3,devnode=on \
> >          -numa node,nodeid=4,devnode=on \
> >          -numa node,nodeid=5,devnode=on \
> >          -numa node,nodeid=6,devnode=on \
> >          -numa node,nodeid=7,devnode=on \
> >          -numa node,nodeid=8,devnode=on \
> >          -numa node,nodeid=9,devnode=on \
> >          -device vfio-pci-nohotplug,host=0009:01:00.0,bus=pcie.0,addr=04.0,rombar=0,numa-node-start=2,numa-node-count=8 \  

I don't see how these numa-node args on a vfio-pci device have any
general utility.  They're only used to create a firmware table, so why
don't we be explicit about it and define the firmware table as an
object?  For example:

	-numa node,nodeid=2 \
	-numa node,nodeid=3 \
	-numa node,nodeid=4 \
	-numa node,nodeid=5 \
	-numa node,nodeid=6 \
	-numa node,nodeid=7 \
	-numa node,nodeid=8 \
	-numa node,nodeid=9 \
	-device vfio-pci-nohotplug,host=0009:01:00.0,bus=pcie.0,addr=04.0,rombar=0,id=nvgrace0 \
	-object nvidia-gpu-mem-acpi,devid=nvgrace0,nodeset=2-9 \

There are some suggestions in this thread that CXL could have similar
requirements, but I haven't found any evidence that these
dev-mem-pxm-{start,count} attributes in the _DSD are standardized in
any way.  If they are, maybe this would be a dev-mem-pxm-acpi object
rather than an NVIDIA specific one.

It seems like we could almost meet the requirement for this table via
-acpitable, but I think we'd like to avoid the VM orchestration tool
from creating, compiling, and passing ACPI data blobs into the VM.
Thanks,

Alex
Ankit Agrawal Sept. 27, 2023, 7:14 a.m. UTC | #9
> >
> > Based on the suggestions here, can we consider something like the
> > following?
> > 1. Introduce a new -numa subparam 'devnode', which tells Qemu to mark
> > the node with MEM_AFFINITY_HOTPLUGGABLE in the SRAT's memory affinity
> > structure to make it hotpluggable.
>
> Is that "devnode=on" parameter required? Can't we simply expose any node
> that does *not* have any boot memory assigned as MEM_AFFINITY_HOTPLUGGABLE?
>
> Right now, with "ordinary", fixed-location memory devices
> (DIMM/NVDIMM/virtio-mem/virtio-pmem), we create an srat entry that
> covers the device memory region for these devices with
> MEM_AFFINITY_HOTPLUGGABLE. We use the highest NUMA node in the machine,
> which does not quite work IIRC. All applicable nodes that don't have
> boot memory would need MEM_AFFINITY_HOTPLUGGABLE for Linux to create them.

Yeah, you're right that it isn't required. Exposing the node without any memory as
MEM_AFFINITY_HOTPLUGGABLE seems like a better approach than using
"devnode=on".

> In your example, which memory ranges would we use for these nodes in SRAT?

We are setting the Base Address and the Size as 0 in the SRAT memory affinity
structures. This is done through the following:
build_srat_memory(table_data, 0, 0, i,
                  MEM_AFFINITY_HOTPLUGGABLE | MEM_AFFINITY_ENABLED);

This results in the following logs in the VM from the Linux ACPI SRAT parsing code:
[    0.000000] ACPI: SRAT: Node 2 PXM 2 [mem 0x00000000-0xffffffffffffffff] hotplug
[    0.000000] ACPI: SRAT: Node 3 PXM 3 [mem 0x00000000-0xffffffffffffffff] hotplug
[    0.000000] ACPI: SRAT: Node 4 PXM 4 [mem 0x00000000-0xffffffffffffffff] hotplug
[    0.000000] ACPI: SRAT: Node 5 PXM 5 [mem 0x00000000-0xffffffffffffffff] hotplug
[    0.000000] ACPI: SRAT: Node 6 PXM 6 [mem 0x00000000-0xffffffffffffffff] hotplug
[    0.000000] ACPI: SRAT: Node 7 PXM 7 [mem 0x00000000-0xffffffffffffffff] hotplug
[    0.000000] ACPI: SRAT: Node 8 PXM 8 [mem 0x00000000-0xffffffffffffffff] hotplug
[    0.000000] ACPI: SRAT: Node 9 PXM 9 [mem 0x00000000-0xffffffffffffffff] hotplug

I would re-iterate that we are just emulating the baremetal behavior here.


> I don't see how these numa-node args on a vfio-pci device have any
> general utility.  They're only used to create a firmware table, so why
> don't we be explicit about it and define the firmware table as an
> object?  For example:
>
>        -numa node,nodeid=2 \
>        -numa node,nodeid=3 \
>        -numa node,nodeid=4 \
>        -numa node,nodeid=5 \
>        -numa node,nodeid=6 \
>        -numa node,nodeid=7 \
>        -numa node,nodeid=8 \
>        -numa node,nodeid=9 \
>        -device vfio-pci-nohotplug,host=0009:01:00.0,bus=pcie.0,addr=04.0,rombar=0,id=nvgrace0 \
>        -object nvidia-gpu-mem-acpi,devid=nvgrace0,nodeset=2-9 \

Yeah, that is fine with me. If we agree with this approach, I can go
implement it.


> There are some suggestions in this thread that CXL could have similar
> requirements, but I haven't found any evidence that these
> dev-mem-pxm-{start,count} attributes in the _DSD are standardized in
> any way.  If they are, maybe this would be a dev-mem-pxm-acpi object
> rather than an NVIDIA specific one.

Maybe Jason, Jonathan can chime in on this?


> It seems like we could almost meet the requirement for this table via
> -acpitable, but I think we'd like to avoid the VM orchestration tool
> from creating, compiling, and passing ACPI data blobs into the VM.
Jonathan Cameron Sept. 27, 2023, 11:33 a.m. UTC | #10
On Wed, 27 Sep 2023 07:14:28 +0000
Ankit Agrawal <ankita@nvidia.com> wrote:

> > >
> > > Based on the suggestions here, can we consider something like the
> > > following?
> > > 1. Introduce a new -numa subparam 'devnode', which tells Qemu to mark
> > > the node with MEM_AFFINITY_HOTPLUGGABLE in the SRAT's memory affinity
> > > structure to make it hotpluggable.  
> >
> > Is that "devnode=on" parameter required? Can't we simply expose any node
> > that does *not* have any boot memory assigned as MEM_AFFINITY_HOTPLUGGABLE?

That needs some checking for what extra stuff we'll instantiate on CPU only
(or once we implement them) Generic Initiator / Generic Port nodes -
I'm definitely not keen on doing so for generic ports (which QEMU doesn't yet
do though there have been some RFCs I think).

> > Right now, with "ordinary", fixed-location memory devices
> > (DIMM/NVDIMM/virtio-mem/virtio-pmem), we create an srat entry that
> > covers the device memory region for these devices with
> > MEM_AFFINITY_HOTPLUGGABLE. We use the highest NUMA node in the machine,
> > which does not quite work IIRC. All applicable nodes that don't have
> > boot memory would need MEM_AFFINITY_HOTPLUGGABLE for Linux to create them.  
> 
> Yeah, you're right that it isn't required. Exposing the node without any memory as
> MEM_AFFINITY_HOTPLUGGABLE seems like a better approach than using
> "devnode=on".
> 
> > In your example, which memory ranges would we use for these nodes in SRAT?  
> 
> We are setting the Base Address and the Size as 0 in the SRAT memory affinity
> structures. This is done through the following:
> build_srat_memory(table_data, 0, 0, i,
>                   MEM_AFFINITY_HOTPLUGGABLE | MEM_AFFINITY_ENABLED);
> 
> This results in the following logs in the VM from the Linux ACPI SRAT parsing code:
> [    0.000000] ACPI: SRAT: Node 2 PXM 2 [mem 0x00000000-0xffffffffffffffff] hotplug
> [    0.000000] ACPI: SRAT: Node 3 PXM 3 [mem 0x00000000-0xffffffffffffffff] hotplug
> [    0.000000] ACPI: SRAT: Node 4 PXM 4 [mem 0x00000000-0xffffffffffffffff] hotplug
> [    0.000000] ACPI: SRAT: Node 5 PXM 5 [mem 0x00000000-0xffffffffffffffff] hotplug
> [    0.000000] ACPI: SRAT: Node 6 PXM 6 [mem 0x00000000-0xffffffffffffffff] hotplug
> [    0.000000] ACPI: SRAT: Node 7 PXM 7 [mem 0x00000000-0xffffffffffffffff] hotplug
> [    0.000000] ACPI: SRAT: Node 8 PXM 8 [mem 0x00000000-0xffffffffffffffff] hotplug
> [    0.000000] ACPI: SRAT: Node 9 PXM 9 [mem 0x00000000-0xffffffffffffffff] hotplug
> 
> I would re-iterate that we are just emulating the baremetal behavior here.
> 
> 
> > I don't see how these numa-node args on a vfio-pci device have any
> > general utility.  They're only used to create a firmware table, so why
> > don't we be explicit about it and define the firmware table as an
> > object?  For example:
> >
> >        -numa node,nodeid=2 \
> >        -numa node,nodeid=3 \
> >        -numa node,nodeid=4 \
> >        -numa node,nodeid=5 \
> >        -numa node,nodeid=6 \
> >        -numa node,nodeid=7 \
> >        -numa node,nodeid=8 \
> >        -numa node,nodeid=9 \
> >        -device vfio-pci-nohotplug,host=0009:01:00.0,bus=pcie.0,addr=04.0,rombar=0,id=nvgrace0 \
> >        -object nvidia-gpu-mem-acpi,devid=nvgrace0,nodeset=2-9 \  
> 
> Yeah, that is fine with me. If we agree with this approach, I can go
> implement it.
> 
> 
> > There are some suggestions in this thread that CXL could have similar
> > requirements,

For CXL side of things, if talking memory devices (type 3), I'm not
sure what the usecase will be of this feature.
Either we treat them as normal memory in which case it will all be static
at boot of the VM (for SRAT anyway - we might plug things in and
out of ranges), or it will be whole device hotplug and look like pc-dimm
hotplug (which should be into a statically defined range in SRAT).
 Longer term if we look at virtualizing dynamic capacity
devices (not sure we need to other that possibly to leverage
sparse DAX etc on top of them) then we might want to provide
emulated CXL Fixed memory windows in the guest (which get their own 
NUMA nodes anyway) + plug the memory into that. We'd probably hide
away interleaving etc in the host as all the guest should care about
is performance information and I doubt we'd want to emulate the complexity
of address routing complexities.

Similar to host PA ranges used in CXL fixed memory windows, I'm not sure
we wouldn't just allow for the guest to have 'all' possible setups that
might get plugged later by just burning a lot of HPA space and hence
just be able to use static SRAT nodes covering each region.
This would be less painful than for real PAs because as we are
emulating the CXL devices, probably as one emulated type 3 device per
potential set of real devices in an interleave set we can avoid
all the ordering constraints of CXL address decoders that end up eating
up Host PA space.

Virtualizing DCD is going to be a fun topic (that's next year's
plumbers CXL uconf session sorted ;), but I can see it might be done completely
differently and look nothing like a CXL device, in which case maybe
what you have here will make sense.

Come to think of it, you 'could' potentially do that for your use case
if the regions are reasonably bound in maximum size at the cost of
large GPA usage?

CXL accelerators / GPUs etc are a different question but who has one
of those anyway? :)


> > but I haven't found any evidence that these
> > dev-mem-pxm-{start,count} attributes in the _DSD are standardized in
> > any way.  If they are, maybe this would be a dev-mem-pxm-acpi object
> > rather than an NVIDIA specific one.  
> 
> Maybe Jason, Jonathan can chime in on this?

I'm not aware of anything general around this.  A PCI device
can have a _PXM and I think you could define subdevices each with a
_PXM of their own?  Those subdevices would need drivers to interpret
the structure anyway so not real benefit over a _DSD that I can
immediately think of...

If we think this will be common long term, anyone want to take
multiple _PXM per device support as a proposal to ACPI?

So agreed, it's not general, so if it's acceptable to have 0 length
NUMA nodes (and I think we have to emulate them given that's what real
hardware is doing even if some of us think the real hardware shouldn't
have done that!) then just spinning them up explicitly as nodes
+ device specific stuff for the NVIDIA device seems fine to me.

> 
> 
> > It seems like we could almost meet the requirement for this table via
> > -acpitable, but I think we'd like to avoid the VM orchestration tool
> > from creating, compiling, and passing ACPI data blobs into the VM.  
>
Jason Gunthorpe Sept. 27, 2023, 1:53 p.m. UTC | #11
On Wed, Sep 27, 2023 at 12:33:18PM +0100, Jonathan Cameron wrote:

> CXL accelerators / GPUs etc are a different question but who has one
> of those anyway? :)

That's exactly what I mean when I say CXL will need it too. I keep
describing this current Grace & Hopper as pre-CXL HW. You can easially
imagine draping CXL around it. CXL doesn't solve the problem that
motivates all this hackying - Linux can't dynamically create NUMA
nodes.

So even with CXL baremetal FW will have to create these nodes so Linux
can consume them. Hackity hack hack..

Broadly when you start to look at all of CXL, any device with coherent
memory and the ability to create SRIOV VF like things is going to have
a problem that Linux driver sofware will want a NUMA node for every
possible VF.

So we can view this as some generic NVIDIA hack, but really it is a
"coherent memory and SRIOV" hack that should be needed forany device
of this class.

I was hoping the CXL world would fix this stuff, but I was told they
also doubled down and are creating unnecessary ACPI subtly to avoid
needing dynamic NUMA nodes in Linux.. Sigh.

Jason
Alex Williamson Sept. 27, 2023, 2:24 p.m. UTC | #12
On Wed, 27 Sep 2023 10:53:36 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Wed, Sep 27, 2023 at 12:33:18PM +0100, Jonathan Cameron wrote:
> 
> > CXL accelerators / GPUs etc are a different question but who has one
> > of those anyway? :)  
> 
> That's exactly what I mean when I say CXL will need it too. I keep
> describing this current Grace & Hopper as pre-CXL HW. You can easially
> imagine draping CXL around it. CXL doesn't solve the problem that
> motivates all this hackying - Linux can't dynamically create NUMA
> nodes.

Why is that and why aren't we pushing towards a solution of removing
that barrier so that we don't require the machine topology to be
configured to support this use case and guest OS limitations?  Thanks,

Alex
Vikram Sethi Sept. 27, 2023, 3:03 p.m. UTC | #13
> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Wednesday, September 27, 2023 9:25 AM
> To: Jason Gunthorpe <jgg@nvidia.com>
> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>; Ankit Agrawal
> <ankita@nvidia.com>; David Hildenbrand <david@redhat.com>; Cédric Le
> Goater <clg@redhat.com>; shannon.zhaosl@gmail.com;
> peter.maydell@linaro.org; ani@anisinha.ca; Aniket Agashe
> <aniketa@nvidia.com>; Neo Jia <cjia@nvidia.com>; Kirti Wankhede
> <kwankhede@nvidia.com>; Tarun Gupta (SW-GPU) <targupta@nvidia.com>;
> Vikram Sethi <vsethi@nvidia.com>; Andy Currid <ACurrid@nvidia.com>;
> qemu-arm@nongnu.org; qemu-devel@nongnu.org; Gavin Shan
> <gshan@redhat.com>; ira.weiny@intel.com; navneet.singh@intel.com
> Subject: Re: [PATCH v1 0/4] vfio: report NUMA nodes for device memory
> 
> 
> On Wed, 27 Sep 2023 10:53:36 -0300
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > On Wed, Sep 27, 2023 at 12:33:18PM +0100, Jonathan Cameron wrote:
> >
> > > CXL accelerators / GPUs etc are a different question but who has one
> > > of those anyway? :)
> >
> > That's exactly what I mean when I say CXL will need it too. I keep
> > describing this current Grace & Hopper as pre-CXL HW. You can easially
> > imagine draping CXL around it. CXL doesn't solve the problem that
> > motivates all this hackying - Linux can't dynamically create NUMA
> > nodes.
> 
> Why is that and why aren't we pushing towards a solution of removing that
> barrier so that we don't require the machine topology to be configured to
> support this use case and guest OS limitations?  Thanks,
>

Even if Linux could create NUMA nodes dynamically for coherent CXL or CXL type devices, 
there is additional information FW knows that the kernel doesn't. For example,
what the distance/latency between CPU and the device NUMA node is. While CXL devices
present a CDAT table which gives latency type attributes within the device, there would still be some
guesswork needed in the kernel as to what the end to end latency/distance is. 
It's probably not the best outcome to just consider this generically far memory" because 
is it further than Intersocket memory access or not matters. 
Pre CXL devices such as for this patchset don't even have CDAT so the kernel by itself has
no idea if this latency/distance is less than or more than inter socket memory access latency
even.
So specially for devices present at boot, FW knows this information and should provide it. 
Similarly, QEMU should pass along this information to VMs for the best outcomes.  

Thanks
> Alex
Jason Gunthorpe Sept. 27, 2023, 3:42 p.m. UTC | #14
On Wed, Sep 27, 2023 at 03:03:09PM +0000, Vikram Sethi wrote:
> > From: Alex Williamson <alex.williamson@redhat.com>
> > Sent: Wednesday, September 27, 2023 9:25 AM
> > To: Jason Gunthorpe <jgg@nvidia.com>
> > Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>; Ankit Agrawal
> > <ankita@nvidia.com>; David Hildenbrand <david@redhat.com>; Cédric Le
> > Goater <clg@redhat.com>; shannon.zhaosl@gmail.com;
> > peter.maydell@linaro.org; ani@anisinha.ca; Aniket Agashe
> > <aniketa@nvidia.com>; Neo Jia <cjia@nvidia.com>; Kirti Wankhede
> > <kwankhede@nvidia.com>; Tarun Gupta (SW-GPU) <targupta@nvidia.com>;
> > Vikram Sethi <vsethi@nvidia.com>; Andy Currid <ACurrid@nvidia.com>;
> > qemu-arm@nongnu.org; qemu-devel@nongnu.org; Gavin Shan
> > <gshan@redhat.com>; ira.weiny@intel.com; navneet.singh@intel.com
> > Subject: Re: [PATCH v1 0/4] vfio: report NUMA nodes for device memory
> > 
> > 
> > On Wed, 27 Sep 2023 10:53:36 -0300
> > Jason Gunthorpe <jgg@nvidia.com> wrote:
> > 
> > > On Wed, Sep 27, 2023 at 12:33:18PM +0100, Jonathan Cameron wrote:
> > >
> > > > CXL accelerators / GPUs etc are a different question but who has one
> > > > of those anyway? :)
> > >
> > > That's exactly what I mean when I say CXL will need it too. I keep
> > > describing this current Grace & Hopper as pre-CXL HW. You can easially
> > > imagine draping CXL around it. CXL doesn't solve the problem that
> > > motivates all this hackying - Linux can't dynamically create NUMA
> > > nodes.
> > 
> > Why is that and why aren't we pushing towards a solution of removing that
> > barrier so that we don't require the machine topology to be configured to
> > support this use case and guest OS limitations?  Thanks,

I haven't looked myself, but I've been told this is very
hard. Probably the NUMA concept needs to be split a bit so that the
memory allocator pool handle is not tied to the ACPI.

> Even if Linux could create NUMA nodes dynamically for coherent CXL
> or CXL type devices, there is additional information FW knows that
> the kernel doesn't. For example, what the distance/latency between
> CPU and the device NUMA node is.

But that should just be the existing normal PCIy stuff to define
affinity of the PCI function. Since the memory is logically linked to
the PCI function we have no issue from a topology perspective.

> While CXL devices present a CDAT table which gives latency type
> attributes within the device, there would still be some guesswork
> needed in the kernel as to what the end to end latency/distance is.

Is it non-uniform per CXL function?

> knows this information and should provide it.  Similarly, QEMU
> should pass along this information to VMs for the best outcomes.

Well yes, ideally qemu would pass vCPU affinity for the vPCI functions
so existing NUMA aware allocations in PCI drivers are optimized. eg we
put queues in memory close to the PCI function already.

I think it is important to keep seperated the need to know the
topology/affinity/etc and the need for the driver to have a Linux NUMA
node handle to operate the memory alocator pool APIs.

Regardless, it is what it is for the foreseeable future :(

Jason
Alex Williamson Sept. 27, 2023, 4:37 p.m. UTC | #15
On Wed, 27 Sep 2023 15:03:09 +0000
Vikram Sethi <vsethi@nvidia.com> wrote:

> > From: Alex Williamson <alex.williamson@redhat.com>
> > Sent: Wednesday, September 27, 2023 9:25 AM
> > To: Jason Gunthorpe <jgg@nvidia.com>
> > Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>; Ankit Agrawal
> > <ankita@nvidia.com>; David Hildenbrand <david@redhat.com>; Cédric Le
> > Goater <clg@redhat.com>; shannon.zhaosl@gmail.com;
> > peter.maydell@linaro.org; ani@anisinha.ca; Aniket Agashe
> > <aniketa@nvidia.com>; Neo Jia <cjia@nvidia.com>; Kirti Wankhede
> > <kwankhede@nvidia.com>; Tarun Gupta (SW-GPU) <targupta@nvidia.com>;
> > Vikram Sethi <vsethi@nvidia.com>; Andy Currid <ACurrid@nvidia.com>;
> > qemu-arm@nongnu.org; qemu-devel@nongnu.org; Gavin Shan
> > <gshan@redhat.com>; ira.weiny@intel.com; navneet.singh@intel.com
> > Subject: Re: [PATCH v1 0/4] vfio: report NUMA nodes for device memory
> > 
> > 
> > On Wed, 27 Sep 2023 10:53:36 -0300
> > Jason Gunthorpe <jgg@nvidia.com> wrote:
> >   
> > > On Wed, Sep 27, 2023 at 12:33:18PM +0100, Jonathan Cameron wrote:
> > >  
> > > > CXL accelerators / GPUs etc are a different question but who has one
> > > > of those anyway? :)  
> > >
> > > That's exactly what I mean when I say CXL will need it too. I keep
> > > describing this current Grace & Hopper as pre-CXL HW. You can easially
> > > imagine draping CXL around it. CXL doesn't solve the problem that
> > > motivates all this hackying - Linux can't dynamically create NUMA
> > > nodes.  
> > 
> > Why is that and why aren't we pushing towards a solution of removing that
> > barrier so that we don't require the machine topology to be configured to
> > support this use case and guest OS limitations?  Thanks,
> >  
> 
> Even if Linux could create NUMA nodes dynamically for coherent CXL or CXL type devices, 
> there is additional information FW knows that the kernel doesn't. For example,
> what the distance/latency between CPU and the device NUMA node is. While CXL devices
> present a CDAT table which gives latency type attributes within the device, there would still be some
> guesswork needed in the kernel as to what the end to end latency/distance is. 
> It's probably not the best outcome to just consider this generically far memory" because 
> is it further than Intersocket memory access or not matters. 
> Pre CXL devices such as for this patchset don't even have CDAT so the kernel by itself has
> no idea if this latency/distance is less than or more than inter socket memory access latency
> even.
> So specially for devices present at boot, FW knows this information and should provide it. 
> Similarly, QEMU should pass along this information to VMs for the best outcomes.  

Yeah, AFAICT we're not doing any of that in this series.  We're only
creating some number of nodes for the guest driver to make use of and
describing in the generated _DSD the set of nodes associated for use by
the device.  How many nodes are required, how the guest driver
partitions coherent memory among the nodes, and how the guest assigns a
distance to the nodes is unspecified.

A glance at the CDAT spec seems brilliant in this regard, is there such
a table for this device or could/should the vfio-pci variant driver
provide one?  I imagine this is how the VM admin or orchestration tool
would learn to configure nodes and the VMM would further virtualize
this table for the guest OS and drivers.  Thanks,

Alex
Jonathan Cameron Sept. 28, 2023, 4:04 p.m. UTC | #16
On Wed, 27 Sep 2023 15:03:09 +0000
Vikram Sethi <vsethi@nvidia.com> wrote:

> > From: Alex Williamson <alex.williamson@redhat.com>
> > Sent: Wednesday, September 27, 2023 9:25 AM
> > To: Jason Gunthorpe <jgg@nvidia.com>
> > Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>; Ankit Agrawal
> > <ankita@nvidia.com>; David Hildenbrand <david@redhat.com>; Cédric Le
> > Goater <clg@redhat.com>; shannon.zhaosl@gmail.com;
> > peter.maydell@linaro.org; ani@anisinha.ca; Aniket Agashe
> > <aniketa@nvidia.com>; Neo Jia <cjia@nvidia.com>; Kirti Wankhede
> > <kwankhede@nvidia.com>; Tarun Gupta (SW-GPU) <targupta@nvidia.com>;
> > Vikram Sethi <vsethi@nvidia.com>; Andy Currid <ACurrid@nvidia.com>;
> > qemu-arm@nongnu.org; qemu-devel@nongnu.org; Gavin Shan
> > <gshan@redhat.com>; ira.weiny@intel.com; navneet.singh@intel.com
> > Subject: Re: [PATCH v1 0/4] vfio: report NUMA nodes for device memory
> > 
> > 
> > On Wed, 27 Sep 2023 10:53:36 -0300
> > Jason Gunthorpe <jgg@nvidia.com> wrote:
> >   
> > > On Wed, Sep 27, 2023 at 12:33:18PM +0100, Jonathan Cameron wrote:
> > >  
> > > > CXL accelerators / GPUs etc are a different question but who has one
> > > > of those anyway? :)  
> > >
> > > That's exactly what I mean when I say CXL will need it too. I keep
> > > describing this current Grace & Hopper as pre-CXL HW. You can easially
> > > imagine draping CXL around it. CXL doesn't solve the problem that
> > > motivates all this hackying - Linux can't dynamically create NUMA
> > > nodes.  
> > 
> > Why is that and why aren't we pushing towards a solution of removing that
> > barrier so that we don't require the machine topology to be configured to
> > support this use case and guest OS limitations?  Thanks,
> >  
> 
> Even if Linux could create NUMA nodes dynamically for coherent CXL or CXL type devices, 
> there is additional information FW knows that the kernel doesn't. For example,
> what the distance/latency between CPU and the device NUMA node is. While CXL devices
> present a CDAT table which gives latency type attributes within the device, there would still be some
> guesswork needed in the kernel as to what the end to end latency/distance is. 
FWIW

Shouldn't be guess work needed (for light load case anyway which is what would be
in HMAT). That's what the Generic Ports were added to SRAT for. Dave Jiang has a
patch set
https://lore.kernel.org/all/168695160531.3031571.4875512229068707023.stgit@djiang5-mobl3/
to do the maths...  For CXL there is no problem fully describing the access characteristics.

> It's probably not the best outcome to just consider this generically far memory" because 
> is it further than Intersocket memory access or not matters. 
> Pre CXL devices such as for this patchset don't even have CDAT so the kernel by itself has
> no idea if this latency/distance is less than or more than inter socket memory access latency
> even.

Just because I'm feeling cheeky - you could emulate a DOE and CDAT :)?
Though I suppose you don't want to teach the guest driver about it.

> So specially for devices present at boot, FW knows this information and should provide it. 
> Similarly, QEMU should pass along this information to VMs for the best outcomes.
No problem with the argument that FW has the info and should provide it,
just on the 'how' part.

Jonathan

> 
> Thanks
> > Alex  
>
Jonathan Cameron Sept. 28, 2023, 4:15 p.m. UTC | #17
On Wed, 27 Sep 2023 12:42:40 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Wed, Sep 27, 2023 at 03:03:09PM +0000, Vikram Sethi wrote:
> > > From: Alex Williamson <alex.williamson@redhat.com>
> > > Sent: Wednesday, September 27, 2023 9:25 AM
> > > To: Jason Gunthorpe <jgg@nvidia.com>
> > > Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>; Ankit Agrawal
> > > <ankita@nvidia.com>; David Hildenbrand <david@redhat.com>; Cédric Le
> > > Goater <clg@redhat.com>; shannon.zhaosl@gmail.com;
> > > peter.maydell@linaro.org; ani@anisinha.ca; Aniket Agashe
> > > <aniketa@nvidia.com>; Neo Jia <cjia@nvidia.com>; Kirti Wankhede
> > > <kwankhede@nvidia.com>; Tarun Gupta (SW-GPU) <targupta@nvidia.com>;
> > > Vikram Sethi <vsethi@nvidia.com>; Andy Currid <ACurrid@nvidia.com>;
> > > qemu-arm@nongnu.org; qemu-devel@nongnu.org; Gavin Shan
> > > <gshan@redhat.com>; ira.weiny@intel.com; navneet.singh@intel.com
> > > Subject: Re: [PATCH v1 0/4] vfio: report NUMA nodes for device memory
> > > 
> > > 
> > > On Wed, 27 Sep 2023 10:53:36 -0300
> > > Jason Gunthorpe <jgg@nvidia.com> wrote:
> > >   
> > > > On Wed, Sep 27, 2023 at 12:33:18PM +0100, Jonathan Cameron wrote:
> > > >  
> > > > > CXL accelerators / GPUs etc are a different question but who has one
> > > > > of those anyway? :)  
> > > >
> > > > That's exactly what I mean when I say CXL will need it too. I keep
> > > > describing this current Grace & Hopper as pre-CXL HW. You can easially
> > > > imagine draping CXL around it. CXL doesn't solve the problem that
> > > > motivates all this hackying - Linux can't dynamically create NUMA
> > > > nodes.  
> > > 
> > > Why is that and why aren't we pushing towards a solution of removing that
> > > barrier so that we don't require the machine topology to be configured to
> > > support this use case and guest OS limitations?  Thanks,  
> 
> I haven't looked myself, but I've been told this is very
> hard. Probably the NUMA concept needs to be split a bit so that the
> memory allocator pool handle is not tied to the ACPI.
> 
> > Even if Linux could create NUMA nodes dynamically for coherent CXL
> > or CXL type devices, there is additional information FW knows that
> > the kernel doesn't. For example, what the distance/latency between
> > CPU and the device NUMA node is.  
> 
> But that should just be the existing normal PCIy stuff to define
> affinity of the PCI function. Since the memory is logically linked to
> the PCI function we have no issue from a topology perspective.

Agreed - if there is a node to associate it with, this is easy to cover.
If aim is just to make the info available, could just Generic Initiator
nodes instead.  Those were added for memory free accelerators when the
bios was describing them to fill the gap for things that aren't processors
or memory but for which bandwidth and latency numbers were useful.
This falls into your comment below on the need not being the topology
as such, but rather the number needed for the memory allocator.

> 
> > While CXL devices present a CDAT table which gives latency type
> > attributes within the device, there would still be some guesswork
> > needed in the kernel as to what the end to end latency/distance is.  
> 
> Is it non-uniform per CXL function?

Could be, but if so it is all discoverable. Though mapping is to
device PA range rather than 'function' as such.  You are certainly
allowed lots of different PA ranges with different characteristics.

> 
> > knows this information and should provide it.  Similarly, QEMU
> > should pass along this information to VMs for the best outcomes.  
> 
> Well yes, ideally qemu would pass vCPU affinity for the vPCI functions
> so existing NUMA aware allocations in PCI drivers are optimized. eg we
> put queues in memory close to the PCI function already.

I though we did that via PXBs?

> 
> I think it is important to keep seperated the need to know the
> topology/affinity/etc and the need for the driver to have a Linux NUMA
> node handle to operate the memory alocator pool APIs.
> 
> Regardless, it is what it is for the foreseeable future :(
:(

> 
> Jason
Jonathan Cameron Sept. 28, 2023, 4:29 p.m. UTC | #18
On Wed, 27 Sep 2023 10:37:37 -0600
Alex Williamson <alex.williamson@redhat.com> wrote:

> On Wed, 27 Sep 2023 15:03:09 +0000
> Vikram Sethi <vsethi@nvidia.com> wrote:
> 
> > > From: Alex Williamson <alex.williamson@redhat.com>
> > > Sent: Wednesday, September 27, 2023 9:25 AM
> > > To: Jason Gunthorpe <jgg@nvidia.com>
> > > Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>; Ankit Agrawal
> > > <ankita@nvidia.com>; David Hildenbrand <david@redhat.com>; Cédric Le
> > > Goater <clg@redhat.com>; shannon.zhaosl@gmail.com;
> > > peter.maydell@linaro.org; ani@anisinha.ca; Aniket Agashe
> > > <aniketa@nvidia.com>; Neo Jia <cjia@nvidia.com>; Kirti Wankhede
> > > <kwankhede@nvidia.com>; Tarun Gupta (SW-GPU) <targupta@nvidia.com>;
> > > Vikram Sethi <vsethi@nvidia.com>; Andy Currid <ACurrid@nvidia.com>;
> > > qemu-arm@nongnu.org; qemu-devel@nongnu.org; Gavin Shan
> > > <gshan@redhat.com>; ira.weiny@intel.com; navneet.singh@intel.com
> > > Subject: Re: [PATCH v1 0/4] vfio: report NUMA nodes for device memory
> > > 
> > > 
> > > On Wed, 27 Sep 2023 10:53:36 -0300
> > > Jason Gunthorpe <jgg@nvidia.com> wrote:
> > >     
> > > > On Wed, Sep 27, 2023 at 12:33:18PM +0100, Jonathan Cameron wrote:
> > > >    
> > > > > CXL accelerators / GPUs etc are a different question but who has one
> > > > > of those anyway? :)    
> > > >
> > > > That's exactly what I mean when I say CXL will need it too. I keep
> > > > describing this current Grace & Hopper as pre-CXL HW. You can easially
> > > > imagine draping CXL around it. CXL doesn't solve the problem that
> > > > motivates all this hackying - Linux can't dynamically create NUMA
> > > > nodes.    
> > > 
> > > Why is that and why aren't we pushing towards a solution of removing that
> > > barrier so that we don't require the machine topology to be configured to
> > > support this use case and guest OS limitations?  Thanks,
> > >    
> > 
> > Even if Linux could create NUMA nodes dynamically for coherent CXL or CXL type devices, 
> > there is additional information FW knows that the kernel doesn't. For example,
> > what the distance/latency between CPU and the device NUMA node is. While CXL devices
> > present a CDAT table which gives latency type attributes within the device, there would still be some
> > guesswork needed in the kernel as to what the end to end latency/distance is. 
> > It's probably not the best outcome to just consider this generically far memory" because 
> > is it further than Intersocket memory access or not matters. 
> > Pre CXL devices such as for this patchset don't even have CDAT so the kernel by itself has
> > no idea if this latency/distance is less than or more than inter socket memory access latency
> > even.
> > So specially for devices present at boot, FW knows this information and should provide it. 
> > Similarly, QEMU should pass along this information to VMs for the best outcomes.    
> 
> Yeah, AFAICT we're not doing any of that in this series.  We're only
> creating some number of nodes for the guest driver to make use of and
> describing in the generated _DSD the set of nodes associated for use by
> the device.  How many nodes are required, how the guest driver
> partitions coherent memory among the nodes, and how the guest assigns a
> distance to the nodes is unspecified.
> 
> A glance at the CDAT spec seems brilliant in this regard, is there such
> a table for this device or could/should the vfio-pci variant driver
> provide one?  I imagine this is how the VM admin or orchestration tool
> would learn to configure nodes and the VMM would further virtualize
> this table for the guest OS and drivers.  Thanks,

I like this proposal. Host bit seems doable.  However I'm not sure
virtualizing it would help for device drivers that have no awareness unless
we slap some generic layer in the driver core in Linux that can use this if
available?  It would certainly be easy enough to add emulation into the guest
to the PCI function config space (as long as we aren't out of room!)

For CXL I think we will have to do this anyway. For a complex topology we
could either:
1) Present a matching topology in the guest with all the components involved
   including switches with right numbers of ports etc so the CDAT tables
   can be read directly from underlying hardware, or...
2) Present a somewhat matching topology in the guest having just messed with
   switches so that we can present their existence but not all the ports etc.
   So emulate a cut down CDAT for the switch.
3) Allow flattening everything and present a single CDAT with aggregate numbers
   at the EP. That will get a bit messy as we may need the kernel to ignore some
   parts it would normally account for (link widths on CXL buses etc.)

So I'm thinking 1 is to restrictive, hence need to emulate the table anyway
(not mention it might not be on the function being assigned anyway as they
are normally only on Function 0 IIRC).

Gah. Too much of CXL emulation still to do in QEMU already....

Jonathan




> 
> Alex
>