Message ID | 20230915024559.6565-1-ankita@nvidia.com (mailing list archive) |
---|---|
Headers | show |
Series | vfio: report NUMA nodes for device memory | expand |
Hello Ankit, On 9/15/23 04:45, ankita@nvidia.com wrote: > From: Ankit Agrawal <ankita@nvidia.com> > > For devices which allow CPU to cache coherently access their memory, > it is sensible to expose such memory as NUMA nodes separate from > the sysmem node. Qemu currently do not provide a mechanism for creation > of NUMA nodes associated with a vfio-pci device. > > Implement a mechanism to create and associate a set of unique NUMA nodes > with a vfio-pci device.> > NUMA node is created by inserting a series of the unique proximity > domains (PXM) in the VM SRAT ACPI table. The ACPI tables are read once > at the time of bootup by the kernel to determine the NUMA configuration > and is inflexible post that. Hence this feature is incompatible with > device hotplug. The added node range associated with the device is > communicated through ACPI DSD and can be fetched by the VM kernel or > kernel modules. QEMU's VM SRAT and DSD builder code is modified > accordingly. > > New command line params are introduced for admin to have a control on > the NUMA node assignment. This approach seems to bypass the NUMA framework in place in QEMU and will be a challenge for the upper layers. QEMU is generally used from libvirt when dealing with KVM guests. Typically, a command line for a virt machine with NUMA nodes would look like : -object memory-backend-ram,id=ram-node0,size=1G \ -numa node,nodeid=0,memdev=ram-node0 \ -object memory-backend-ram,id=ram-node1,size=1G \ -numa node,nodeid=1,cpus=0-3,memdev=ram-node1 which defines 2 nodes, one with memory and all CPUs and a second with only memory. # numactl -H available: 2 nodes (0-1) node 0 cpus: 0 1 2 3 node 0 size: 1003 MB node 0 free: 734 MB node 1 cpus: node 1 size: 975 MB node 1 free: 968 MB node distances: node 0 1 0: 10 20 1: 20 10 Could it be a new type of host memory backend ? Have you considered this approach ? Thanks, C. > > It is expected for a vfio-pci driver to expose this feature through > sysfs. Presence of the feature is checked to enable these code changes. > > Applied over v8.1.0-rc4. > > Ankit Agrawal (4): > vfio: new command line params for device memory NUMA nodes > vfio: assign default values to node params > hw/arm/virt-acpi-build: patch guest SRAT for NUMA nodes > acpi/gpex: patch guest DSDT for dev mem information > > hw/arm/virt-acpi-build.c | 54 +++++++++++++ > hw/pci-host/gpex-acpi.c | 69 +++++++++++++++++ > hw/vfio/pci.c | 146 ++++++++++++++++++++++++++++++++++++ > hw/vfio/pci.h | 2 + > include/hw/pci/pci_device.h | 3 + > 5 files changed, 274 insertions(+) >
On Fri, 15 Sep 2023 16:19:29 +0200 Cédric Le Goater <clg@redhat.com> wrote: > Hello Ankit, > > On 9/15/23 04:45, ankita@nvidia.com wrote: > > From: Ankit Agrawal <ankita@nvidia.com> > > > > For devices which allow CPU to cache coherently access their memory, > > it is sensible to expose such memory as NUMA nodes separate from > > the sysmem node. Qemu currently do not provide a mechanism for creation > > of NUMA nodes associated with a vfio-pci device. > > > > Implement a mechanism to create and associate a set of unique NUMA nodes > > with a vfio-pci device.> > > NUMA node is created by inserting a series of the unique proximity > > domains (PXM) in the VM SRAT ACPI table. The ACPI tables are read once > > at the time of bootup by the kernel to determine the NUMA configuration > > and is inflexible post that. Hence this feature is incompatible with > > device hotplug. The added node range associated with the device is > > communicated through ACPI DSD and can be fetched by the VM kernel or > > kernel modules. QEMU's VM SRAT and DSD builder code is modified > > accordingly. > > > > New command line params are introduced for admin to have a control on > > the NUMA node assignment. > > This approach seems to bypass the NUMA framework in place in QEMU and > will be a challenge for the upper layers. QEMU is generally used from > libvirt when dealing with KVM guests. > > Typically, a command line for a virt machine with NUMA nodes would look > like : > > -object memory-backend-ram,id=ram-node0,size=1G \ > -numa node,nodeid=0,memdev=ram-node0 \ > -object memory-backend-ram,id=ram-node1,size=1G \ > -numa node,nodeid=1,cpus=0-3,memdev=ram-node1 > > which defines 2 nodes, one with memory and all CPUs and a second with > only memory. > > # numactl -H > available: 2 nodes (0-1) > node 0 cpus: 0 1 2 3 > node 0 size: 1003 MB > node 0 free: 734 MB > node 1 cpus: > node 1 size: 975 MB > node 1 free: 968 MB > node distances: > node 0 1 > 0: 10 20 > 1: 20 10 > > > Could it be a new type of host memory backend ? Have you considered > this approach ? Good idea. Fundamentally the device should not be creating NUMA nodes, the VM should be configured with NUMA nodes and the device memory associated with those nodes. I think we're also dealing with a lot of very, very device specific behavior, so I question whether we shouldn't create a separate device for this beyond vifo-pci or vfio-pci-nohotplug. In particular, a PCI device typically only has association to a single proximity domain, so what sense does it make to describe the coherent memory as a PCI BAR to only then create a confusing mapping where the device has a proximity domain separate from a resources associated with the device? It's seeming like this device should create memory objects that can be associated as memory backing for command line specified NUMA nodes. Thanks, Alex
On 15.09.23 16:47, Alex Williamson wrote: > On Fri, 15 Sep 2023 16:19:29 +0200 > Cédric Le Goater <clg@redhat.com> wrote: > >> Hello Ankit, >> >> On 9/15/23 04:45, ankita@nvidia.com wrote: >>> From: Ankit Agrawal <ankita@nvidia.com> >>> >>> For devices which allow CPU to cache coherently access their memory, >>> it is sensible to expose such memory as NUMA nodes separate from >>> the sysmem node. Qemu currently do not provide a mechanism for creation >>> of NUMA nodes associated with a vfio-pci device. >>> >>> Implement a mechanism to create and associate a set of unique NUMA nodes >>> with a vfio-pci device.> >>> NUMA node is created by inserting a series of the unique proximity >>> domains (PXM) in the VM SRAT ACPI table. The ACPI tables are read once >>> at the time of bootup by the kernel to determine the NUMA configuration >>> and is inflexible post that. Hence this feature is incompatible with >>> device hotplug. The added node range associated with the device is >>> communicated through ACPI DSD and can be fetched by the VM kernel or >>> kernel modules. QEMU's VM SRAT and DSD builder code is modified >>> accordingly. >>> >>> New command line params are introduced for admin to have a control on >>> the NUMA node assignment. >> >> This approach seems to bypass the NUMA framework in place in QEMU and >> will be a challenge for the upper layers. QEMU is generally used from >> libvirt when dealing with KVM guests. >> >> Typically, a command line for a virt machine with NUMA nodes would look >> like : >> >> -object memory-backend-ram,id=ram-node0,size=1G \ >> -numa node,nodeid=0,memdev=ram-node0 \ >> -object memory-backend-ram,id=ram-node1,size=1G \ >> -numa node,nodeid=1,cpus=0-3,memdev=ram-node1 >> >> which defines 2 nodes, one with memory and all CPUs and a second with >> only memory. >> >> # numactl -H >> available: 2 nodes (0-1) >> node 0 cpus: 0 1 2 3 >> node 0 size: 1003 MB >> node 0 free: 734 MB >> node 1 cpus: >> node 1 size: 975 MB >> node 1 free: 968 MB >> node distances: >> node 0 1 >> 0: 10 20 >> 1: 20 10 >> >> >> Could it be a new type of host memory backend ? Have you considered >> this approach ? > > Good idea. Fundamentally the device should not be creating NUMA nodes, > the VM should be configured with NUMA nodes and the device memory > associated with those nodes. +1. That would also make it fly with DIMMs and virtio-mem, where you would want NUMA-less nodes ass well (imagine passing CXL memory to a VM using virtio-mem).
> >> > >> Typically, a command line for a virt machine with NUMA nodes would > >> look like : > >> > >> -object memory-backend-ram,id=ram-node0,size=1G \ > >> -numa node,nodeid=0,memdev=ram-node0 \ > >> -object memory-backend-ram,id=ram-node1,size=1G \ > >> -numa node,nodeid=1,cpus=0-3,memdev=ram-node1 > >> > >> which defines 2 nodes, one with memory and all CPUs and a second with > >> only memory. > >> > >> # numactl -H > >> available: 2 nodes (0-1) > >> node 0 cpus: 0 1 2 3 > >> node 0 size: 1003 MB > >> node 0 free: 734 MB > >> node 1 cpus: > >> node 1 size: 975 MB > >> node 1 free: 968 MB > >> node distances: > >> node 0 1 > >> 0: 10 20 > >> 1: 20 10 > >> > >> > >> Could it be a new type of host memory backend ? Have you considered > >> this approach ? > > > > Good idea. Fundamentally the device should not be creating NUMA > > nodes, the VM should be configured with NUMA nodes and the device > > memory associated with those nodes. > > +1. That would also make it fly with DIMMs and virtio-mem, where you > would want NUMA-less nodes ass well (imagine passing CXL memory to a VM > using virtio-mem). > We actually do not add the device memory on the host, instead map it into the Qemu VMA using remap_pfn_range(). Please checkout the mmap function in vfio-pci variant driver code managing the device. https://lore.kernel.org/all/20230915025415.6762-1-ankita@nvidia.com/ And I think host memory backend would need memory that is added on the host. Moreover since we want to passthrough the entire device memory, the -object memory-backend-ram would have to be passed a size that is equal to the device memory. I wonder if that would be too much of a trouble for an admin (or libvirt) triggering the Qemu process. Both these items are avoided by exposing the device memory as BAR as in the current implementation (referenced above) since it lets Qemu to naturally discover the device memory region and do mmap.
On 22.09.23 10:11, Ankit Agrawal wrote: >>>> >>>> Typically, a command line for a virt machine with NUMA nodes would >>>> look like : >>>> >>>> -object memory-backend-ram,id=ram-node0,size=1G \ >>>> -numa node,nodeid=0,memdev=ram-node0 \ >>>> -object memory-backend-ram,id=ram-node1,size=1G \ >>>> -numa node,nodeid=1,cpus=0-3,memdev=ram-node1 >>>> >>>> which defines 2 nodes, one with memory and all CPUs and a second with >>>> only memory. >>>> >>>> # numactl -H >>>> available: 2 nodes (0-1) >>>> node 0 cpus: 0 1 2 3 >>>> node 0 size: 1003 MB >>>> node 0 free: 734 MB >>>> node 1 cpus: >>>> node 1 size: 975 MB >>>> node 1 free: 968 MB >>>> node distances: >>>> node 0 1 >>>> 0: 10 20 >>>> 1: 20 10 >>>> >>>> >>>> Could it be a new type of host memory backend ? Have you considered >>>> this approach ? >>> >>> Good idea. Fundamentally the device should not be creating NUMA >>> nodes, the VM should be configured with NUMA nodes and the device >>> memory associated with those nodes. >> >> +1. That would also make it fly with DIMMs and virtio-mem, where you >> would want NUMA-less nodes ass well (imagine passing CXL memory to a VM >> using virtio-mem). >> > > We actually do not add the device memory on the host, instead > map it into the Qemu VMA using remap_pfn_range(). Please checkout the > mmap function in vfio-pci variant driver code managing the device. > https://lore.kernel.org/all/20230915025415.6762-1-ankita@nvidia.com/ > And I think host memory backend would need memory that is added on the > host. > > Moreover since we want to passthrough the entire device memory, the > -object memory-backend-ram would have to be passed a size that is equal > to the device memory. I wonder if that would be too much of a trouble > for an admin (or libvirt) triggering the Qemu process. > > Both these items are avoided by exposing the device memory as BAR as in the > current implementation (referenced above) since it lets Qemu to naturally > discover the device memory region and do mmap. > Just to clarify: nNUMA nodes for DIMMs/NVDIMMs/virtio-mem are configured on the device, not on the memory backend. e.g., -device pc-dimm,node=3,memdev=mem1,...
>>>> Good idea. Fundamentally the device should not be creating NUMA >>>> nodes, the VM should be configured with NUMA nodes and the device >>>> memory associated with those nodes. >>> >>> +1. That would also make it fly with DIMMs and virtio-mem, where you >>> would want NUMA-less nodes ass well (imagine passing CXL memory to a VM >>> using virtio-mem). >>> >> >> We actually do not add the device memory on the host, instead >> map it into the Qemu VMA using remap_pfn_range(). Please checkout the >> mmap function in vfio-pci variant driver code managing the device. >> https://lore.kernel.org/all/20230915025415.6762-1-ankita@nvidia.com/ >> And I think host memory backend would need memory that is added on the >> host. >> >> Moreover since we want to passthrough the entire device memory, the >> -object memory-backend-ram would have to be passed a size that is equal >> to the device memory. I wonder if that would be too much of a trouble >> for an admin (or libvirt) triggering the Qemu process. >> >> Both these items are avoided by exposing the device memory as BAR as in the >> current implementation (referenced above) since it lets Qemu to naturally >> discover the device memory region and do mmap. >> > > Just to clarify: nNUMA nodes for DIMMs/NVDIMMs/virtio-mem are configured > on the device, not on the memory backend. > > e.g., -device pc-dimm,node=3,memdev=mem1,... Agreed, but still we will have the aforementioned issues viz. 1. The backing memory for the memory device would need to be allocated on the host. However, we do not add the device memory on the host in this case. Instead the Qemu VMA is mapped to the device memory physical address using remap_pfn_range(). 2. The memory device need to be passed an allocation size such that all of the device memory is mapped into the Qemu VMA. This may not be readily available to the admin/libvirt. Based on the suggestions here, can we consider something like the following? 1. Introduce a new -numa subparam 'devnode', which tells Qemu to mark the node with MEM_AFFINITY_HOTPLUGGABLE in the SRAT's memory affinity structure to make it hotpluggable. 2. Create several NUMA nodes with 'devnode' which are supposed to be associated with the vfio-pci device. 3. Pass the numa node start and count to associate the nodes created. So, the command would look something like the following. ... -numa node,nodeid=2,devnode=on \ -numa node,nodeid=3,devnode=on \ -numa node,nodeid=4,devnode=on \ -numa node,nodeid=5,devnode=on \ -numa node,nodeid=6,devnode=on \ -numa node,nodeid=7,devnode=on \ -numa node,nodeid=8,devnode=on \ -numa node,nodeid=9,devnode=on \ -device vfio-pci-nohotplug,host=0009:01:00.0,bus=pcie.0,addr=04.0,rombar=0,numa-node-start=2,numa-node-count=8 \ Thoughts?
On 26.09.23 16:52, Ankit Agrawal wrote: >>>>> Good idea. Fundamentally the device should not be creating NUMA >>>>> nodes, the VM should be configured with NUMA nodes and the device >>>>> memory associated with those nodes. >>>> >>>> +1. That would also make it fly with DIMMs and virtio-mem, where you >>>> would want NUMA-less nodes ass well (imagine passing CXL memory to a VM >>>> using virtio-mem). >>>> >>> >>> We actually do not add the device memory on the host, instead >>> map it into the Qemu VMA using remap_pfn_range(). Please checkout the >>> mmap function in vfio-pci variant driver code managing the device. >>> https://lore.kernel.org/all/20230915025415.6762-1-ankita@nvidia.com/ >>> And I think host memory backend would need memory that is added on the >>> host. >>> >>> Moreover since we want to passthrough the entire device memory, the >>> -object memory-backend-ram would have to be passed a size that is equal >>> to the device memory. I wonder if that would be too much of a trouble >>> for an admin (or libvirt) triggering the Qemu process. >>> >>> Both these items are avoided by exposing the device memory as BAR as in the >>> current implementation (referenced above) since it lets Qemu to naturally >>> discover the device memory region and do mmap. >>> >> >> Just to clarify: nNUMA nodes for DIMMs/NVDIMMs/virtio-mem are configured >> on the device, not on the memory backend. >> >> e.g., -device pc-dimm,node=3,memdev=mem1,... > Alco CCing Gavin, I remember he once experimented with virtio-mem + multiple memory-less nodes and it was quite working (because of MEM_AFFINITY_HOTPLUGGABLE only on the last node, below). > Agreed, but still we will have the aforementioned issues viz. > 1. The backing memory for the memory device would need to be allocated > on the host. However, we do not add the device memory on the host in this > case. Instead the Qemu VMA is mapped to the device memory physical > address using remap_pfn_range(). I don't see why that would be necessary ... > 2. The memory device need to be passed an allocation size such that all of > the device memory is mapped into the Qemu VMA. This may not be readily > available to the admin/libvirt. ... or that. But your proposal roughly looks like what I had in mind, so let's focus on that. > > Based on the suggestions here, can we consider something like the > following? > 1. Introduce a new -numa subparam 'devnode', which tells Qemu to mark > the node with MEM_AFFINITY_HOTPLUGGABLE in the SRAT's memory affinity > structure to make it hotpluggable. Is that "devnode=on" parameter required? Can't we simply expose any node that does *not* have any boot memory assigned as MEM_AFFINITY_HOTPLUGGABLE? Right now, with "ordinary", fixed-location memory devices (DIMM/NVDIMM/virtio-mem/virtio-pmem), we create an srat entry that covers the device memory region for these devices with MEM_AFFINITY_HOTPLUGGABLE. We use the highest NUMA node in the machine, which does not quite work IIRC. All applicable nodes that don't have boot memory would need MEM_AFFINITY_HOTPLUGGABLE for Linux to create them. In your example, which memory ranges would we use for these nodes in SRAT? > 2. Create several NUMA nodes with 'devnode' which are supposed to be > associated with the vfio-pci device. > 3. Pass the numa node start and count to associate the nodes created. > > So, the command would look something like the following. > ... > -numa node,nodeid=2,devnode=on \ > -numa node,nodeid=3,devnode=on \ > -numa node,nodeid=4,devnode=on \ > -numa node,nodeid=5,devnode=on \ > -numa node,nodeid=6,devnode=on \ > -numa node,nodeid=7,devnode=on \ > -numa node,nodeid=8,devnode=on \ > -numa node,nodeid=9,devnode=on \ > -device vfio-pci-nohotplug,host=0009:01:00.0,bus=pcie.0,addr=04.0,rombar=0,numa-node-start=2,numa-node-count=8 \ Better an array/list like "numa-nodes=2-9" ... but how would the device actually use these nodes? (which for which?)
On Tue, 26 Sep 2023 18:54:53 +0200 David Hildenbrand <david@redhat.com> wrote: > On 26.09.23 16:52, Ankit Agrawal wrote: > >>>>> Good idea. Fundamentally the device should not be creating NUMA > >>>>> nodes, the VM should be configured with NUMA nodes and the device > >>>>> memory associated with those nodes. > >>>> > >>>> +1. That would also make it fly with DIMMs and virtio-mem, where you > >>>> would want NUMA-less nodes ass well (imagine passing CXL memory to a VM > >>>> using virtio-mem). > >>>> > >>> > >>> We actually do not add the device memory on the host, instead > >>> map it into the Qemu VMA using remap_pfn_range(). Please checkout the > >>> mmap function in vfio-pci variant driver code managing the device. > >>> https://lore.kernel.org/all/20230915025415.6762-1-ankita@nvidia.com/ > >>> And I think host memory backend would need memory that is added on the > >>> host. > >>> > >>> Moreover since we want to passthrough the entire device memory, the > >>> -object memory-backend-ram would have to be passed a size that is equal > >>> to the device memory. I wonder if that would be too much of a trouble > >>> for an admin (or libvirt) triggering the Qemu process. > >>> > >>> Both these items are avoided by exposing the device memory as BAR as in the > >>> current implementation (referenced above) since it lets Qemu to naturally > >>> discover the device memory region and do mmap. > >>> > >> > >> Just to clarify: nNUMA nodes for DIMMs/NVDIMMs/virtio-mem are configured > >> on the device, not on the memory backend. > >> > >> e.g., -device pc-dimm,node=3,memdev=mem1,... > > > > Alco CCing Gavin, I remember he once experimented with virtio-mem + > multiple memory-less nodes and it was quite working (because of > MEM_AFFINITY_HOTPLUGGABLE only on the last node, below). > > > Agreed, but still we will have the aforementioned issues viz. > > 1. The backing memory for the memory device would need to be allocated > > on the host. However, we do not add the device memory on the host in this > > case. Instead the Qemu VMA is mapped to the device memory physical > > address using remap_pfn_range(). > > I don't see why that would be necessary ... > > > 2. The memory device need to be passed an allocation size such that all of > > the device memory is mapped into the Qemu VMA. This may not be readily > > available to the admin/libvirt. > > ... or that. But your proposal roughly looks like what I had in mind, so > let's focus on that. > > > > > Based on the suggestions here, can we consider something like the > > following? > > 1. Introduce a new -numa subparam 'devnode', which tells Qemu to mark > > the node with MEM_AFFINITY_HOTPLUGGABLE in the SRAT's memory affinity > > structure to make it hotpluggable. > > Is that "devnode=on" parameter required? Can't we simply expose any node > that does *not* have any boot memory assigned as MEM_AFFINITY_HOTPLUGGABLE? > > Right now, with "ordinary", fixed-location memory devices > (DIMM/NVDIMM/virtio-mem/virtio-pmem), we create an srat entry that > covers the device memory region for these devices with > MEM_AFFINITY_HOTPLUGGABLE. We use the highest NUMA node in the machine, > which does not quite work IIRC. All applicable nodes that don't have > boot memory would need MEM_AFFINITY_HOTPLUGGABLE for Linux to create them. > > In your example, which memory ranges would we use for these nodes in SRAT? > > > 2. Create several NUMA nodes with 'devnode' which are supposed to be > > associated with the vfio-pci device. > > 3. Pass the numa node start and count to associate the nodes created. > > > > So, the command would look something like the following. > > ... > > -numa node,nodeid=2,devnode=on \ > > -numa node,nodeid=3,devnode=on \ > > -numa node,nodeid=4,devnode=on \ > > -numa node,nodeid=5,devnode=on \ > > -numa node,nodeid=6,devnode=on \ > > -numa node,nodeid=7,devnode=on \ > > -numa node,nodeid=8,devnode=on \ > > -numa node,nodeid=9,devnode=on \ > > -device vfio-pci-nohotplug,host=0009:01:00.0,bus=pcie.0,addr=04.0,rombar=0,numa-node-start=2,numa-node-count=8 \ I don't see how these numa-node args on a vfio-pci device have any general utility. They're only used to create a firmware table, so why don't we be explicit about it and define the firmware table as an object? For example: -numa node,nodeid=2 \ -numa node,nodeid=3 \ -numa node,nodeid=4 \ -numa node,nodeid=5 \ -numa node,nodeid=6 \ -numa node,nodeid=7 \ -numa node,nodeid=8 \ -numa node,nodeid=9 \ -device vfio-pci-nohotplug,host=0009:01:00.0,bus=pcie.0,addr=04.0,rombar=0,id=nvgrace0 \ -object nvidia-gpu-mem-acpi,devid=nvgrace0,nodeset=2-9 \ There are some suggestions in this thread that CXL could have similar requirements, but I haven't found any evidence that these dev-mem-pxm-{start,count} attributes in the _DSD are standardized in any way. If they are, maybe this would be a dev-mem-pxm-acpi object rather than an NVIDIA specific one. It seems like we could almost meet the requirement for this table via -acpitable, but I think we'd like to avoid the VM orchestration tool from creating, compiling, and passing ACPI data blobs into the VM. Thanks, Alex
> > > > Based on the suggestions here, can we consider something like the > > following? > > 1. Introduce a new -numa subparam 'devnode', which tells Qemu to mark > > the node with MEM_AFFINITY_HOTPLUGGABLE in the SRAT's memory affinity > > structure to make it hotpluggable. > > Is that "devnode=on" parameter required? Can't we simply expose any node > that does *not* have any boot memory assigned as MEM_AFFINITY_HOTPLUGGABLE? > > Right now, with "ordinary", fixed-location memory devices > (DIMM/NVDIMM/virtio-mem/virtio-pmem), we create an srat entry that > covers the device memory region for these devices with > MEM_AFFINITY_HOTPLUGGABLE. We use the highest NUMA node in the machine, > which does not quite work IIRC. All applicable nodes that don't have > boot memory would need MEM_AFFINITY_HOTPLUGGABLE for Linux to create them. Yeah, you're right that it isn't required. Exposing the node without any memory as MEM_AFFINITY_HOTPLUGGABLE seems like a better approach than using "devnode=on". > In your example, which memory ranges would we use for these nodes in SRAT? We are setting the Base Address and the Size as 0 in the SRAT memory affinity structures. This is done through the following: build_srat_memory(table_data, 0, 0, i, MEM_AFFINITY_HOTPLUGGABLE | MEM_AFFINITY_ENABLED); This results in the following logs in the VM from the Linux ACPI SRAT parsing code: [ 0.000000] ACPI: SRAT: Node 2 PXM 2 [mem 0x00000000-0xffffffffffffffff] hotplug [ 0.000000] ACPI: SRAT: Node 3 PXM 3 [mem 0x00000000-0xffffffffffffffff] hotplug [ 0.000000] ACPI: SRAT: Node 4 PXM 4 [mem 0x00000000-0xffffffffffffffff] hotplug [ 0.000000] ACPI: SRAT: Node 5 PXM 5 [mem 0x00000000-0xffffffffffffffff] hotplug [ 0.000000] ACPI: SRAT: Node 6 PXM 6 [mem 0x00000000-0xffffffffffffffff] hotplug [ 0.000000] ACPI: SRAT: Node 7 PXM 7 [mem 0x00000000-0xffffffffffffffff] hotplug [ 0.000000] ACPI: SRAT: Node 8 PXM 8 [mem 0x00000000-0xffffffffffffffff] hotplug [ 0.000000] ACPI: SRAT: Node 9 PXM 9 [mem 0x00000000-0xffffffffffffffff] hotplug I would re-iterate that we are just emulating the baremetal behavior here. > I don't see how these numa-node args on a vfio-pci device have any > general utility. They're only used to create a firmware table, so why > don't we be explicit about it and define the firmware table as an > object? For example: > > -numa node,nodeid=2 \ > -numa node,nodeid=3 \ > -numa node,nodeid=4 \ > -numa node,nodeid=5 \ > -numa node,nodeid=6 \ > -numa node,nodeid=7 \ > -numa node,nodeid=8 \ > -numa node,nodeid=9 \ > -device vfio-pci-nohotplug,host=0009:01:00.0,bus=pcie.0,addr=04.0,rombar=0,id=nvgrace0 \ > -object nvidia-gpu-mem-acpi,devid=nvgrace0,nodeset=2-9 \ Yeah, that is fine with me. If we agree with this approach, I can go implement it. > There are some suggestions in this thread that CXL could have similar > requirements, but I haven't found any evidence that these > dev-mem-pxm-{start,count} attributes in the _DSD are standardized in > any way. If they are, maybe this would be a dev-mem-pxm-acpi object > rather than an NVIDIA specific one. Maybe Jason, Jonathan can chime in on this? > It seems like we could almost meet the requirement for this table via > -acpitable, but I think we'd like to avoid the VM orchestration tool > from creating, compiling, and passing ACPI data blobs into the VM.
On Wed, 27 Sep 2023 07:14:28 +0000 Ankit Agrawal <ankita@nvidia.com> wrote: > > > > > > Based on the suggestions here, can we consider something like the > > > following? > > > 1. Introduce a new -numa subparam 'devnode', which tells Qemu to mark > > > the node with MEM_AFFINITY_HOTPLUGGABLE in the SRAT's memory affinity > > > structure to make it hotpluggable. > > > > Is that "devnode=on" parameter required? Can't we simply expose any node > > that does *not* have any boot memory assigned as MEM_AFFINITY_HOTPLUGGABLE? That needs some checking for what extra stuff we'll instantiate on CPU only (or once we implement them) Generic Initiator / Generic Port nodes - I'm definitely not keen on doing so for generic ports (which QEMU doesn't yet do though there have been some RFCs I think). > > Right now, with "ordinary", fixed-location memory devices > > (DIMM/NVDIMM/virtio-mem/virtio-pmem), we create an srat entry that > > covers the device memory region for these devices with > > MEM_AFFINITY_HOTPLUGGABLE. We use the highest NUMA node in the machine, > > which does not quite work IIRC. All applicable nodes that don't have > > boot memory would need MEM_AFFINITY_HOTPLUGGABLE for Linux to create them. > > Yeah, you're right that it isn't required. Exposing the node without any memory as > MEM_AFFINITY_HOTPLUGGABLE seems like a better approach than using > "devnode=on". > > > In your example, which memory ranges would we use for these nodes in SRAT? > > We are setting the Base Address and the Size as 0 in the SRAT memory affinity > structures. This is done through the following: > build_srat_memory(table_data, 0, 0, i, > MEM_AFFINITY_HOTPLUGGABLE | MEM_AFFINITY_ENABLED); > > This results in the following logs in the VM from the Linux ACPI SRAT parsing code: > [ 0.000000] ACPI: SRAT: Node 2 PXM 2 [mem 0x00000000-0xffffffffffffffff] hotplug > [ 0.000000] ACPI: SRAT: Node 3 PXM 3 [mem 0x00000000-0xffffffffffffffff] hotplug > [ 0.000000] ACPI: SRAT: Node 4 PXM 4 [mem 0x00000000-0xffffffffffffffff] hotplug > [ 0.000000] ACPI: SRAT: Node 5 PXM 5 [mem 0x00000000-0xffffffffffffffff] hotplug > [ 0.000000] ACPI: SRAT: Node 6 PXM 6 [mem 0x00000000-0xffffffffffffffff] hotplug > [ 0.000000] ACPI: SRAT: Node 7 PXM 7 [mem 0x00000000-0xffffffffffffffff] hotplug > [ 0.000000] ACPI: SRAT: Node 8 PXM 8 [mem 0x00000000-0xffffffffffffffff] hotplug > [ 0.000000] ACPI: SRAT: Node 9 PXM 9 [mem 0x00000000-0xffffffffffffffff] hotplug > > I would re-iterate that we are just emulating the baremetal behavior here. > > > > I don't see how these numa-node args on a vfio-pci device have any > > general utility. They're only used to create a firmware table, so why > > don't we be explicit about it and define the firmware table as an > > object? For example: > > > > -numa node,nodeid=2 \ > > -numa node,nodeid=3 \ > > -numa node,nodeid=4 \ > > -numa node,nodeid=5 \ > > -numa node,nodeid=6 \ > > -numa node,nodeid=7 \ > > -numa node,nodeid=8 \ > > -numa node,nodeid=9 \ > > -device vfio-pci-nohotplug,host=0009:01:00.0,bus=pcie.0,addr=04.0,rombar=0,id=nvgrace0 \ > > -object nvidia-gpu-mem-acpi,devid=nvgrace0,nodeset=2-9 \ > > Yeah, that is fine with me. If we agree with this approach, I can go > implement it. > > > > There are some suggestions in this thread that CXL could have similar > > requirements, For CXL side of things, if talking memory devices (type 3), I'm not sure what the usecase will be of this feature. Either we treat them as normal memory in which case it will all be static at boot of the VM (for SRAT anyway - we might plug things in and out of ranges), or it will be whole device hotplug and look like pc-dimm hotplug (which should be into a statically defined range in SRAT). Longer term if we look at virtualizing dynamic capacity devices (not sure we need to other that possibly to leverage sparse DAX etc on top of them) then we might want to provide emulated CXL Fixed memory windows in the guest (which get their own NUMA nodes anyway) + plug the memory into that. We'd probably hide away interleaving etc in the host as all the guest should care about is performance information and I doubt we'd want to emulate the complexity of address routing complexities. Similar to host PA ranges used in CXL fixed memory windows, I'm not sure we wouldn't just allow for the guest to have 'all' possible setups that might get plugged later by just burning a lot of HPA space and hence just be able to use static SRAT nodes covering each region. This would be less painful than for real PAs because as we are emulating the CXL devices, probably as one emulated type 3 device per potential set of real devices in an interleave set we can avoid all the ordering constraints of CXL address decoders that end up eating up Host PA space. Virtualizing DCD is going to be a fun topic (that's next year's plumbers CXL uconf session sorted ;), but I can see it might be done completely differently and look nothing like a CXL device, in which case maybe what you have here will make sense. Come to think of it, you 'could' potentially do that for your use case if the regions are reasonably bound in maximum size at the cost of large GPA usage? CXL accelerators / GPUs etc are a different question but who has one of those anyway? :) > > but I haven't found any evidence that these > > dev-mem-pxm-{start,count} attributes in the _DSD are standardized in > > any way. If they are, maybe this would be a dev-mem-pxm-acpi object > > rather than an NVIDIA specific one. > > Maybe Jason, Jonathan can chime in on this? I'm not aware of anything general around this. A PCI device can have a _PXM and I think you could define subdevices each with a _PXM of their own? Those subdevices would need drivers to interpret the structure anyway so not real benefit over a _DSD that I can immediately think of... If we think this will be common long term, anyone want to take multiple _PXM per device support as a proposal to ACPI? So agreed, it's not general, so if it's acceptable to have 0 length NUMA nodes (and I think we have to emulate them given that's what real hardware is doing even if some of us think the real hardware shouldn't have done that!) then just spinning them up explicitly as nodes + device specific stuff for the NVIDIA device seems fine to me. > > > > It seems like we could almost meet the requirement for this table via > > -acpitable, but I think we'd like to avoid the VM orchestration tool > > from creating, compiling, and passing ACPI data blobs into the VM. >
On Wed, Sep 27, 2023 at 12:33:18PM +0100, Jonathan Cameron wrote: > CXL accelerators / GPUs etc are a different question but who has one > of those anyway? :) That's exactly what I mean when I say CXL will need it too. I keep describing this current Grace & Hopper as pre-CXL HW. You can easially imagine draping CXL around it. CXL doesn't solve the problem that motivates all this hackying - Linux can't dynamically create NUMA nodes. So even with CXL baremetal FW will have to create these nodes so Linux can consume them. Hackity hack hack.. Broadly when you start to look at all of CXL, any device with coherent memory and the ability to create SRIOV VF like things is going to have a problem that Linux driver sofware will want a NUMA node for every possible VF. So we can view this as some generic NVIDIA hack, but really it is a "coherent memory and SRIOV" hack that should be needed forany device of this class. I was hoping the CXL world would fix this stuff, but I was told they also doubled down and are creating unnecessary ACPI subtly to avoid needing dynamic NUMA nodes in Linux.. Sigh. Jason
On Wed, 27 Sep 2023 10:53:36 -0300 Jason Gunthorpe <jgg@nvidia.com> wrote: > On Wed, Sep 27, 2023 at 12:33:18PM +0100, Jonathan Cameron wrote: > > > CXL accelerators / GPUs etc are a different question but who has one > > of those anyway? :) > > That's exactly what I mean when I say CXL will need it too. I keep > describing this current Grace & Hopper as pre-CXL HW. You can easially > imagine draping CXL around it. CXL doesn't solve the problem that > motivates all this hackying - Linux can't dynamically create NUMA > nodes. Why is that and why aren't we pushing towards a solution of removing that barrier so that we don't require the machine topology to be configured to support this use case and guest OS limitations? Thanks, Alex
> From: Alex Williamson <alex.williamson@redhat.com> > Sent: Wednesday, September 27, 2023 9:25 AM > To: Jason Gunthorpe <jgg@nvidia.com> > Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>; Ankit Agrawal > <ankita@nvidia.com>; David Hildenbrand <david@redhat.com>; Cédric Le > Goater <clg@redhat.com>; shannon.zhaosl@gmail.com; > peter.maydell@linaro.org; ani@anisinha.ca; Aniket Agashe > <aniketa@nvidia.com>; Neo Jia <cjia@nvidia.com>; Kirti Wankhede > <kwankhede@nvidia.com>; Tarun Gupta (SW-GPU) <targupta@nvidia.com>; > Vikram Sethi <vsethi@nvidia.com>; Andy Currid <ACurrid@nvidia.com>; > qemu-arm@nongnu.org; qemu-devel@nongnu.org; Gavin Shan > <gshan@redhat.com>; ira.weiny@intel.com; navneet.singh@intel.com > Subject: Re: [PATCH v1 0/4] vfio: report NUMA nodes for device memory > > > On Wed, 27 Sep 2023 10:53:36 -0300 > Jason Gunthorpe <jgg@nvidia.com> wrote: > > > On Wed, Sep 27, 2023 at 12:33:18PM +0100, Jonathan Cameron wrote: > > > > > CXL accelerators / GPUs etc are a different question but who has one > > > of those anyway? :) > > > > That's exactly what I mean when I say CXL will need it too. I keep > > describing this current Grace & Hopper as pre-CXL HW. You can easially > > imagine draping CXL around it. CXL doesn't solve the problem that > > motivates all this hackying - Linux can't dynamically create NUMA > > nodes. > > Why is that and why aren't we pushing towards a solution of removing that > barrier so that we don't require the machine topology to be configured to > support this use case and guest OS limitations? Thanks, > Even if Linux could create NUMA nodes dynamically for coherent CXL or CXL type devices, there is additional information FW knows that the kernel doesn't. For example, what the distance/latency between CPU and the device NUMA node is. While CXL devices present a CDAT table which gives latency type attributes within the device, there would still be some guesswork needed in the kernel as to what the end to end latency/distance is. It's probably not the best outcome to just consider this generically far memory" because is it further than Intersocket memory access or not matters. Pre CXL devices such as for this patchset don't even have CDAT so the kernel by itself has no idea if this latency/distance is less than or more than inter socket memory access latency even. So specially for devices present at boot, FW knows this information and should provide it. Similarly, QEMU should pass along this information to VMs for the best outcomes. Thanks > Alex
On Wed, Sep 27, 2023 at 03:03:09PM +0000, Vikram Sethi wrote: > > From: Alex Williamson <alex.williamson@redhat.com> > > Sent: Wednesday, September 27, 2023 9:25 AM > > To: Jason Gunthorpe <jgg@nvidia.com> > > Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>; Ankit Agrawal > > <ankita@nvidia.com>; David Hildenbrand <david@redhat.com>; Cédric Le > > Goater <clg@redhat.com>; shannon.zhaosl@gmail.com; > > peter.maydell@linaro.org; ani@anisinha.ca; Aniket Agashe > > <aniketa@nvidia.com>; Neo Jia <cjia@nvidia.com>; Kirti Wankhede > > <kwankhede@nvidia.com>; Tarun Gupta (SW-GPU) <targupta@nvidia.com>; > > Vikram Sethi <vsethi@nvidia.com>; Andy Currid <ACurrid@nvidia.com>; > > qemu-arm@nongnu.org; qemu-devel@nongnu.org; Gavin Shan > > <gshan@redhat.com>; ira.weiny@intel.com; navneet.singh@intel.com > > Subject: Re: [PATCH v1 0/4] vfio: report NUMA nodes for device memory > > > > > > On Wed, 27 Sep 2023 10:53:36 -0300 > > Jason Gunthorpe <jgg@nvidia.com> wrote: > > > > > On Wed, Sep 27, 2023 at 12:33:18PM +0100, Jonathan Cameron wrote: > > > > > > > CXL accelerators / GPUs etc are a different question but who has one > > > > of those anyway? :) > > > > > > That's exactly what I mean when I say CXL will need it too. I keep > > > describing this current Grace & Hopper as pre-CXL HW. You can easially > > > imagine draping CXL around it. CXL doesn't solve the problem that > > > motivates all this hackying - Linux can't dynamically create NUMA > > > nodes. > > > > Why is that and why aren't we pushing towards a solution of removing that > > barrier so that we don't require the machine topology to be configured to > > support this use case and guest OS limitations? Thanks, I haven't looked myself, but I've been told this is very hard. Probably the NUMA concept needs to be split a bit so that the memory allocator pool handle is not tied to the ACPI. > Even if Linux could create NUMA nodes dynamically for coherent CXL > or CXL type devices, there is additional information FW knows that > the kernel doesn't. For example, what the distance/latency between > CPU and the device NUMA node is. But that should just be the existing normal PCIy stuff to define affinity of the PCI function. Since the memory is logically linked to the PCI function we have no issue from a topology perspective. > While CXL devices present a CDAT table which gives latency type > attributes within the device, there would still be some guesswork > needed in the kernel as to what the end to end latency/distance is. Is it non-uniform per CXL function? > knows this information and should provide it. Similarly, QEMU > should pass along this information to VMs for the best outcomes. Well yes, ideally qemu would pass vCPU affinity for the vPCI functions so existing NUMA aware allocations in PCI drivers are optimized. eg we put queues in memory close to the PCI function already. I think it is important to keep seperated the need to know the topology/affinity/etc and the need for the driver to have a Linux NUMA node handle to operate the memory alocator pool APIs. Regardless, it is what it is for the foreseeable future :( Jason
On Wed, 27 Sep 2023 15:03:09 +0000 Vikram Sethi <vsethi@nvidia.com> wrote: > > From: Alex Williamson <alex.williamson@redhat.com> > > Sent: Wednesday, September 27, 2023 9:25 AM > > To: Jason Gunthorpe <jgg@nvidia.com> > > Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>; Ankit Agrawal > > <ankita@nvidia.com>; David Hildenbrand <david@redhat.com>; Cédric Le > > Goater <clg@redhat.com>; shannon.zhaosl@gmail.com; > > peter.maydell@linaro.org; ani@anisinha.ca; Aniket Agashe > > <aniketa@nvidia.com>; Neo Jia <cjia@nvidia.com>; Kirti Wankhede > > <kwankhede@nvidia.com>; Tarun Gupta (SW-GPU) <targupta@nvidia.com>; > > Vikram Sethi <vsethi@nvidia.com>; Andy Currid <ACurrid@nvidia.com>; > > qemu-arm@nongnu.org; qemu-devel@nongnu.org; Gavin Shan > > <gshan@redhat.com>; ira.weiny@intel.com; navneet.singh@intel.com > > Subject: Re: [PATCH v1 0/4] vfio: report NUMA nodes for device memory > > > > > > On Wed, 27 Sep 2023 10:53:36 -0300 > > Jason Gunthorpe <jgg@nvidia.com> wrote: > > > > > On Wed, Sep 27, 2023 at 12:33:18PM +0100, Jonathan Cameron wrote: > > > > > > > CXL accelerators / GPUs etc are a different question but who has one > > > > of those anyway? :) > > > > > > That's exactly what I mean when I say CXL will need it too. I keep > > > describing this current Grace & Hopper as pre-CXL HW. You can easially > > > imagine draping CXL around it. CXL doesn't solve the problem that > > > motivates all this hackying - Linux can't dynamically create NUMA > > > nodes. > > > > Why is that and why aren't we pushing towards a solution of removing that > > barrier so that we don't require the machine topology to be configured to > > support this use case and guest OS limitations? Thanks, > > > > Even if Linux could create NUMA nodes dynamically for coherent CXL or CXL type devices, > there is additional information FW knows that the kernel doesn't. For example, > what the distance/latency between CPU and the device NUMA node is. While CXL devices > present a CDAT table which gives latency type attributes within the device, there would still be some > guesswork needed in the kernel as to what the end to end latency/distance is. > It's probably not the best outcome to just consider this generically far memory" because > is it further than Intersocket memory access or not matters. > Pre CXL devices such as for this patchset don't even have CDAT so the kernel by itself has > no idea if this latency/distance is less than or more than inter socket memory access latency > even. > So specially for devices present at boot, FW knows this information and should provide it. > Similarly, QEMU should pass along this information to VMs for the best outcomes. Yeah, AFAICT we're not doing any of that in this series. We're only creating some number of nodes for the guest driver to make use of and describing in the generated _DSD the set of nodes associated for use by the device. How many nodes are required, how the guest driver partitions coherent memory among the nodes, and how the guest assigns a distance to the nodes is unspecified. A glance at the CDAT spec seems brilliant in this regard, is there such a table for this device or could/should the vfio-pci variant driver provide one? I imagine this is how the VM admin or orchestration tool would learn to configure nodes and the VMM would further virtualize this table for the guest OS and drivers. Thanks, Alex
On Wed, 27 Sep 2023 15:03:09 +0000 Vikram Sethi <vsethi@nvidia.com> wrote: > > From: Alex Williamson <alex.williamson@redhat.com> > > Sent: Wednesday, September 27, 2023 9:25 AM > > To: Jason Gunthorpe <jgg@nvidia.com> > > Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>; Ankit Agrawal > > <ankita@nvidia.com>; David Hildenbrand <david@redhat.com>; Cédric Le > > Goater <clg@redhat.com>; shannon.zhaosl@gmail.com; > > peter.maydell@linaro.org; ani@anisinha.ca; Aniket Agashe > > <aniketa@nvidia.com>; Neo Jia <cjia@nvidia.com>; Kirti Wankhede > > <kwankhede@nvidia.com>; Tarun Gupta (SW-GPU) <targupta@nvidia.com>; > > Vikram Sethi <vsethi@nvidia.com>; Andy Currid <ACurrid@nvidia.com>; > > qemu-arm@nongnu.org; qemu-devel@nongnu.org; Gavin Shan > > <gshan@redhat.com>; ira.weiny@intel.com; navneet.singh@intel.com > > Subject: Re: [PATCH v1 0/4] vfio: report NUMA nodes for device memory > > > > > > On Wed, 27 Sep 2023 10:53:36 -0300 > > Jason Gunthorpe <jgg@nvidia.com> wrote: > > > > > On Wed, Sep 27, 2023 at 12:33:18PM +0100, Jonathan Cameron wrote: > > > > > > > CXL accelerators / GPUs etc are a different question but who has one > > > > of those anyway? :) > > > > > > That's exactly what I mean when I say CXL will need it too. I keep > > > describing this current Grace & Hopper as pre-CXL HW. You can easially > > > imagine draping CXL around it. CXL doesn't solve the problem that > > > motivates all this hackying - Linux can't dynamically create NUMA > > > nodes. > > > > Why is that and why aren't we pushing towards a solution of removing that > > barrier so that we don't require the machine topology to be configured to > > support this use case and guest OS limitations? Thanks, > > > > Even if Linux could create NUMA nodes dynamically for coherent CXL or CXL type devices, > there is additional information FW knows that the kernel doesn't. For example, > what the distance/latency between CPU and the device NUMA node is. While CXL devices > present a CDAT table which gives latency type attributes within the device, there would still be some > guesswork needed in the kernel as to what the end to end latency/distance is. FWIW Shouldn't be guess work needed (for light load case anyway which is what would be in HMAT). That's what the Generic Ports were added to SRAT for. Dave Jiang has a patch set https://lore.kernel.org/all/168695160531.3031571.4875512229068707023.stgit@djiang5-mobl3/ to do the maths... For CXL there is no problem fully describing the access characteristics. > It's probably not the best outcome to just consider this generically far memory" because > is it further than Intersocket memory access or not matters. > Pre CXL devices such as for this patchset don't even have CDAT so the kernel by itself has > no idea if this latency/distance is less than or more than inter socket memory access latency > even. Just because I'm feeling cheeky - you could emulate a DOE and CDAT :)? Though I suppose you don't want to teach the guest driver about it. > So specially for devices present at boot, FW knows this information and should provide it. > Similarly, QEMU should pass along this information to VMs for the best outcomes. No problem with the argument that FW has the info and should provide it, just on the 'how' part. Jonathan > > Thanks > > Alex >
On Wed, 27 Sep 2023 12:42:40 -0300 Jason Gunthorpe <jgg@nvidia.com> wrote: > On Wed, Sep 27, 2023 at 03:03:09PM +0000, Vikram Sethi wrote: > > > From: Alex Williamson <alex.williamson@redhat.com> > > > Sent: Wednesday, September 27, 2023 9:25 AM > > > To: Jason Gunthorpe <jgg@nvidia.com> > > > Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>; Ankit Agrawal > > > <ankita@nvidia.com>; David Hildenbrand <david@redhat.com>; Cédric Le > > > Goater <clg@redhat.com>; shannon.zhaosl@gmail.com; > > > peter.maydell@linaro.org; ani@anisinha.ca; Aniket Agashe > > > <aniketa@nvidia.com>; Neo Jia <cjia@nvidia.com>; Kirti Wankhede > > > <kwankhede@nvidia.com>; Tarun Gupta (SW-GPU) <targupta@nvidia.com>; > > > Vikram Sethi <vsethi@nvidia.com>; Andy Currid <ACurrid@nvidia.com>; > > > qemu-arm@nongnu.org; qemu-devel@nongnu.org; Gavin Shan > > > <gshan@redhat.com>; ira.weiny@intel.com; navneet.singh@intel.com > > > Subject: Re: [PATCH v1 0/4] vfio: report NUMA nodes for device memory > > > > > > > > > On Wed, 27 Sep 2023 10:53:36 -0300 > > > Jason Gunthorpe <jgg@nvidia.com> wrote: > > > > > > > On Wed, Sep 27, 2023 at 12:33:18PM +0100, Jonathan Cameron wrote: > > > > > > > > > CXL accelerators / GPUs etc are a different question but who has one > > > > > of those anyway? :) > > > > > > > > That's exactly what I mean when I say CXL will need it too. I keep > > > > describing this current Grace & Hopper as pre-CXL HW. You can easially > > > > imagine draping CXL around it. CXL doesn't solve the problem that > > > > motivates all this hackying - Linux can't dynamically create NUMA > > > > nodes. > > > > > > Why is that and why aren't we pushing towards a solution of removing that > > > barrier so that we don't require the machine topology to be configured to > > > support this use case and guest OS limitations? Thanks, > > I haven't looked myself, but I've been told this is very > hard. Probably the NUMA concept needs to be split a bit so that the > memory allocator pool handle is not tied to the ACPI. > > > Even if Linux could create NUMA nodes dynamically for coherent CXL > > or CXL type devices, there is additional information FW knows that > > the kernel doesn't. For example, what the distance/latency between > > CPU and the device NUMA node is. > > But that should just be the existing normal PCIy stuff to define > affinity of the PCI function. Since the memory is logically linked to > the PCI function we have no issue from a topology perspective. Agreed - if there is a node to associate it with, this is easy to cover. If aim is just to make the info available, could just Generic Initiator nodes instead. Those were added for memory free accelerators when the bios was describing them to fill the gap for things that aren't processors or memory but for which bandwidth and latency numbers were useful. This falls into your comment below on the need not being the topology as such, but rather the number needed for the memory allocator. > > > While CXL devices present a CDAT table which gives latency type > > attributes within the device, there would still be some guesswork > > needed in the kernel as to what the end to end latency/distance is. > > Is it non-uniform per CXL function? Could be, but if so it is all discoverable. Though mapping is to device PA range rather than 'function' as such. You are certainly allowed lots of different PA ranges with different characteristics. > > > knows this information and should provide it. Similarly, QEMU > > should pass along this information to VMs for the best outcomes. > > Well yes, ideally qemu would pass vCPU affinity for the vPCI functions > so existing NUMA aware allocations in PCI drivers are optimized. eg we > put queues in memory close to the PCI function already. I though we did that via PXBs? > > I think it is important to keep seperated the need to know the > topology/affinity/etc and the need for the driver to have a Linux NUMA > node handle to operate the memory alocator pool APIs. > > Regardless, it is what it is for the foreseeable future :( :( > > Jason
On Wed, 27 Sep 2023 10:37:37 -0600 Alex Williamson <alex.williamson@redhat.com> wrote: > On Wed, 27 Sep 2023 15:03:09 +0000 > Vikram Sethi <vsethi@nvidia.com> wrote: > > > > From: Alex Williamson <alex.williamson@redhat.com> > > > Sent: Wednesday, September 27, 2023 9:25 AM > > > To: Jason Gunthorpe <jgg@nvidia.com> > > > Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>; Ankit Agrawal > > > <ankita@nvidia.com>; David Hildenbrand <david@redhat.com>; Cédric Le > > > Goater <clg@redhat.com>; shannon.zhaosl@gmail.com; > > > peter.maydell@linaro.org; ani@anisinha.ca; Aniket Agashe > > > <aniketa@nvidia.com>; Neo Jia <cjia@nvidia.com>; Kirti Wankhede > > > <kwankhede@nvidia.com>; Tarun Gupta (SW-GPU) <targupta@nvidia.com>; > > > Vikram Sethi <vsethi@nvidia.com>; Andy Currid <ACurrid@nvidia.com>; > > > qemu-arm@nongnu.org; qemu-devel@nongnu.org; Gavin Shan > > > <gshan@redhat.com>; ira.weiny@intel.com; navneet.singh@intel.com > > > Subject: Re: [PATCH v1 0/4] vfio: report NUMA nodes for device memory > > > > > > > > > On Wed, 27 Sep 2023 10:53:36 -0300 > > > Jason Gunthorpe <jgg@nvidia.com> wrote: > > > > > > > On Wed, Sep 27, 2023 at 12:33:18PM +0100, Jonathan Cameron wrote: > > > > > > > > > CXL accelerators / GPUs etc are a different question but who has one > > > > > of those anyway? :) > > > > > > > > That's exactly what I mean when I say CXL will need it too. I keep > > > > describing this current Grace & Hopper as pre-CXL HW. You can easially > > > > imagine draping CXL around it. CXL doesn't solve the problem that > > > > motivates all this hackying - Linux can't dynamically create NUMA > > > > nodes. > > > > > > Why is that and why aren't we pushing towards a solution of removing that > > > barrier so that we don't require the machine topology to be configured to > > > support this use case and guest OS limitations? Thanks, > > > > > > > Even if Linux could create NUMA nodes dynamically for coherent CXL or CXL type devices, > > there is additional information FW knows that the kernel doesn't. For example, > > what the distance/latency between CPU and the device NUMA node is. While CXL devices > > present a CDAT table which gives latency type attributes within the device, there would still be some > > guesswork needed in the kernel as to what the end to end latency/distance is. > > It's probably not the best outcome to just consider this generically far memory" because > > is it further than Intersocket memory access or not matters. > > Pre CXL devices such as for this patchset don't even have CDAT so the kernel by itself has > > no idea if this latency/distance is less than or more than inter socket memory access latency > > even. > > So specially for devices present at boot, FW knows this information and should provide it. > > Similarly, QEMU should pass along this information to VMs for the best outcomes. > > Yeah, AFAICT we're not doing any of that in this series. We're only > creating some number of nodes for the guest driver to make use of and > describing in the generated _DSD the set of nodes associated for use by > the device. How many nodes are required, how the guest driver > partitions coherent memory among the nodes, and how the guest assigns a > distance to the nodes is unspecified. > > A glance at the CDAT spec seems brilliant in this regard, is there such > a table for this device or could/should the vfio-pci variant driver > provide one? I imagine this is how the VM admin or orchestration tool > would learn to configure nodes and the VMM would further virtualize > this table for the guest OS and drivers. Thanks, I like this proposal. Host bit seems doable. However I'm not sure virtualizing it would help for device drivers that have no awareness unless we slap some generic layer in the driver core in Linux that can use this if available? It would certainly be easy enough to add emulation into the guest to the PCI function config space (as long as we aren't out of room!) For CXL I think we will have to do this anyway. For a complex topology we could either: 1) Present a matching topology in the guest with all the components involved including switches with right numbers of ports etc so the CDAT tables can be read directly from underlying hardware, or... 2) Present a somewhat matching topology in the guest having just messed with switches so that we can present their existence but not all the ports etc. So emulate a cut down CDAT for the switch. 3) Allow flattening everything and present a single CDAT with aggregate numbers at the EP. That will get a bit messy as we may need the kernel to ignore some parts it would normally account for (link widths on CXL buses etc.) So I'm thinking 1 is to restrictive, hence need to emulate the table anyway (not mention it might not be on the function being assigned anyway as they are normally only on Function 0 IIRC). Gah. Too much of CXL emulation still to do in QEMU already.... Jonathan > > Alex >
From: Ankit Agrawal <ankita@nvidia.com> For devices which allow CPU to cache coherently access their memory, it is sensible to expose such memory as NUMA nodes separate from the sysmem node. Qemu currently do not provide a mechanism for creation of NUMA nodes associated with a vfio-pci device. Implement a mechanism to create and associate a set of unique NUMA nodes with a vfio-pci device. NUMA node is created by inserting a series of the unique proximity domains (PXM) in the VM SRAT ACPI table. The ACPI tables are read once at the time of bootup by the kernel to determine the NUMA configuration and is inflexible post that. Hence this feature is incompatible with device hotplug. The added node range associated with the device is communicated through ACPI DSD and can be fetched by the VM kernel or kernel modules. QEMU's VM SRAT and DSD builder code is modified accordingly. New command line params are introduced for admin to have a control on the NUMA node assignment. It is expected for a vfio-pci driver to expose this feature through sysfs. Presence of the feature is checked to enable these code changes. Applied over v8.1.0-rc4. Ankit Agrawal (4): vfio: new command line params for device memory NUMA nodes vfio: assign default values to node params hw/arm/virt-acpi-build: patch guest SRAT for NUMA nodes acpi/gpex: patch guest DSDT for dev mem information hw/arm/virt-acpi-build.c | 54 +++++++++++++ hw/pci-host/gpex-acpi.c | 69 +++++++++++++++++ hw/vfio/pci.c | 146 ++++++++++++++++++++++++++++++++++++ hw/vfio/pci.h | 2 + include/hw/pci/pci_device.h | 3 + 5 files changed, 274 insertions(+)