Message ID | 20231225045603.7654-1-ankita@nvidia.com (mailing list archive) |
---|---|
Headers | show |
Series | acpi: report numa nodes for device memory using GI | expand |
On Mon, 25 Dec 2023 10:26:01 +0530 <ankita@nvidia.com> wrote: > From: Ankit Agrawal <ankita@nvidia.com> > > There are upcoming devices which allow CPU to cache coherently access > their memory. It is sensible to expose such memory as NUMA nodes separate > from the sysmem node to the OS. The ACPI spec provides a scheme in SRAT > called Generic Initiator Affinity Structure [1] to allow an association > between a Proximity Domain (PXM) and a Generic Initiator (GI) (e.g. > heterogeneous processors and accelerators, GPUs, and I/O devices with > integrated compute or DMA engines). > > While a single node per device may cover several use cases, it is however > insufficient for a full utilization of the NVIDIA GPUs MIG > (Mult-Instance GPUs) [2] feature. The feature allows partitioning of the > GPU device resources (including device memory) into several (upto 8) > isolated instances. Each of the partitioned memory requires a dedicated NUMA > node to operate. The partitions are not fixed and they can be created/deleted > at runtime. > > Linux OS does not provide a means to dynamically create/destroy NUMA nodes > and such feature implementation is expected to be non-trivial. The nodes > that OS discovers at the boot time while parsing SRAT remains fixed. So we > utilize the GI Affinity structures that allows association between nodes > and devices. Multiple GI structures per device/BDF is possible, allowing > creation of multiple nodes in the VM by exposing unique PXM in each of these > structures. > > Implement the mechanism to build the GI affinity structures as Qemu currently > does not. Introduce a new acpi-generic-initiator object that allows an > association of a set of nodes with a device. During SRAT creation, all such > objected are identified and used to add the GI Affinity Structures. Currently, > only PCI device is supported. On a multi device system, each device supporting > the features needs a unique acpi-generic-initiator object with its own set of > NUMA nodes associated to it. > > The admin will create a range of 8 nodes and associate that with the device > using the acpi-generic-initiator object. While a configuration of less than > 8 nodes per device is allowed, such configuration will prevent utilization of > the feature to the fullest. This setting is applicable to all the Grace+Hopper > systems. The following is an example of the Qemu command line arguments to > create 8 nodes and link them to the device 'dev0': > > -numa node,nodeid=2 -numa node,nodeid=3 -numa node,nodeid=4 \ > -numa node,nodeid=5 -numa node,nodeid=6 -numa node,nodeid=7 \ > -numa node,nodeid=8 -numa node,nodeid=9 \ > -device vfio-pci-nohotplug,host=0009:01:00.0,bus=pcie.0,addr=04.0,rombar=0,id=dev0 \ > -object acpi-generic-initiator,id=gi0,pci-dev=dev0,host-nodes=2-9 \ > I'd find it helpful to see the resulting chunk of SRAT for these examples (disassembled) in this cover letter and the patches (where there are more examples).
>> >> -numa node,nodeid=2 -numa node,nodeid=3 -numa node,nodeid=4 \ >> -numa node,nodeid=5 -numa node,nodeid=6 -numa node,nodeid=7 \ >> -numa node,nodeid=8 -numa node,nodeid=9 \ >> -device vfio-pci-nohotplug,host=0009:01:00.0,bus=pcie.0,addr=04.0,rombar=0,id=dev0 \ >> -object acpi-generic-initiator,id=gi0,pci-dev=dev0,host-nodes=2-9 \ >> > > I'd find it helpful to see the resulting chunk of SRAT for these examples > (disassembled) in this cover letter and the patches (where there are more examples). Ack. I'll document the resulting SRAT table as well.
On Thu, Jan 04, 2024 at 03:05:27AM +0000, Ankit Agrawal wrote: > > >> > >> -numa node,nodeid=2 -numa node,nodeid=3 -numa node,nodeid=4 \ > >> -numa node,nodeid=5 -numa node,nodeid=6 -numa node,nodeid=7 \ > >> -numa node,nodeid=8 -numa node,nodeid=9 \ > >> -device vfio-pci-nohotplug,host=0009:01:00.0,bus=pcie.0,addr=04.0,rombar=0,id=dev0 \ > >> -object acpi-generic-initiator,id=gi0,pci-dev=dev0,host-nodes=2-9 \ > >> > > > > I'd find it helpful to see the resulting chunk of SRAT for these examples > > (disassembled) in this cover letter and the patches (where there are more examples). > > Ack. I'll document the resulting SRAT table as well. Still didn't happen so this is dropped for now.
>> > >> > I'd find it helpful to see the resulting chunk of SRAT for these examples >> > (disassembled) in this cover letter and the patches (where there are more examples). >> >> Ack. I'll document the resulting SRAT table as well. > > Still didn't happen so this is dropped for now. Hi Michael, does this mean it is dropped from Qemu v9.0? FWIW, I'll post the next version incorporating the feedbacks by next week.
From: Ankit Agrawal <ankita@nvidia.com> There are upcoming devices which allow CPU to cache coherently access their memory. It is sensible to expose such memory as NUMA nodes separate from the sysmem node to the OS. The ACPI spec provides a scheme in SRAT called Generic Initiator Affinity Structure [1] to allow an association between a Proximity Domain (PXM) and a Generic Initiator (GI) (e.g. heterogeneous processors and accelerators, GPUs, and I/O devices with integrated compute or DMA engines). While a single node per device may cover several use cases, it is however insufficient for a full utilization of the NVIDIA GPUs MIG (Mult-Instance GPUs) [2] feature. The feature allows partitioning of the GPU device resources (including device memory) into several (upto 8) isolated instances. Each of the partitioned memory requires a dedicated NUMA node to operate. The partitions are not fixed and they can be created/deleted at runtime. Linux OS does not provide a means to dynamically create/destroy NUMA nodes and such feature implementation is expected to be non-trivial. The nodes that OS discovers at the boot time while parsing SRAT remains fixed. So we utilize the GI Affinity structures that allows association between nodes and devices. Multiple GI structures per device/BDF is possible, allowing creation of multiple nodes in the VM by exposing unique PXM in each of these structures. Implement the mechanism to build the GI affinity structures as Qemu currently does not. Introduce a new acpi-generic-initiator object that allows an association of a set of nodes with a device. During SRAT creation, all such objected are identified and used to add the GI Affinity Structures. Currently, only PCI device is supported. On a multi device system, each device supporting the features needs a unique acpi-generic-initiator object with its own set of NUMA nodes associated to it. The admin will create a range of 8 nodes and associate that with the device using the acpi-generic-initiator object. While a configuration of less than 8 nodes per device is allowed, such configuration will prevent utilization of the feature to the fullest. This setting is applicable to all the Grace+Hopper systems. The following is an example of the Qemu command line arguments to create 8 nodes and link them to the device 'dev0': -numa node,nodeid=2 -numa node,nodeid=3 -numa node,nodeid=4 \ -numa node,nodeid=5 -numa node,nodeid=6 -numa node,nodeid=7 \ -numa node,nodeid=8 -numa node,nodeid=9 \ -device vfio-pci-nohotplug,host=0009:01:00.0,bus=pcie.0,addr=04.0,rombar=0,id=dev0 \ -object acpi-generic-initiator,id=gi0,pci-dev=dev0,host-nodes=2-9 \ The performance benefits can be realized by providing the NUMA node distances appropriately (through libvirt tags or Qemu params). The admin can get the distance among nodes in hardware using `numactl -H`. This series goes along with the vfio-pci variant driver [3] under review. Applied over v8.2.0-rc4. [1] ACPI Spec 6.3, Section 5.2.16.6 [2] https://www.nvidia.com/en-in/technologies/multi-instance-gpu [3] https://lore.kernel.org/all/20231212184613.3237-1-ankita@nvidia.com/ Link for v5: https://lore.kernel.org/all/20231203060245.31593-1-ankita@nvidia.com/ v5 -> v6 - Updated commit message for the [1/2] and the cover letter. - Updated the acpi-generic-initiator object comment description for clarity on the input host-nodes. - Rebased to v8.2.0-rc4. v4 -> v5 - Removed acpi-dev option until full support. - The numa nodes are saved as bitmap instead of uint16List. - Replaced asserts to exit calls. - Addressed other miscellaneous comments. v3 -> v4 - changed the ':' delimited way to a uint16 array to communicate the nodes associated with the device. - added asserts to handle invalid inputs. - addressed other miscellaneous v3 comments. v2 -> v3 - changed param to accept a ':' delimited list of numa nodes, instead of a range. - Removed nvidia-acpi-generic-initiator object. - Addressed miscellaneous comments in v2. v1 -> v2 - Removed dependency on sysfs to communicate the feature with variant module. - Use GI Affinity SRAT structure instead of Memory Affinity. - No DSDT entries needed to communicate the PXM for the device. SRAT GI structure is used instead. - New objects introduced to establish link between device and nodes. Ankit Agrawal (2): qom: new object to associate device to numa node hw/acpi: Implement the SRAT GI affinity structure hw/acpi/acpi-generic-initiator.c | 169 +++++++++++++++++++++++ hw/acpi/meson.build | 1 + hw/arm/virt-acpi-build.c | 3 + include/hw/acpi/acpi-generic-initiator.h | 53 +++++++ qapi/qom.json | 17 +++ 5 files changed, 243 insertions(+) create mode 100644 hw/acpi/acpi-generic-initiator.c create mode 100644 include/hw/acpi/acpi-generic-initiator.h