Message ID | 20250124183102.3976-1-ankita@nvidia.com (mailing list archive) |
---|---|
Headers | show |
Series | vfio/nvgrace-gpu: Enable grace blackwell boards | expand |
On Fri, 24 Jan 2025 18:30:58 +0000 <ankita@nvidia.com> wrote: > From: Ankit Agrawal <ankita@nvidia.com> > > NVIDIA's recently introduced Grace Blackwell (GB) Superchip in > continuation with the Grace Hopper (GH) superchip that provides a > cache coherent access to CPU and GPU to each other's memory with > an internal proprietary chip-to-chip (C2C) cache coherent interconnect. > The in-tree nvgrace-gpu driver manages the GH devices. The intention > is to extend the support to the new Grace Blackwell boards. > > There is a HW defect on GH to support the Multi-Instance GPU (MIG) > feature [1] that necessiated the presence of a 1G carved out from > the device memory and mapped uncached. The 1G region is shown as a > fake BAR (comprising region 2 and 3) to workaround the issue. > > The GB systems differ from GH systems in the following aspects. > 1. The aforementioned HW defect is fixed on GB systems. > 2. There is a usable BAR1 (region 2 and 3) on GB systems for the > GPUdirect RDMA feature [2]. > > This patch series accommodate those GB changes by showing the real > physical device BAR1 (region2 and 3) to the VM instead of the fake > one. This takes care of both the differences. > > The presence of the fix for the HW defect is communicated by the > firmware through a DVSEC PCI config register. The module reads > this to take a different codepath on GB vs GH. > > To improve system bootup time, HBM training is moved out of UEFI > in GB system. Poll for the register indicating the training state. > Also check the C2C link status if it is ready. Fail the probe if > either fails. > > Applied over next-20241220 and the required KVM patch (under review > on the mailing list) to map the GPU device memory as cacheable [3]. > Tested on the Grace Blackwell platform by booting up VM, loading > NVIDIA module [4] and running nvidia-smi in the VM. > > To run CUDA workloads, there is a dependency on the IOMMUFD and the > Nested Page Table patches being worked on separately by Nicolin Chen. > (nicolinc@nvidia.com). NVIDIA has provided git repositories which > includes all the requisite kernel [5] and Qemu [6] patches in case > one wants to try. > > Link: https://www.nvidia.com/en-in/technologies/multi-instance-gpu/ [1] > Link: https://docs.nvidia.com/cuda/gpudirect-rdma/ [2] > Link: https://lore.kernel.org/all/20241118131958.4609-2-ankita@nvidia.com/ [3] > Link: https://github.com/NVIDIA/open-gpu-kernel-modules [4] > Link: https://github.com/NVIDIA/NV-Kernels/tree/6.8_ghvirt [5] > Link: https://github.com/NVIDIA/QEMU/tree/6.8_ghvirt_iommufd_vcmdq [6] > > v5 -> v6 LGTM. I'll give others who have reviewed this a short opportunity to take a final look. We're already in the merge window but I think we're just wrapping up some loose ends and I don't see any benefit to holding it back, so pending comments from others, I'll plan to include it early next week. Thanks, Alex > * Updated the code based on Alex Williamson's suggestion to move the > device id enablement to a new patch and using KBUILD_MODNAME > in place of "vfio-pci" > > v4 -> v5 > * Added code to enable the BAR0 region as per Alex Williamson's suggestion. > * Updated code based on Kevin Tian's suggestion to replace the variable > with the semantic representing the presence of MIG bug. Also reorg the > code to return early for blackwell without any resmem processing. > * Code comments updates. > > v3 -> v4 > * Added code to enable and restore device memory regions before reading > BAR0 registers as per Alex Williamson's suggestion. > > v2 -> v3 > * Incorporated Alex Williamson's suggestion to simplify patch 2/3. > * Updated the code in 3/3 to use time_after() and other miscellaneous > suggestions from Alex Williamson. > > v1 -> v2 > * Rebased to next-20241220. > > v5: > Link: https://lore.kernel.org/all/20250123174854.3338-1-ankita@nvidia.com/ > > Signed-off-by: Ankit Agrawal <ankita@nvidia.com> > > Ankit Agrawal (4): > vfio/nvgrace-gpu: Read dvsec register to determine need for uncached > resmem > vfio/nvgrace-gpu: Expose the blackwell device PF BAR1 to the VM > vfio/nvgrace-gpu: Check the HBM training and C2C link status > vfio/nvgrace-gpu: Add GB200 SKU to the devid table > > drivers/vfio/pci/nvgrace-gpu/main.c | 169 ++++++++++++++++++++++++---- > 1 file changed, 147 insertions(+), 22 deletions(-) >
> On Jan 24, 2025, at 12:30 PM, Ankit Agrawal <ankita@nvidia.com> wrote: > > v5 -> v6 > * Updated the code based on Alex Williamson's suggestion to move the > device id enablement to a new patch and using KBUILD_MODNAME > in place of "vfio-pci" > > Signed-off-by: Ankit Agrawal <ankita@nvidia.com> > Tested series with Grace-Blackwell and Grace-Hopper. Tested-by: Matthew R. Ochs <mochs@nvidia.com>
>> v5 -> v6 >> * Updated the code based on Alex Williamson's suggestion to move the >> device id enablement to a new patch and using KBUILD_MODNAME >> in place of "vfio-pci" >> >> Signed-off-by: Ankit Agrawal <ankita@nvidia.com> >> > > Tested series with Grace-Blackwell and Grace-Hopper. > > Tested-by: Matthew R. Ochs <mochs@nvidia.com> Thank you so much, Matt!
>> >> v5 -> v6 > > LGTM. I'll give others who have reviewed this a short opportunity to > take a final look. We're already in the merge window but I think we're > just wrapping up some loose ends and I don't see any benefit to holding > it back, so pending comments from others, I'll plan to include it early > next week. Thanks, > > Alex Thank you very much Alex for guiding through this! - Ankit