mbox series

[v6,0/4] vfio/nvgrace-gpu: Enable grace blackwell boards

Message ID 20250124183102.3976-1-ankita@nvidia.com (mailing list archive)
Headers show
Series vfio/nvgrace-gpu: Enable grace blackwell boards | expand

Message

Ankit Agrawal Jan. 24, 2025, 6:30 p.m. UTC
From: Ankit Agrawal <ankita@nvidia.com>

NVIDIA's recently introduced Grace Blackwell (GB) Superchip in
continuation with the Grace Hopper (GH) superchip that provides a
cache coherent access to CPU and GPU to each other's memory with
an internal proprietary chip-to-chip (C2C) cache coherent interconnect.
The in-tree nvgrace-gpu driver manages the GH devices. The intention
is to extend the support to the new Grace Blackwell boards.

There is a HW defect on GH to support the Multi-Instance GPU (MIG)
feature [1] that necessiated the presence of a 1G carved out from
the device memory and mapped uncached. The 1G region is shown as a
fake BAR (comprising region 2 and 3) to workaround the issue.

The GB systems differ from GH systems in the following aspects.
1. The aforementioned HW defect is fixed on GB systems.
2. There is a usable BAR1 (region 2 and 3) on GB systems for the
GPUdirect RDMA feature [2].

This patch series accommodate those GB changes by showing the real
physical device BAR1 (region2 and 3) to the VM instead of the fake
one. This takes care of both the differences.

The presence of the fix for the HW defect is communicated by the
firmware through a DVSEC PCI config register. The module reads
this to take a different codepath on GB vs GH.

To improve system bootup time, HBM training is moved out of UEFI
in GB system. Poll for the register indicating the training state.
Also check the C2C link status if it is ready. Fail the probe if
either fails.

Applied over next-20241220 and the required KVM patch (under review
on the mailing list) to map the GPU device memory as cacheable [3].
Tested on the Grace Blackwell platform by booting up VM, loading
NVIDIA module [4] and running nvidia-smi in the VM.

To run CUDA workloads, there is a dependency on the IOMMUFD and the
Nested Page Table patches being worked on separately by Nicolin Chen.
(nicolinc@nvidia.com). NVIDIA has provided git repositories which
includes all the requisite kernel [5] and Qemu [6] patches in case
one wants to try.

Link: https://www.nvidia.com/en-in/technologies/multi-instance-gpu/ [1]
Link: https://docs.nvidia.com/cuda/gpudirect-rdma/ [2]
Link: https://lore.kernel.org/all/20241118131958.4609-2-ankita@nvidia.com/ [3]
Link: https://github.com/NVIDIA/open-gpu-kernel-modules [4]
Link: https://github.com/NVIDIA/NV-Kernels/tree/6.8_ghvirt [5]
Link: https://github.com/NVIDIA/QEMU/tree/6.8_ghvirt_iommufd_vcmdq [6]

v5 -> v6
* Updated the code based on Alex Williamson's suggestion to move the
  device id enablement to a new patch and using KBUILD_MODNAME
  in place of "vfio-pci"

v4 -> v5
* Added code to enable the BAR0 region as per Alex Williamson's suggestion.
* Updated code based on Kevin Tian's suggestion to replace the variable
  with the semantic representing the presence of MIG bug. Also reorg the
  code to return early for blackwell without any resmem processing.
* Code comments updates.

v3 -> v4
* Added code to enable and restore device memory regions before reading
  BAR0 registers as per Alex Williamson's suggestion.

v2 -> v3
* Incorporated Alex Williamson's suggestion to simplify patch 2/3.
* Updated the code in 3/3 to use time_after() and other miscellaneous
  suggestions from Alex Williamson.

v1 -> v2
* Rebased to next-20241220.

v5:
Link: https://lore.kernel.org/all/20250123174854.3338-1-ankita@nvidia.com/

Signed-off-by: Ankit Agrawal <ankita@nvidia.com>

Ankit Agrawal (4):
  vfio/nvgrace-gpu: Read dvsec register to determine need for uncached
    resmem
  vfio/nvgrace-gpu: Expose the blackwell device PF BAR1 to the VM
  vfio/nvgrace-gpu: Check the HBM training and C2C link status
  vfio/nvgrace-gpu: Add GB200 SKU to the devid table

 drivers/vfio/pci/nvgrace-gpu/main.c | 169 ++++++++++++++++++++++++----
 1 file changed, 147 insertions(+), 22 deletions(-)

Comments

Alex Williamson Jan. 24, 2025, 10:05 p.m. UTC | #1
On Fri, 24 Jan 2025 18:30:58 +0000
<ankita@nvidia.com> wrote:

> From: Ankit Agrawal <ankita@nvidia.com>
> 
> NVIDIA's recently introduced Grace Blackwell (GB) Superchip in
> continuation with the Grace Hopper (GH) superchip that provides a
> cache coherent access to CPU and GPU to each other's memory with
> an internal proprietary chip-to-chip (C2C) cache coherent interconnect.
> The in-tree nvgrace-gpu driver manages the GH devices. The intention
> is to extend the support to the new Grace Blackwell boards.
> 
> There is a HW defect on GH to support the Multi-Instance GPU (MIG)
> feature [1] that necessiated the presence of a 1G carved out from
> the device memory and mapped uncached. The 1G region is shown as a
> fake BAR (comprising region 2 and 3) to workaround the issue.
> 
> The GB systems differ from GH systems in the following aspects.
> 1. The aforementioned HW defect is fixed on GB systems.
> 2. There is a usable BAR1 (region 2 and 3) on GB systems for the
> GPUdirect RDMA feature [2].
> 
> This patch series accommodate those GB changes by showing the real
> physical device BAR1 (region2 and 3) to the VM instead of the fake
> one. This takes care of both the differences.
> 
> The presence of the fix for the HW defect is communicated by the
> firmware through a DVSEC PCI config register. The module reads
> this to take a different codepath on GB vs GH.
> 
> To improve system bootup time, HBM training is moved out of UEFI
> in GB system. Poll for the register indicating the training state.
> Also check the C2C link status if it is ready. Fail the probe if
> either fails.
> 
> Applied over next-20241220 and the required KVM patch (under review
> on the mailing list) to map the GPU device memory as cacheable [3].
> Tested on the Grace Blackwell platform by booting up VM, loading
> NVIDIA module [4] and running nvidia-smi in the VM.
> 
> To run CUDA workloads, there is a dependency on the IOMMUFD and the
> Nested Page Table patches being worked on separately by Nicolin Chen.
> (nicolinc@nvidia.com). NVIDIA has provided git repositories which
> includes all the requisite kernel [5] and Qemu [6] patches in case
> one wants to try.
> 
> Link: https://www.nvidia.com/en-in/technologies/multi-instance-gpu/ [1]
> Link: https://docs.nvidia.com/cuda/gpudirect-rdma/ [2]
> Link: https://lore.kernel.org/all/20241118131958.4609-2-ankita@nvidia.com/ [3]
> Link: https://github.com/NVIDIA/open-gpu-kernel-modules [4]
> Link: https://github.com/NVIDIA/NV-Kernels/tree/6.8_ghvirt [5]
> Link: https://github.com/NVIDIA/QEMU/tree/6.8_ghvirt_iommufd_vcmdq [6]
> 
> v5 -> v6

LGTM.  I'll give others who have reviewed this a short opportunity to
take a final look.  We're already in the merge window but I think we're
just wrapping up some loose ends and I don't see any benefit to holding
it back, so pending comments from others, I'll plan to include it early
next week.  Thanks,

Alex

> * Updated the code based on Alex Williamson's suggestion to move the
>   device id enablement to a new patch and using KBUILD_MODNAME
>   in place of "vfio-pci"
> 
> v4 -> v5
> * Added code to enable the BAR0 region as per Alex Williamson's suggestion.
> * Updated code based on Kevin Tian's suggestion to replace the variable
>   with the semantic representing the presence of MIG bug. Also reorg the
>   code to return early for blackwell without any resmem processing.
> * Code comments updates.
> 
> v3 -> v4
> * Added code to enable and restore device memory regions before reading
>   BAR0 registers as per Alex Williamson's suggestion.
> 
> v2 -> v3
> * Incorporated Alex Williamson's suggestion to simplify patch 2/3.
> * Updated the code in 3/3 to use time_after() and other miscellaneous
>   suggestions from Alex Williamson.
> 
> v1 -> v2
> * Rebased to next-20241220.
> 
> v5:
> Link: https://lore.kernel.org/all/20250123174854.3338-1-ankita@nvidia.com/
> 
> Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
> 
> Ankit Agrawal (4):
>   vfio/nvgrace-gpu: Read dvsec register to determine need for uncached
>     resmem
>   vfio/nvgrace-gpu: Expose the blackwell device PF BAR1 to the VM
>   vfio/nvgrace-gpu: Check the HBM training and C2C link status
>   vfio/nvgrace-gpu: Add GB200 SKU to the devid table
> 
>  drivers/vfio/pci/nvgrace-gpu/main.c | 169 ++++++++++++++++++++++++----
>  1 file changed, 147 insertions(+), 22 deletions(-)
>
Matthew R. Ochs Jan. 28, 2025, 2:03 a.m. UTC | #2
> On Jan 24, 2025, at 12:30 PM, Ankit Agrawal <ankita@nvidia.com> wrote:
> 
> v5 -> v6
> * Updated the code based on Alex Williamson's suggestion to move the
>  device id enablement to a new patch and using KBUILD_MODNAME
>  in place of "vfio-pci"
> 
> Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
> 

Tested series with Grace-Blackwell and Grace-Hopper.

Tested-by: Matthew R. Ochs <mochs@nvidia.com>
Ankit Agrawal Jan. 28, 2025, 5:11 a.m. UTC | #3
>> v5 -> v6
>> * Updated the code based on Alex Williamson's suggestion to move the
>>  device id enablement to a new patch and using KBUILD_MODNAME
>>  in place of "vfio-pci"
>>
>> Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
>>
>
> Tested series with Grace-Blackwell and Grace-Hopper.
> 
> Tested-by: Matthew R. Ochs <mochs@nvidia.com>

Thank you so much, Matt!
Ankit Agrawal Jan. 29, 2025, 2:18 a.m. UTC | #4
>>
>> v5 -> v6
>
> LGTM.  I'll give others who have reviewed this a short opportunity to
> take a final look.  We're already in the merge window but I think we're
> just wrapping up some loose ends and I don't see any benefit to holding
> it back, so pending comments from others, I'll plan to include it early
> next week.  Thanks,
> 
> Alex

Thank you very much Alex for guiding through this!

- Ankit