diff mbox series

[v4,2/3] vfio/nvgrace-gpu: Expose the blackwell device PF BAR1 to the VM

Message ID 20250117233704.3374-3-ankita@nvidia.com (mailing list archive)
State New
Headers show
Series vfio/nvgrace-gpu: Enable grace blackwell boards | expand

Commit Message

Ankit Agrawal Jan. 17, 2025, 11:37 p.m. UTC
From: Ankit Agrawal <ankita@nvidia.com>

There is a HW defect on Grace Hopper (GH) to support the
Multi-Instance GPU (MIG) feature [1] that necessiated the presence
of a 1G region carved out from the device memory and mapped as
uncached. The 1G region is shown as a fake BAR (comprising region 2 and 3)
to workaround the issue.

The Grace Blackwell systems (GB) differ from GH systems in the following
aspects:
1. The aforementioned HW defect is fixed on GB systems.
2. There is a usable BAR1 (region 2 and 3) on GB systems for the
GPUdirect RDMA feature [2].

This patch accommodate those GB changes by showing the 64b physical
device BAR1 (region2 and 3) to the VM instead of the fake one. This
takes care of both the differences.

Moreover, the entire device memory is exposed on GB as cacheable to
the VM as there is no carveout required.

Link: https://www.nvidia.com/en-in/technologies/multi-instance-gpu/ [1]
Link: https://docs.nvidia.com/cuda/gpudirect-rdma/ [2]

Suggested-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
---
 drivers/vfio/pci/nvgrace-gpu/main.c | 66 ++++++++++++++++++-----------
 1 file changed, 42 insertions(+), 24 deletions(-)

Comments

Tian, Kevin Jan. 20, 2025, 7:29 a.m. UTC | #1
> From: ankita@nvidia.com <ankita@nvidia.com>
> Sent: Saturday, January 18, 2025 7:37 AM
> @@ -780,23 +787,31 @@ nvgrace_gpu_init_nvdev_struct(struct pci_dev
> *pdev,
>  	 * memory (usemem) is added to the kernel for usage by the VM
>  	 * workloads. Make the usable memory size memblock aligned.
>  	 */
> -	if (check_sub_overflow(memlength, RESMEM_SIZE,
> +	if (check_sub_overflow(memlength, resmem_size,
>  			       &nvdev->usemem.memlength)) {
>  		ret = -EOVERFLOW;
>  		goto done;
>  	}
> 
> -	/*
> -	 * The USEMEM part of the device memory has to be MEMBLK_SIZE
> -	 * aligned. This is a hardwired ABI value between the GPU FW and
> -	 * VFIO driver. The VM device driver is also aware of it and make
> -	 * use of the value for its calculation to determine USEMEM size.
> -	 */
> -	nvdev->usemem.memlength = round_down(nvdev-
> >usemem.memlength,
> -					     MEMBLK_SIZE);
> -	if (nvdev->usemem.memlength == 0) {
> -		ret = -EINVAL;
> -		goto done;
> +	if (!nvdev->has_mig_hw_bug_fix) {
> +		/*
> +		 * If the device memory is split to workaround the MIG bug,
> +		 * the USEMEM part of the device memory has to be
> MEMBLK_SIZE
> +		 * aligned. This is a hardwired ABI value between the GPU FW
> and
> +		 * VFIO driver. The VM device driver is also aware of it and
> make
> +		 * use of the value for its calculation to determine USEMEM
> size.
> +		 * Note that the device memory may not be 512M aligned.
> +		 *
> +		 * If the hardware has the fix for MIG, there is no
> requirement
> +		 * for splitting the device memory to create RESMEM. The
> entire
> +		 * device memory is usable and will be USEMEM.

Just double confirm. With the fix it's not required to have the usemem
512M aligned, or does hardware guarantee that usemem is always 
512M aligned?

And it's clearer to return early when the fix is there so the majority of
the existing code can be left intact instead of causing unnecessary
indent here.
Ankit Agrawal Jan. 20, 2025, 5:13 p.m. UTC | #2
>> +     if (!nvdev->has_mig_hw_bug_fix) {
>> +             /*
>> +              * If the device memory is split to workaround the MIG bug,
>> +              * the USEMEM part of the device memory has to be
>> MEMBLK_SIZE
>> +              * aligned. This is a hardwired ABI value between the GPU FW
>> and
>> +              * VFIO driver. The VM device driver is also aware of it and
>> make
>> +              * use of the value for its calculation to determine USEMEM
>> size.
>> +              * Note that the device memory may not be 512M aligned.
>> +              *
>> +              * If the hardware has the fix for MIG, there is no
>> requirement
>> +              * for splitting the device memory to create RESMEM. The
>> entire
>> +              * device memory is usable and will be USEMEM.
>
> Just double confirm. With the fix it's not required to have the usemem
> 512M aligned, or does hardware guarantee that usemem is always
> 512M aligned?

The first one - On devices without the MIG bug, the device memory
passed to the VM need not be 512M aligned. The devices may still have
non 512M aligned memory.

> And it's clearer to return early when the fix is there so the majority of
> the existing code can be left intact instead of causing unnecessary
> indent here.

I think that can be done. We calculate nvdev->usemem.bar_size down
the function, but I suppose that can be moved up before returning
early.
diff mbox series

Patch

diff --git a/drivers/vfio/pci/nvgrace-gpu/main.c b/drivers/vfio/pci/nvgrace-gpu/main.c
index 85eacafaffdf..e6fe5bc8940f 100644
--- a/drivers/vfio/pci/nvgrace-gpu/main.c
+++ b/drivers/vfio/pci/nvgrace-gpu/main.c
@@ -17,9 +17,6 @@ 
 #define RESMEM_REGION_INDEX VFIO_PCI_BAR2_REGION_INDEX
 #define USEMEM_REGION_INDEX VFIO_PCI_BAR4_REGION_INDEX
 
-/* Memory size expected as non cached and reserved by the VM driver */
-#define RESMEM_SIZE SZ_1G
-
 /* A hardwired and constant ABI value between the GPU FW and VFIO driver. */
 #define MEMBLK_SIZE SZ_512M
 
@@ -72,7 +69,7 @@  nvgrace_gpu_memregion(int index,
 	if (index == USEMEM_REGION_INDEX)
 		return &nvdev->usemem;
 
-	if (index == RESMEM_REGION_INDEX)
+	if (nvdev->resmem.memlength && index == RESMEM_REGION_INDEX)
 		return &nvdev->resmem;
 
 	return NULL;
@@ -757,21 +754,31 @@  nvgrace_gpu_init_nvdev_struct(struct pci_dev *pdev,
 			      u64 memphys, u64 memlength)
 {
 	int ret = 0;
+	u64 resmem_size = 0;
 
 	/*
-	 * The VM GPU device driver needs a non-cacheable region to support
-	 * the MIG feature. Since the device memory is mapped as NORMAL cached,
-	 * carve out a region from the end with a different NORMAL_NC
-	 * property (called as reserved memory and represented as resmem). This
-	 * region then is exposed as a 64b BAR (region 2 and 3) to the VM, while
-	 * exposing the rest (termed as usable memory and represented using usemem)
-	 * as cacheable 64b BAR (region 4 and 5).
+	 * On Grace Hopper systems, the VM GPU device driver needs a non-cacheable
+	 * region to support the MIG feature owing to a hardware bug. Since the
+	 * device memory is mapped as NORMAL cached, carve out a region from the end
+	 * with a different NORMAL_NC property (called as reserved memory and
+	 * represented as resmem). This region then is exposed as a 64b BAR
+	 * (region 2 and 3) to the VM, while exposing the rest (termed as usable
+	 * memory and represented using usemem) as cacheable 64b BAR (region 4 and 5).
 	 *
 	 *               devmem (memlength)
 	 * |-------------------------------------------------|
 	 * |                                           |
 	 * usemem.memphys                              resmem.memphys
+	 *
+	 * This hardware bug is fixed on the Grace Blackwell platforms and the
+	 * presence of fix can be determined through nvdev->has_mig_hw_bug_fix.
+	 * Thus on systems with the hardware fix, there is no need to partition
+	 * the GPU device memory and the entire memory is usable and mapped as
+	 * NORMAL cached.
 	 */
+	if (!nvdev->has_mig_hw_bug_fix)
+		resmem_size = SZ_1G;
+
 	nvdev->usemem.memphys = memphys;
 
 	/*
@@ -780,23 +787,31 @@  nvgrace_gpu_init_nvdev_struct(struct pci_dev *pdev,
 	 * memory (usemem) is added to the kernel for usage by the VM
 	 * workloads. Make the usable memory size memblock aligned.
 	 */
-	if (check_sub_overflow(memlength, RESMEM_SIZE,
+	if (check_sub_overflow(memlength, resmem_size,
 			       &nvdev->usemem.memlength)) {
 		ret = -EOVERFLOW;
 		goto done;
 	}
 
-	/*
-	 * The USEMEM part of the device memory has to be MEMBLK_SIZE
-	 * aligned. This is a hardwired ABI value between the GPU FW and
-	 * VFIO driver. The VM device driver is also aware of it and make
-	 * use of the value for its calculation to determine USEMEM size.
-	 */
-	nvdev->usemem.memlength = round_down(nvdev->usemem.memlength,
-					     MEMBLK_SIZE);
-	if (nvdev->usemem.memlength == 0) {
-		ret = -EINVAL;
-		goto done;
+	if (!nvdev->has_mig_hw_bug_fix) {
+		/*
+		 * If the device memory is split to workaround the MIG bug,
+		 * the USEMEM part of the device memory has to be MEMBLK_SIZE
+		 * aligned. This is a hardwired ABI value between the GPU FW and
+		 * VFIO driver. The VM device driver is also aware of it and make
+		 * use of the value for its calculation to determine USEMEM size.
+		 * Note that the device memory may not be 512M aligned.
+		 *
+		 * If the hardware has the fix for MIG, there is no requirement
+		 * for splitting the device memory to create RESMEM. The entire
+		 * device memory is usable and will be USEMEM.
+		 */
+		nvdev->usemem.memlength = round_down(nvdev->usemem.memlength,
+						     MEMBLK_SIZE);
+		if (nvdev->usemem.memlength == 0) {
+			ret = -EINVAL;
+			goto done;
+		}
 	}
 
 	if ((check_add_overflow(nvdev->usemem.memphys,
@@ -813,7 +828,10 @@  nvgrace_gpu_init_nvdev_struct(struct pci_dev *pdev,
 	 * the BAR size for them.
 	 */
 	nvdev->usemem.bar_size = roundup_pow_of_two(nvdev->usemem.memlength);
-	nvdev->resmem.bar_size = roundup_pow_of_two(nvdev->resmem.memlength);
+
+	if (nvdev->resmem.memlength)
+		nvdev->resmem.bar_size =
+			roundup_pow_of_two(nvdev->resmem.memlength);
 done:
 	return ret;
 }