Message ID | 20250117152334.2786-3-ankita@nvidia.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | vfio/nvgrace-gpu: Enable grace blackwell boards | expand |
On Fri, 17 Jan 2025 15:23:33 +0000 <ankita@nvidia.com> wrote: > From: Ankit Agrawal <ankita@nvidia.com> > > There is a HW defect on Grace Hopper (GH) to support the > Multi-Instance GPU (MIG) feature [1] that necessiated the presence > of a 1G region carved out from the device memory and mapped as > uncached. The 1G region is shown as a fake BAR (comprising region 2 and 3) > to workaround the issue. > > The Grace Blackwell systems (GB) differ from GH systems in the following > aspects: > 1. The aforementioned HW defect is fixed on GB systems. > 2. There is a usable BAR1 (region 2 and 3) on GB systems for the > GPUdirect RDMA feature [2]. > > This patch accommodate those GB changes by showing the 64b physical > device BAR1 (region2 and 3) to the VM instead of the fake one. This > takes care of both the differences. > > Moreover, the entire device memory is exposed on GB as cacheable to > the VM as there is no carveout required. > > Link: https://www.nvidia.com/en-in/technologies/multi-instance-gpu/ [1] > Link: https://docs.nvidia.com/cuda/gpudirect-rdma/ [2] > > Suggested-by: Alex Williamson <alex.williamson@redhat.com> > Signed-off-by: Ankit Agrawal <ankita@nvidia.com> > --- > drivers/vfio/pci/nvgrace-gpu/main.c | 65 ++++++++++++++++++----------- > 1 file changed, 41 insertions(+), 24 deletions(-) > > diff --git a/drivers/vfio/pci/nvgrace-gpu/main.c b/drivers/vfio/pci/nvgrace-gpu/main.c > index 85eacafaffdf..89d38e3c0261 100644 > --- a/drivers/vfio/pci/nvgrace-gpu/main.c > +++ b/drivers/vfio/pci/nvgrace-gpu/main.c > @@ -17,9 +17,6 @@ > #define RESMEM_REGION_INDEX VFIO_PCI_BAR2_REGION_INDEX > #define USEMEM_REGION_INDEX VFIO_PCI_BAR4_REGION_INDEX > > -/* Memory size expected as non cached and reserved by the VM driver */ > -#define RESMEM_SIZE SZ_1G > - > /* A hardwired and constant ABI value between the GPU FW and VFIO driver. */ > #define MEMBLK_SIZE SZ_512M > > @@ -72,7 +69,7 @@ nvgrace_gpu_memregion(int index, > if (index == USEMEM_REGION_INDEX) > return &nvdev->usemem; > > - if (index == RESMEM_REGION_INDEX) > + if (nvdev->resmem.memlength && index == RESMEM_REGION_INDEX) > return &nvdev->resmem; > > return NULL; > @@ -757,21 +754,31 @@ nvgrace_gpu_init_nvdev_struct(struct pci_dev *pdev, > u64 memphys, u64 memlength) > { > int ret = 0; > + u64 resmem_size = 0; > > /* > - * The VM GPU device driver needs a non-cacheable region to support > - * the MIG feature. Since the device memory is mapped as NORMAL cached, > - * carve out a region from the end with a different NORMAL_NC > - * property (called as reserved memory and represented as resmem). This > - * region then is exposed as a 64b BAR (region 2 and 3) to the VM, while > - * exposing the rest (termed as usable memory and represented using usemem) > - * as cacheable 64b BAR (region 4 and 5). > + * On Grace Hopper systems, the VM GPU device driver needs a non-cacheable > + * region to support the MIG feature owing to a hardware bug. Since the > + * device memory is mapped as NORMAL cached, carve out a region from the end > + * with a different NORMAL_NC property (called as reserved memory and > + * represented as resmem). This region then is exposed as a 64b BAR > + * (region 2 and 3) to the VM, while exposing the rest (termed as usable > + * memory and represented using usemem) as cacheable 64b BAR (region 4 and 5). > * > * devmem (memlength) > * |-------------------------------------------------| > * | | > * usemem.memphys resmem.memphys > + * > + * This hardware bug is fixed on the Grace Blackwell platforms and the > + * presence of fix can be determined through nvdev->has_mig_hw_bug_fix. > + * Thus on systems with the hardware fix, there is no need to partition > + * the GPU device memory and the entire memory is usable and mapped as > + * NORMAL cached. > */ > + if (!nvdev->has_mig_hw_bug_fix) > + resmem_size = SZ_1G; > + > nvdev->usemem.memphys = memphys; > > /* > @@ -780,23 +787,30 @@ nvgrace_gpu_init_nvdev_struct(struct pci_dev *pdev, > * memory (usemem) is added to the kernel for usage by the VM > * workloads. Make the usable memory size memblock aligned. > */ > - if (check_sub_overflow(memlength, RESMEM_SIZE, > + if (check_sub_overflow(memlength, resmem_size, > &nvdev->usemem.memlength)) { > ret = -EOVERFLOW; > goto done; > } > > - /* > - * The USEMEM part of the device memory has to be MEMBLK_SIZE > - * aligned. This is a hardwired ABI value between the GPU FW and > - * VFIO driver. The VM device driver is also aware of it and make > - * use of the value for its calculation to determine USEMEM size. > - */ > - nvdev->usemem.memlength = round_down(nvdev->usemem.memlength, > - MEMBLK_SIZE); > - if (nvdev->usemem.memlength == 0) { > - ret = -EINVAL; > - goto done; > + if (!nvdev->has_mig_hw_bug_fix) { > + /* > + * If the device memory is split to workaround the MIG bug, > + * the USEMEM part of the device memory has to be MEMBLK_SIZE > + * aligned. This is a hardwired ABI value between the GPU FW and > + * VFIO driver. The VM device driver is also aware of it and make > + * use of the value for its calculation to determine USEMEM size. > + * > + * If the hardware has the fix for MIG, there is no requirement > + * for splitting the device memory to create RESMEM. The entire > + * device memory is usable and will be USEMEM. > + */ > + nvdev->usemem.memlength = round_down(nvdev->usemem.memlength, > + MEMBLK_SIZE); > + if (nvdev->usemem.memlength == 0) { > + ret = -EINVAL; > + goto done; > + } Why does this operation need to be predicated on the buggy device? Does GB have memory that's not a multiple of 512MB? I was expecting this would be a no-op on GB and therefore wouldn't need to be conditional. Thanks, Alex > } > > if ((check_add_overflow(nvdev->usemem.memphys, > @@ -813,7 +827,10 @@ nvgrace_gpu_init_nvdev_struct(struct pci_dev *pdev, > * the BAR size for them. > */ > nvdev->usemem.bar_size = roundup_pow_of_two(nvdev->usemem.memlength); > - nvdev->resmem.bar_size = roundup_pow_of_two(nvdev->resmem.memlength); > + > + if (nvdev->resmem.memlength) > + nvdev->resmem.bar_size = > + roundup_pow_of_two(nvdev->resmem.memlength); > done: > return ret; > }
>> + if (!nvdev->has_mig_hw_bug_fix) { >> + /* >> + * If the device memory is split to workaround the MIG bug, >> + * the USEMEM part of the device memory has to be MEMBLK_SIZE >> + * aligned. This is a hardwired ABI value between the GPU FW and >> + * VFIO driver. The VM device driver is also aware of it and make >> + * use of the value for its calculation to determine USEMEM size. >> + * >> + * If the hardware has the fix for MIG, there is no requirement >> + * for splitting the device memory to create RESMEM. The entire >> + * device memory is usable and will be USEMEM. >> + */ >> + nvdev->usemem.memlength = round_down(nvdev->usemem.memlength, >> + MEMBLK_SIZE); >> + if (nvdev->usemem.memlength == 0) { >> + ret = -EINVAL; >> + goto done; >> + } > > Why does this operation need to be predicated on the buggy device? > Does GB have memory that's not a multiple of 512MB? I was expecting > this would be a no-op on GB and therefore wouldn't need to be > conditional. Thanks, > > Alex Thanks Alex, yeah the device memory size is not necessarily 512M aligned.
diff --git a/drivers/vfio/pci/nvgrace-gpu/main.c b/drivers/vfio/pci/nvgrace-gpu/main.c index 85eacafaffdf..89d38e3c0261 100644 --- a/drivers/vfio/pci/nvgrace-gpu/main.c +++ b/drivers/vfio/pci/nvgrace-gpu/main.c @@ -17,9 +17,6 @@ #define RESMEM_REGION_INDEX VFIO_PCI_BAR2_REGION_INDEX #define USEMEM_REGION_INDEX VFIO_PCI_BAR4_REGION_INDEX -/* Memory size expected as non cached and reserved by the VM driver */ -#define RESMEM_SIZE SZ_1G - /* A hardwired and constant ABI value between the GPU FW and VFIO driver. */ #define MEMBLK_SIZE SZ_512M @@ -72,7 +69,7 @@ nvgrace_gpu_memregion(int index, if (index == USEMEM_REGION_INDEX) return &nvdev->usemem; - if (index == RESMEM_REGION_INDEX) + if (nvdev->resmem.memlength && index == RESMEM_REGION_INDEX) return &nvdev->resmem; return NULL; @@ -757,21 +754,31 @@ nvgrace_gpu_init_nvdev_struct(struct pci_dev *pdev, u64 memphys, u64 memlength) { int ret = 0; + u64 resmem_size = 0; /* - * The VM GPU device driver needs a non-cacheable region to support - * the MIG feature. Since the device memory is mapped as NORMAL cached, - * carve out a region from the end with a different NORMAL_NC - * property (called as reserved memory and represented as resmem). This - * region then is exposed as a 64b BAR (region 2 and 3) to the VM, while - * exposing the rest (termed as usable memory and represented using usemem) - * as cacheable 64b BAR (region 4 and 5). + * On Grace Hopper systems, the VM GPU device driver needs a non-cacheable + * region to support the MIG feature owing to a hardware bug. Since the + * device memory is mapped as NORMAL cached, carve out a region from the end + * with a different NORMAL_NC property (called as reserved memory and + * represented as resmem). This region then is exposed as a 64b BAR + * (region 2 and 3) to the VM, while exposing the rest (termed as usable + * memory and represented using usemem) as cacheable 64b BAR (region 4 and 5). * * devmem (memlength) * |-------------------------------------------------| * | | * usemem.memphys resmem.memphys + * + * This hardware bug is fixed on the Grace Blackwell platforms and the + * presence of fix can be determined through nvdev->has_mig_hw_bug_fix. + * Thus on systems with the hardware fix, there is no need to partition + * the GPU device memory and the entire memory is usable and mapped as + * NORMAL cached. */ + if (!nvdev->has_mig_hw_bug_fix) + resmem_size = SZ_1G; + nvdev->usemem.memphys = memphys; /* @@ -780,23 +787,30 @@ nvgrace_gpu_init_nvdev_struct(struct pci_dev *pdev, * memory (usemem) is added to the kernel for usage by the VM * workloads. Make the usable memory size memblock aligned. */ - if (check_sub_overflow(memlength, RESMEM_SIZE, + if (check_sub_overflow(memlength, resmem_size, &nvdev->usemem.memlength)) { ret = -EOVERFLOW; goto done; } - /* - * The USEMEM part of the device memory has to be MEMBLK_SIZE - * aligned. This is a hardwired ABI value between the GPU FW and - * VFIO driver. The VM device driver is also aware of it and make - * use of the value for its calculation to determine USEMEM size. - */ - nvdev->usemem.memlength = round_down(nvdev->usemem.memlength, - MEMBLK_SIZE); - if (nvdev->usemem.memlength == 0) { - ret = -EINVAL; - goto done; + if (!nvdev->has_mig_hw_bug_fix) { + /* + * If the device memory is split to workaround the MIG bug, + * the USEMEM part of the device memory has to be MEMBLK_SIZE + * aligned. This is a hardwired ABI value between the GPU FW and + * VFIO driver. The VM device driver is also aware of it and make + * use of the value for its calculation to determine USEMEM size. + * + * If the hardware has the fix for MIG, there is no requirement + * for splitting the device memory to create RESMEM. The entire + * device memory is usable and will be USEMEM. + */ + nvdev->usemem.memlength = round_down(nvdev->usemem.memlength, + MEMBLK_SIZE); + if (nvdev->usemem.memlength == 0) { + ret = -EINVAL; + goto done; + } } if ((check_add_overflow(nvdev->usemem.memphys, @@ -813,7 +827,10 @@ nvgrace_gpu_init_nvdev_struct(struct pci_dev *pdev, * the BAR size for them. */ nvdev->usemem.bar_size = roundup_pow_of_two(nvdev->usemem.memlength); - nvdev->resmem.bar_size = roundup_pow_of_two(nvdev->resmem.memlength); + + if (nvdev->resmem.memlength) + nvdev->resmem.bar_size = + roundup_pow_of_two(nvdev->resmem.memlength); done: return ret; }