[v1,0/3] vfio/nvgrace-gpu: Enable grace blackwell boards

Message ID	20241006102722.3991-1-ankita@nvidia.com (mailing list archive)
Headers	show Received: from NAM11-BN8-obe.outbound.protection.outlook.com (mail-bn8nam11on2070.outbound.protection.outlook.com [40.107.236.70]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id A82A04C8C; Sun, 6 Oct 2024 10:27:47 +0000 (UTC) Received-SPF: Pass (protection.outlook.com: domain of nvidia.com designates 216.228.118.233 as permitted sender) receiver=protection.outlook.com; client-ip=216.228.118.233; helo=mail.nvidia.com; pr=C From: <ankita@nvidia.com> To: <ankita@nvidia.com>, <jgg@nvidia.com>, <alex.williamson@redhat.com>, <yishaih@nvidia.com>, <shameerali.kolothum.thodi@huawei.com>, <kevin.tian@intel.com>, <zhiw@nvidia.com> CC: <aniketa@nvidia.com>, <cjia@nvidia.com>, <kwankhede@nvidia.com>, <targupta@nvidia.com>, <vsethi@nvidia.com>, <acurrid@nvidia.com>, <apopple@nvidia.com>, <jhubbard@nvidia.com>, <danw@nvidia.com>, <anuaggarwal@nvidia.com>, <mochs@nvidia.com>, <kvm@vger.kernel.org>, <linux-kernel@vger.kernel.org> Subject: [PATCH v1 0/3] vfio/nvgrace-gpu: Enable grace blackwell boards Date: Sun, 6 Oct 2024 10:27:19 +0000 Message-ID: <20241006102722.3991-1-ankita@nvidia.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Content-Type: text/plain
Series	vfio/nvgrace-gpu: Enable grace blackwell boards \| expand [v1,0/3] vfio/nvgrace-gpu: Enable grace blackwell boards [v1,1/3] vfio/nvgrace-gpu: Read dvsec register to determine need for uncached resmem [v1,2/3] vfio/nvgrace-gpu: Expose the blackwell device PF BAR1 to the VM [v1,3/3] vfio/nvgrace-gpu: Check the HBM training and C2C link status

Message ID

20241006102722.3991-1-ankita@nvidia.com (mailing list archive)

Headers

Received-SPF: Pass (protection.outlook.com: domain of nvidia.com designates
 216.228.118.233 as permitted sender) receiver=protection.outlook.com;
 client-ip=216.228.118.233; helo=mail.nvidia.com; pr=C
From: <ankita@nvidia.com>
To: <ankita@nvidia.com>, <jgg@nvidia.com>, <alex.williamson@redhat.com>,
	<yishaih@nvidia.com>, <shameerali.kolothum.thodi@huawei.com>,
	<kevin.tian@intel.com>, <zhiw@nvidia.com>
CC: <aniketa@nvidia.com>, <cjia@nvidia.com>, <kwankhede@nvidia.com>,
	<targupta@nvidia.com>, <vsethi@nvidia.com>, <acurrid@nvidia.com>,
	<apopple@nvidia.com>, <jhubbard@nvidia.com>, <danw@nvidia.com>,
	<anuaggarwal@nvidia.com>, <mochs@nvidia.com>, <kvm@vger.kernel.org>,
	<linux-kernel@vger.kernel.org>
Subject: [PATCH v1 0/3] vfio/nvgrace-gpu: Enable grace blackwell boards
Date: Sun, 6 Oct 2024 10:27:19 +0000
Message-ID: <20241006102722.3991-1-ankita@nvidia.com>
Precedence: bulk
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Content-Type: text/plain
X-MS-Exchange-CrossTenant-OriginalArrivalTime: 06 Oct 2024 10:27:41.9450
 (UTC)
X-MS-Exchange-CrossTenant-Network-Message-Id: 
 adc75756-9909-43aa-4894-08dce5f1828f
X-MS-Exchange-CrossTenant-Id: 43083d15-7273-40c1-b7db-39efd9ccc17a
X-MS-Exchange-CrossTenant-OriginalAttributedTenantConnectingIp: 
 TenantId=43083d15-7273-40c1-b7db-39efd9ccc17a;Ip=[216.228.118.233];Helo=[mail.nvidia.com]
X-MS-Exchange-CrossTenant-AuthSource: 
	SJ1PEPF00001CE3.namprd05.prod.outlook.com
X-MS-Exchange-CrossTenant-AuthAs: Anonymous
X-MS-Exchange-CrossTenant-FromEntityHeader: HybridOnPrem
X-MS-Exchange-Transport-CrossTenantHeadersStamped: DS0PR12MB8245

Series

vfio/nvgrace-gpu: Enable grace blackwell boards | expand

Message

Ankit Agrawal Oct. 6, 2024, 10:27 a.m. UTC

From: Ankit Agrawal <ankita@nvidia.com>

NVIDIA's recently introduced Grace Blackwell (GB) Superchip in
continuation with the Grace Hopper (GH) superchip that provides a
cache coherent access to CPU and GPU to each other's memory with
an internal proprietary chip-to-chip (C2C) cache coherent interconnect.
The in-tree nvgrace-gpu driver manages the GH devices. The intention
is to extend the support to the new Grace Blackwell boards.

There is a HW defect on GH to support the Multi-Instance GPU (MIG)
feature [1] that necessiated the presence of a 1G carved out from
the device memory and mapped uncached. The 1G region is shown as a
fake BAR (comprising region 2 and 3) to workaround the issue.

The GB systems differ from GH systems in the following aspects.
1. The aforementioned HW defect is fixed on GB systems.
2. There is a usable BAR1 (region 2 and 3) on GB systems for the
GPUdirect RDMA feature [2].

This patch series accommodate those GB changes by showing the real
physical device BAR1 (region2 and 3) to the VM instead of the fake
one. This takes care of both the differences.

The presence of the fix for the HW defect is communicated by the
firmware through a DVSEC PCI config register. The module reads
this to take a different codepath on GB vs GH.

To improve system bootup time, HBM training is moved out of UEFI
in GB system. Poll for the register indicating the training state.
Also check the C2C link status if it is ready. Fail the probe if
either fails.

Link: https://www.nvidia.com/en-in/technologies/multi-instance-gpu/ [1]
Link: https://docs.nvidia.com/cuda/gpudirect-rdma/ [2]

Applied over next-20241003.

Signed-off-by: Ankit Agrawal <ankita@nvidia.com>

Ankit Agrawal (3):
  vfio/nvgrace-gpu: Read dvsec register to determine need for uncached
    resmem
  vfio/nvgrace-gpu: Expose the blackwell device PF BAR1 to the VM
  vfio/nvgrace-gpu: Check the HBM training and C2C link status

 drivers/vfio/pci/nvgrace-gpu/main.c | 115 ++++++++++++++++++++++++++--
 1 file changed, 107 insertions(+), 8 deletions(-)

Comments

Alex Williamson Oct. 7, 2024, 2:19 p.m. UTC | #1

On Sun, 6 Oct 2024 10:27:19 +0000
<ankita@nvidia.com> wrote:

> From: Ankit Agrawal <ankita@nvidia.com>
> 
> NVIDIA's recently introduced Grace Blackwell (GB) Superchip in
> continuation with the Grace Hopper (GH) superchip that provides a
> cache coherent access to CPU and GPU to each other's memory with
> an internal proprietary chip-to-chip (C2C) cache coherent interconnect.
> The in-tree nvgrace-gpu driver manages the GH devices. The intention
> is to extend the support to the new Grace Blackwell boards.

Where do we stand on QEMU enablement of GH, or the GB support here?
IIRC, the nvgrace-gpu variant driver was initially proposed with QEMU
being the means through which the community could make use of this
driver, but there seem to be a number of pieces missing for that
support.  Thanks,

Alex

> There is a HW defect on GH to support the Multi-Instance GPU (MIG)
> feature [1] that necessiated the presence of a 1G carved out from
> the device memory and mapped uncached. The 1G region is shown as a
> fake BAR (comprising region 2 and 3) to workaround the issue.
> 
> The GB systems differ from GH systems in the following aspects.
> 1. The aforementioned HW defect is fixed on GB systems.
> 2. There is a usable BAR1 (region 2 and 3) on GB systems for the
> GPUdirect RDMA feature [2].
> 
> This patch series accommodate those GB changes by showing the real
> physical device BAR1 (region2 and 3) to the VM instead of the fake
> one. This takes care of both the differences.
> 
> The presence of the fix for the HW defect is communicated by the
> firmware through a DVSEC PCI config register. The module reads
> this to take a different codepath on GB vs GH.
> 
> To improve system bootup time, HBM training is moved out of UEFI
> in GB system. Poll for the register indicating the training state.
> Also check the C2C link status if it is ready. Fail the probe if
> either fails.
> 
> Link: https://www.nvidia.com/en-in/technologies/multi-instance-gpu/ [1]
> Link: https://docs.nvidia.com/cuda/gpudirect-rdma/ [2]
> 
> Applied over next-20241003.
> 
> Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
> 
> Ankit Agrawal (3):
>   vfio/nvgrace-gpu: Read dvsec register to determine need for uncached
>     resmem
>   vfio/nvgrace-gpu: Expose the blackwell device PF BAR1 to the VM
>   vfio/nvgrace-gpu: Check the HBM training and C2C link status
> 
>  drivers/vfio/pci/nvgrace-gpu/main.c | 115 ++++++++++++++++++++++++++--
>  1 file changed, 107 insertions(+), 8 deletions(-)
>

Ankit Agrawal Oct. 7, 2024, 4:37 p.m. UTC | #2

>>
>> NVIDIA's recently introduced Grace Blackwell (GB) Superchip in
>> continuation with the Grace Hopper (GH) superchip that provides a
>> cache coherent access to CPU and GPU to each other's memory with
>> an internal proprietary chip-to-chip (C2C) cache coherent interconnect.
>> The in-tree nvgrace-gpu driver manages the GH devices. The intention
>> is to extend the support to the new Grace Blackwell boards.
>
> Where do we stand on QEMU enablement of GH, or the GB support here?
> IIRC, the nvgrace-gpu variant driver was initially proposed with QEMU
> being the means through which the community could make use of this
> driver, but there seem to be a number of pieces missing for that
> support.  Thanks,
> 
> Alex

Hi Alex, the Qemu enablement changes for GH is already in Qemu 9.0.
This is the Generic initiator change that got merged:
https://lore.kernel.org/all/20240308145525.10886-1-ankita@nvidia.com/

The missing pieces are actually in the kvm/kernel viz:
1. KVM need to map the device memory as Normal. The KVM patch was
proposed here. This patch need refresh to address the suggestions:
https://lore.kernel.org/all/20230907181459.18145-2-ankita@nvidia.com/
2. ECC handling series for the GPU device memory that is remap_pfn_range()
mapped: https://lore.kernel.org/all/20231123003513.24292-1-ankita@nvidia.com/

With those changes, the GH would be functional with the Qemu 9.0.
We discovered a separate Qemu issue while doing verification of Grace Blackwell,
where the 512G of highmem proved short here:
https://github.com/qemu/qemu/blob/v9.0.0/hw/arm/virt.c#L211
We are planning to have a proposal for the fix floated for that.

Thanks
Ankit Agrawal

Alex Williamson Oct. 7, 2024, 9:16 p.m. UTC | #3

On Mon, 7 Oct 2024 16:37:12 +0000
Ankit Agrawal <ankita@nvidia.com> wrote:

> >>
> >> NVIDIA's recently introduced Grace Blackwell (GB) Superchip in
> >> continuation with the Grace Hopper (GH) superchip that provides a
> >> cache coherent access to CPU and GPU to each other's memory with
> >> an internal proprietary chip-to-chip (C2C) cache coherent interconnect.
> >> The in-tree nvgrace-gpu driver manages the GH devices. The intention
> >> is to extend the support to the new Grace Blackwell boards.  
> >
> > Where do we stand on QEMU enablement of GH, or the GB support here?
> > IIRC, the nvgrace-gpu variant driver was initially proposed with QEMU
> > being the means through which the community could make use of this
> > driver, but there seem to be a number of pieces missing for that
> > support.  Thanks,
> > 
> > Alex  
> 
> Hi Alex, the Qemu enablement changes for GH is already in Qemu 9.0.
> This is the Generic initiator change that got merged:
> https://lore.kernel.org/all/20240308145525.10886-1-ankita@nvidia.com/
> 
> The missing pieces are actually in the kvm/kernel viz:
> 1. KVM need to map the device memory as Normal. The KVM patch was
> proposed here. This patch need refresh to address the suggestions:
> https://lore.kernel.org/all/20230907181459.18145-2-ankita@nvidia.com/
> 2. ECC handling series for the GPU device memory that is remap_pfn_range()
> mapped: https://lore.kernel.org/all/20231123003513.24292-1-ankita@nvidia.com/
> 
> With those changes, the GH would be functional with the Qemu 9.0.

Sure, unless we note that those series were posted a year ago, which
makes it much harder to claim that we're actively enabling upstream
testing for this driver that we're now trying to extend to new
hardware.  Thanks,

Alex

> We discovered a separate Qemu issue while doing verification of Grace Blackwell,
> where the 512G of highmem proved short here:
> https://github.com/qemu/qemu/blob/v9.0.0/hw/arm/virt.c#L211
> We are planning to have a proposal for the fix floated for that.
> 
> Thanks
> Ankit Agrawal
>

Ankit Agrawal Oct. 8, 2024, 7:22 a.m. UTC | #4

>>
>> Hi Alex, the Qemu enablement changes for GH is already in Qemu 9.0.
>> This is the Generic initiator change that got merged:
>> https://lore.kernel.org/all/20240308145525.10886-1-ankita@nvidia.com/
>>
>> The missing pieces are actually in the kvm/kernel viz:
>> 1. KVM need to map the device memory as Normal. The KVM patch was
>> proposed here. This patch need refresh to address the suggestions:
>> https://lore.kernel.org/all/20230907181459.18145-2-ankita@nvidia.com/
>> 2. ECC handling series for the GPU device memory that is remap_pfn_range()
>> mapped: https://lore.kernel.org/all/20231123003513.24292-1-ankita@nvidia.com/
>>
>> With those changes, the GH would be functional with the Qemu 9.0.
>
> Sure, unless we note that those series were posted a year ago, which
> makes it much harder to claim that we're actively enabling upstream
> testing for this driver that we're now trying to extend to new
> hardware.  Thanks,
>
> Alex

Right, I am also working to implement the leftover items mentioned above.
The work to refresh the aforementioned items is ongoing and would be posting
it shortly as well (starting with the KVM patch).