diff mbox series

[RFC,08/12] vfio/pci: Create host unaccessible dma-buf for private device

Message ID 20250107142719.179636-9-yilun.xu@linux.intel.com (mailing list archive)
State New
Headers show
Series Private MMIO support for private assigned dev | expand

Commit Message

Xu Yilun Jan. 7, 2025, 2:27 p.m. UTC
Add a flag for ioctl(VFIO_DEVICE_BIND_IOMMUFD) to mark a device as
for private assignment. For these private assigned devices, disallow
host accessing their MMIO resources.

Since the MMIO regions for private assignment are not accessible from
host, remove the VFIO_REGION_INFO_FLAG_MMAP/READ/WRITE for these
regions, instead add a new VFIO_REGION_INFO_FLAG_PRIVATE flag to
indicate users should create dma-buf for MMIO mapping in KVM MMU.

Signed-off-by: Xu Yilun <yilun.xu@linux.intel.com>
---
 drivers/vfio/device_cdev.c       |  9 ++++++++-
 drivers/vfio/pci/vfio_pci_core.c | 14 ++++++++++++++
 drivers/vfio/pci/vfio_pci_priv.h |  2 ++
 drivers/vfio/pci/vfio_pci_rdwr.c |  3 +++
 include/linux/vfio.h             |  1 +
 include/uapi/linux/vfio.h        |  5 ++++-
 6 files changed, 32 insertions(+), 2 deletions(-)

Comments

Jason Gunthorpe Jan. 8, 2025, 1:30 p.m. UTC | #1
On Tue, Jan 07, 2025 at 10:27:15PM +0800, Xu Yilun wrote:
> Add a flag for ioctl(VFIO_DEVICE_BIND_IOMMUFD) to mark a device as
> for private assignment. For these private assigned devices, disallow
> host accessing their MMIO resources.

Why? Shouldn't the VMM simply not call mmap? Why does the kernel have
to enforce this?

Jason
Xu Yilun Jan. 8, 2025, 4:57 p.m. UTC | #2
On Wed, Jan 08, 2025 at 09:30:26AM -0400, Jason Gunthorpe wrote:
> On Tue, Jan 07, 2025 at 10:27:15PM +0800, Xu Yilun wrote:
> > Add a flag for ioctl(VFIO_DEVICE_BIND_IOMMUFD) to mark a device as
> > for private assignment. For these private assigned devices, disallow
> > host accessing their MMIO resources.
> 
> Why? Shouldn't the VMM simply not call mmap? Why does the kernel have
> to enforce this?

MM.. maybe I should not say 'host', instead 'userspace'.

I think the kernel part VMM (KVM) has the responsibility to enforce the
correct behavior of the userspace part VMM (QEMU). QEMU has no way to
touch private memory/MMIO intentionally or accidently. IIUC that's one
of the initiative guest_memfd is introduced for private memory. Private
MMIO follows.

Thanks,
Yilun

> 
> Jason
Jason Gunthorpe Jan. 9, 2025, 2:40 p.m. UTC | #3
On Thu, Jan 09, 2025 at 12:57:58AM +0800, Xu Yilun wrote:
> On Wed, Jan 08, 2025 at 09:30:26AM -0400, Jason Gunthorpe wrote:
> > On Tue, Jan 07, 2025 at 10:27:15PM +0800, Xu Yilun wrote:
> > > Add a flag for ioctl(VFIO_DEVICE_BIND_IOMMUFD) to mark a device as
> > > for private assignment. For these private assigned devices, disallow
> > > host accessing their MMIO resources.
> > 
> > Why? Shouldn't the VMM simply not call mmap? Why does the kernel have
> > to enforce this?
> 
> MM.. maybe I should not say 'host', instead 'userspace'.
> 
> I think the kernel part VMM (KVM) has the responsibility to enforce the
> correct behavior of the userspace part VMM (QEMU). QEMU has no way to
> touch private memory/MMIO intentionally or accidently. IIUC that's one
> of the initiative guest_memfd is introduced for private memory. Private
> MMIO follows.

Okay, but then why is it a flag like that? I'm expecting a much
broader system here to make the VFIO device into a confidential device
(like setup the TDI) where we'd have to enforce the private things,
communicate with some secure world to assign it, and so on.

I want to see a fuller solution to the CC problem in VFIO before we
can be sure what is the correct UAPI. In other words, make the
VFIO device into a CC device should also prevent mmaping it and so on.

So, I would take this out and defer VFIO enforcment to a series which
does fuller CC enablement of VFIO.

The precursor work should just be avoiding requiring a VMA when
installing VFIO MMIO into the KVM and IOMMU stage 2 mappings. Ie by
using a FD to get the CPU pfns into iommufd and kvm as you are
showing.

This works just fine for non-CC devices anyhow and is the necessary
building block for making a TDI interface in VFIO.

Jason
Xu Yilun Jan. 9, 2025, 4:40 p.m. UTC | #4
On Thu, Jan 09, 2025 at 10:40:51AM -0400, Jason Gunthorpe wrote:
> On Thu, Jan 09, 2025 at 12:57:58AM +0800, Xu Yilun wrote:
> > On Wed, Jan 08, 2025 at 09:30:26AM -0400, Jason Gunthorpe wrote:
> > > On Tue, Jan 07, 2025 at 10:27:15PM +0800, Xu Yilun wrote:
> > > > Add a flag for ioctl(VFIO_DEVICE_BIND_IOMMUFD) to mark a device as
> > > > for private assignment. For these private assigned devices, disallow
> > > > host accessing their MMIO resources.
> > > 
> > > Why? Shouldn't the VMM simply not call mmap? Why does the kernel have
> > > to enforce this?
> > 
> > MM.. maybe I should not say 'host', instead 'userspace'.
> > 
> > I think the kernel part VMM (KVM) has the responsibility to enforce the
> > correct behavior of the userspace part VMM (QEMU). QEMU has no way to
> > touch private memory/MMIO intentionally or accidently. IIUC that's one
> > of the initiative guest_memfd is introduced for private memory. Private
> > MMIO follows.
> 
> Okay, but then why is it a flag like that? I'm expecting a much

This flag is a prerequisite for setting up TDI, or part of the
requirement to make a "TDI capable" assigned device. It prevents the
userspace mapping at the first place, even as a shared device.

We want the device firstly appear as a shared device in CoCo-VM, then
do TDI setup (via a tsm verb "bind"). This late bind approach avoids
changing the CoCo VM startup routine. In contrast, early bind would
easily be broken, especially if bios is not aware of the TDI rule.

So then we face with the shared <-> private device conversion in CoCo VM,
and in turn shared <-> private MMIO conversion. MMIO region has only one
physical backend so it is a bit like in-place conversion which is
complicated. I wanna simply the MMIO conversion routine based on the fact
that VMM never needs to access assigned MMIO for feature emulation, so
always disallow userspace MMIO mapping during the whole lifecycle. That's
why the flag is introduced.

Patch 6 has similar discription.

> broader system here to make the VFIO device into a confidential device
> (like setup the TDI) where we'd have to enforce the private things,

I plan to introduce a new VFIO ioctl to setup the TDI.

> communicate with some secure world to assign it, and so on.

Yes, the new VFIO ioctl will communicate with PCI TSM.

> 
> I want to see a fuller solution to the CC problem in VFIO before we

MM.. I have something but need more preparation. Whether send out or
make a public repo, I'll discuss with internal.

> can be sure what is the correct UAPI. In other words, make the
> VFIO device into a CC device should also prevent mmaping it and so on.

My idea is prevent mmaping first, then allow VFIO device into CC dev (TDI).

> 
> So, I would take this out and defer VFIO enforcment to a series which
> does fuller CC enablement of VFIO.
> 
> The precursor work should just be avoiding requiring a VMA when
> installing VFIO MMIO into the KVM and IOMMU stage 2 mappings. Ie by
> using a FD to get the CPU pfns into iommufd and kvm as you are
> showing.
> 
> This works just fine for non-CC devices anyhow and is the necessary

Yes. It carries out the idea of "KVM maps MMIO resources without firstly
mapping into the host" even for normal VM. That's why I think it could
be an independent patchset.

Thanks,
Yilun

> building block for making a TDI interface in VFIO.
> 
> Jason
Jason Gunthorpe Jan. 10, 2025, 1:31 p.m. UTC | #5
On Fri, Jan 10, 2025 at 12:40:28AM +0800, Xu Yilun wrote:

> So then we face with the shared <-> private device conversion in CoCo VM,
> and in turn shared <-> private MMIO conversion. MMIO region has only one
> physical backend so it is a bit like in-place conversion which is
> complicated. I wanna simply the MMIO conversion routine based on the fact
> that VMM never needs to access assigned MMIO for feature emulation, so
> always disallow userspace MMIO mapping during the whole lifecycle. That's
> why the flag is introduced.

The VMM can simply not map it if for these cases. As part of the TDI
flow the kernel can validate it is not mapped.
 
> > can be sure what is the correct UAPI. In other words, make the
> > VFIO device into a CC device should also prevent mmaping it and so on.
> 
> My idea is prevent mmaping first, then allow VFIO device into CC dev (TDI).

I think you need to start the TDI process much earlier. Some arches
are going to need work to prepare the TDI before the VM is started.

The other issue here is that Intel is somewhat different from others
and when we build uapi for TDI it has to accommodate everyone.

> Yes. It carries out the idea of "KVM maps MMIO resources without firstly
> mapping into the host" even for normal VM. That's why I think it could
> be an independent patchset.

Yes, just remove this patch and other TDI focused stuff. Just
infrastructure to move to FD based mapping instead of VMA.

Jason
diff mbox series

Patch

diff --git a/drivers/vfio/device_cdev.c b/drivers/vfio/device_cdev.c
index bb1817bd4ff3..919285c1cd7a 100644
--- a/drivers/vfio/device_cdev.c
+++ b/drivers/vfio/device_cdev.c
@@ -75,7 +75,10 @@  long vfio_df_ioctl_bind_iommufd(struct vfio_device_file *df,
 	if (copy_from_user(&bind, arg, minsz))
 		return -EFAULT;
 
-	if (bind.argsz < minsz || bind.flags || bind.iommufd < 0)
+	if (bind.argsz < minsz || bind.iommufd < 0)
+		return -EINVAL;
+
+	if (bind.flags & ~(VFIO_DEVICE_BIND_IOMMUFD_PRIVATE))
 		return -EINVAL;
 
 	/* BIND_IOMMUFD only allowed for cdev fds */
@@ -118,6 +121,9 @@  long vfio_df_ioctl_bind_iommufd(struct vfio_device_file *df,
 		goto out_close_device;
 
 	device->cdev_opened = true;
+	if (bind.flags & VFIO_DEVICE_BIND_IOMMUFD_PRIVATE)
+		device->is_private = true;
+
 	/*
 	 * Paired with smp_load_acquire() in vfio_device_fops::ioctl/
 	 * read/write/mmap
@@ -151,6 +157,7 @@  void vfio_df_unbind_iommufd(struct vfio_device_file *df)
 		return;
 
 	mutex_lock(&device->dev_set->lock);
+	device->is_private = false;
 	vfio_df_close(df);
 	vfio_device_put_kvm(device);
 	iommufd_ctx_put(df->iommufd);
diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index f69eda5956ad..11c735dfe1f7 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -1005,6 +1005,12 @@  static int vfio_pci_ioctl_get_info(struct vfio_pci_core_device *vdev,
 	return copy_to_user(arg, &info, minsz) ? -EFAULT : 0;
 }
 
+bool is_vfio_pci_bar_private(struct vfio_pci_core_device *vdev, int bar)
+{
+	/* Any mmap supported bar can be used as vfio dmabuf */
+	return vdev->bar_mmap_supported[bar] && vdev->vdev.is_private;
+}
+
 static int vfio_pci_ioctl_get_region_info(struct vfio_pci_core_device *vdev,
 					  struct vfio_region_info __user *arg)
 {
@@ -1035,6 +1041,11 @@  static int vfio_pci_ioctl_get_region_info(struct vfio_pci_core_device *vdev,
 			break;
 		}
 
+		if (is_vfio_pci_bar_private(vdev, info.index)) {
+			info.flags = VFIO_REGION_INFO_FLAG_PRIVATE;
+			break;
+		}
+
 		info.flags = VFIO_REGION_INFO_FLAG_READ |
 			     VFIO_REGION_INFO_FLAG_WRITE;
 		if (vdev->bar_mmap_supported[info.index]) {
@@ -1735,6 +1746,9 @@  int vfio_pci_core_mmap(struct vfio_device *core_vdev, struct vm_area_struct *vma
 	u64 phys_len, req_len, pgoff, req_start;
 	int ret;
 
+	if (vdev->vdev.is_private)
+		return -EINVAL;
+
 	index = vma->vm_pgoff >> (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT);
 
 	if (index >= VFIO_PCI_NUM_REGIONS + vdev->num_regions)
diff --git a/drivers/vfio/pci/vfio_pci_priv.h b/drivers/vfio/pci/vfio_pci_priv.h
index d27f383f3931..2b61e35145fd 100644
--- a/drivers/vfio/pci/vfio_pci_priv.h
+++ b/drivers/vfio/pci/vfio_pci_priv.h
@@ -126,4 +126,6 @@  static inline void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev,
 }
 #endif
 
+bool is_vfio_pci_bar_private(struct vfio_pci_core_device *vdev, int bar);
+
 #endif
diff --git a/drivers/vfio/pci/vfio_pci_rdwr.c b/drivers/vfio/pci/vfio_pci_rdwr.c
index 66b72c289284..e385f7f63414 100644
--- a/drivers/vfio/pci/vfio_pci_rdwr.c
+++ b/drivers/vfio/pci/vfio_pci_rdwr.c
@@ -242,6 +242,9 @@  ssize_t vfio_pci_bar_rw(struct vfio_pci_core_device *vdev, char __user *buf,
 	struct resource *res = &vdev->pdev->resource[bar];
 	ssize_t done;
 
+	if (is_vfio_pci_bar_private(vdev, bar))
+		return -EINVAL;
+
 	if (pci_resource_start(pdev, bar))
 		end = pci_resource_len(pdev, bar);
 	else if (bar == PCI_ROM_RESOURCE &&
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index 2258b0585330..e99d856c6cd8 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -69,6 +69,7 @@  struct vfio_device {
 	struct iommufd_device *iommufd_device;
 	u8 iommufd_attached:1;
 #endif
+	u8 is_private:1;
 	u8 cdev_opened:1;
 #ifdef CONFIG_DEBUG_FS
 	/*
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index f43dfbde7352..6a1c703e3185 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -275,6 +275,7 @@  struct vfio_region_info {
 #define VFIO_REGION_INFO_FLAG_WRITE	(1 << 1) /* Region supports write */
 #define VFIO_REGION_INFO_FLAG_MMAP	(1 << 2) /* Region supports mmap */
 #define VFIO_REGION_INFO_FLAG_CAPS	(1 << 3) /* Info supports caps */
+#define VFIO_REGION_INFO_FLAG_PRIVATE	(1 << 4) /* Region supports private MMIO */
 	__u32	index;		/* Region index */
 	__u32	cap_offset;	/* Offset within info struct of first cap */
 	__aligned_u64	size;	/* Region size (bytes) */
@@ -904,7 +905,8 @@  struct vfio_device_feature {
  * VFIO_DEVICE_BIND_IOMMUFD - _IOR(VFIO_TYPE, VFIO_BASE + 18,
  *				   struct vfio_device_bind_iommufd)
  * @argsz:	 User filled size of this data.
- * @flags:	 Must be 0.
+ * @flags:	 Optional device initialization flags:
+ *		 VFIO_DEVICE_BIND_IOMMUFD_PRIVATE:	for private assignment
  * @iommufd:	 iommufd to bind.
  * @out_devid:	 The device id generated by this bind. devid is a handle for
  *		 this device/iommufd bond and can be used in IOMMUFD commands.
@@ -921,6 +923,7 @@  struct vfio_device_feature {
 struct vfio_device_bind_iommufd {
 	__u32		argsz;
 	__u32		flags;
+#define VFIO_DEVICE_BIND_IOMMUFD_PRIVATE	(1 << 0)
 	__s32		iommufd;
 	__u32		out_devid;
 };