Message ID | 20250107142719.179636-9-yilun.xu@linux.intel.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | Private MMIO support for private assigned dev | expand |
On Tue, Jan 07, 2025 at 10:27:15PM +0800, Xu Yilun wrote: > Add a flag for ioctl(VFIO_DEVICE_BIND_IOMMUFD) to mark a device as > for private assignment. For these private assigned devices, disallow > host accessing their MMIO resources. Why? Shouldn't the VMM simply not call mmap? Why does the kernel have to enforce this? Jason
On Wed, Jan 08, 2025 at 09:30:26AM -0400, Jason Gunthorpe wrote: > On Tue, Jan 07, 2025 at 10:27:15PM +0800, Xu Yilun wrote: > > Add a flag for ioctl(VFIO_DEVICE_BIND_IOMMUFD) to mark a device as > > for private assignment. For these private assigned devices, disallow > > host accessing their MMIO resources. > > Why? Shouldn't the VMM simply not call mmap? Why does the kernel have > to enforce this? MM.. maybe I should not say 'host', instead 'userspace'. I think the kernel part VMM (KVM) has the responsibility to enforce the correct behavior of the userspace part VMM (QEMU). QEMU has no way to touch private memory/MMIO intentionally or accidently. IIUC that's one of the initiative guest_memfd is introduced for private memory. Private MMIO follows. Thanks, Yilun > > Jason
On Thu, Jan 09, 2025 at 12:57:58AM +0800, Xu Yilun wrote: > On Wed, Jan 08, 2025 at 09:30:26AM -0400, Jason Gunthorpe wrote: > > On Tue, Jan 07, 2025 at 10:27:15PM +0800, Xu Yilun wrote: > > > Add a flag for ioctl(VFIO_DEVICE_BIND_IOMMUFD) to mark a device as > > > for private assignment. For these private assigned devices, disallow > > > host accessing their MMIO resources. > > > > Why? Shouldn't the VMM simply not call mmap? Why does the kernel have > > to enforce this? > > MM.. maybe I should not say 'host', instead 'userspace'. > > I think the kernel part VMM (KVM) has the responsibility to enforce the > correct behavior of the userspace part VMM (QEMU). QEMU has no way to > touch private memory/MMIO intentionally or accidently. IIUC that's one > of the initiative guest_memfd is introduced for private memory. Private > MMIO follows. Okay, but then why is it a flag like that? I'm expecting a much broader system here to make the VFIO device into a confidential device (like setup the TDI) where we'd have to enforce the private things, communicate with some secure world to assign it, and so on. I want to see a fuller solution to the CC problem in VFIO before we can be sure what is the correct UAPI. In other words, make the VFIO device into a CC device should also prevent mmaping it and so on. So, I would take this out and defer VFIO enforcment to a series which does fuller CC enablement of VFIO. The precursor work should just be avoiding requiring a VMA when installing VFIO MMIO into the KVM and IOMMU stage 2 mappings. Ie by using a FD to get the CPU pfns into iommufd and kvm as you are showing. This works just fine for non-CC devices anyhow and is the necessary building block for making a TDI interface in VFIO. Jason
On Thu, Jan 09, 2025 at 10:40:51AM -0400, Jason Gunthorpe wrote: > On Thu, Jan 09, 2025 at 12:57:58AM +0800, Xu Yilun wrote: > > On Wed, Jan 08, 2025 at 09:30:26AM -0400, Jason Gunthorpe wrote: > > > On Tue, Jan 07, 2025 at 10:27:15PM +0800, Xu Yilun wrote: > > > > Add a flag for ioctl(VFIO_DEVICE_BIND_IOMMUFD) to mark a device as > > > > for private assignment. For these private assigned devices, disallow > > > > host accessing their MMIO resources. > > > > > > Why? Shouldn't the VMM simply not call mmap? Why does the kernel have > > > to enforce this? > > > > MM.. maybe I should not say 'host', instead 'userspace'. > > > > I think the kernel part VMM (KVM) has the responsibility to enforce the > > correct behavior of the userspace part VMM (QEMU). QEMU has no way to > > touch private memory/MMIO intentionally or accidently. IIUC that's one > > of the initiative guest_memfd is introduced for private memory. Private > > MMIO follows. > > Okay, but then why is it a flag like that? I'm expecting a much This flag is a prerequisite for setting up TDI, or part of the requirement to make a "TDI capable" assigned device. It prevents the userspace mapping at the first place, even as a shared device. We want the device firstly appear as a shared device in CoCo-VM, then do TDI setup (via a tsm verb "bind"). This late bind approach avoids changing the CoCo VM startup routine. In contrast, early bind would easily be broken, especially if bios is not aware of the TDI rule. So then we face with the shared <-> private device conversion in CoCo VM, and in turn shared <-> private MMIO conversion. MMIO region has only one physical backend so it is a bit like in-place conversion which is complicated. I wanna simply the MMIO conversion routine based on the fact that VMM never needs to access assigned MMIO for feature emulation, so always disallow userspace MMIO mapping during the whole lifecycle. That's why the flag is introduced. Patch 6 has similar discription. > broader system here to make the VFIO device into a confidential device > (like setup the TDI) where we'd have to enforce the private things, I plan to introduce a new VFIO ioctl to setup the TDI. > communicate with some secure world to assign it, and so on. Yes, the new VFIO ioctl will communicate with PCI TSM. > > I want to see a fuller solution to the CC problem in VFIO before we MM.. I have something but need more preparation. Whether send out or make a public repo, I'll discuss with internal. > can be sure what is the correct UAPI. In other words, make the > VFIO device into a CC device should also prevent mmaping it and so on. My idea is prevent mmaping first, then allow VFIO device into CC dev (TDI). > > So, I would take this out and defer VFIO enforcment to a series which > does fuller CC enablement of VFIO. > > The precursor work should just be avoiding requiring a VMA when > installing VFIO MMIO into the KVM and IOMMU stage 2 mappings. Ie by > using a FD to get the CPU pfns into iommufd and kvm as you are > showing. > > This works just fine for non-CC devices anyhow and is the necessary Yes. It carries out the idea of "KVM maps MMIO resources without firstly mapping into the host" even for normal VM. That's why I think it could be an independent patchset. Thanks, Yilun > building block for making a TDI interface in VFIO. > > Jason
On Fri, Jan 10, 2025 at 12:40:28AM +0800, Xu Yilun wrote: > So then we face with the shared <-> private device conversion in CoCo VM, > and in turn shared <-> private MMIO conversion. MMIO region has only one > physical backend so it is a bit like in-place conversion which is > complicated. I wanna simply the MMIO conversion routine based on the fact > that VMM never needs to access assigned MMIO for feature emulation, so > always disallow userspace MMIO mapping during the whole lifecycle. That's > why the flag is introduced. The VMM can simply not map it if for these cases. As part of the TDI flow the kernel can validate it is not mapped. > > can be sure what is the correct UAPI. In other words, make the > > VFIO device into a CC device should also prevent mmaping it and so on. > > My idea is prevent mmaping first, then allow VFIO device into CC dev (TDI). I think you need to start the TDI process much earlier. Some arches are going to need work to prepare the TDI before the VM is started. The other issue here is that Intel is somewhat different from others and when we build uapi for TDI it has to accommodate everyone. > Yes. It carries out the idea of "KVM maps MMIO resources without firstly > mapping into the host" even for normal VM. That's why I think it could > be an independent patchset. Yes, just remove this patch and other TDI focused stuff. Just infrastructure to move to FD based mapping instead of VMA. Jason
diff --git a/drivers/vfio/device_cdev.c b/drivers/vfio/device_cdev.c index bb1817bd4ff3..919285c1cd7a 100644 --- a/drivers/vfio/device_cdev.c +++ b/drivers/vfio/device_cdev.c @@ -75,7 +75,10 @@ long vfio_df_ioctl_bind_iommufd(struct vfio_device_file *df, if (copy_from_user(&bind, arg, minsz)) return -EFAULT; - if (bind.argsz < minsz || bind.flags || bind.iommufd < 0) + if (bind.argsz < minsz || bind.iommufd < 0) + return -EINVAL; + + if (bind.flags & ~(VFIO_DEVICE_BIND_IOMMUFD_PRIVATE)) return -EINVAL; /* BIND_IOMMUFD only allowed for cdev fds */ @@ -118,6 +121,9 @@ long vfio_df_ioctl_bind_iommufd(struct vfio_device_file *df, goto out_close_device; device->cdev_opened = true; + if (bind.flags & VFIO_DEVICE_BIND_IOMMUFD_PRIVATE) + device->is_private = true; + /* * Paired with smp_load_acquire() in vfio_device_fops::ioctl/ * read/write/mmap @@ -151,6 +157,7 @@ void vfio_df_unbind_iommufd(struct vfio_device_file *df) return; mutex_lock(&device->dev_set->lock); + device->is_private = false; vfio_df_close(df); vfio_device_put_kvm(device); iommufd_ctx_put(df->iommufd); diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c index f69eda5956ad..11c735dfe1f7 100644 --- a/drivers/vfio/pci/vfio_pci_core.c +++ b/drivers/vfio/pci/vfio_pci_core.c @@ -1005,6 +1005,12 @@ static int vfio_pci_ioctl_get_info(struct vfio_pci_core_device *vdev, return copy_to_user(arg, &info, minsz) ? -EFAULT : 0; } +bool is_vfio_pci_bar_private(struct vfio_pci_core_device *vdev, int bar) +{ + /* Any mmap supported bar can be used as vfio dmabuf */ + return vdev->bar_mmap_supported[bar] && vdev->vdev.is_private; +} + static int vfio_pci_ioctl_get_region_info(struct vfio_pci_core_device *vdev, struct vfio_region_info __user *arg) { @@ -1035,6 +1041,11 @@ static int vfio_pci_ioctl_get_region_info(struct vfio_pci_core_device *vdev, break; } + if (is_vfio_pci_bar_private(vdev, info.index)) { + info.flags = VFIO_REGION_INFO_FLAG_PRIVATE; + break; + } + info.flags = VFIO_REGION_INFO_FLAG_READ | VFIO_REGION_INFO_FLAG_WRITE; if (vdev->bar_mmap_supported[info.index]) { @@ -1735,6 +1746,9 @@ int vfio_pci_core_mmap(struct vfio_device *core_vdev, struct vm_area_struct *vma u64 phys_len, req_len, pgoff, req_start; int ret; + if (vdev->vdev.is_private) + return -EINVAL; + index = vma->vm_pgoff >> (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT); if (index >= VFIO_PCI_NUM_REGIONS + vdev->num_regions) diff --git a/drivers/vfio/pci/vfio_pci_priv.h b/drivers/vfio/pci/vfio_pci_priv.h index d27f383f3931..2b61e35145fd 100644 --- a/drivers/vfio/pci/vfio_pci_priv.h +++ b/drivers/vfio/pci/vfio_pci_priv.h @@ -126,4 +126,6 @@ static inline void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, } #endif +bool is_vfio_pci_bar_private(struct vfio_pci_core_device *vdev, int bar); + #endif diff --git a/drivers/vfio/pci/vfio_pci_rdwr.c b/drivers/vfio/pci/vfio_pci_rdwr.c index 66b72c289284..e385f7f63414 100644 --- a/drivers/vfio/pci/vfio_pci_rdwr.c +++ b/drivers/vfio/pci/vfio_pci_rdwr.c @@ -242,6 +242,9 @@ ssize_t vfio_pci_bar_rw(struct vfio_pci_core_device *vdev, char __user *buf, struct resource *res = &vdev->pdev->resource[bar]; ssize_t done; + if (is_vfio_pci_bar_private(vdev, bar)) + return -EINVAL; + if (pci_resource_start(pdev, bar)) end = pci_resource_len(pdev, bar); else if (bar == PCI_ROM_RESOURCE && diff --git a/include/linux/vfio.h b/include/linux/vfio.h index 2258b0585330..e99d856c6cd8 100644 --- a/include/linux/vfio.h +++ b/include/linux/vfio.h @@ -69,6 +69,7 @@ struct vfio_device { struct iommufd_device *iommufd_device; u8 iommufd_attached:1; #endif + u8 is_private:1; u8 cdev_opened:1; #ifdef CONFIG_DEBUG_FS /* diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h index f43dfbde7352..6a1c703e3185 100644 --- a/include/uapi/linux/vfio.h +++ b/include/uapi/linux/vfio.h @@ -275,6 +275,7 @@ struct vfio_region_info { #define VFIO_REGION_INFO_FLAG_WRITE (1 << 1) /* Region supports write */ #define VFIO_REGION_INFO_FLAG_MMAP (1 << 2) /* Region supports mmap */ #define VFIO_REGION_INFO_FLAG_CAPS (1 << 3) /* Info supports caps */ +#define VFIO_REGION_INFO_FLAG_PRIVATE (1 << 4) /* Region supports private MMIO */ __u32 index; /* Region index */ __u32 cap_offset; /* Offset within info struct of first cap */ __aligned_u64 size; /* Region size (bytes) */ @@ -904,7 +905,8 @@ struct vfio_device_feature { * VFIO_DEVICE_BIND_IOMMUFD - _IOR(VFIO_TYPE, VFIO_BASE + 18, * struct vfio_device_bind_iommufd) * @argsz: User filled size of this data. - * @flags: Must be 0. + * @flags: Optional device initialization flags: + * VFIO_DEVICE_BIND_IOMMUFD_PRIVATE: for private assignment * @iommufd: iommufd to bind. * @out_devid: The device id generated by this bind. devid is a handle for * this device/iommufd bond and can be used in IOMMUFD commands. @@ -921,6 +923,7 @@ struct vfio_device_feature { struct vfio_device_bind_iommufd { __u32 argsz; __u32 flags; +#define VFIO_DEVICE_BIND_IOMMUFD_PRIVATE (1 << 0) __s32 iommufd; __u32 out_devid; };
Add a flag for ioctl(VFIO_DEVICE_BIND_IOMMUFD) to mark a device as for private assignment. For these private assigned devices, disallow host accessing their MMIO resources. Since the MMIO regions for private assignment are not accessible from host, remove the VFIO_REGION_INFO_FLAG_MMAP/READ/WRITE for these regions, instead add a new VFIO_REGION_INFO_FLAG_PRIVATE flag to indicate users should create dma-buf for MMIO mapping in KVM MMU. Signed-off-by: Xu Yilun <yilun.xu@linux.intel.com> --- drivers/vfio/device_cdev.c | 9 ++++++++- drivers/vfio/pci/vfio_pci_core.c | 14 ++++++++++++++ drivers/vfio/pci/vfio_pci_priv.h | 2 ++ drivers/vfio/pci/vfio_pci_rdwr.c | 3 +++ include/linux/vfio.h | 1 + include/uapi/linux/vfio.h | 5 ++++- 6 files changed, 32 insertions(+), 2 deletions(-)