Message ID | 20201119142737.17574-1-justin.he@arm.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | vfio iommu type1: Bypass the vma permission check in vfio_pin_pages_remote() | expand |
On Thu, 19 Nov 2020 22:27:37 +0800 Jia He <justin.he@arm.com> wrote: > The permission of vfio iommu is different and incompatible with vma > permission. If the iotlb->perm is IOMMU_NONE (e.g. qemu side), qemu will > simply call unmap ioctl() instead of mapping. Hence vfio_dma_map() can't > map a dma region with NONE permission. > > This corner case will be exposed in coming virtio_fs cache_size > commit [1] > - mmap(NULL, size, PROT_NONE, MAP_ANONYMOUS | MAP_PRIVATE, -1, 0); > memory_region_init_ram_ptr() > - re-mmap the above area with read/write authority. > - vfio_dma_map() will be invoked when vfio device is hotplug added. > > qemu: > vfio_listener_region_add() > vfio_dma_map(..., readonly=false) > map.flags is set to VFIO_DMA_MAP_FLAG_READ|VFIO_..._WRITE > ioctl(VFIO_IOMMU_MAP_DMA) > > kernel: > vfio_dma_do_map() > vfio_pin_map_dma() > vfio_pin_pages_remote() > vaddr_get_pfn() > ... > check_vma_flags() failed! because > vm_flags hasn't VM_WRITE && gup_flags > has FOLL_WRITE > > It will report error in qemu log when hotplug adding(vfio) a nvme disk > to qemu guest on an Ampere EMAG server: > "VFIO_MAP_DMA failed: Bad address" I don't fully understand the argument here, I think this is suggesting that because QEMU won't call VFIO_IOMMU_MAP_DMA on a region that has NONE permission, the kernel can ignore read/write permission by using FOLL_FORCE. Not only is QEMU not the only userspace driver for vfio, but regardless of that, we can't trust the behavior of any given userspace driver. Bypassing the permission check with FOLL_FORCE seems like it's placing the trust in the user, which seems like a security issue. Thanks, Alex > [1] https://gitlab.com/virtio-fs/qemu/-/blob/virtio-fs-dev/hw/virtio/vhost-user-fs.c#L502 > > Signed-off-by: Jia He <justin.he@arm.com> > --- > drivers/vfio/vfio_iommu_type1.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c > index 67e827638995..33faa6b7dbd4 100644 > --- a/drivers/vfio/vfio_iommu_type1.c > +++ b/drivers/vfio/vfio_iommu_type1.c > @@ -453,7 +453,8 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr, > flags |= FOLL_WRITE; > > mmap_read_lock(mm); > - ret = pin_user_pages_remote(mm, vaddr, 1, flags | FOLL_LONGTERM, > + ret = pin_user_pages_remote(mm, vaddr, 1, > + flags | FOLL_LONGTERM | FOLL_FORCE, > page, NULL, NULL); > if (ret == 1) { > *pfn = page_to_pfn(page[0]);
Hi Alex, thanks for the comments. See mine below: > -----Original Message----- > From: Alex Williamson <alex.williamson@redhat.com> > Sent: Friday, November 20, 2020 1:05 AM > To: Justin He <Justin.He@arm.com> > Cc: Cornelia Huck <cohuck@redhat.com>; kvm@vger.kernel.org; linux- > kernel@vger.kernel.org > Subject: Re: [PATCH] vfio iommu type1: Bypass the vma permission check in > vfio_pin_pages_remote() > > On Thu, 19 Nov 2020 22:27:37 +0800 > Jia He <justin.he@arm.com> wrote: > > > The permission of vfio iommu is different and incompatible with vma > > permission. If the iotlb->perm is IOMMU_NONE (e.g. qemu side), qemu will > > simply call unmap ioctl() instead of mapping. Hence vfio_dma_map() can't > > map a dma region with NONE permission. > > > > This corner case will be exposed in coming virtio_fs cache_size > > commit [1] > > - mmap(NULL, size, PROT_NONE, MAP_ANONYMOUS | MAP_PRIVATE, -1, 0); > > memory_region_init_ram_ptr() > > - re-mmap the above area with read/write authority. > > - vfio_dma_map() will be invoked when vfio device is hotplug added. > > > > qemu: > > vfio_listener_region_add() > > vfio_dma_map(..., readonly=false) > > map.flags is set to VFIO_DMA_MAP_FLAG_READ|VFIO_..._WRITE > > ioctl(VFIO_IOMMU_MAP_DMA) > > > > kernel: > > vfio_dma_do_map() > > vfio_pin_map_dma() > > vfio_pin_pages_remote() > > vaddr_get_pfn() > > ... > > check_vma_flags() failed! because > > vm_flags hasn't VM_WRITE && gup_flags > > has FOLL_WRITE > > > > It will report error in qemu log when hotplug adding(vfio) a nvme disk > > to qemu guest on an Ampere EMAG server: > > "VFIO_MAP_DMA failed: Bad address" > > I don't fully understand the argument here, I think this is suggesting > that because QEMU won't call VFIO_IOMMU_MAP_DMA on a region that has > NONE permission, the kernel can ignore read/write permission by using > FOLL_FORCE. Not only is QEMU not the only userspace driver for vfio, > but regardless of that, we can't trust the behavior of any given > userspace driver. Bypassing the permission check with FOLL_FORCE seems > like it's placing the trust in the user, which seems like a security > issue. Thanks, Yes, this might have side impact on security. But besides this simple fix(adding FOLL_FORCE), do you think it is a good idea that: Qemu provides a special vfio_dma_map_none_perm() to allow mapping a region with NONE permission? Thanks for any suggestion. -- Cheers, Justin (Jia He) > > Alex > > > > [1] https://gitlab.com/virtio-fs/qemu/-/blob/virtio-fs- > dev/hw/virtio/vhost-user-fs.c#L502 > > > > Signed-off-by: Jia He <justin.he@arm.com> > > --- > > drivers/vfio/vfio_iommu_type1.c | 3 ++- > > 1 file changed, 2 insertions(+), 1 deletion(-) > > > > diff --git a/drivers/vfio/vfio_iommu_type1.c > b/drivers/vfio/vfio_iommu_type1.c > > index 67e827638995..33faa6b7dbd4 100644 > > --- a/drivers/vfio/vfio_iommu_type1.c > > +++ b/drivers/vfio/vfio_iommu_type1.c > > @@ -453,7 +453,8 @@ static int vaddr_get_pfn(struct mm_struct *mm, > unsigned long vaddr, > > flags |= FOLL_WRITE; > > > > mmap_read_lock(mm); > > -ret = pin_user_pages_remote(mm, vaddr, 1, flags | FOLL_LONGTERM, > > +ret = pin_user_pages_remote(mm, vaddr, 1, > > + flags | FOLL_LONGTERM | FOLL_FORCE, > > page, NULL, NULL); > > if (ret == 1) { > > *pfn = page_to_pfn(page[0]); IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
On Mon, 23 Nov 2020 02:37:32 +0000 Justin He <Justin.He@arm.com> wrote: > Hi Alex, thanks for the comments. > See mine below: > > > -----Original Message----- > > From: Alex Williamson <alex.williamson@redhat.com> > > Sent: Friday, November 20, 2020 1:05 AM > > To: Justin He <Justin.He@arm.com> > > Cc: Cornelia Huck <cohuck@redhat.com>; kvm@vger.kernel.org; linux- > > kernel@vger.kernel.org > > Subject: Re: [PATCH] vfio iommu type1: Bypass the vma permission check in > > vfio_pin_pages_remote() > > > > On Thu, 19 Nov 2020 22:27:37 +0800 > > Jia He <justin.he@arm.com> wrote: > > > > > The permission of vfio iommu is different and incompatible with vma > > > permission. If the iotlb->perm is IOMMU_NONE (e.g. qemu side), qemu will > > > simply call unmap ioctl() instead of mapping. Hence vfio_dma_map() can't > > > map a dma region with NONE permission. > > > > > > This corner case will be exposed in coming virtio_fs cache_size > > > commit [1] > > > - mmap(NULL, size, PROT_NONE, MAP_ANONYMOUS | MAP_PRIVATE, -1, 0); > > > memory_region_init_ram_ptr() > > > - re-mmap the above area with read/write authority. > > > - vfio_dma_map() will be invoked when vfio device is hotplug added. > > > > > > qemu: > > > vfio_listener_region_add() > > > vfio_dma_map(..., readonly=false) > > > map.flags is set to VFIO_DMA_MAP_FLAG_READ|VFIO_..._WRITE > > > ioctl(VFIO_IOMMU_MAP_DMA) > > > > > > kernel: > > > vfio_dma_do_map() > > > vfio_pin_map_dma() > > > vfio_pin_pages_remote() > > > vaddr_get_pfn() > > > ... > > > check_vma_flags() failed! because > > > vm_flags hasn't VM_WRITE && gup_flags > > > has FOLL_WRITE > > > > > > It will report error in qemu log when hotplug adding(vfio) a nvme disk > > > to qemu guest on an Ampere EMAG server: > > > "VFIO_MAP_DMA failed: Bad address" > > > > I don't fully understand the argument here, I think this is suggesting > > that because QEMU won't call VFIO_IOMMU_MAP_DMA on a region that has > > NONE permission, the kernel can ignore read/write permission by using > > FOLL_FORCE. Not only is QEMU not the only userspace driver for vfio, > > but regardless of that, we can't trust the behavior of any given > > userspace driver. Bypassing the permission check with FOLL_FORCE seems > > like it's placing the trust in the user, which seems like a security > > issue. Thanks, > Yes, this might have side impact on security. > But besides this simple fix(adding FOLL_FORCE), do you think it is a good > idea that: > Qemu provides a special vfio_dma_map_none_perm() to allow mapping a > region with NONE permission? If NONE permission implies that we use FOLL_FORCE as described here to ignore the r+w permissions and trust that the user knows what they're doing, that seems like a non-starter. Ultimately I think what you're describing is a scenario where our current permission check fails and the solution is probably to extend the check to account for other ways that a user may have access to a vma rather than bypass the check. Thanks, Alex
Hi, Jia, On Thu, Nov 19, 2020 at 10:27:37PM +0800, Jia He wrote: > The permission of vfio iommu is different and incompatible with vma > permission. If the iotlb->perm is IOMMU_NONE (e.g. qemu side), qemu will > simply call unmap ioctl() instead of mapping. Hence vfio_dma_map() can't > map a dma region with NONE permission. > > This corner case will be exposed in coming virtio_fs cache_size > commit [1] > - mmap(NULL, size, PROT_NONE, MAP_ANONYMOUS | MAP_PRIVATE, -1, 0); > memory_region_init_ram_ptr() > - re-mmap the above area with read/write authority. If iiuc here we'll remap the above PROT_NONE into PROT_READ|PROT_WRITE, then... > - vfio_dma_map() will be invoked when vfio device is hotplug added. ... here I'm slightly confused on why VFIO_IOMMU_MAP_DMA would encounter vma check fail - aren't they already get rw permissions? I'd appreciate if you could explain why vfio needs to dma map some PROT_NONE pages after all, and whether QEMU would be able to postpone the vfio map of those PROT_NONE pages until they got to become with RW permissions. Thanks,
Hi Peter > -----Original Message----- > From: Peter Xu <peterx@redhat.com> > Sent: Wednesday, November 25, 2020 2:12 AM > To: Justin He <Justin.He@arm.com> > Cc: Alex Williamson <alex.williamson@redhat.com>; Cornelia Huck > <cohuck@redhat.com>; kvm@vger.kernel.org; linux-kernel@vger.kernel.org > Subject: Re: [PATCH] vfio iommu type1: Bypass the vma permission check in > vfio_pin_pages_remote() > > Hi, Jia, > > On Thu, Nov 19, 2020 at 10:27:37PM +0800, Jia He wrote: > > The permission of vfio iommu is different and incompatible with vma > > permission. If the iotlb->perm is IOMMU_NONE (e.g. qemu side), qemu will > > simply call unmap ioctl() instead of mapping. Hence vfio_dma_map() can't > > map a dma region with NONE permission. > > > > This corner case will be exposed in coming virtio_fs cache_size > > commit [1] > > - mmap(NULL, size, PROT_NONE, MAP_ANONYMOUS | MAP_PRIVATE, -1, 0); > > memory_region_init_ram_ptr() > > - re-mmap the above area with read/write authority. > > If iiuc here we'll remap the above PROT_NONE into PROT_READ|PROT_WRITE, > then... > > > - vfio_dma_map() will be invoked when vfio device is hotplug added. > > ... here I'm slightly confused on why VFIO_IOMMU_MAP_DMA would encounter > vma > check fail - aren't they already get rw permissions? No, we haven't got the vma rw permission yet, but the default permission in this case is rw by default. When qemu side invoke vfio_dma_map(), the rw of iommu will be automatically added [1] [2] (currently map a NONE region is not supported in qemu vfio). [1] https://git.qemu.org/?p=qemu.git;a=blob;f=hw/vfio/common.c;h=6ff1daa763f87a1ed5351bcc19aeb027c43b8a8f;hb=HEAD#l479 [2] https://git.qemu.org/?p=qemu.git;a=blob;f=hw/vfio/common.c;h=6ff1daa763f87a1ed5351bcc19aeb027c43b8a8f;hb=HEAD#l486 But at kernel side, the vma permission is created by PROT_NONE. Then the check in check_vma_flags() at [3] will be failed. [3] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/mm/gup.c#n929 > > I'd appreciate if you could explain why vfio needs to dma map some > PROT_NONE Virtiofs will map a PROT_NONE cache window region firstly, then remap the sub region of that cache window with read or write permission. I guess this might be an security concern. Just CC virtiofs expert Stefan to answer it more accurately. -- Cheers, Justin (Jia He) > pages after all, and whether QEMU would be able to postpone the vfio map of > those PROT_NONE pages until they got to become with RW permissions. > > Thanks, > > -- > Peter Xu IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
On Wed, Nov 25, 2020 at 01:05:25AM +0000, Justin He wrote: > > I'd appreciate if you could explain why vfio needs to dma map some > > PROT_NONE > > Virtiofs will map a PROT_NONE cache window region firstly, then remap the sub > region of that cache window with read or write permission. I guess this might > be an security concern. Just CC virtiofs expert Stefan to answer it more accurately. Yep. Since my previous sentence was cut off, I'll rephrase: I was thinking whether qemu can do vfio maps only until it remaps the PROT_NONE regions into PROT_READ|PROT_WRITE ones, rather than trying to map dma pages upon PROT_NONE. Thanks,
On Wed, Nov 25, 2020 at 10:57:11AM -0500, Peter Xu wrote: > On Wed, Nov 25, 2020 at 01:05:25AM +0000, Justin He wrote: > > > I'd appreciate if you could explain why vfio needs to dma map some > > > PROT_NONE > > > > Virtiofs will map a PROT_NONE cache window region firstly, then remap the sub > > region of that cache window with read or write permission. I guess this might > > be an security concern. Just CC virtiofs expert Stefan to answer it more accurately. > > Yep. Since my previous sentence was cut off, I'll rephrase: I was thinking > whether qemu can do vfio maps only until it remaps the PROT_NONE regions into > PROT_READ|PROT_WRITE ones, rather than trying to map dma pages upon PROT_NONE. Userspace processes sometimes use PROT_NONE to reserve virtual address space. That way future mmap(NULL, ...) calls will not accidentally allocate an address from the reserved range. virtio-fs needs to do this because the DAX window mappings change at runtime. Initially the entire DAX window is just reserved using PROT_NONE. When it's time to mmap a portion of a file into the DAX window an mmap(fixed_addr, ...) call will be made. Stefan
Hi, Stefan, On Wed, Dec 02, 2020 at 02:33:56PM +0000, Stefan Hajnoczi wrote: > On Wed, Nov 25, 2020 at 10:57:11AM -0500, Peter Xu wrote: > > On Wed, Nov 25, 2020 at 01:05:25AM +0000, Justin He wrote: > > > > I'd appreciate if you could explain why vfio needs to dma map some > > > > PROT_NONE > > > > > > Virtiofs will map a PROT_NONE cache window region firstly, then remap the sub > > > region of that cache window with read or write permission. I guess this might > > > be an security concern. Just CC virtiofs expert Stefan to answer it more accurately. > > > > Yep. Since my previous sentence was cut off, I'll rephrase: I was thinking > > whether qemu can do vfio maps only until it remaps the PROT_NONE regions into > > PROT_READ|PROT_WRITE ones, rather than trying to map dma pages upon PROT_NONE. > > Userspace processes sometimes use PROT_NONE to reserve virtual address > space. That way future mmap(NULL, ...) calls will not accidentally > allocate an address from the reserved range. > > virtio-fs needs to do this because the DAX window mappings change at > runtime. Initially the entire DAX window is just reserved using > PROT_NONE. When it's time to mmap a portion of a file into the DAX > window an mmap(fixed_addr, ...) call will be made. Yes I can understand the rational on why the region is reserved. However IMHO the real question is why such reservation behavior should affect qemu memory layout, and even further to VFIO mappings. Note that PROT_NONE should likely mean that there's no backing page at all in this case. Since vfio will pin all the pages before mapping the DMAs, it also means that it's at least inefficient, because when we try to map all the PROT_NONE pages we'll try to fault in every single page of it, even if they may not ever be used. So I still think this patch is not doing the right thing. Instead we should somehow teach qemu that the virtiofs memory region should only be the size of enabled regions (with PROT_READ|PROT_WRITE), rather than the whole reserved PROT_NONE region. Thanks,
On Wed, Dec 02, 2020 at 10:45:11AM -0500, Peter Xu wrote: > On Wed, Dec 02, 2020 at 02:33:56PM +0000, Stefan Hajnoczi wrote: > > On Wed, Nov 25, 2020 at 10:57:11AM -0500, Peter Xu wrote: > > > On Wed, Nov 25, 2020 at 01:05:25AM +0000, Justin He wrote: > > > > > I'd appreciate if you could explain why vfio needs to dma map some > > > > > PROT_NONE > > > > > > > > Virtiofs will map a PROT_NONE cache window region firstly, then remap the sub > > > > region of that cache window with read or write permission. I guess this might > > > > be an security concern. Just CC virtiofs expert Stefan to answer it more accurately. > > > > > > Yep. Since my previous sentence was cut off, I'll rephrase: I was thinking > > > whether qemu can do vfio maps only until it remaps the PROT_NONE regions into > > > PROT_READ|PROT_WRITE ones, rather than trying to map dma pages upon PROT_NONE. > > > > Userspace processes sometimes use PROT_NONE to reserve virtual address > > space. That way future mmap(NULL, ...) calls will not accidentally > > allocate an address from the reserved range. > > > > virtio-fs needs to do this because the DAX window mappings change at > > runtime. Initially the entire DAX window is just reserved using > > PROT_NONE. When it's time to mmap a portion of a file into the DAX > > window an mmap(fixed_addr, ...) call will be made. > > Yes I can understand the rational on why the region is reserved. However IMHO > the real question is why such reservation behavior should affect qemu memory > layout, and even further to VFIO mappings. > > Note that PROT_NONE should likely mean that there's no backing page at all in > this case. Since vfio will pin all the pages before mapping the DMAs, it also > means that it's at least inefficient, because when we try to map all the > PROT_NONE pages we'll try to fault in every single page of it, even if they may > not ever be used. > > So I still think this patch is not doing the right thing. Instead we should > somehow teach qemu that the virtiofs memory region should only be the size of > enabled regions (with PROT_READ|PROT_WRITE), rather than the whole reserved > PROT_NONE region. virtio-fs was not implemented with IOMMUs in mind. The idea is just to install a kvm.ko memory region that exposes the DAX window. Perhaps we need to treat the DAX window like an IOMMU? That way the virtio-fs code can send map/unmap notifications and hw/vfio/ can propagate them to the host kernel. Stefan
On Thu, Dec 03, 2020 at 11:20:02AM +0000, Stefan Hajnoczi wrote: > On Wed, Dec 02, 2020 at 10:45:11AM -0500, Peter Xu wrote: > > On Wed, Dec 02, 2020 at 02:33:56PM +0000, Stefan Hajnoczi wrote: > > > On Wed, Nov 25, 2020 at 10:57:11AM -0500, Peter Xu wrote: > > > > On Wed, Nov 25, 2020 at 01:05:25AM +0000, Justin He wrote: > > > > > > I'd appreciate if you could explain why vfio needs to dma map some > > > > > > PROT_NONE > > > > > > > > > > Virtiofs will map a PROT_NONE cache window region firstly, then remap the sub > > > > > region of that cache window with read or write permission. I guess this might > > > > > be an security concern. Just CC virtiofs expert Stefan to answer it more accurately. > > > > > > > > Yep. Since my previous sentence was cut off, I'll rephrase: I was thinking > > > > whether qemu can do vfio maps only until it remaps the PROT_NONE regions into > > > > PROT_READ|PROT_WRITE ones, rather than trying to map dma pages upon PROT_NONE. > > > > > > Userspace processes sometimes use PROT_NONE to reserve virtual address > > > space. That way future mmap(NULL, ...) calls will not accidentally > > > allocate an address from the reserved range. > > > > > > virtio-fs needs to do this because the DAX window mappings change at > > > runtime. Initially the entire DAX window is just reserved using > > > PROT_NONE. When it's time to mmap a portion of a file into the DAX > > > window an mmap(fixed_addr, ...) call will be made. > > > > Yes I can understand the rational on why the region is reserved. However IMHO > > the real question is why such reservation behavior should affect qemu memory > > layout, and even further to VFIO mappings. > > > > Note that PROT_NONE should likely mean that there's no backing page at all in > > this case. Since vfio will pin all the pages before mapping the DMAs, it also > > means that it's at least inefficient, because when we try to map all the > > PROT_NONE pages we'll try to fault in every single page of it, even if they may > > not ever be used. > > > > So I still think this patch is not doing the right thing. Instead we should > > somehow teach qemu that the virtiofs memory region should only be the size of > > enabled regions (with PROT_READ|PROT_WRITE), rather than the whole reserved > > PROT_NONE region. > > virtio-fs was not implemented with IOMMUs in mind. The idea is just to > install a kvm.ko memory region that exposes the DAX window. > > Perhaps we need to treat the DAX window like an IOMMU? That way the > virtio-fs code can send map/unmap notifications and hw/vfio/ can > propagate them to the host kernel. Sounds right. One more thing to mention is that we may need to avoid tearing down the whole old DMA region when resizing the PROT_READ|PROT_WRITE region into e.g. a bigger one to cover some of the previusly PROT_NONE part, as long as if the before-resizing region is still possible to be accessed from any hardware. It smells like something David is working with virtio-mem, not sure whether there's any common infrastructure that could be shared. Thanks,
On Thu, 3 Dec 2020 10:43:22 -0500 Peter Xu <peterx@redhat.com> wrote: > On Thu, Dec 03, 2020 at 11:20:02AM +0000, Stefan Hajnoczi wrote: > > On Wed, Dec 02, 2020 at 10:45:11AM -0500, Peter Xu wrote: > > > On Wed, Dec 02, 2020 at 02:33:56PM +0000, Stefan Hajnoczi wrote: > > > > On Wed, Nov 25, 2020 at 10:57:11AM -0500, Peter Xu wrote: > > > > > On Wed, Nov 25, 2020 at 01:05:25AM +0000, Justin He wrote: > > > > > > > I'd appreciate if you could explain why vfio needs to dma map some > > > > > > > PROT_NONE > > > > > > > > > > > > Virtiofs will map a PROT_NONE cache window region firstly, then remap the sub > > > > > > region of that cache window with read or write permission. I guess this might > > > > > > be an security concern. Just CC virtiofs expert Stefan to answer it more accurately. > > > > > > > > > > Yep. Since my previous sentence was cut off, I'll rephrase: I was thinking > > > > > whether qemu can do vfio maps only until it remaps the PROT_NONE regions into > > > > > PROT_READ|PROT_WRITE ones, rather than trying to map dma pages upon PROT_NONE. > > > > > > > > Userspace processes sometimes use PROT_NONE to reserve virtual address > > > > space. That way future mmap(NULL, ...) calls will not accidentally > > > > allocate an address from the reserved range. > > > > > > > > virtio-fs needs to do this because the DAX window mappings change at > > > > runtime. Initially the entire DAX window is just reserved using > > > > PROT_NONE. When it's time to mmap a portion of a file into the DAX > > > > window an mmap(fixed_addr, ...) call will be made. > > > > > > Yes I can understand the rational on why the region is reserved. However IMHO > > > the real question is why such reservation behavior should affect qemu memory > > > layout, and even further to VFIO mappings. > > > > > > Note that PROT_NONE should likely mean that there's no backing page at all in > > > this case. Since vfio will pin all the pages before mapping the DMAs, it also > > > means that it's at least inefficient, because when we try to map all the > > > PROT_NONE pages we'll try to fault in every single page of it, even if they may > > > not ever be used. > > > > > > So I still think this patch is not doing the right thing. Instead we should > > > somehow teach qemu that the virtiofs memory region should only be the size of > > > enabled regions (with PROT_READ|PROT_WRITE), rather than the whole reserved > > > PROT_NONE region. > > > > virtio-fs was not implemented with IOMMUs in mind. The idea is just to > > install a kvm.ko memory region that exposes the DAX window. > > > > Perhaps we need to treat the DAX window like an IOMMU? That way the > > virtio-fs code can send map/unmap notifications and hw/vfio/ can > > propagate them to the host kernel. > > Sounds right. One more thing to mention is that we may need to avoid tearing > down the whole old DMA region when resizing the PROT_READ|PROT_WRITE region > into e.g. a bigger one to cover some of the previusly PROT_NONE part, as long > as if the before-resizing region is still possible to be accessed from any > hardware. It smells like something David is working with virtio-mem, not sure > whether there's any common infrastructure that could be shared. Yes, very similar to his RamDiscardMgr which works on a granularity basis. vfio maps in granularity chunks and unmaps can span multiple chunks. This usage might be similar enough to use as-is. Thanks, Alex
On 03.12.20 16:43, Peter Xu wrote: > On Thu, Dec 03, 2020 at 11:20:02AM +0000, Stefan Hajnoczi wrote: >> On Wed, Dec 02, 2020 at 10:45:11AM -0500, Peter Xu wrote: >>> On Wed, Dec 02, 2020 at 02:33:56PM +0000, Stefan Hajnoczi wrote: >>>> On Wed, Nov 25, 2020 at 10:57:11AM -0500, Peter Xu wrote: >>>>> On Wed, Nov 25, 2020 at 01:05:25AM +0000, Justin He wrote: >>>>>>> I'd appreciate if you could explain why vfio needs to dma map some >>>>>>> PROT_NONE >>>>>> >>>>>> Virtiofs will map a PROT_NONE cache window region firstly, then remap the sub >>>>>> region of that cache window with read or write permission. I guess this might >>>>>> be an security concern. Just CC virtiofs expert Stefan to answer it more accurately. >>>>> >>>>> Yep. Since my previous sentence was cut off, I'll rephrase: I was thinking >>>>> whether qemu can do vfio maps only until it remaps the PROT_NONE regions into >>>>> PROT_READ|PROT_WRITE ones, rather than trying to map dma pages upon PROT_NONE. >>>> >>>> Userspace processes sometimes use PROT_NONE to reserve virtual address >>>> space. That way future mmap(NULL, ...) calls will not accidentally >>>> allocate an address from the reserved range. >>>> >>>> virtio-fs needs to do this because the DAX window mappings change at >>>> runtime. Initially the entire DAX window is just reserved using >>>> PROT_NONE. When it's time to mmap a portion of a file into the DAX >>>> window an mmap(fixed_addr, ...) call will be made. >>> >>> Yes I can understand the rational on why the region is reserved. However IMHO >>> the real question is why such reservation behavior should affect qemu memory >>> layout, and even further to VFIO mappings. >>> >>> Note that PROT_NONE should likely mean that there's no backing page at all in >>> this case. Since vfio will pin all the pages before mapping the DMAs, it also >>> means that it's at least inefficient, because when we try to map all the >>> PROT_NONE pages we'll try to fault in every single page of it, even if they may >>> not ever be used. >>> >>> So I still think this patch is not doing the right thing. Instead we should >>> somehow teach qemu that the virtiofs memory region should only be the size of >>> enabled regions (with PROT_READ|PROT_WRITE), rather than the whole reserved >>> PROT_NONE region. >> >> virtio-fs was not implemented with IOMMUs in mind. The idea is just to >> install a kvm.ko memory region that exposes the DAX window. >> >> Perhaps we need to treat the DAX window like an IOMMU? That way the >> virtio-fs code can send map/unmap notifications and hw/vfio/ can >> propagate them to the host kernel. > > Sounds right. One more thing to mention is that we may need to avoid tearing > down the whole old DMA region when resizing the PROT_READ|PROT_WRITE region > into e.g. a bigger one to cover some of the previusly PROT_NONE part, as long > as if the before-resizing region is still possible to be accessed from any > hardware. It smells like something David is working with virtio-mem, not sure > whether there's any common infrastructure that could be shared. "somehow teach qemu that the virtiofs memory region should only be the size of enabled regions" - for virtio-mem, I'm working on resizeable RAM blocks/RAM memory regions. Fairly complicated, that's why I deferred upstreaming it and still need to implement plenty of special cases for atomic resizes (e.g., vhost-user). But it's only one part of the puzzle for virtio-fs. AFAIU, it's not only about resizing the region for virtio-fs - we can have PROT_NONE holes anywhere inside there. In vfio, you cannot shrink mappings atomically. Growing works, but requires additional mappings (-> bad). So assume you mapped a file with size X and want to resize it. You first have to unmap + remap with the new size. This is not atomic, thus problematic. For virtio-mem, we have a fix block size and can map/unmap in that granularity whenever we populate/discard memory within the device-managed region. I don't think that applies for files in case of virtio-fs. The real question is: do we even *need* DMA from vfio devices to virtio-fs regions? If not (do guests rely on it? what does the spec state?), just don't care about vfio at all and don't map anything. But maybe I am missing something important Q1: Is the virtio-fs region mapped into system address space, so we have to map everything without a vIOMMU? Q2: Is DMA from vfio devices to virtio-fs regions a valid use case?
On Thu, Dec 03, 2020 at 05:01:32PM +0100, David Hildenbrand wrote: > The real question is: do we even *need* DMA from vfio devices to > virtio-fs regions? If not (do guests rely on it? what does the spec > state?), just don't care about vfio at all and don't map anything. Can DMA to/from the virtio-fs DAX Window happen? Yes, the guest could do it although it's rare. Is it a great idea to combine VFIO and virtio-fs DAX? Maybe not. It involves pinning host page cache pages O_o. Stefan
diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c index 67e827638995..33faa6b7dbd4 100644 --- a/drivers/vfio/vfio_iommu_type1.c +++ b/drivers/vfio/vfio_iommu_type1.c @@ -453,7 +453,8 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr, flags |= FOLL_WRITE; mmap_read_lock(mm); - ret = pin_user_pages_remote(mm, vaddr, 1, flags | FOLL_LONGTERM, + ret = pin_user_pages_remote(mm, vaddr, 1, + flags | FOLL_LONGTERM | FOLL_FORCE, page, NULL, NULL); if (ret == 1) { *pfn = page_to_pfn(page[0]);
The permission of vfio iommu is different and incompatible with vma permission. If the iotlb->perm is IOMMU_NONE (e.g. qemu side), qemu will simply call unmap ioctl() instead of mapping. Hence vfio_dma_map() can't map a dma region with NONE permission. This corner case will be exposed in coming virtio_fs cache_size commit [1] - mmap(NULL, size, PROT_NONE, MAP_ANONYMOUS | MAP_PRIVATE, -1, 0); memory_region_init_ram_ptr() - re-mmap the above area with read/write authority. - vfio_dma_map() will be invoked when vfio device is hotplug added. qemu: vfio_listener_region_add() vfio_dma_map(..., readonly=false) map.flags is set to VFIO_DMA_MAP_FLAG_READ|VFIO_..._WRITE ioctl(VFIO_IOMMU_MAP_DMA) kernel: vfio_dma_do_map() vfio_pin_map_dma() vfio_pin_pages_remote() vaddr_get_pfn() ... check_vma_flags() failed! because vm_flags hasn't VM_WRITE && gup_flags has FOLL_WRITE It will report error in qemu log when hotplug adding(vfio) a nvme disk to qemu guest on an Ampere EMAG server: "VFIO_MAP_DMA failed: Bad address" [1] https://gitlab.com/virtio-fs/qemu/-/blob/virtio-fs-dev/hw/virtio/vhost-user-fs.c#L502 Signed-off-by: Jia He <justin.he@arm.com> --- drivers/vfio/vfio_iommu_type1.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)