diff mbox series

vfio iommu type1: Bypass the vma permission check in vfio_pin_pages_remote()

Message ID 20201119142737.17574-1-justin.he@arm.com (mailing list archive)
State New, archived
Headers show
Series vfio iommu type1: Bypass the vma permission check in vfio_pin_pages_remote() | expand

Commit Message

Jia He Nov. 19, 2020, 2:27 p.m. UTC
The permission of vfio iommu is different and incompatible with vma
permission. If the iotlb->perm is IOMMU_NONE (e.g. qemu side), qemu will
simply call unmap ioctl() instead of mapping. Hence vfio_dma_map() can't
map a dma region with NONE permission.

This corner case will be exposed in coming virtio_fs cache_size
commit [1]
 - mmap(NULL, size, PROT_NONE, MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
   memory_region_init_ram_ptr()
 - re-mmap the above area with read/write authority.
 - vfio_dma_map() will be invoked when vfio device is hotplug added.

qemu:
vfio_listener_region_add()
	vfio_dma_map(..., readonly=false)
		map.flags is set to VFIO_DMA_MAP_FLAG_READ|VFIO_..._WRITE
		ioctl(VFIO_IOMMU_MAP_DMA)

kernel:
vfio_dma_do_map()
	vfio_pin_map_dma()
		vfio_pin_pages_remote()
			vaddr_get_pfn()
			...
				check_vma_flags() failed! because
				vm_flags hasn't VM_WRITE && gup_flags
				has FOLL_WRITE

It will report error in qemu log when hotplug adding(vfio) a nvme disk
to qemu guest on an Ampere EMAG server:
"VFIO_MAP_DMA failed: Bad address"

[1] https://gitlab.com/virtio-fs/qemu/-/blob/virtio-fs-dev/hw/virtio/vhost-user-fs.c#L502

Signed-off-by: Jia He <justin.he@arm.com>
---
 drivers/vfio/vfio_iommu_type1.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

Comments

Alex Williamson Nov. 19, 2020, 5:05 p.m. UTC | #1
On Thu, 19 Nov 2020 22:27:37 +0800
Jia He <justin.he@arm.com> wrote:

> The permission of vfio iommu is different and incompatible with vma
> permission. If the iotlb->perm is IOMMU_NONE (e.g. qemu side), qemu will
> simply call unmap ioctl() instead of mapping. Hence vfio_dma_map() can't
> map a dma region with NONE permission.
> 
> This corner case will be exposed in coming virtio_fs cache_size
> commit [1]
>  - mmap(NULL, size, PROT_NONE, MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
>    memory_region_init_ram_ptr()
>  - re-mmap the above area with read/write authority.
>  - vfio_dma_map() will be invoked when vfio device is hotplug added.
> 
> qemu:
> vfio_listener_region_add()
> 	vfio_dma_map(..., readonly=false)
> 		map.flags is set to VFIO_DMA_MAP_FLAG_READ|VFIO_..._WRITE
> 		ioctl(VFIO_IOMMU_MAP_DMA)
> 
> kernel:
> vfio_dma_do_map()
> 	vfio_pin_map_dma()
> 		vfio_pin_pages_remote()
> 			vaddr_get_pfn()
> 			...
> 				check_vma_flags() failed! because
> 				vm_flags hasn't VM_WRITE && gup_flags
> 				has FOLL_WRITE
> 
> It will report error in qemu log when hotplug adding(vfio) a nvme disk
> to qemu guest on an Ampere EMAG server:
> "VFIO_MAP_DMA failed: Bad address"

I don't fully understand the argument here, I think this is suggesting
that because QEMU won't call VFIO_IOMMU_MAP_DMA on a region that has
NONE permission, the kernel can ignore read/write permission by using
FOLL_FORCE.  Not only is QEMU not the only userspace driver for vfio,
but regardless of that, we can't trust the behavior of any given
userspace driver.  Bypassing the permission check with FOLL_FORCE seems
like it's placing the trust in the user, which seems like a security
issue.  Thanks,

Alex


> [1] https://gitlab.com/virtio-fs/qemu/-/blob/virtio-fs-dev/hw/virtio/vhost-user-fs.c#L502
> 
> Signed-off-by: Jia He <justin.he@arm.com>
> ---
>  drivers/vfio/vfio_iommu_type1.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index 67e827638995..33faa6b7dbd4 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -453,7 +453,8 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr,
>  		flags |= FOLL_WRITE;
>  
>  	mmap_read_lock(mm);
> -	ret = pin_user_pages_remote(mm, vaddr, 1, flags | FOLL_LONGTERM,
> +	ret = pin_user_pages_remote(mm, vaddr, 1,
> +				    flags | FOLL_LONGTERM | FOLL_FORCE,
>  				    page, NULL, NULL);
>  	if (ret == 1) {
>  		*pfn = page_to_pfn(page[0]);
Jia He Nov. 23, 2020, 2:37 a.m. UTC | #2
Hi Alex, thanks for the comments.
See mine below:

> -----Original Message-----
> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Friday, November 20, 2020 1:05 AM
> To: Justin He <Justin.He@arm.com>
> Cc: Cornelia Huck <cohuck@redhat.com>; kvm@vger.kernel.org; linux-
> kernel@vger.kernel.org
> Subject: Re: [PATCH] vfio iommu type1: Bypass the vma permission check in
> vfio_pin_pages_remote()
>
> On Thu, 19 Nov 2020 22:27:37 +0800
> Jia He <justin.he@arm.com> wrote:
>
> > The permission of vfio iommu is different and incompatible with vma
> > permission. If the iotlb->perm is IOMMU_NONE (e.g. qemu side), qemu will
> > simply call unmap ioctl() instead of mapping. Hence vfio_dma_map() can't
> > map a dma region with NONE permission.
> >
> > This corner case will be exposed in coming virtio_fs cache_size
> > commit [1]
> >  - mmap(NULL, size, PROT_NONE, MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
> >    memory_region_init_ram_ptr()
> >  - re-mmap the above area with read/write authority.
> >  - vfio_dma_map() will be invoked when vfio device is hotplug added.
> >
> > qemu:
> > vfio_listener_region_add()
> > vfio_dma_map(..., readonly=false)
> > map.flags is set to VFIO_DMA_MAP_FLAG_READ|VFIO_..._WRITE
> > ioctl(VFIO_IOMMU_MAP_DMA)
> >
> > kernel:
> > vfio_dma_do_map()
> > vfio_pin_map_dma()
> > vfio_pin_pages_remote()
> > vaddr_get_pfn()
> > ...
> > check_vma_flags() failed! because
> > vm_flags hasn't VM_WRITE && gup_flags
> > has FOLL_WRITE
> >
> > It will report error in qemu log when hotplug adding(vfio) a nvme disk
> > to qemu guest on an Ampere EMAG server:
> > "VFIO_MAP_DMA failed: Bad address"
>
> I don't fully understand the argument here, I think this is suggesting
> that because QEMU won't call VFIO_IOMMU_MAP_DMA on a region that has
> NONE permission, the kernel can ignore read/write permission by using
> FOLL_FORCE.  Not only is QEMU not the only userspace driver for vfio,
> but regardless of that, we can't trust the behavior of any given
> userspace driver.  Bypassing the permission check with FOLL_FORCE seems
> like it's placing the trust in the user, which seems like a security
> issue.  Thanks,
Yes, this might have side impact on security.
But besides this simple fix(adding FOLL_FORCE), do you think it is a good
idea that:
Qemu provides a special vfio_dma_map_none_perm() to allow mapping a
region with NONE permission?

Thanks for any suggestion.

--
Cheers,
Justin (Jia He)
>
> Alex
>
>
> > [1] https://gitlab.com/virtio-fs/qemu/-/blob/virtio-fs-
> dev/hw/virtio/vhost-user-fs.c#L502
> >
> > Signed-off-by: Jia He <justin.he@arm.com>
> > ---
> >  drivers/vfio/vfio_iommu_type1.c | 3 ++-
> >  1 file changed, 2 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/vfio/vfio_iommu_type1.c
> b/drivers/vfio/vfio_iommu_type1.c
> > index 67e827638995..33faa6b7dbd4 100644
> > --- a/drivers/vfio/vfio_iommu_type1.c
> > +++ b/drivers/vfio/vfio_iommu_type1.c
> > @@ -453,7 +453,8 @@ static int vaddr_get_pfn(struct mm_struct *mm,
> unsigned long vaddr,
> >  flags |= FOLL_WRITE;
> >
> >  mmap_read_lock(mm);
> > -ret = pin_user_pages_remote(mm, vaddr, 1, flags | FOLL_LONGTERM,
> > +ret = pin_user_pages_remote(mm, vaddr, 1,
> > +    flags | FOLL_LONGTERM | FOLL_FORCE,
> >      page, NULL, NULL);
> >  if (ret == 1) {
> >  *pfn = page_to_pfn(page[0]);

IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
Alex Williamson Nov. 24, 2020, 5:07 p.m. UTC | #3
On Mon, 23 Nov 2020 02:37:32 +0000
Justin He <Justin.He@arm.com> wrote:

> Hi Alex, thanks for the comments.
> See mine below:
> 
> > -----Original Message-----
> > From: Alex Williamson <alex.williamson@redhat.com>
> > Sent: Friday, November 20, 2020 1:05 AM
> > To: Justin He <Justin.He@arm.com>
> > Cc: Cornelia Huck <cohuck@redhat.com>; kvm@vger.kernel.org; linux-
> > kernel@vger.kernel.org
> > Subject: Re: [PATCH] vfio iommu type1: Bypass the vma permission check in
> > vfio_pin_pages_remote()
> >
> > On Thu, 19 Nov 2020 22:27:37 +0800
> > Jia He <justin.he@arm.com> wrote:
> >  
> > > The permission of vfio iommu is different and incompatible with vma
> > > permission. If the iotlb->perm is IOMMU_NONE (e.g. qemu side), qemu will
> > > simply call unmap ioctl() instead of mapping. Hence vfio_dma_map() can't
> > > map a dma region with NONE permission.
> > >
> > > This corner case will be exposed in coming virtio_fs cache_size
> > > commit [1]
> > >  - mmap(NULL, size, PROT_NONE, MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
> > >    memory_region_init_ram_ptr()
> > >  - re-mmap the above area with read/write authority.
> > >  - vfio_dma_map() will be invoked when vfio device is hotplug added.
> > >
> > > qemu:
> > > vfio_listener_region_add()
> > > vfio_dma_map(..., readonly=false)
> > > map.flags is set to VFIO_DMA_MAP_FLAG_READ|VFIO_..._WRITE
> > > ioctl(VFIO_IOMMU_MAP_DMA)
> > >
> > > kernel:
> > > vfio_dma_do_map()
> > > vfio_pin_map_dma()
> > > vfio_pin_pages_remote()
> > > vaddr_get_pfn()
> > > ...
> > > check_vma_flags() failed! because
> > > vm_flags hasn't VM_WRITE && gup_flags
> > > has FOLL_WRITE
> > >
> > > It will report error in qemu log when hotplug adding(vfio) a nvme disk
> > > to qemu guest on an Ampere EMAG server:
> > > "VFIO_MAP_DMA failed: Bad address"  
> >
> > I don't fully understand the argument here, I think this is suggesting
> > that because QEMU won't call VFIO_IOMMU_MAP_DMA on a region that has
> > NONE permission, the kernel can ignore read/write permission by using
> > FOLL_FORCE.  Not only is QEMU not the only userspace driver for vfio,
> > but regardless of that, we can't trust the behavior of any given
> > userspace driver.  Bypassing the permission check with FOLL_FORCE seems
> > like it's placing the trust in the user, which seems like a security
> > issue.  Thanks,  
> Yes, this might have side impact on security.
> But besides this simple fix(adding FOLL_FORCE), do you think it is a good
> idea that:
> Qemu provides a special vfio_dma_map_none_perm() to allow mapping a
> region with NONE permission?

If NONE permission implies that we use FOLL_FORCE as described here to
ignore the r+w permissions and trust that the user knows what they're
doing, that seems like a non-starter.  Ultimately I think what you're
describing is a scenario where our current permission check fails and
the solution is probably to extend the check to account for other ways
that a user may have access to a vma rather than bypass the check.
Thanks,

Alex
Peter Xu Nov. 24, 2020, 6:12 p.m. UTC | #4
Hi, Jia,

On Thu, Nov 19, 2020 at 10:27:37PM +0800, Jia He wrote:
> The permission of vfio iommu is different and incompatible with vma
> permission. If the iotlb->perm is IOMMU_NONE (e.g. qemu side), qemu will
> simply call unmap ioctl() instead of mapping. Hence vfio_dma_map() can't
> map a dma region with NONE permission.
> 
> This corner case will be exposed in coming virtio_fs cache_size
> commit [1]
>  - mmap(NULL, size, PROT_NONE, MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
>    memory_region_init_ram_ptr()
>  - re-mmap the above area with read/write authority.

If iiuc here we'll remap the above PROT_NONE into PROT_READ|PROT_WRITE, then...

>  - vfio_dma_map() will be invoked when vfio device is hotplug added.

... here I'm slightly confused on why VFIO_IOMMU_MAP_DMA would encounter vma
check fail - aren't they already get rw permissions?

I'd appreciate if you could explain why vfio needs to dma map some PROT_NONE
pages after all, and whether QEMU would be able to postpone the vfio map of
those PROT_NONE pages until they got to become with RW permissions.

Thanks,
Jia He Nov. 25, 2020, 1:05 a.m. UTC | #5
Hi Peter

> -----Original Message-----
> From: Peter Xu <peterx@redhat.com>
> Sent: Wednesday, November 25, 2020 2:12 AM
> To: Justin He <Justin.He@arm.com>
> Cc: Alex Williamson <alex.williamson@redhat.com>; Cornelia Huck
> <cohuck@redhat.com>; kvm@vger.kernel.org; linux-kernel@vger.kernel.org
> Subject: Re: [PATCH] vfio iommu type1: Bypass the vma permission check in
> vfio_pin_pages_remote()
>
> Hi, Jia,
>
> On Thu, Nov 19, 2020 at 10:27:37PM +0800, Jia He wrote:
> > The permission of vfio iommu is different and incompatible with vma
> > permission. If the iotlb->perm is IOMMU_NONE (e.g. qemu side), qemu will
> > simply call unmap ioctl() instead of mapping. Hence vfio_dma_map() can't
> > map a dma region with NONE permission.
> >
> > This corner case will be exposed in coming virtio_fs cache_size
> > commit [1]
> >  - mmap(NULL, size, PROT_NONE, MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
> >    memory_region_init_ram_ptr()
> >  - re-mmap the above area with read/write authority.
>
> If iiuc here we'll remap the above PROT_NONE into PROT_READ|PROT_WRITE,
> then...
>
> >  - vfio_dma_map() will be invoked when vfio device is hotplug added.
>
> ... here I'm slightly confused on why VFIO_IOMMU_MAP_DMA would encounter
> vma
> check fail - aren't they already get rw permissions?

No, we haven't got the vma rw permission yet, but the default permission in
this case is rw by default.

When qemu side invoke vfio_dma_map(), the rw of iommu will be automatically
added [1] [2] (currently map a NONE region is not supported in qemu vfio).
[1] https://git.qemu.org/?p=qemu.git;a=blob;f=hw/vfio/common.c;h=6ff1daa763f87a1ed5351bcc19aeb027c43b8a8f;hb=HEAD#l479
[2] https://git.qemu.org/?p=qemu.git;a=blob;f=hw/vfio/common.c;h=6ff1daa763f87a1ed5351bcc19aeb027c43b8a8f;hb=HEAD#l486

But at kernel side, the vma permission is created by PROT_NONE.

Then the check in check_vma_flags() at [3] will be failed.
[3] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/mm/gup.c#n929

>
> I'd appreciate if you could explain why vfio needs to dma map some
> PROT_NONE

Virtiofs will map a PROT_NONE cache window region firstly, then remap the sub
region of that cache window with read or write permission. I guess this might
be an security concern. Just CC virtiofs expert Stefan to answer it more accurately.


--
Cheers,
Justin (Jia He)


> pages after all, and whether QEMU would be able to postpone the vfio map of
> those PROT_NONE pages until they got to become with RW permissions.
>
> Thanks,
>
> --
> Peter Xu

IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
Peter Xu Nov. 25, 2020, 3:57 p.m. UTC | #6
On Wed, Nov 25, 2020 at 01:05:25AM +0000, Justin He wrote:
> > I'd appreciate if you could explain why vfio needs to dma map some
> > PROT_NONE
> 
> Virtiofs will map a PROT_NONE cache window region firstly, then remap the sub
> region of that cache window with read or write permission. I guess this might
> be an security concern. Just CC virtiofs expert Stefan to answer it more accurately.

Yep.  Since my previous sentence was cut off, I'll rephrase: I was thinking
whether qemu can do vfio maps only until it remaps the PROT_NONE regions into
PROT_READ|PROT_WRITE ones, rather than trying to map dma pages upon PROT_NONE.

Thanks,
Stefan Hajnoczi Dec. 2, 2020, 2:33 p.m. UTC | #7
On Wed, Nov 25, 2020 at 10:57:11AM -0500, Peter Xu wrote:
> On Wed, Nov 25, 2020 at 01:05:25AM +0000, Justin He wrote:
> > > I'd appreciate if you could explain why vfio needs to dma map some
> > > PROT_NONE
> > 
> > Virtiofs will map a PROT_NONE cache window region firstly, then remap the sub
> > region of that cache window with read or write permission. I guess this might
> > be an security concern. Just CC virtiofs expert Stefan to answer it more accurately.
> 
> Yep.  Since my previous sentence was cut off, I'll rephrase: I was thinking
> whether qemu can do vfio maps only until it remaps the PROT_NONE regions into
> PROT_READ|PROT_WRITE ones, rather than trying to map dma pages upon PROT_NONE.

Userspace processes sometimes use PROT_NONE to reserve virtual address
space. That way future mmap(NULL, ...) calls will not accidentally
allocate an address from the reserved range.

virtio-fs needs to do this because the DAX window mappings change at
runtime. Initially the entire DAX window is just reserved using
PROT_NONE. When it's time to mmap a portion of a file into the DAX
window an mmap(fixed_addr, ...) call will be made.

Stefan
Peter Xu Dec. 2, 2020, 3:45 p.m. UTC | #8
Hi, Stefan,

On Wed, Dec 02, 2020 at 02:33:56PM +0000, Stefan Hajnoczi wrote:
> On Wed, Nov 25, 2020 at 10:57:11AM -0500, Peter Xu wrote:
> > On Wed, Nov 25, 2020 at 01:05:25AM +0000, Justin He wrote:
> > > > I'd appreciate if you could explain why vfio needs to dma map some
> > > > PROT_NONE
> > > 
> > > Virtiofs will map a PROT_NONE cache window region firstly, then remap the sub
> > > region of that cache window with read or write permission. I guess this might
> > > be an security concern. Just CC virtiofs expert Stefan to answer it more accurately.
> > 
> > Yep.  Since my previous sentence was cut off, I'll rephrase: I was thinking
> > whether qemu can do vfio maps only until it remaps the PROT_NONE regions into
> > PROT_READ|PROT_WRITE ones, rather than trying to map dma pages upon PROT_NONE.
> 
> Userspace processes sometimes use PROT_NONE to reserve virtual address
> space. That way future mmap(NULL, ...) calls will not accidentally
> allocate an address from the reserved range.
> 
> virtio-fs needs to do this because the DAX window mappings change at
> runtime. Initially the entire DAX window is just reserved using
> PROT_NONE. When it's time to mmap a portion of a file into the DAX
> window an mmap(fixed_addr, ...) call will be made.

Yes I can understand the rational on why the region is reserved.  However IMHO
the real question is why such reservation behavior should affect qemu memory
layout, and even further to VFIO mappings.

Note that PROT_NONE should likely mean that there's no backing page at all in
this case.  Since vfio will pin all the pages before mapping the DMAs, it also
means that it's at least inefficient, because when we try to map all the
PROT_NONE pages we'll try to fault in every single page of it, even if they may
not ever be used.

So I still think this patch is not doing the right thing.  Instead we should
somehow teach qemu that the virtiofs memory region should only be the size of
enabled regions (with PROT_READ|PROT_WRITE), rather than the whole reserved
PROT_NONE region.

Thanks,
Stefan Hajnoczi Dec. 3, 2020, 11:20 a.m. UTC | #9
On Wed, Dec 02, 2020 at 10:45:11AM -0500, Peter Xu wrote:
> On Wed, Dec 02, 2020 at 02:33:56PM +0000, Stefan Hajnoczi wrote:
> > On Wed, Nov 25, 2020 at 10:57:11AM -0500, Peter Xu wrote:
> > > On Wed, Nov 25, 2020 at 01:05:25AM +0000, Justin He wrote:
> > > > > I'd appreciate if you could explain why vfio needs to dma map some
> > > > > PROT_NONE
> > > > 
> > > > Virtiofs will map a PROT_NONE cache window region firstly, then remap the sub
> > > > region of that cache window with read or write permission. I guess this might
> > > > be an security concern. Just CC virtiofs expert Stefan to answer it more accurately.
> > > 
> > > Yep.  Since my previous sentence was cut off, I'll rephrase: I was thinking
> > > whether qemu can do vfio maps only until it remaps the PROT_NONE regions into
> > > PROT_READ|PROT_WRITE ones, rather than trying to map dma pages upon PROT_NONE.
> > 
> > Userspace processes sometimes use PROT_NONE to reserve virtual address
> > space. That way future mmap(NULL, ...) calls will not accidentally
> > allocate an address from the reserved range.
> > 
> > virtio-fs needs to do this because the DAX window mappings change at
> > runtime. Initially the entire DAX window is just reserved using
> > PROT_NONE. When it's time to mmap a portion of a file into the DAX
> > window an mmap(fixed_addr, ...) call will be made.
> 
> Yes I can understand the rational on why the region is reserved.  However IMHO
> the real question is why such reservation behavior should affect qemu memory
> layout, and even further to VFIO mappings.
> 
> Note that PROT_NONE should likely mean that there's no backing page at all in
> this case.  Since vfio will pin all the pages before mapping the DMAs, it also
> means that it's at least inefficient, because when we try to map all the
> PROT_NONE pages we'll try to fault in every single page of it, even if they may
> not ever be used.
> 
> So I still think this patch is not doing the right thing.  Instead we should
> somehow teach qemu that the virtiofs memory region should only be the size of
> enabled regions (with PROT_READ|PROT_WRITE), rather than the whole reserved
> PROT_NONE region.

virtio-fs was not implemented with IOMMUs in mind. The idea is just to
install a kvm.ko memory region that exposes the DAX window.

Perhaps we need to treat the DAX window like an IOMMU? That way the
virtio-fs code can send map/unmap notifications and hw/vfio/ can
propagate them to the host kernel.

Stefan
Peter Xu Dec. 3, 2020, 3:43 p.m. UTC | #10
On Thu, Dec 03, 2020 at 11:20:02AM +0000, Stefan Hajnoczi wrote:
> On Wed, Dec 02, 2020 at 10:45:11AM -0500, Peter Xu wrote:
> > On Wed, Dec 02, 2020 at 02:33:56PM +0000, Stefan Hajnoczi wrote:
> > > On Wed, Nov 25, 2020 at 10:57:11AM -0500, Peter Xu wrote:
> > > > On Wed, Nov 25, 2020 at 01:05:25AM +0000, Justin He wrote:
> > > > > > I'd appreciate if you could explain why vfio needs to dma map some
> > > > > > PROT_NONE
> > > > > 
> > > > > Virtiofs will map a PROT_NONE cache window region firstly, then remap the sub
> > > > > region of that cache window with read or write permission. I guess this might
> > > > > be an security concern. Just CC virtiofs expert Stefan to answer it more accurately.
> > > > 
> > > > Yep.  Since my previous sentence was cut off, I'll rephrase: I was thinking
> > > > whether qemu can do vfio maps only until it remaps the PROT_NONE regions into
> > > > PROT_READ|PROT_WRITE ones, rather than trying to map dma pages upon PROT_NONE.
> > > 
> > > Userspace processes sometimes use PROT_NONE to reserve virtual address
> > > space. That way future mmap(NULL, ...) calls will not accidentally
> > > allocate an address from the reserved range.
> > > 
> > > virtio-fs needs to do this because the DAX window mappings change at
> > > runtime. Initially the entire DAX window is just reserved using
> > > PROT_NONE. When it's time to mmap a portion of a file into the DAX
> > > window an mmap(fixed_addr, ...) call will be made.
> > 
> > Yes I can understand the rational on why the region is reserved.  However IMHO
> > the real question is why such reservation behavior should affect qemu memory
> > layout, and even further to VFIO mappings.
> > 
> > Note that PROT_NONE should likely mean that there's no backing page at all in
> > this case.  Since vfio will pin all the pages before mapping the DMAs, it also
> > means that it's at least inefficient, because when we try to map all the
> > PROT_NONE pages we'll try to fault in every single page of it, even if they may
> > not ever be used.
> > 
> > So I still think this patch is not doing the right thing.  Instead we should
> > somehow teach qemu that the virtiofs memory region should only be the size of
> > enabled regions (with PROT_READ|PROT_WRITE), rather than the whole reserved
> > PROT_NONE region.
> 
> virtio-fs was not implemented with IOMMUs in mind. The idea is just to
> install a kvm.ko memory region that exposes the DAX window.
> 
> Perhaps we need to treat the DAX window like an IOMMU? That way the
> virtio-fs code can send map/unmap notifications and hw/vfio/ can
> propagate them to the host kernel.

Sounds right.  One more thing to mention is that we may need to avoid tearing
down the whole old DMA region when resizing the PROT_READ|PROT_WRITE region
into e.g. a bigger one to cover some of the previusly PROT_NONE part, as long
as if the before-resizing region is still possible to be accessed from any
hardware.  It smells like something David is working with virtio-mem, not sure
whether there's any common infrastructure that could be shared.

Thanks,
Alex Williamson Dec. 3, 2020, 3:55 p.m. UTC | #11
On Thu, 3 Dec 2020 10:43:22 -0500
Peter Xu <peterx@redhat.com> wrote:

> On Thu, Dec 03, 2020 at 11:20:02AM +0000, Stefan Hajnoczi wrote:
> > On Wed, Dec 02, 2020 at 10:45:11AM -0500, Peter Xu wrote:  
> > > On Wed, Dec 02, 2020 at 02:33:56PM +0000, Stefan Hajnoczi wrote:  
> > > > On Wed, Nov 25, 2020 at 10:57:11AM -0500, Peter Xu wrote:  
> > > > > On Wed, Nov 25, 2020 at 01:05:25AM +0000, Justin He wrote:  
> > > > > > > I'd appreciate if you could explain why vfio needs to dma map some
> > > > > > > PROT_NONE  
> > > > > > 
> > > > > > Virtiofs will map a PROT_NONE cache window region firstly, then remap the sub
> > > > > > region of that cache window with read or write permission. I guess this might
> > > > > > be an security concern. Just CC virtiofs expert Stefan to answer it more accurately.  
> > > > > 
> > > > > Yep.  Since my previous sentence was cut off, I'll rephrase: I was thinking
> > > > > whether qemu can do vfio maps only until it remaps the PROT_NONE regions into
> > > > > PROT_READ|PROT_WRITE ones, rather than trying to map dma pages upon PROT_NONE.  
> > > > 
> > > > Userspace processes sometimes use PROT_NONE to reserve virtual address
> > > > space. That way future mmap(NULL, ...) calls will not accidentally
> > > > allocate an address from the reserved range.
> > > > 
> > > > virtio-fs needs to do this because the DAX window mappings change at
> > > > runtime. Initially the entire DAX window is just reserved using
> > > > PROT_NONE. When it's time to mmap a portion of a file into the DAX
> > > > window an mmap(fixed_addr, ...) call will be made.  
> > > 
> > > Yes I can understand the rational on why the region is reserved.  However IMHO
> > > the real question is why such reservation behavior should affect qemu memory
> > > layout, and even further to VFIO mappings.
> > > 
> > > Note that PROT_NONE should likely mean that there's no backing page at all in
> > > this case.  Since vfio will pin all the pages before mapping the DMAs, it also
> > > means that it's at least inefficient, because when we try to map all the
> > > PROT_NONE pages we'll try to fault in every single page of it, even if they may
> > > not ever be used.
> > > 
> > > So I still think this patch is not doing the right thing.  Instead we should
> > > somehow teach qemu that the virtiofs memory region should only be the size of
> > > enabled regions (with PROT_READ|PROT_WRITE), rather than the whole reserved
> > > PROT_NONE region.  
> > 
> > virtio-fs was not implemented with IOMMUs in mind. The idea is just to
> > install a kvm.ko memory region that exposes the DAX window.
> > 
> > Perhaps we need to treat the DAX window like an IOMMU? That way the
> > virtio-fs code can send map/unmap notifications and hw/vfio/ can
> > propagate them to the host kernel.  
> 
> Sounds right.  One more thing to mention is that we may need to avoid tearing
> down the whole old DMA region when resizing the PROT_READ|PROT_WRITE region
> into e.g. a bigger one to cover some of the previusly PROT_NONE part, as long
> as if the before-resizing region is still possible to be accessed from any
> hardware.  It smells like something David is working with virtio-mem, not sure
> whether there's any common infrastructure that could be shared.

Yes, very similar to his RamDiscardMgr which works on a granularity
basis.  vfio maps in granularity chunks and unmaps can span multiple
chunks.  This usage might be similar enough to use as-is.  Thanks,

Alex
David Hildenbrand Dec. 3, 2020, 4:01 p.m. UTC | #12
On 03.12.20 16:43, Peter Xu wrote:
> On Thu, Dec 03, 2020 at 11:20:02AM +0000, Stefan Hajnoczi wrote:
>> On Wed, Dec 02, 2020 at 10:45:11AM -0500, Peter Xu wrote:
>>> On Wed, Dec 02, 2020 at 02:33:56PM +0000, Stefan Hajnoczi wrote:
>>>> On Wed, Nov 25, 2020 at 10:57:11AM -0500, Peter Xu wrote:
>>>>> On Wed, Nov 25, 2020 at 01:05:25AM +0000, Justin He wrote:
>>>>>>> I'd appreciate if you could explain why vfio needs to dma map some
>>>>>>> PROT_NONE
>>>>>>
>>>>>> Virtiofs will map a PROT_NONE cache window region firstly, then remap the sub
>>>>>> region of that cache window with read or write permission. I guess this might
>>>>>> be an security concern. Just CC virtiofs expert Stefan to answer it more accurately.
>>>>>
>>>>> Yep.  Since my previous sentence was cut off, I'll rephrase: I was thinking
>>>>> whether qemu can do vfio maps only until it remaps the PROT_NONE regions into
>>>>> PROT_READ|PROT_WRITE ones, rather than trying to map dma pages upon PROT_NONE.
>>>>
>>>> Userspace processes sometimes use PROT_NONE to reserve virtual address
>>>> space. That way future mmap(NULL, ...) calls will not accidentally
>>>> allocate an address from the reserved range.
>>>>
>>>> virtio-fs needs to do this because the DAX window mappings change at
>>>> runtime. Initially the entire DAX window is just reserved using
>>>> PROT_NONE. When it's time to mmap a portion of a file into the DAX
>>>> window an mmap(fixed_addr, ...) call will be made.
>>>
>>> Yes I can understand the rational on why the region is reserved.  However IMHO
>>> the real question is why such reservation behavior should affect qemu memory
>>> layout, and even further to VFIO mappings.
>>>
>>> Note that PROT_NONE should likely mean that there's no backing page at all in
>>> this case.  Since vfio will pin all the pages before mapping the DMAs, it also
>>> means that it's at least inefficient, because when we try to map all the
>>> PROT_NONE pages we'll try to fault in every single page of it, even if they may
>>> not ever be used.
>>>
>>> So I still think this patch is not doing the right thing.  Instead we should
>>> somehow teach qemu that the virtiofs memory region should only be the size of
>>> enabled regions (with PROT_READ|PROT_WRITE), rather than the whole reserved
>>> PROT_NONE region.
>>
>> virtio-fs was not implemented with IOMMUs in mind. The idea is just to
>> install a kvm.ko memory region that exposes the DAX window.
>>
>> Perhaps we need to treat the DAX window like an IOMMU? That way the
>> virtio-fs code can send map/unmap notifications and hw/vfio/ can
>> propagate them to the host kernel.
> 
> Sounds right.  One more thing to mention is that we may need to avoid tearing
> down the whole old DMA region when resizing the PROT_READ|PROT_WRITE region
> into e.g. a bigger one to cover some of the previusly PROT_NONE part, as long
> as if the before-resizing region is still possible to be accessed from any
> hardware.  It smells like something David is working with virtio-mem, not sure
> whether there's any common infrastructure that could be shared.

"somehow teach qemu that the virtiofs memory region should only be the
size of enabled regions" - for virtio-mem, I'm working on resizeable RAM
blocks/RAM memory regions. Fairly complicated, that's why I deferred
upstreaming it and still need to implement plenty of special cases for
atomic resizes (e.g., vhost-user).

But it's only one part of the puzzle for virtio-fs. AFAIU, it's not only
about resizing the region for virtio-fs - we can have PROT_NONE holes
anywhere inside there.

In vfio, you cannot shrink mappings atomically. Growing works, but
requires additional mappings (-> bad). So assume you mapped a file with
size X and want to resize it. You first have to unmap + remap with the
new size. This is not atomic, thus problematic.

For virtio-mem, we have a fix block size and can map/unmap in that
granularity whenever we populate/discard memory within the
device-managed region. I don't think that applies for files in case of
virtio-fs.


The real question is: do we even *need* DMA from vfio devices to
virtio-fs regions? If not (do guests rely on it? what does the spec
state?), just don't care about vfio at all and don't map anything.

But maybe I am missing something important

Q1: Is the virtio-fs region mapped into system address space, so we have
to map everything without a vIOMMU?

Q2: Is DMA from vfio devices to virtio-fs regions a valid use case?
Stefan Hajnoczi Dec. 7, 2020, 2:48 p.m. UTC | #13
On Thu, Dec 03, 2020 at 05:01:32PM +0100, David Hildenbrand wrote:
> The real question is: do we even *need* DMA from vfio devices to
> virtio-fs regions? If not (do guests rely on it? what does the spec
> state?), just don't care about vfio at all and don't map anything.

Can DMA to/from the virtio-fs DAX Window happen? Yes, the guest could do
it although it's rare.

Is it a great idea to combine VFIO and virtio-fs DAX? Maybe not. It
involves pinning host page cache pages O_o.

Stefan
diff mbox series

Patch

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 67e827638995..33faa6b7dbd4 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -453,7 +453,8 @@  static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr,
 		flags |= FOLL_WRITE;
 
 	mmap_read_lock(mm);
-	ret = pin_user_pages_remote(mm, vaddr, 1, flags | FOLL_LONGTERM,
+	ret = pin_user_pages_remote(mm, vaddr, 1,
+				    flags | FOLL_LONGTERM | FOLL_FORCE,
 				    page, NULL, NULL);
 	if (ret == 1) {
 		*pfn = page_to_pfn(page[0]);