diff mbox

[3/3] vfio: disable filesystem-dax page pinning

Message ID 151778553083.7139.6601964812589807125.stgit@dwillia2-desk3.amr.corp.intel.com (mailing list archive)
State Accepted
Commit 94db151dc892
Headers show

Commit Message

Dan Williams Feb. 4, 2018, 11:05 p.m. UTC
Filesystem-DAX is incompatible with 'longterm' page pinning. Without
page cache indirection a DAX mapping maps filesystem blocks directly.
This means that the filesystem must not modify a file's block map while
any page in a mapping is pinned. In order to prevent the situation of
userspace holding of filesystem operations indefinitely, disallow
'longterm' Filesystem-DAX mappings.

RDMA has the same conflict and the plan there is to add a 'with lease'
mechanism to allow the kernel to notify userspace that the mapping is
being torn down for block-map maintenance. Perhaps something similar can
be put in place for vfio.

Note that xfs and ext4 still report:

   "DAX enabled. Warning: EXPERIMENTAL, use at your own risk"

...at mount time, and resolving the dax-dma-vs-truncate problem is one
of the last hurdles to remove that designation.

Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: kvm@vger.kernel.org
Cc: <stable@vger.kernel.org>
Reported-by: Haozhong Zhang <haozhong.zhang@intel.com>
Fixes: d475c6346a38 ("dax,ext2: replace XIP read and write with DAX I/O")
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/vfio/vfio_iommu_type1.c |   18 +++++++++++++++---
 1 file changed, 15 insertions(+), 3 deletions(-)

Comments

Haozhong Zhang Feb. 5, 2018, 3:46 a.m. UTC | #1
On 02/04/18 15:05 -0800, Dan Williams wrote:
> Filesystem-DAX is incompatible with 'longterm' page pinning. Without
> page cache indirection a DAX mapping maps filesystem blocks directly.
> This means that the filesystem must not modify a file's block map while
> any page in a mapping is pinned. In order to prevent the situation of
> userspace holding of filesystem operations indefinitely, disallow
> 'longterm' Filesystem-DAX mappings.
> 
> RDMA has the same conflict and the plan there is to add a 'with lease'
> mechanism to allow the kernel to notify userspace that the mapping is
> being torn down for block-map maintenance. Perhaps something similar can
> be put in place for vfio.
> 
> Note that xfs and ext4 still report:
> 
>    "DAX enabled. Warning: EXPERIMENTAL, use at your own risk"
> 
> ...at mount time, and resolving the dax-dma-vs-truncate problem is one
> of the last hurdles to remove that designation.
> 
> Cc: Alex Williamson <alex.williamson@redhat.com>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: kvm@vger.kernel.org
> Cc: <stable@vger.kernel.org>
> Reported-by: Haozhong Zhang <haozhong.zhang@intel.com>
> Fixes: d475c6346a38 ("dax,ext2: replace XIP read and write with DAX I/O")
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>  drivers/vfio/vfio_iommu_type1.c |   18 +++++++++++++++---
>  1 file changed, 15 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index e30e29ae4819..45657e2b1ff7 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -338,11 +338,12 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr,
>  {
>  	struct page *page[1];
>  	struct vm_area_struct *vma;
> +	struct vm_area_struct *vmas[1];
>  	int ret;
>  
>  	if (mm == current->mm) {
> -		ret = get_user_pages_fast(vaddr, 1, !!(prot & IOMMU_WRITE),
> -					  page);
> +		ret = get_user_pages_longterm(vaddr, 1, !!(prot & IOMMU_WRITE),
> +					      page, vmas);

vmas is not used subsequently if this branch is taken, so can we use
NULL here?

Thanks,
Haozhong

>  	} else {
>  		unsigned int flags = 0;
>  
> @@ -351,7 +352,18 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr,
>  
>  		down_read(&mm->mmap_sem);
>  		ret = get_user_pages_remote(NULL, mm, vaddr, 1, flags, page,
> -					    NULL, NULL);
> +					    vmas, NULL);
> +		/*
> +		 * The lifetime of a vaddr_get_pfn() page pin is
> +		 * userspace-controlled. In the fs-dax case this could
> +		 * lead to indefinite stalls in filesystem operations.
> +		 * Disallow attempts to pin fs-dax pages via this
> +		 * interface.
> +		 */
> +		if (ret > 0 && vma_is_fsdax(vmas[0])) {
> +			ret = -EOPNOTSUPP;
> +			put_page(page[0]);
> +		}
>  		up_read(&mm->mmap_sem);
>  	}
>  
>
Dan Williams Feb. 5, 2018, 3:54 a.m. UTC | #2
On Sun, Feb 4, 2018 at 7:46 PM, Haozhong Zhang <haozhong.zhang@intel.com> wrote:
> On 02/04/18 15:05 -0800, Dan Williams wrote:
>> Filesystem-DAX is incompatible with 'longterm' page pinning. Without
>> page cache indirection a DAX mapping maps filesystem blocks directly.
>> This means that the filesystem must not modify a file's block map while
>> any page in a mapping is pinned. In order to prevent the situation of
>> userspace holding of filesystem operations indefinitely, disallow
>> 'longterm' Filesystem-DAX mappings.
>>
>> RDMA has the same conflict and the plan there is to add a 'with lease'
>> mechanism to allow the kernel to notify userspace that the mapping is
>> being torn down for block-map maintenance. Perhaps something similar can
>> be put in place for vfio.
>>
>> Note that xfs and ext4 still report:
>>
>>    "DAX enabled. Warning: EXPERIMENTAL, use at your own risk"
>>
>> ...at mount time, and resolving the dax-dma-vs-truncate problem is one
>> of the last hurdles to remove that designation.
>>
>> Cc: Alex Williamson <alex.williamson@redhat.com>
>> Cc: Michal Hocko <mhocko@suse.com>
>> Cc: Christoph Hellwig <hch@lst.de>
>> Cc: kvm@vger.kernel.org
>> Cc: <stable@vger.kernel.org>
>> Reported-by: Haozhong Zhang <haozhong.zhang@intel.com>
>> Fixes: d475c6346a38 ("dax,ext2: replace XIP read and write with DAX I/O")
>> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
>> ---
>>  drivers/vfio/vfio_iommu_type1.c |   18 +++++++++++++++---
>>  1 file changed, 15 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
>> index e30e29ae4819..45657e2b1ff7 100644
>> --- a/drivers/vfio/vfio_iommu_type1.c
>> +++ b/drivers/vfio/vfio_iommu_type1.c
>> @@ -338,11 +338,12 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr,
>>  {
>>       struct page *page[1];
>>       struct vm_area_struct *vma;
>> +     struct vm_area_struct *vmas[1];
>>       int ret;
>>
>>       if (mm == current->mm) {
>> -             ret = get_user_pages_fast(vaddr, 1, !!(prot & IOMMU_WRITE),
>> -                                       page);
>> +             ret = get_user_pages_longterm(vaddr, 1, !!(prot & IOMMU_WRITE),
>> +                                           page, vmas);
>
> vmas is not used subsequently if this branch is taken, so can we use
> NULL here?

I'd rather go the other way and refactor this a bit further to skip
the find_vma_intersection() below since get_user_pages() already does
that work.
Alex Williamson Feb. 5, 2018, 9:44 p.m. UTC | #3
On Sun, 04 Feb 2018 15:05:30 -0800
Dan Williams <dan.j.williams@intel.com> wrote:

> Filesystem-DAX is incompatible with 'longterm' page pinning. Without
> page cache indirection a DAX mapping maps filesystem blocks directly.
> This means that the filesystem must not modify a file's block map while
> any page in a mapping is pinned. In order to prevent the situation of
> userspace holding of filesystem operations indefinitely, disallow
> 'longterm' Filesystem-DAX mappings.
> 
> RDMA has the same conflict and the plan there is to add a 'with lease'
> mechanism to allow the kernel to notify userspace that the mapping is
> being torn down for block-map maintenance. Perhaps something similar can
> be put in place for vfio.
> 
> Note that xfs and ext4 still report:
> 
>    "DAX enabled. Warning: EXPERIMENTAL, use at your own risk"
> 
> ...at mount time, and resolving the dax-dma-vs-truncate problem is one
> of the last hurdles to remove that designation.
> 
> Cc: Alex Williamson <alex.williamson@redhat.com>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: kvm@vger.kernel.org
> Cc: <stable@vger.kernel.org>
> Reported-by: Haozhong Zhang <haozhong.zhang@intel.com>
> Fixes: d475c6346a38 ("dax,ext2: replace XIP read and write with DAX I/O")
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>  drivers/vfio/vfio_iommu_type1.c |   18 +++++++++++++++---
>  1 file changed, 15 insertions(+), 3 deletions(-)

This isn't without some expense, a vfio mapping and un-mapping unit
test incurs ~1.5% increase in system time losing access to gup_fast().
Also, I think tce_iommu_use_page() is going to have the same problem, it
provides the same sort of functionality for a different vfio IOMMU
backend.  Please take this through your tree and I'll add a todo list
item to see how we might improve this.

Acked-by: Alex Williamson <alex.williamson@redhat.com>

Thanks,
Alex

> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index e30e29ae4819..45657e2b1ff7 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -338,11 +338,12 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr,
>  {
>  	struct page *page[1];
>  	struct vm_area_struct *vma;
> +	struct vm_area_struct *vmas[1];
>  	int ret;
>  
>  	if (mm == current->mm) {
> -		ret = get_user_pages_fast(vaddr, 1, !!(prot & IOMMU_WRITE),
> -					  page);
> +		ret = get_user_pages_longterm(vaddr, 1, !!(prot & IOMMU_WRITE),
> +					      page, vmas);
>  	} else {
>  		unsigned int flags = 0;
>  
> @@ -351,7 +352,18 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr,
>  
>  		down_read(&mm->mmap_sem);
>  		ret = get_user_pages_remote(NULL, mm, vaddr, 1, flags, page,
> -					    NULL, NULL);
> +					    vmas, NULL);
> +		/*
> +		 * The lifetime of a vaddr_get_pfn() page pin is
> +		 * userspace-controlled. In the fs-dax case this could
> +		 * lead to indefinite stalls in filesystem operations.
> +		 * Disallow attempts to pin fs-dax pages via this
> +		 * interface.
> +		 */
> +		if (ret > 0 && vma_is_fsdax(vmas[0])) {
> +			ret = -EOPNOTSUPP;
> +			put_page(page[0]);
> +		}
>  		up_read(&mm->mmap_sem);
>  	}
>  
>
Dan Williams Feb. 5, 2018, 10:01 p.m. UTC | #4
On Mon, Feb 5, 2018 at 1:44 PM, Alex Williamson
<alex.williamson@redhat.com> wrote:
> On Sun, 04 Feb 2018 15:05:30 -0800
> Dan Williams <dan.j.williams@intel.com> wrote:
>
>> Filesystem-DAX is incompatible with 'longterm' page pinning. Without
>> page cache indirection a DAX mapping maps filesystem blocks directly.
>> This means that the filesystem must not modify a file's block map while
>> any page in a mapping is pinned. In order to prevent the situation of
>> userspace holding of filesystem operations indefinitely, disallow
>> 'longterm' Filesystem-DAX mappings.
>>
>> RDMA has the same conflict and the plan there is to add a 'with lease'
>> mechanism to allow the kernel to notify userspace that the mapping is
>> being torn down for block-map maintenance. Perhaps something similar can
>> be put in place for vfio.
>>
>> Note that xfs and ext4 still report:
>>
>>    "DAX enabled. Warning: EXPERIMENTAL, use at your own risk"
>>
>> ...at mount time, and resolving the dax-dma-vs-truncate problem is one
>> of the last hurdles to remove that designation.
>>
>> Cc: Alex Williamson <alex.williamson@redhat.com>
>> Cc: Michal Hocko <mhocko@suse.com>
>> Cc: Christoph Hellwig <hch@lst.de>
>> Cc: kvm@vger.kernel.org
>> Cc: <stable@vger.kernel.org>
>> Reported-by: Haozhong Zhang <haozhong.zhang@intel.com>
>> Fixes: d475c6346a38 ("dax,ext2: replace XIP read and write with DAX I/O")
>> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
>> ---
>>  drivers/vfio/vfio_iommu_type1.c |   18 +++++++++++++++---
>>  1 file changed, 15 insertions(+), 3 deletions(-)
>
> This isn't without some expense, a vfio mapping and un-mapping unit
> test incurs ~1.5% increase in system time losing access to gup_fast().
> Also, I think tce_iommu_use_page() is going to have the same problem, it
> provides the same sort of functionality for a different vfio IOMMU
> backend.  Please take this through your tree and I'll add a todo list
> item to see how we might improve this.
>
> Acked-by: Alex Williamson <alex.williamson@redhat.com>

Thanks Alex.
Haozhong Zhang Feb. 6, 2018, 7:53 a.m. UTC | #5
Hi Dan,

On 02/04/18 15:05 -0800, Dan Williams wrote:
> Filesystem-DAX is incompatible with 'longterm' page pinning. Without
> page cache indirection a DAX mapping maps filesystem blocks directly.
> This means that the filesystem must not modify a file's block map while
> any page in a mapping is pinned. In order to prevent the situation of
> userspace holding of filesystem operations indefinitely, disallow
> 'longterm' Filesystem-DAX mappings.
> 
> RDMA has the same conflict and the plan there is to add a 'with lease'
> mechanism to allow the kernel to notify userspace that the mapping is
> being torn down for block-map maintenance. Perhaps something similar can
> be put in place for vfio.
> 
> Note that xfs and ext4 still report:
> 
>    "DAX enabled. Warning: EXPERIMENTAL, use at your own risk"
> 
> ...at mount time, and resolving the dax-dma-vs-truncate problem is one
> of the last hurdles to remove that designation.
> 
> Cc: Alex Williamson <alex.williamson@redhat.com>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: kvm@vger.kernel.org
> Cc: <stable@vger.kernel.org>
> Reported-by: Haozhong Zhang <haozhong.zhang@intel.com>
> Fixes: d475c6346a38 ("dax,ext2: replace XIP read and write with DAX I/O")
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>  drivers/vfio/vfio_iommu_type1.c |   18 +++++++++++++++---
>  1 file changed, 15 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index e30e29ae4819..45657e2b1ff7 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -338,11 +338,12 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr,
>  {
>  	struct page *page[1];
>  	struct vm_area_struct *vma;
> +	struct vm_area_struct *vmas[1];
>  	int ret;
>  
>  	if (mm == current->mm) {
> -		ret = get_user_pages_fast(vaddr, 1, !!(prot & IOMMU_WRITE),
> -					  page);
> +		ret = get_user_pages_longterm(vaddr, 1, !!(prot & IOMMU_WRITE),
> +					      page, vmas);
>  	} else {
>  		unsigned int flags = 0;
>  
> @@ -351,7 +352,18 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr,
>  
>  		down_read(&mm->mmap_sem);
>  		ret = get_user_pages_remote(NULL, mm, vaddr, 1, flags, page,
> -					    NULL, NULL);
> +					    vmas, NULL);
> +		/*
> +		 * The lifetime of a vaddr_get_pfn() page pin is
> +		 * userspace-controlled. In the fs-dax case this could
> +		 * lead to indefinite stalls in filesystem operations.
> +		 * Disallow attempts to pin fs-dax pages via this
> +		 * interface.
> +		 */
> +		if (ret > 0 && vma_is_fsdax(vmas[0])) {
> +			ret = -EOPNOTSUPP;
> +			put_page(page[0]);
> +		}
>  		up_read(&mm->mmap_sem);
>  	}
>  
> 

Besides this patch series, are there other patches needed to make
vma_is_fsdax() to work with device-dax?

I applied this patch series on the libvdimm-for-next branch of nvdimm
tree (ee95f4059a83), and found this patch series also failed
device-dax mapping with vfio. It can be reproduced by following steps:

1. Attach PCI device at BDF 0000:03:10.2 to vfio-pci.
   # modprobe vfio-pci
   # lspci -n -s 0000:03:10.2
   03:10.2 0200: 8086:1515 (rev 01)
   # echo 0000:03:10.2 > /sys/bus/pci/devices/0000:06:0d.0/driver/unbind
   # echo 8086:1515 > /sys/bus/pci/drivers/vfio-pci/new_id

2. Use RAM to emulate NVDIMM and create a device-dax device /dev/dax0.0
   # cat /proc/iomem
   ...
   100000000-2ffffffff : Persistent Memory (legacy)
     100000000-2ffffffff : namespace0.0
   ...

   # ndctl create-namespace -f -e namespace0.0 -m dax
   {
     "dev":"namespace0.0",
     "mode":"dax",
     "size":8453619712,
     "uuid":"e1db00bc-f830-4f1b-ac18-091ae7df4f93",
     "daxdevs":[
       {
         "chardev":"dax0.0",
         "size":8453619712
       }
     ]
   }

3. Create a VM with assigned PCI device in step 1 and the device-dax
   device in step 2.
   # qemu-system-x86_64 -machine pc,accel=kvm,nvdimm=on -smp host \
                        -m 4G,slots=32,maxmem=128G \
                        -drive file=VM_DISK_IMG.img,format=raw,if=virtio \
                        -object memory-backend-file,id=nv_be1,share=on,mem-path=/dev/dax0.0,size=4G,align=2M \
                        -device nvdimm,id=nv1,memdev=nv_be1 \
                        -device ioh3420,id=root.0,slot=4 \
                        -device vfio-pci,sysfsdev=/sys/bus/pci/devices/0000:03:10.2,id=nic1,bus=pci.0,addr=0x6

   It then fails with the following QEMU error messages:
     qemu-system-x86_64: -device vfio-pci,sysfsdev=/sys/bus/pci/devices/0000:03:10.2,id=nic1,bus=pci.0,addr=0x6: VFIO_MAP_DMA: -95
     qemu-system-x86_64: -device vfio-pci,sysfsdev=/sys/bus/pci/devices/0000:03:10.2,id=nic1,bus=pci.0,addr=0x6: vfio_dma_map(0x5643804a92c0, 0x140000000, 0xffe00000, 0x7f2ed5200000) = -95 (Operation not supported)
     qemu-system-x86_64: -device vfio-pci,sysfsdev=/sys/bus/pci/devices/0000:03:10.2,id=nic1,bus=pci.0,addr=0x6: vfio error: 0000:03:10.2: failed to setup container for group 52: memory listener initialization failed for container: Operation not supported

   I added the following debug messages after the
   get_user_pages_longterm() call in this patch,
       if (vmas[0] && vma_is_dax(vmas[0]))
               printk(KERN_DEBUG "%s: longterm failed for pfn 0x%lx, ret %d\n",
                      __func__, page_to_pfn(page[0]), ret);
   and shows get_user_pages_longterm() returns -EOPNOTSUPP on the
   first device-dax page mapping.



Haozhong
Dan Williams Feb. 6, 2018, 3:09 p.m. UTC | #6
On Mon, Feb 5, 2018 at 11:53 PM, Haozhong Zhang
<haozhong.zhang@intel.com> wrote:
> Hi Dan,
>
> On 02/04/18 15:05 -0800, Dan Williams wrote:
>> Filesystem-DAX is incompatible with 'longterm' page pinning. Without
>> page cache indirection a DAX mapping maps filesystem blocks directly.
>> This means that the filesystem must not modify a file's block map while
>> any page in a mapping is pinned. In order to prevent the situation of
>> userspace holding of filesystem operations indefinitely, disallow
>> 'longterm' Filesystem-DAX mappings.
>>
>> RDMA has the same conflict and the plan there is to add a 'with lease'
>> mechanism to allow the kernel to notify userspace that the mapping is
>> being torn down for block-map maintenance. Perhaps something similar can
>> be put in place for vfio.
>>
>> Note that xfs and ext4 still report:
>>
>>    "DAX enabled. Warning: EXPERIMENTAL, use at your own risk"
>>
>> ...at mount time, and resolving the dax-dma-vs-truncate problem is one
>> of the last hurdles to remove that designation.
>>
>> Cc: Alex Williamson <alex.williamson@redhat.com>
>> Cc: Michal Hocko <mhocko@suse.com>
>> Cc: Christoph Hellwig <hch@lst.de>
>> Cc: kvm@vger.kernel.org
>> Cc: <stable@vger.kernel.org>
>> Reported-by: Haozhong Zhang <haozhong.zhang@intel.com>
>> Fixes: d475c6346a38 ("dax,ext2: replace XIP read and write with DAX I/O")
>> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
>> ---
>>  drivers/vfio/vfio_iommu_type1.c |   18 +++++++++++++++---
>>  1 file changed, 15 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
>> index e30e29ae4819..45657e2b1ff7 100644
>> --- a/drivers/vfio/vfio_iommu_type1.c
>> +++ b/drivers/vfio/vfio_iommu_type1.c
>> @@ -338,11 +338,12 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr,
>>  {
>>       struct page *page[1];
>>       struct vm_area_struct *vma;
>> +     struct vm_area_struct *vmas[1];
>>       int ret;
>>
>>       if (mm == current->mm) {
>> -             ret = get_user_pages_fast(vaddr, 1, !!(prot & IOMMU_WRITE),
>> -                                       page);
>> +             ret = get_user_pages_longterm(vaddr, 1, !!(prot & IOMMU_WRITE),
>> +                                           page, vmas);
>>       } else {
>>               unsigned int flags = 0;
>>
>> @@ -351,7 +352,18 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr,
>>
>>               down_read(&mm->mmap_sem);
>>               ret = get_user_pages_remote(NULL, mm, vaddr, 1, flags, page,
>> -                                         NULL, NULL);
>> +                                         vmas, NULL);
>> +             /*
>> +              * The lifetime of a vaddr_get_pfn() page pin is
>> +              * userspace-controlled. In the fs-dax case this could
>> +              * lead to indefinite stalls in filesystem operations.
>> +              * Disallow attempts to pin fs-dax pages via this
>> +              * interface.
>> +              */
>> +             if (ret > 0 && vma_is_fsdax(vmas[0])) {
>> +                     ret = -EOPNOTSUPP;
>> +                     put_page(page[0]);
>> +             }
>>               up_read(&mm->mmap_sem);
>>       }
>>
>>
>
> Besides this patch series, are there other patches needed to make
> vma_is_fsdax() to work with device-dax?
>
> I applied this patch series on the libvdimm-for-next branch of nvdimm
> tree (ee95f4059a83), and found this patch series also failed
> device-dax mapping with vfio. It can be reproduced by following steps:
>
> 1. Attach PCI device at BDF 0000:03:10.2 to vfio-pci.
>    # modprobe vfio-pci
>    # lspci -n -s 0000:03:10.2
>    03:10.2 0200: 8086:1515 (rev 01)
>    # echo 0000:03:10.2 > /sys/bus/pci/devices/0000:06:0d.0/driver/unbind
>    # echo 8086:1515 > /sys/bus/pci/drivers/vfio-pci/new_id
>
> 2. Use RAM to emulate NVDIMM and create a device-dax device /dev/dax0.0
>    # cat /proc/iomem
>    ...
>    100000000-2ffffffff : Persistent Memory (legacy)
>      100000000-2ffffffff : namespace0.0
>    ...
>
>    # ndctl create-namespace -f -e namespace0.0 -m dax
>    {
>      "dev":"namespace0.0",
>      "mode":"dax",
>      "size":8453619712,
>      "uuid":"e1db00bc-f830-4f1b-ac18-091ae7df4f93",
>      "daxdevs":[
>        {
>          "chardev":"dax0.0",
>          "size":8453619712
>        }
>      ]
>    }
>
> 3. Create a VM with assigned PCI device in step 1 and the device-dax
>    device in step 2.
>    # qemu-system-x86_64 -machine pc,accel=kvm,nvdimm=on -smp host \
>                         -m 4G,slots=32,maxmem=128G \
>                         -drive file=VM_DISK_IMG.img,format=raw,if=virtio \
>                         -object memory-backend-file,id=nv_be1,share=on,mem-path=/dev/dax0.0,size=4G,align=2M \
>                         -device nvdimm,id=nv1,memdev=nv_be1 \
>                         -device ioh3420,id=root.0,slot=4 \
>                         -device vfio-pci,sysfsdev=/sys/bus/pci/devices/0000:03:10.2,id=nic1,bus=pci.0,addr=0x6
>
>    It then fails with the following QEMU error messages:
>      qemu-system-x86_64: -device vfio-pci,sysfsdev=/sys/bus/pci/devices/0000:03:10.2,id=nic1,bus=pci.0,addr=0x6: VFIO_MAP_DMA: -95
>      qemu-system-x86_64: -device vfio-pci,sysfsdev=/sys/bus/pci/devices/0000:03:10.2,id=nic1,bus=pci.0,addr=0x6: vfio_dma_map(0x5643804a92c0, 0x140000000, 0xffe00000, 0x7f2ed5200000) = -95 (Operation not supported)
>      qemu-system-x86_64: -device vfio-pci,sysfsdev=/sys/bus/pci/devices/0000:03:10.2,id=nic1,bus=pci.0,addr=0x6: vfio error: 0000:03:10.2: failed to setup container for group 52: memory listener initialization failed for container: Operation not supported
>
>    I added the following debug messages after the
>    get_user_pages_longterm() call in this patch,
>        if (vmas[0] && vma_is_dax(vmas[0]))
>                printk(KERN_DEBUG "%s: longterm failed for pfn 0x%lx, ret %d\n",
>                       __func__, page_to_pfn(page[0]), ret);
>    and shows get_user_pages_longterm() returns -EOPNOTSUPP on the
>    first device-dax page mapping.

Thanks for that thorough debug, I'll take a look today.
diff mbox

Patch

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index e30e29ae4819..45657e2b1ff7 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -338,11 +338,12 @@  static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr,
 {
 	struct page *page[1];
 	struct vm_area_struct *vma;
+	struct vm_area_struct *vmas[1];
 	int ret;
 
 	if (mm == current->mm) {
-		ret = get_user_pages_fast(vaddr, 1, !!(prot & IOMMU_WRITE),
-					  page);
+		ret = get_user_pages_longterm(vaddr, 1, !!(prot & IOMMU_WRITE),
+					      page, vmas);
 	} else {
 		unsigned int flags = 0;
 
@@ -351,7 +352,18 @@  static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr,
 
 		down_read(&mm->mmap_sem);
 		ret = get_user_pages_remote(NULL, mm, vaddr, 1, flags, page,
-					    NULL, NULL);
+					    vmas, NULL);
+		/*
+		 * The lifetime of a vaddr_get_pfn() page pin is
+		 * userspace-controlled. In the fs-dax case this could
+		 * lead to indefinite stalls in filesystem operations.
+		 * Disallow attempts to pin fs-dax pages via this
+		 * interface.
+		 */
+		if (ret > 0 && vma_is_fsdax(vmas[0])) {
+			ret = -EOPNOTSUPP;
+			put_page(page[0]);
+		}
 		up_read(&mm->mmap_sem);
 	}