[v2] vfio/pci: Handle concurrent vma faults

Message ID	161540257788.10151.6284852774772157400.stgit@gimli.home (mailing list archive)
State	New, archived
Headers	show Return-Path: <kvm-owner@kernel.org> Subject: [PATCH v2] vfio/pci: Handle concurrent vma faults From: Alex Williamson <alex.williamson@redhat.com> To: alex.williamson@redhat.com Cc: kvm@vger.kernel.org, linux-kernel@vger.kernel.org, jgg@nvidia.com, peterx@redhat.com, prime.zeng@hisilicon.com, cohuck@redhat.com Date: Wed, 10 Mar 2021 11:58:07 -0700 Message-ID: <161540257788.10151.6284852774772157400.stgit@gimli.home> User-Agent: StGit/0.21-2-g8ef5 MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Precedence: bulk
Series	[v2] vfio/pci: Handle concurrent vma faults \| expand [v2] vfio/pci: Handle concurrent vma faults

Alex Williamson March 10, 2021, 6:58 p.m. UTC

vfio_pci_mmap_fault() incorrectly makes use of io_remap_pfn_range()
from within a vm_ops fault handler.  This function will trigger a
BUG_ON if it encounters a populated pte within the remapped range,
where any fault is meant to populate the entire vma.  Concurrent
inflight faults to the same vma will therefore hit this issue,
triggering traces such as:

[ 1591.733256] kernel BUG at mm/memory.c:2177!
[ 1591.739515] Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
[ 1591.747381] Modules linked in: vfio_iommu_type1 vfio_pci vfio_virqfd vfio pv680_mii(O)
[ 1591.760536] CPU: 2 PID: 227 Comm: lcore-worker-2 Tainted: G O 5.11.0-rc3+ #1
[ 1591.770735] Hardware name:  , BIOS HixxxxFPGA 1P B600 V121-1
[ 1591.778872] pstate: 40400009 (nZcv daif +PAN -UAO -TCO BTYPE=--)
[ 1591.786134] pc : remap_pfn_range+0x214/0x340
[ 1591.793564] lr : remap_pfn_range+0x1b8/0x340
[ 1591.799117] sp : ffff80001068bbd0
[ 1591.803476] x29: ffff80001068bbd0 x28: 0000042eff6f0000
[ 1591.810404] x27: 0000001100910000 x26: 0000001300910000
[ 1591.817457] x25: 0068000000000fd3 x24: ffffa92f1338e358
[ 1591.825144] x23: 0000001140000000 x22: 0000000000000041
[ 1591.832506] x21: 0000001300910000 x20: ffffa92f141a4000
[ 1591.839520] x19: 0000001100a00000 x18: 0000000000000000
[ 1591.846108] x17: 0000000000000000 x16: ffffa92f11844540
[ 1591.853570] x15: 0000000000000000 x14: 0000000000000000
[ 1591.860768] x13: fffffc0000000000 x12: 0000000000000880
[ 1591.868053] x11: ffff0821bf3d01d0 x10: ffff5ef2abd89000
[ 1591.875932] x9 : ffffa92f12ab0064 x8 : ffffa92f136471c0
[ 1591.883208] x7 : 0000001140910000 x6 : 0000000200000000
[ 1591.890177] x5 : 0000000000000001 x4 : 0000000000000001
[ 1591.896656] x3 : 0000000000000000 x2 : 0168044000000fd3
[ 1591.903215] x1 : ffff082126261880 x0 : fffffc2084989868
[ 1591.910234] Call trace:
[ 1591.914837]  remap_pfn_range+0x214/0x340
[ 1591.921765]  vfio_pci_mmap_fault+0xac/0x130 [vfio_pci]
[ 1591.931200]  __do_fault+0x44/0x12c
[ 1591.937031]  handle_mm_fault+0xcc8/0x1230
[ 1591.942475]  do_page_fault+0x16c/0x484
[ 1591.948635]  do_translation_fault+0xbc/0xd8
[ 1591.954171]  do_mem_abort+0x4c/0xc0
[ 1591.960316]  el0_da+0x40/0x80
[ 1591.965585]  el0_sync_handler+0x168/0x1b0
[ 1591.971608]  el0_sync+0x174/0x180
[ 1591.978312] Code: eb1b027f 540000c0 f9400022 b4fffe02 (d4210000)

Switch to using vmf_insert_pfn() to allow replacing mappings, and
include decrypted memory protection as formerly provided by
io_remap_pfn_range().  Tracking of vmas is also updated to
prevent duplicate entries.

Fixes: 11c4cd07ba11 ("vfio-pci: Fault mmaps to enable vma tracking")
Reported-by: Zeng Tao <prime.zeng@hisilicon.com>
Suggested-by: Zeng Tao <prime.zeng@hisilicon.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---

v2: Set decrypted pgprot in mmap, use non-_prot vmf_insert_pfn()
    as suggested by Jason G.

 drivers/vfio/pci/vfio_pci.c |   30 ++++++++++++++++++------------
 1 file changed, 18 insertions(+), 12 deletions(-)

Alex Williamson June 28, 2021, 4:46 p.m. UTC | #1

On Wed, 10 Mar 2021 11:58:07 -0700
Alex Williamson <alex.williamson@redhat.com> wrote:

> vfio_pci_mmap_fault() incorrectly makes use of io_remap_pfn_range()
> from within a vm_ops fault handler.  This function will trigger a
> BUG_ON if it encounters a populated pte within the remapped range,
> where any fault is meant to populate the entire vma.  Concurrent
> inflight faults to the same vma will therefore hit this issue,
> triggering traces such as:
> 
> [ 1591.733256] kernel BUG at mm/memory.c:2177!
> [ 1591.739515] Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
> [ 1591.747381] Modules linked in: vfio_iommu_type1 vfio_pci vfio_virqfd vfio pv680_mii(O)
> [ 1591.760536] CPU: 2 PID: 227 Comm: lcore-worker-2 Tainted: G O 5.11.0-rc3+ #1
> [ 1591.770735] Hardware name:  , BIOS HixxxxFPGA 1P B600 V121-1
> [ 1591.778872] pstate: 40400009 (nZcv daif +PAN -UAO -TCO BTYPE=--)
> [ 1591.786134] pc : remap_pfn_range+0x214/0x340
> [ 1591.793564] lr : remap_pfn_range+0x1b8/0x340
> [ 1591.799117] sp : ffff80001068bbd0
> [ 1591.803476] x29: ffff80001068bbd0 x28: 0000042eff6f0000
> [ 1591.810404] x27: 0000001100910000 x26: 0000001300910000
> [ 1591.817457] x25: 0068000000000fd3 x24: ffffa92f1338e358
> [ 1591.825144] x23: 0000001140000000 x22: 0000000000000041
> [ 1591.832506] x21: 0000001300910000 x20: ffffa92f141a4000
> [ 1591.839520] x19: 0000001100a00000 x18: 0000000000000000
> [ 1591.846108] x17: 0000000000000000 x16: ffffa92f11844540
> [ 1591.853570] x15: 0000000000000000 x14: 0000000000000000
> [ 1591.860768] x13: fffffc0000000000 x12: 0000000000000880
> [ 1591.868053] x11: ffff0821bf3d01d0 x10: ffff5ef2abd89000
> [ 1591.875932] x9 : ffffa92f12ab0064 x8 : ffffa92f136471c0
> [ 1591.883208] x7 : 0000001140910000 x6 : 0000000200000000
> [ 1591.890177] x5 : 0000000000000001 x4 : 0000000000000001
> [ 1591.896656] x3 : 0000000000000000 x2 : 0168044000000fd3
> [ 1591.903215] x1 : ffff082126261880 x0 : fffffc2084989868
> [ 1591.910234] Call trace:
> [ 1591.914837]  remap_pfn_range+0x214/0x340
> [ 1591.921765]  vfio_pci_mmap_fault+0xac/0x130 [vfio_pci]
> [ 1591.931200]  __do_fault+0x44/0x12c
> [ 1591.937031]  handle_mm_fault+0xcc8/0x1230
> [ 1591.942475]  do_page_fault+0x16c/0x484
> [ 1591.948635]  do_translation_fault+0xbc/0xd8
> [ 1591.954171]  do_mem_abort+0x4c/0xc0
> [ 1591.960316]  el0_da+0x40/0x80
> [ 1591.965585]  el0_sync_handler+0x168/0x1b0
> [ 1591.971608]  el0_sync+0x174/0x180
> [ 1591.978312] Code: eb1b027f 540000c0 f9400022 b4fffe02 (d4210000)
> 
> Switch to using vmf_insert_pfn() to allow replacing mappings, and
> include decrypted memory protection as formerly provided by
> io_remap_pfn_range().  Tracking of vmas is also updated to
> prevent duplicate entries.
> 
> Fixes: 11c4cd07ba11 ("vfio-pci: Fault mmaps to enable vma tracking")
> Reported-by: Zeng Tao <prime.zeng@hisilicon.com>
> Suggested-by: Zeng Tao <prime.zeng@hisilicon.com>
> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> ---
> 
> v2: Set decrypted pgprot in mmap, use non-_prot vmf_insert_pfn()
>     as suggested by Jason G.

IIRC, there were no blocking issues on this patch as an interim fix to
resolve the concurrent fault issues with io_remap_pfn_range().
Unfortunately it also got no Reviewed-by or Tested-by feedback.  I'd
like to put this in for v5.14 (should have gone in earlier).  Any final
comments?  Thanks,

Alex

> 
>  drivers/vfio/pci/vfio_pci.c |   30 ++++++++++++++++++------------
>  1 file changed, 18 insertions(+), 12 deletions(-)
> 
> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> index 65e7e6b44578..73e125d73640 100644
> --- a/drivers/vfio/pci/vfio_pci.c
> +++ b/drivers/vfio/pci/vfio_pci.c
> @@ -1573,6 +1573,11 @@ static int __vfio_pci_add_vma(struct vfio_pci_device *vdev,
>  {
>  	struct vfio_pci_mmap_vma *mmap_vma;
>  
> +	list_for_each_entry(mmap_vma, &vdev->vma_list, vma_next) {
> +		if (mmap_vma->vma == vma)
> +			return 0; /* Swallow the error, the vma is tracked */
> +	}
> +
>  	mmap_vma = kmalloc(sizeof(*mmap_vma), GFP_KERNEL);
>  	if (!mmap_vma)
>  		return -ENOMEM;
> @@ -1612,31 +1617,31 @@ static vm_fault_t vfio_pci_mmap_fault(struct vm_fault *vmf)
>  {
>  	struct vm_area_struct *vma = vmf->vma;
>  	struct vfio_pci_device *vdev = vma->vm_private_data;
> -	vm_fault_t ret = VM_FAULT_NOPAGE;
> +	unsigned long vaddr = vma->vm_start, pfn = vma->vm_pgoff;
> +	vm_fault_t ret = VM_FAULT_SIGBUS;
>  
>  	mutex_lock(&vdev->vma_lock);
>  	down_read(&vdev->memory_lock);
>  
> -	if (!__vfio_pci_memory_enabled(vdev)) {
> -		ret = VM_FAULT_SIGBUS;
> -		mutex_unlock(&vdev->vma_lock);
> +	if (!__vfio_pci_memory_enabled(vdev))
>  		goto up_out;
> +
> +	for (; vaddr < vma->vm_end; vaddr += PAGE_SIZE, pfn++) {
> +		ret = vmf_insert_pfn(vma, vaddr, pfn);
> +		if (ret != VM_FAULT_NOPAGE) {
> +			zap_vma_ptes(vma, vma->vm_start, vaddr - vma->vm_start);
> +			goto up_out;
> +		}
>  	}
>  
>  	if (__vfio_pci_add_vma(vdev, vma)) {
>  		ret = VM_FAULT_OOM;
> -		mutex_unlock(&vdev->vma_lock);
> -		goto up_out;
> +		zap_vma_ptes(vma, vma->vm_start, vma->vm_end - vma->vm_start);
>  	}
>  
> -	mutex_unlock(&vdev->vma_lock);
> -
> -	if (io_remap_pfn_range(vma, vma->vm_start, vma->vm_pgoff,
> -			       vma->vm_end - vma->vm_start, vma->vm_page_prot))
> -		ret = VM_FAULT_SIGBUS;
> -
>  up_out:
>  	up_read(&vdev->memory_lock);
> +	mutex_unlock(&vdev->vma_lock);
>  	return ret;
>  }
>  
> @@ -1702,6 +1707,7 @@ static int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma)
>  
>  	vma->vm_private_data = vdev;
>  	vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
> +	vma->vm_page_prot = pgprot_decrypted(vma->vm_page_prot);
>  	vma->vm_pgoff = (pci_resource_start(pdev, index) >> PAGE_SHIFT) + pgoff;
>  
>  	/*
>

Jason Gunthorpe June 28, 2021, 5:30 p.m. UTC | #2

On Mon, Jun 28, 2021 at 10:46:53AM -0600, Alex Williamson wrote:
> On Wed, 10 Mar 2021 11:58:07 -0700
> Alex Williamson <alex.williamson@redhat.com> wrote:
> 
> > vfio_pci_mmap_fault() incorrectly makes use of io_remap_pfn_range()
> > from within a vm_ops fault handler.  This function will trigger a
> > BUG_ON if it encounters a populated pte within the remapped range,
> > where any fault is meant to populate the entire vma.  Concurrent
> > inflight faults to the same vma will therefore hit this issue,
> > triggering traces such as:

If it is just about concurrancy can the vma_lock enclose
io_remap_pfn_range() ?

> IIRC, there were no blocking issues on this patch as an interim fix to
> resolve the concurrent fault issues with io_remap_pfn_range().
> Unfortunately it also got no Reviewed-by or Tested-by feedback.  I'd
> like to put this in for v5.14 (should have gone in earlier).  Any final
> comments?  Thanks,

I assume there is a reason why vm_lock can't be used here, so I
wouldn't object, though I don't especially like the loss of tracking
either.

Jason

Alex Williamson June 28, 2021, 6:36 p.m. UTC | #3

On Mon, 28 Jun 2021 14:30:28 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Mon, Jun 28, 2021 at 10:46:53AM -0600, Alex Williamson wrote:
> > On Wed, 10 Mar 2021 11:58:07 -0700
> > Alex Williamson <alex.williamson@redhat.com> wrote:
> >   
> > > vfio_pci_mmap_fault() incorrectly makes use of io_remap_pfn_range()
> > > from within a vm_ops fault handler.  This function will trigger a
> > > BUG_ON if it encounters a populated pte within the remapped range,
> > > where any fault is meant to populate the entire vma.  Concurrent
> > > inflight faults to the same vma will therefore hit this issue,
> > > triggering traces such as:  
> 
> If it is just about concurrancy can the vma_lock enclose
> io_remap_pfn_range() ?

We could extend vma_lock around io_remap_pfn_range(), but that alone
would just block the concurrent faults to the same vma and once we
released them they'd still hit the BUG_ON in io_remap_pfn_range()
because the page is no longer pte_none().  We'd need to combine that
with something like __vfio_pci_add_vma() returning -EEXIST to skip the
io_remap_pfn_range(), but I've been advised that we shouldn't be
calling io_remap_pfn_range() from within the fault handler anyway, we
should be using something like vmf_insert_pfn() instead, which I
understand can be called safely in the same situation.  That's rather
the testing I was hoping someone who reproduced the issue previously
could validate.

> > IIRC, there were no blocking issues on this patch as an interim fix to
> > resolve the concurrent fault issues with io_remap_pfn_range().
> > Unfortunately it also got no Reviewed-by or Tested-by feedback.  I'd
> > like to put this in for v5.14 (should have gone in earlier).  Any final
> > comments?  Thanks,  
> 
> I assume there is a reason why vm_lock can't be used here, so I
> wouldn't object, though I don't especially like the loss of tracking
> either.

There's no loss of tracking here, we were only expecting a single fault
per vma to add the vma to our list.  This just skips adding duplicates
in these cases where we can have multiple faults in-flight.  Thanks,

Alex

Jason Gunthorpe June 28, 2021, 6:52 p.m. UTC | #4

On Mon, Jun 28, 2021 at 12:36:21PM -0600, Alex Williamson wrote:
> On Mon, 28 Jun 2021 14:30:28 -0300
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > On Mon, Jun 28, 2021 at 10:46:53AM -0600, Alex Williamson wrote:
> > > On Wed, 10 Mar 2021 11:58:07 -0700
> > > Alex Williamson <alex.williamson@redhat.com> wrote:
> > >   
> > > > vfio_pci_mmap_fault() incorrectly makes use of io_remap_pfn_range()
> > > > from within a vm_ops fault handler.  This function will trigger a
> > > > BUG_ON if it encounters a populated pte within the remapped range,
> > > > where any fault is meant to populate the entire vma.  Concurrent
> > > > inflight faults to the same vma will therefore hit this issue,
> > > > triggering traces such as:  
> > 
> > If it is just about concurrancy can the vma_lock enclose
> > io_remap_pfn_range() ?
> 
> We could extend vma_lock around io_remap_pfn_range(), but that alone
> would just block the concurrent faults to the same vma and once we
> released them they'd still hit the BUG_ON in io_remap_pfn_range()
> because the page is no longer pte_none().  We'd need to combine that
> with something like __vfio_pci_add_vma() returning -EEXIST to skip the
> io_remap_pfn_range(), but I've been advised that we shouldn't be
> calling io_remap_pfn_range() from within the fault handler anyway, we
> should be using something like vmf_insert_pfn() instead, which I
> understand can be called safely in the same situation.  That's rather
> the testing I was hoping someone who reproduced the issue previously
> could validate.

Yes, using the vmf_ stuff is 'righter' for sure, but there isn't
really a vmf for IO mappings..

> > I assume there is a reason why vm_lock can't be used here, so I
> > wouldn't object, though I don't especially like the loss of tracking
> > either.
> 
> There's no loss of tracking here, we were only expecting a single fault
> per vma to add the vma to our list.  This just skips adding duplicates
> in these cases where we can have multiple faults in-flight.  Thanks,

I mean the arch tracking of IO maps that is hidden inside ioremap_pfn

Jason

Alex Williamson June 28, 2021, 7:30 p.m. UTC | #5

On Mon, 28 Jun 2021 15:52:42 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Mon, Jun 28, 2021 at 12:36:21PM -0600, Alex Williamson wrote:
> > On Mon, 28 Jun 2021 14:30:28 -0300
> > Jason Gunthorpe <jgg@nvidia.com> wrote:
> >   
> > > On Mon, Jun 28, 2021 at 10:46:53AM -0600, Alex Williamson wrote:  
> > > > On Wed, 10 Mar 2021 11:58:07 -0700
> > > > Alex Williamson <alex.williamson@redhat.com> wrote:
> > > >     
> > > > > vfio_pci_mmap_fault() incorrectly makes use of io_remap_pfn_range()
> > > > > from within a vm_ops fault handler.  This function will trigger a
> > > > > BUG_ON if it encounters a populated pte within the remapped range,
> > > > > where any fault is meant to populate the entire vma.  Concurrent
> > > > > inflight faults to the same vma will therefore hit this issue,
> > > > > triggering traces such as:    
> > > 
> > > If it is just about concurrancy can the vma_lock enclose
> > > io_remap_pfn_range() ?  
> > 
> > We could extend vma_lock around io_remap_pfn_range(), but that alone
> > would just block the concurrent faults to the same vma and once we
> > released them they'd still hit the BUG_ON in io_remap_pfn_range()
> > because the page is no longer pte_none().  We'd need to combine that
> > with something like __vfio_pci_add_vma() returning -EEXIST to skip the
> > io_remap_pfn_range(), but I've been advised that we shouldn't be
> > calling io_remap_pfn_range() from within the fault handler anyway, we
> > should be using something like vmf_insert_pfn() instead, which I
> > understand can be called safely in the same situation.  That's rather
> > the testing I was hoping someone who reproduced the issue previously
> > could validate.  
> 
> Yes, using the vmf_ stuff is 'righter' for sure, but there isn't
> really a vmf for IO mappings..
> 
> > > I assume there is a reason why vm_lock can't be used here, so I
> > > wouldn't object, though I don't especially like the loss of tracking
> > > either.  
> > 
> > There's no loss of tracking here, we were only expecting a single fault
> > per vma to add the vma to our list.  This just skips adding duplicates
> > in these cases where we can have multiple faults in-flight.  Thanks,  
> 
> I mean the arch tracking of IO maps that is hidden inside ioremap_pfn

Ok, so I take it you'd feel more comfortable with something like this,
right?  Thanks,

Alex

diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 759dfb118712..74fc66cf9cf4 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -1584,6 +1584,7 @@ static vm_fault_t vfio_pci_mmap_fault(struct vm_fault *vmf)
 {
 	struct vm_area_struct *vma = vmf->vma;
 	struct vfio_pci_device *vdev = vma->vm_private_data;
+	struct vfio_pci_mmap_vma *mmap_vma;
 	vm_fault_t ret = VM_FAULT_NOPAGE;
 
 	mutex_lock(&vdev->vma_lock);
@@ -1591,24 +1592,33 @@ static vm_fault_t vfio_pci_mmap_fault(struct vm_fault *vmf)
 
 	if (!__vfio_pci_memory_enabled(vdev)) {
 		ret = VM_FAULT_SIGBUS;
-		mutex_unlock(&vdev->vma_lock);
 		goto up_out;
 	}
 
-	if (__vfio_pci_add_vma(vdev, vma)) {
-		ret = VM_FAULT_OOM;
-		mutex_unlock(&vdev->vma_lock);
-		goto up_out;
+	/*
+	 * Skip existing vmas, assume concurrent in-flight faults to avoid
+	 * BUG_ON from io_remap_pfn_range() hitting !pte_none() pages.
+	 */
+	list_for_each_entry(mmap_vma, &vdev->vma_list, vma_next) {
+		if (mmap_vma->vma == vma)
+			goto up_out;
 	}
 
-	mutex_unlock(&vdev->vma_lock);
-
 	if (io_remap_pfn_range(vma, vma->vm_start, vma->vm_pgoff,
-			       vma->vm_end - vma->vm_start, vma->vm_page_prot))
+			       vma->vm_end - vma->vm_start,
+			       vma->vm_page_prot)) {
 		ret = VM_FAULT_SIGBUS;
+		goto up_out;
+	}
+
+	if (__vfio_pci_add_vma(vdev, vma)) {
+		ret = VM_FAULT_OOM;
+		zap_vma_ptes(vma, vma->vm_start, vma->vm_end - vma->vm_start);
+	}
 
 up_out:
 	up_read(&vdev->memory_lock);
+	mutex_unlock(&vdev->vma_lock);
 	return ret;
 }

Jason Gunthorpe June 28, 2021, 11:26 p.m. UTC | #6

On Mon, Jun 28, 2021 at 01:30:19PM -0600, Alex Williamson wrote:
> On Mon, 28 Jun 2021 15:52:42 -0300
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > On Mon, Jun 28, 2021 at 12:36:21PM -0600, Alex Williamson wrote:
> > > On Mon, 28 Jun 2021 14:30:28 -0300
> > > Jason Gunthorpe <jgg@nvidia.com> wrote:
> > >   
> > > > On Mon, Jun 28, 2021 at 10:46:53AM -0600, Alex Williamson wrote:  
> > > > > On Wed, 10 Mar 2021 11:58:07 -0700
> > > > > Alex Williamson <alex.williamson@redhat.com> wrote:
> > > > >     
> > > > > > vfio_pci_mmap_fault() incorrectly makes use of io_remap_pfn_range()
> > > > > > from within a vm_ops fault handler.  This function will trigger a
> > > > > > BUG_ON if it encounters a populated pte within the remapped range,
> > > > > > where any fault is meant to populate the entire vma.  Concurrent
> > > > > > inflight faults to the same vma will therefore hit this issue,
> > > > > > triggering traces such as:    
> > > > 
> > > > If it is just about concurrancy can the vma_lock enclose
> > > > io_remap_pfn_range() ?  
> > > 
> > > We could extend vma_lock around io_remap_pfn_range(), but that alone
> > > would just block the concurrent faults to the same vma and once we
> > > released them they'd still hit the BUG_ON in io_remap_pfn_range()
> > > because the page is no longer pte_none().  We'd need to combine that
> > > with something like __vfio_pci_add_vma() returning -EEXIST to skip the
> > > io_remap_pfn_range(), but I've been advised that we shouldn't be
> > > calling io_remap_pfn_range() from within the fault handler anyway, we
> > > should be using something like vmf_insert_pfn() instead, which I
> > > understand can be called safely in the same situation.  That's rather
> > > the testing I was hoping someone who reproduced the issue previously
> > > could validate.  
> > 
> > Yes, using the vmf_ stuff is 'righter' for sure, but there isn't
> > really a vmf for IO mappings..
> > 
> > > > I assume there is a reason why vm_lock can't be used here, so I
> > > > wouldn't object, though I don't especially like the loss of tracking
> > > > either.  
> > > 
> > > There's no loss of tracking here, we were only expecting a single fault
> > > per vma to add the vma to our list.  This just skips adding duplicates
> > > in these cases where we can have multiple faults in-flight.  Thanks,  
> > 
> > I mean the arch tracking of IO maps that is hidden inside ioremap_pfn
> 
> Ok, so I take it you'd feel more comfortable with something like this,
> right?  Thanks,

I think so, it doesn't abuse the arch code, but it does abuse not
using vmf_* in a fault handler.

> index 759dfb118712..74fc66cf9cf4 100644
> +++ b/drivers/vfio/pci/vfio_pci.c
> @@ -1584,6 +1584,7 @@ static vm_fault_t vfio_pci_mmap_fault(struct vm_fault *vmf)
>  {
>  	struct vm_area_struct *vma = vmf->vma;
>  	struct vfio_pci_device *vdev = vma->vm_private_data;
> +	struct vfio_pci_mmap_vma *mmap_vma;
>  	vm_fault_t ret = VM_FAULT_NOPAGE;
>  
>  	mutex_lock(&vdev->vma_lock);
> @@ -1591,24 +1592,33 @@ static vm_fault_t vfio_pci_mmap_fault(struct vm_fault *vmf)
>  
>  	if (!__vfio_pci_memory_enabled(vdev)) {
>  		ret = VM_FAULT_SIGBUS;
> -		mutex_unlock(&vdev->vma_lock);
>  		goto up_out;
>  	}
>  
> -	if (__vfio_pci_add_vma(vdev, vma)) {
> -		ret = VM_FAULT_OOM;
> -		mutex_unlock(&vdev->vma_lock);
> -		goto up_out;


> +	/*
> +	 * Skip existing vmas, assume concurrent in-flight faults to avoid
> +	 * BUG_ON from io_remap_pfn_range() hitting !pte_none() pages.
> +	 */
> +	list_for_each_entry(mmap_vma, &vdev->vma_list, vma_next) {
> +		if (mmap_vma->vma == vma)
> +			goto up_out;
>  	}
>  
> -	mutex_unlock(&vdev->vma_lock);
> -
>  	if (io_remap_pfn_range(vma, vma->vm_start, vma->vm_pgoff,
> -			       vma->vm_end - vma->vm_start, vma->vm_page_prot))
> +			       vma->vm_end - vma->vm_start,
> +			       vma->vm_page_prot)) {
>  		ret = VM_FAULT_SIGBUS;
> +		goto up_out;
> +	}

I suppose io_remap_pfn_range can fail inside after partially
populating the range, ie if it fails to allocate another pte table or
something.

Since partial allocations are not allowed we'd have to zap it here
too.

I suppose the other idea is to do the io_remap_pfn_range() when the
mmap becomes valid and the zap when it becomes invalid and just have
the fault handler always fail. That way we don't abuse anything.

Was there some tricky locking reason why this didn't work? Does it get
better with the address_space?

Jason

Alex Williamson June 29, 2021, 2:11 p.m. UTC | #7

On Mon, 28 Jun 2021 20:26:25 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Mon, Jun 28, 2021 at 01:30:19PM -0600, Alex Williamson wrote:
> > On Mon, 28 Jun 2021 15:52:42 -0300
> > Jason Gunthorpe <jgg@nvidia.com> wrote:
> >   
> > > On Mon, Jun 28, 2021 at 12:36:21PM -0600, Alex Williamson wrote:  
> > > > On Mon, 28 Jun 2021 14:30:28 -0300
> > > > Jason Gunthorpe <jgg@nvidia.com> wrote:
> > > >     
> > > > > On Mon, Jun 28, 2021 at 10:46:53AM -0600, Alex Williamson wrote:    
> > > > > > On Wed, 10 Mar 2021 11:58:07 -0700
> > > > > > Alex Williamson <alex.williamson@redhat.com> wrote:
> > > > > >       
> > > > > > > vfio_pci_mmap_fault() incorrectly makes use of io_remap_pfn_range()
> > > > > > > from within a vm_ops fault handler.  This function will trigger a
> > > > > > > BUG_ON if it encounters a populated pte within the remapped range,
> > > > > > > where any fault is meant to populate the entire vma.  Concurrent
> > > > > > > inflight faults to the same vma will therefore hit this issue,
> > > > > > > triggering traces such as:      
> > > > > 
> > > > > If it is just about concurrancy can the vma_lock enclose
> > > > > io_remap_pfn_range() ?    
> > > > 
> > > > We could extend vma_lock around io_remap_pfn_range(), but that alone
> > > > would just block the concurrent faults to the same vma and once we
> > > > released them they'd still hit the BUG_ON in io_remap_pfn_range()
> > > > because the page is no longer pte_none().  We'd need to combine that
> > > > with something like __vfio_pci_add_vma() returning -EEXIST to skip the
> > > > io_remap_pfn_range(), but I've been advised that we shouldn't be
> > > > calling io_remap_pfn_range() from within the fault handler anyway, we
> > > > should be using something like vmf_insert_pfn() instead, which I
> > > > understand can be called safely in the same situation.  That's rather
> > > > the testing I was hoping someone who reproduced the issue previously
> > > > could validate.    
> > > 
> > > Yes, using the vmf_ stuff is 'righter' for sure, but there isn't
> > > really a vmf for IO mappings..
> > >   
> > > > > I assume there is a reason why vm_lock can't be used here, so I
> > > > > wouldn't object, though I don't especially like the loss of tracking
> > > > > either.    
> > > > 
> > > > There's no loss of tracking here, we were only expecting a single fault
> > > > per vma to add the vma to our list.  This just skips adding duplicates
> > > > in these cases where we can have multiple faults in-flight.  Thanks,    
> > > 
> > > I mean the arch tracking of IO maps that is hidden inside ioremap_pfn  
> > 
> > Ok, so I take it you'd feel more comfortable with something like this,
> > right?  Thanks,  
> 
> I think so, it doesn't abuse the arch code, but it does abuse not
> using vmf_* in a fault handler.
> 
> > index 759dfb118712..74fc66cf9cf4 100644
> > +++ b/drivers/vfio/pci/vfio_pci.c
> > @@ -1584,6 +1584,7 @@ static vm_fault_t vfio_pci_mmap_fault(struct vm_fault *vmf)
> >  {
> >  	struct vm_area_struct *vma = vmf->vma;
> >  	struct vfio_pci_device *vdev = vma->vm_private_data;
> > +	struct vfio_pci_mmap_vma *mmap_vma;
> >  	vm_fault_t ret = VM_FAULT_NOPAGE;
> >  
> >  	mutex_lock(&vdev->vma_lock);
> > @@ -1591,24 +1592,33 @@ static vm_fault_t vfio_pci_mmap_fault(struct vm_fault *vmf)
> >  
> >  	if (!__vfio_pci_memory_enabled(vdev)) {
> >  		ret = VM_FAULT_SIGBUS;
> > -		mutex_unlock(&vdev->vma_lock);
> >  		goto up_out;
> >  	}
> >  
> > -	if (__vfio_pci_add_vma(vdev, vma)) {
> > -		ret = VM_FAULT_OOM;
> > -		mutex_unlock(&vdev->vma_lock);
> > -		goto up_out;  
> 
> 
> > +	/*
> > +	 * Skip existing vmas, assume concurrent in-flight faults to avoid
> > +	 * BUG_ON from io_remap_pfn_range() hitting !pte_none() pages.
> > +	 */
> > +	list_for_each_entry(mmap_vma, &vdev->vma_list, vma_next) {
> > +		if (mmap_vma->vma == vma)
> > +			goto up_out;
> >  	}
> >  
> > -	mutex_unlock(&vdev->vma_lock);
> > -
> >  	if (io_remap_pfn_range(vma, vma->vm_start, vma->vm_pgoff,
> > -			       vma->vm_end - vma->vm_start, vma->vm_page_prot))
> > +			       vma->vm_end - vma->vm_start,
> > +			       vma->vm_page_prot)) {
> >  		ret = VM_FAULT_SIGBUS;
> > +		goto up_out;
> > +	}  
> 
> I suppose io_remap_pfn_range can fail inside after partially
> populating the range, ie if it fails to allocate another pte table or
> something.
> 
> Since partial allocations are not allowed we'd have to zap it here
> too.

Yup, easy enough.
 
> I suppose the other idea is to do the io_remap_pfn_range() when the
> mmap becomes valid and the zap when it becomes invalid and just have
> the fault handler always fail. That way we don't abuse anything.

That's coming, but it's a more substantial change, I'd rather fix this
within the current framework first.

> Was there some tricky locking reason why this didn't work? Does it get
> better with the address_space?

Yes it does, we can create the reverse of unmmap_mapping_range() using
the address space solution, which hugely simplifies zapping and
re-inserting BAR mappings.  This has been stalled due to lack of time
to work out the notifier issues for dropping IOMMU mappings, but it
also stands on its own, so I'll see if I can get that far in the series
reposted.  v3 w/ zap on io_remap_pfn_range() error path coming.  Thanks,

Alex

[v2] vfio/pci: Handle concurrent vma faults

Commit Message

Comments

Patch