vfio/type1: Fix VA->PA translation for PFNMAP VMAs in vaddr_get_pfn()
diff mbox series

Message ID 20200416225057.8449-1-sean.j.christopherson@intel.com
State New
Headers show
Series
  • vfio/type1: Fix VA->PA translation for PFNMAP VMAs in vaddr_get_pfn()
Related show

Commit Message

Sean Christopherson April 16, 2020, 10:50 p.m. UTC
Use follow_pfn() to get the PFN of a PFNMAP VMA instead of assuming that
vma->vm_pgoff holds the base PFN of the VMA.  This fixes a bug where
attempting to do VFIO_IOMMU_MAP_DMA on an arbitrary PFNMAP'd region of
memory calculates garbage for the PFN.

Hilariously, this only got detected because the first "PFN" calculated
by vaddr_get_pfn() is PFN 0 (vma->vm_pgoff==0), and iommu_iova_to_phys()
uses PA==0 as an error, which triggers a WARN in vfio_unmap_unpin()
because the translation "failed".  PFN 0 is now unconditionally reserved
on x86 in order to mitigate L1TF, which causes is_invalid_reserved_pfn()
to return true and in turns results in vaddr_get_pfn() returning success
for PFN 0.  Eventually the bogus calculation runs into PFNs that aren't
reserved and leads to failure in vfio_pin_map_dma().  The subsequent
call to vfio_remove_dma() attempts to unmap PFN 0 and WARNs.

  WARNING: CPU: 8 PID: 5130 at drivers/vfio/vfio_iommu_type1.c:750 vfio_unmap_unpin+0x2e1/0x310 [vfio_iommu_type1]
  Modules linked in: vfio_pci vfio_virqfd vfio_iommu_type1 vfio ...
  CPU: 8 PID: 5130 Comm: sgx Tainted: G        W         5.6.0-rc5-705d787c7fee-vfio+ #3
  Hardware name: Intel Corporation Mehlow UP Server Platform/Moss Beach Server, BIOS CNLSE2R1.D00.X119.B49.1803010910 03/01/2018
  RIP: 0010:vfio_unmap_unpin+0x2e1/0x310 [vfio_iommu_type1]
  Code: <0f> 0b 49 81 c5 00 10 00 00 e9 c5 fe ff ff bb 00 10 00 00 e9 3d fe
  RSP: 0018:ffffbeb5039ebda8 EFLAGS: 00010246
  RAX: 0000000000000000 RBX: ffff9a55cbf8d480 RCX: 0000000000000000
  RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff9a52b771c200
  RBP: 0000000000000000 R08: 0000000000000040 R09: 00000000fffffff2
  R10: 0000000000000001 R11: ffff9a51fa896000 R12: 0000000184010000
  R13: 0000000184000000 R14: 0000000000010000 R15: ffff9a55cb66ea08
  FS:  00007f15d3830b40(0000) GS:ffff9a55d5600000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 0000561cf39429e0 CR3: 000000084f75f005 CR4: 00000000003626e0
  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  Call Trace:
   vfio_remove_dma+0x17/0x70 [vfio_iommu_type1]
   vfio_iommu_type1_ioctl+0x9e3/0xa7b [vfio_iommu_type1]
   ksys_ioctl+0x92/0xb0
   __x64_sys_ioctl+0x16/0x20
   do_syscall_64+0x4c/0x180
   entry_SYSCALL_64_after_hwframe+0x44/0xa9
  RIP: 0033:0x7f15d04c75d7
  Code: <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 81 48 2d 00 f7 d8 64 89 01 48

Fixes: 73fa0d10d077 ("vfio: Type1 IOMMU implementation")
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
---

I'm mostly confident this is correct from the standpoint that it generates
the correct VA->PA.  I'm far less confident the end result is what VFIO
wants, there appears to be a fair bit of magic going on that I don't fully
understand, e.g. I'm a bit mystified as to how this ever worked in any
capacity.

Mapping PFNMAP VMAs into the IOMMU without using a mmu_notifier also seems
dangerous, e.g. if the subsystem associated with the VMA unmaps/remaps the
VMA then the IOMMU will end up with stale translations.

Last thought, using PA==0 for the error seems unnecessarily risky, e.g.
why not use something similar to KVM_PFN_ERR_* or an explicit return code?

 drivers/vfio/vfio_iommu_type1.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

Comments

Sean Christopherson April 16, 2020, 11:29 p.m. UTC | #1
On Thu, Apr 16, 2020 at 03:50:57PM -0700, Sean Christopherson wrote:
> Fixes: 73fa0d10d077 ("vfio: Type1 IOMMU implementation")
> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>

Egad, I completely spaced, this should have:

Reported-by: Ankit Amlani <ankit.amlani@intel.com>
Alex Williamson April 20, 2020, 7:40 p.m. UTC | #2
Hi Sean,

On Thu, 16 Apr 2020 15:50:57 -0700
Sean Christopherson <sean.j.christopherson@intel.com> wrote:

> Use follow_pfn() to get the PFN of a PFNMAP VMA instead of assuming that
> vma->vm_pgoff holds the base PFN of the VMA.  This fixes a bug where
> attempting to do VFIO_IOMMU_MAP_DMA on an arbitrary PFNMAP'd region of
> memory calculates garbage for the PFN.
> 
> Hilariously, this only got detected because the first "PFN" calculated
> by vaddr_get_pfn() is PFN 0 (vma->vm_pgoff==0), and iommu_iova_to_phys()
> uses PA==0 as an error, which triggers a WARN in vfio_unmap_unpin()
> because the translation "failed".  PFN 0 is now unconditionally reserved
> on x86 in order to mitigate L1TF, which causes is_invalid_reserved_pfn()
> to return true and in turns results in vaddr_get_pfn() returning success
> for PFN 0.  Eventually the bogus calculation runs into PFNs that aren't
> reserved and leads to failure in vfio_pin_map_dma().  The subsequent
> call to vfio_remove_dma() attempts to unmap PFN 0 and WARNs.
> 
>   WARNING: CPU: 8 PID: 5130 at drivers/vfio/vfio_iommu_type1.c:750 vfio_unmap_unpin+0x2e1/0x310 [vfio_iommu_type1]
>   Modules linked in: vfio_pci vfio_virqfd vfio_iommu_type1 vfio ...
>   CPU: 8 PID: 5130 Comm: sgx Tainted: G        W         5.6.0-rc5-705d787c7fee-vfio+ #3
>   Hardware name: Intel Corporation Mehlow UP Server Platform/Moss Beach Server, BIOS CNLSE2R1.D00.X119.B49.1803010910 03/01/2018
>   RIP: 0010:vfio_unmap_unpin+0x2e1/0x310 [vfio_iommu_type1]
>   Code: <0f> 0b 49 81 c5 00 10 00 00 e9 c5 fe ff ff bb 00 10 00 00 e9 3d fe
>   RSP: 0018:ffffbeb5039ebda8 EFLAGS: 00010246
>   RAX: 0000000000000000 RBX: ffff9a55cbf8d480 RCX: 0000000000000000
>   RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff9a52b771c200
>   RBP: 0000000000000000 R08: 0000000000000040 R09: 00000000fffffff2
>   R10: 0000000000000001 R11: ffff9a51fa896000 R12: 0000000184010000
>   R13: 0000000184000000 R14: 0000000000010000 R15: ffff9a55cb66ea08
>   FS:  00007f15d3830b40(0000) GS:ffff9a55d5600000(0000) knlGS:0000000000000000
>   CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>   CR2: 0000561cf39429e0 CR3: 000000084f75f005 CR4: 00000000003626e0
>   DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>   DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>   Call Trace:
>    vfio_remove_dma+0x17/0x70 [vfio_iommu_type1]
>    vfio_iommu_type1_ioctl+0x9e3/0xa7b [vfio_iommu_type1]
>    ksys_ioctl+0x92/0xb0
>    __x64_sys_ioctl+0x16/0x20
>    do_syscall_64+0x4c/0x180
>    entry_SYSCALL_64_after_hwframe+0x44/0xa9
>   RIP: 0033:0x7f15d04c75d7
>   Code: <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 81 48 2d 00 f7 d8 64 89 01 48
> 
> Fixes: 73fa0d10d077 ("vfio: Type1 IOMMU implementation")
> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> ---
> 
> I'm mostly confident this is correct from the standpoint that it generates
> the correct VA->PA.  I'm far less confident the end result is what VFIO
> wants, there appears to be a fair bit of magic going on that I don't fully
> understand, e.g. I'm a bit mystified as to how this ever worked in any
> capacity.

Yeah, that magic was copied from KVM's hva_to_pfn(), which split this
part out into hva_to_pfn_remapped() in 92176a8ede57 and then in
add6a0cd1c5b adopted a follow_pfn() approach, but also added forcing a
user fault and retry mechanism, iiuc.  Cc'ing Paolo and Andrea to see
if we should consider something similar.  We'd be forcing the fault on
user mapping, not first access though, so I'm not sure if it's still
useful.
 
> Mapping PFNMAP VMAs into the IOMMU without using a mmu_notifier also
> seems dangerous, e.g. if the subsystem associated with the VMA
> unmaps/remaps the VMA then the IOMMU will end up with stale
> translations.

The original use case was to support mapping MMIO ranges between
devices to support p2p within a VM instance, so remapping the VMA was
not a concern.  But yes, as this might be used beyond that limited
case for something like rdma, it should be expanded.  Patches?

> Last thought, using PA==0 for the error seems unnecessarily risky,
> e.g. why not use something similar to KVM_PFN_ERR_* or an explicit
> return code?

We're just consuming what the IOMMU driver provide.  Both Intel and AMD
return zero for a page not found.  In retrospect, yeah, we probably
should have balked at that.

>  drivers/vfio/vfio_iommu_type1.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c
> b/drivers/vfio/vfio_iommu_type1.c index 85b32c325282..c2ada190c5cb
> 100644 --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -342,8 +342,8 @@ static int vaddr_get_pfn(struct mm_struct *mm,
> unsigned long vaddr, vma = find_vma_intersection(mm, vaddr, vaddr +
> 1); 
>  	if (vma && vma->vm_flags & VM_PFNMAP) {
> -		*pfn = ((vaddr - vma->vm_start) >> PAGE_SHIFT) +
> vma->vm_pgoff;
> -		if (is_invalid_reserved_pfn(*pfn))
> +		if (!follow_pfn(vma, vaddr, pfn) &&
> +		    is_invalid_reserved_pfn(*pfn))
>  			ret = 0;
>  	}
>  done:

Should we consume that error code?

	ret = follow_pfn(vma, vaddr, pfn);
	if (!ret && !is_invalid_reserved_pfn(*pfn))
		ret = -EINVAL;

Thanks,
Alex
Sean Christopherson April 21, 2020, 3:34 a.m. UTC | #3
On Mon, Apr 20, 2020 at 01:40:05PM -0600, Alex Williamson wrote:
> > I'm mostly confident this is correct from the standpoint that it generates
> > the correct VA->PA.  I'm far less confident the end result is what VFIO
> > wants, there appears to be a fair bit of magic going on that I don't fully
> > understand, e.g. I'm a bit mystified as to how this ever worked in any
> > capacity.
> 
> Yeah, that magic was copied from KVM's hva_to_pfn(), which split this
> part out into hva_to_pfn_remapped() in 92176a8ede57 and then in

Wowsers.  I don't suppose anyone knows how/if KVM prevented that BUG_ON()
in hva_to_pfn() from being triggered by a malicious/miconfigured userspace?

> add6a0cd1c5b adopted a follow_pfn() approach, but also added forcing a
> user fault and retry mechanism, iiuc.  Cc'ing Paolo and Andrea to see
> if we should consider something similar.  We'd be forcing the fault on
> user mapping, not first access though, so I'm not sure if it's still
> useful.

Hmm, because the fault would trigger on map, userspace could provide the
same effective result by touching the page before calling into VFIO, i.e.
doesn't seem like adding fixup_user_fault() would add much other than
complexity.

> > Mapping PFNMAP VMAs into the IOMMU without using a mmu_notifier also
> > seems dangerous, e.g. if the subsystem associated with the VMA
> > unmaps/remaps the VMA then the IOMMU will end up with stale
> > translations.
> 
> The original use case was to support mapping MMIO ranges between
> devices to support p2p within a VM instance, so remapping the VMA was
> not a concern.  But yes, as this might be used beyond that limited
> case for something like rdma, it should be expanded.  Patches?

Heh, I don't have a use case for any of this.  Quite the opposite actually,
this was encountered because the VFIO memory listener in Qemu was trying to
map SGX EPC memory for DMA.  I segued into this patch only because the WARN
on PA!=0 caught my eye.

> > Last thought, using PA==0 for the error seems unnecessarily risky,
> > e.g. why not use something similar to KVM_PFN_ERR_* or an explicit
> > return code?
> 
> We're just consuming what the IOMMU driver provide.  Both Intel and AMD
> return zero for a page not found.  In retrospect, yeah, we probably
> should have balked at that.
> 
> >  drivers/vfio/vfio_iommu_type1.c | 4 ++--
> >  1 file changed, 2 insertions(+), 2 deletions(-)
> > 
> > diff --git a/drivers/vfio/vfio_iommu_type1.c
> > b/drivers/vfio/vfio_iommu_type1.c index 85b32c325282..c2ada190c5cb
> > 100644 --- a/drivers/vfio/vfio_iommu_type1.c
> > +++ b/drivers/vfio/vfio_iommu_type1.c
> > @@ -342,8 +342,8 @@ static int vaddr_get_pfn(struct mm_struct *mm,
> > unsigned long vaddr, vma = find_vma_intersection(mm, vaddr, vaddr +
> > 1); 
> >  	if (vma && vma->vm_flags & VM_PFNMAP) {
> > -		*pfn = ((vaddr - vma->vm_start) >> PAGE_SHIFT) +
> > vma->vm_pgoff;
> > -		if (is_invalid_reserved_pfn(*pfn))
> > +		if (!follow_pfn(vma, vaddr, pfn) &&
> > +		    is_invalid_reserved_pfn(*pfn))
> >  			ret = 0;
> >  	}
> >  done:
> 
> Should we consume that error code?
> 
> 	ret = follow_pfn(vma, vaddr, pfn);
> 	if (!ret && !is_invalid_reserved_pfn(*pfn))
> 		ret = -EINVAL;

Not sure it matters?  gup() returns -EINVAL on PFNMAP, follow_pfn() returns
-EINVAL for all error cases, and the delta would also return -EINVAL.
Generally speaking, letting the first error "win" usually seems like the
way to go, e.g. to avoid squashing a meaningful error code.  But AFAICT
it's a moot point.

Patch
diff mbox series

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 85b32c325282..c2ada190c5cb 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -342,8 +342,8 @@  static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr,
 	vma = find_vma_intersection(mm, vaddr, vaddr + 1);
 
 	if (vma && vma->vm_flags & VM_PFNMAP) {
-		*pfn = ((vaddr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
-		if (is_invalid_reserved_pfn(*pfn))
+		if (!follow_pfn(vma, vaddr, pfn) &&
+		    is_invalid_reserved_pfn(*pfn))
 			ret = 0;
 	}
 done: