[09/19] mm: New follow_pfnmap API

Message ID	20240809160909.1023470-10-peterx@redhat.com (mailing list archive)
State	New
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: Peter Xu <peterx@redhat.com> To: linux-mm@kvack.org, linux-kernel@vger.kernel.org Cc: Sean Christopherson <seanjc@google.com>, Oscar Salvador <osalvador@suse.de>, Jason Gunthorpe <jgg@nvidia.com>, Axel Rasmussen <axelrasmussen@google.com>, linux-arm-kernel@lists.infradead.org, x86@kernel.org, peterx@redhat.com, Will Deacon <will@kernel.org>, Gavin Shan <gshan@redhat.com>, Paolo Bonzini <pbonzini@redhat.com>, Zi Yan <ziy@nvidia.com>, Andrew Morton <akpm@linux-foundation.org>, Catalin Marinas <catalin.marinas@arm.com>, Ingo Molnar <mingo@redhat.com>, Alistair Popple <apopple@nvidia.com>, Borislav Petkov <bp@alien8.de>, David Hildenbrand <david@redhat.com>, Thomas Gleixner <tglx@linutronix.de>, kvm@vger.kernel.org, Dave Hansen <dave.hansen@linux.intel.com>, Alex Williamson <alex.williamson@redhat.com>, Yan Zhao <yan.y.zhao@intel.com> Subject: [PATCH 09/19] mm: New follow_pfnmap API Date: Fri, 9 Aug 2024 12:08:59 -0400 Message-ID: <20240809160909.1023470-10-peterx@redhat.com> In-Reply-To: <20240809160909.1023470-1-peterx@redhat.com> References: <20240809160909.1023470-1-peterx@redhat.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Content-Type: text/plain; charset="US-ASCII"; x-default=true Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	mm: Support huge pfnmaps \| expand [00/19] mm: Support huge pfnmaps [01/19] mm: Introduce ARCH_SUPPORTS_HUGE_PFNMAP and special bits to pmd/pud [02/19] mm: Drop is_huge_zero_pud() [03/19] mm: Mark special bits for huge pfn mappings when inject [04/19] mm: Allow THP orders for PFNMAPs [05/19] mm/gup: Detect huge pfnmap entries in gup-fast [07/19] mm/fork: Accept huge pfnmap entries [08/19] mm: Always define pxx_pgprot() [09/19] mm: New follow_pfnmap API [10/19] KVM: Use follow_pfnmap API [11/19] s390/pci_mmio: Use follow_pfnmap API [12/19] mm/x86/pat: Use the new follow_pfnmap API [13/19] vfio: Use the new follow_pfnmap API [14/19] acrn: Use the new follow_pfnmap API [15/19] mm/access_process_vm: Use the new follow_pfnmap API [16/19] mm: Remove follow_pte() [17/19] mm/x86: Support large pfn mappings [18/19] mm/arm64: Support large pfn mappings [19/19] vfio/pci: Implement huge_fault support

Peter Xu Aug. 9, 2024, 4:08 p.m. UTC

Introduce a pair of APIs to follow pfn mappings to get entry information.
It's very similar to what follow_pte() does before, but different in that
it recognizes huge pfn mappings.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/linux/mm.h |  31 ++++++++++
 mm/memory.c        | 147 +++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 178 insertions(+)

Jason Gunthorpe Aug. 14, 2024, 1:19 p.m. UTC | #1

On Fri, Aug 09, 2024 at 12:08:59PM -0400, Peter Xu wrote:

> +/**
> + * follow_pfnmap_start() - Look up a pfn mapping at a user virtual address
> + * @args: Pointer to struct @follow_pfnmap_args
> + *
> + * The caller needs to setup args->vma and args->address to point to the
> + * virtual address as the target of such lookup.  On a successful return,
> + * the results will be put into other output fields.
> + *
> + * After the caller finished using the fields, the caller must invoke
> + * another follow_pfnmap_end() to proper releases the locks and resources
> + * of such look up request.
> + *
> + * During the start() and end() calls, the results in @args will be valid
> + * as proper locks will be held.  After the end() is called, all the fields
> + * in @follow_pfnmap_args will be invalid to be further accessed.
> + *
> + * If the PTE maps a refcounted page, callers are responsible to protect
> + * against invalidation with MMU notifiers; otherwise access to the PFN at
> + * a later point in time can trigger use-after-free.
> + *
> + * Only IO mappings and raw PFN mappings are allowed.  

What does this mean? The paragraph before said this can return a
refcounted page?

> + * The mmap semaphore
> + * should be taken for read, and the mmap semaphore cannot be released
> + * before the end() is invoked.

This function is not safe for IO mappings and PFNs either, VFIO has a
known security issue to call it. That should be emphasised in the
comment.

The caller must be protected by mmu notifiers or other locking that
guarentees the PTE cannot be removed while the caller is using it. In
all cases. 

Since this hold the PTL until end is it always safe to use the
returned address before calling end?

Jason

Peter Xu Aug. 14, 2024, 6:24 p.m. UTC | #2

On Wed, Aug 14, 2024 at 10:19:54AM -0300, Jason Gunthorpe wrote:
> On Fri, Aug 09, 2024 at 12:08:59PM -0400, Peter Xu wrote:
> 
> > +/**
> > + * follow_pfnmap_start() - Look up a pfn mapping at a user virtual address
> > + * @args: Pointer to struct @follow_pfnmap_args
> > + *
> > + * The caller needs to setup args->vma and args->address to point to the
> > + * virtual address as the target of such lookup.  On a successful return,
> > + * the results will be put into other output fields.
> > + *
> > + * After the caller finished using the fields, the caller must invoke
> > + * another follow_pfnmap_end() to proper releases the locks and resources
> > + * of such look up request.
> > + *
> > + * During the start() and end() calls, the results in @args will be valid
> > + * as proper locks will be held.  After the end() is called, all the fields
> > + * in @follow_pfnmap_args will be invalid to be further accessed.
> > + *
> > + * If the PTE maps a refcounted page, callers are responsible to protect
> > + * against invalidation with MMU notifiers; otherwise access to the PFN at
> > + * a later point in time can trigger use-after-free.
> > + *
> > + * Only IO mappings and raw PFN mappings are allowed.  
> 
> What does this mean? The paragraph before said this can return a
> refcounted page?

This came from the old follow_pte(), I kept that as I suppose we should
allow VM_IO | VM_PFNMAP just like before, even if in this case I suppose
only the pfnmap matters where huge mappings can start to appear.

> 
> > + * The mmap semaphore
> > + * should be taken for read, and the mmap semaphore cannot be released
> > + * before the end() is invoked.
> 
> This function is not safe for IO mappings and PFNs either, VFIO has a
> known security issue to call it. That should be emphasised in the
> comment.

Any elaboration on this?  I could have missed that..

> 
> The caller must be protected by mmu notifiers or other locking that
> guarentees the PTE cannot be removed while the caller is using it. In
> all cases. 
> 
> Since this hold the PTL until end is it always safe to use the
> returned address before calling end?

I suppose so?  As the pgtable is stable, I thought it means it's safe, but
I'm not sure now when you mentioned there's a VFIO known issue, so I could
have overlooked something.  There's no address returned, but pfn, pgprot,
write, etc.

The user needs to do proper mapping if they need an usable address,
e.g. generic_access_phys() does ioremap_prot() and recheck the pfn didn't
change.

Thanks,

Jason Gunthorpe Aug. 14, 2024, 10:14 p.m. UTC | #3

On Wed, Aug 14, 2024 at 02:24:47PM -0400, Peter Xu wrote:
> On Wed, Aug 14, 2024 at 10:19:54AM -0300, Jason Gunthorpe wrote:
> > On Fri, Aug 09, 2024 at 12:08:59PM -0400, Peter Xu wrote:
> > 
> > > +/**
> > > + * follow_pfnmap_start() - Look up a pfn mapping at a user virtual address
> > > + * @args: Pointer to struct @follow_pfnmap_args
> > > + *
> > > + * The caller needs to setup args->vma and args->address to point to the
> > > + * virtual address as the target of such lookup.  On a successful return,
> > > + * the results will be put into other output fields.
> > > + *
> > > + * After the caller finished using the fields, the caller must invoke
> > > + * another follow_pfnmap_end() to proper releases the locks and resources
> > > + * of such look up request.
> > > + *
> > > + * During the start() and end() calls, the results in @args will be valid
> > > + * as proper locks will be held.  After the end() is called, all the fields
> > > + * in @follow_pfnmap_args will be invalid to be further accessed.
> > > + *
> > > + * If the PTE maps a refcounted page, callers are responsible to protect
> > > + * against invalidation with MMU notifiers; otherwise access to the PFN at
> > > + * a later point in time can trigger use-after-free.
> > > + *
> > > + * Only IO mappings and raw PFN mappings are allowed.  
> > 
> > What does this mean? The paragraph before said this can return a
> > refcounted page?
> 
> This came from the old follow_pte(), I kept that as I suppose we should
> allow VM_IO | VM_PFNMAP just like before, even if in this case I suppose
> only the pfnmap matters where huge mappings can start to appear.

If that is the intention it should actively block returning anything
that is vm_normal_page() not check the VM flags, see the other
discussion..

It makes sense as a restriction if you call the API follow pfnmap.

> > > + * The mmap semaphore
> > > + * should be taken for read, and the mmap semaphore cannot be released
> > > + * before the end() is invoked.
> > 
> > This function is not safe for IO mappings and PFNs either, VFIO has a
> > known security issue to call it. That should be emphasised in the
> > comment.
> 
> Any elaboration on this?  I could have missed that..

Just because the memory is a PFN or IO doesn't mean it is safe to
access it without a refcount. There are many driver scenarios where
revoking a PFN from mmap needs to be a hard fence that nothing else
has access to that PFN. Otherwise it is a security problem for that
driver.

> I suppose so?  As the pgtable is stable, I thought it means it's safe, but
> I'm not sure now when you mentioned there's a VFIO known issue, so I could
> have overlooked something.  There's no address returned, but pfn, pgprot,
> write, etc.

zap/etc will wait on the PTL, I think, so it should be safe for at
least the issues I am thinking of.

> The user needs to do proper mapping if they need an usable address,
> e.g. generic_access_phys() does ioremap_prot() and recheck the pfn didn't
> change.

No, you can't take the phys_addr_t outside the start/end region that
explicitly holds the lock protecting it. This is what the comment must
warn against doing.

Jason

Peter Xu Aug. 15, 2024, 3:41 p.m. UTC | #4

On Wed, Aug 14, 2024 at 07:14:41PM -0300, Jason Gunthorpe wrote:
> On Wed, Aug 14, 2024 at 02:24:47PM -0400, Peter Xu wrote:
> > On Wed, Aug 14, 2024 at 10:19:54AM -0300, Jason Gunthorpe wrote:
> > > On Fri, Aug 09, 2024 at 12:08:59PM -0400, Peter Xu wrote:
> > > 
> > > > +/**
> > > > + * follow_pfnmap_start() - Look up a pfn mapping at a user virtual address
> > > > + * @args: Pointer to struct @follow_pfnmap_args
> > > > + *
> > > > + * The caller needs to setup args->vma and args->address to point to the
> > > > + * virtual address as the target of such lookup.  On a successful return,
> > > > + * the results will be put into other output fields.
> > > > + *
> > > > + * After the caller finished using the fields, the caller must invoke
> > > > + * another follow_pfnmap_end() to proper releases the locks and resources
> > > > + * of such look up request.
> > > > + *
> > > > + * During the start() and end() calls, the results in @args will be valid
> > > > + * as proper locks will be held.  After the end() is called, all the fields
> > > > + * in @follow_pfnmap_args will be invalid to be further accessed.
> > > > + *
> > > > + * If the PTE maps a refcounted page, callers are responsible to protect
> > > > + * against invalidation with MMU notifiers; otherwise access to the PFN at
> > > > + * a later point in time can trigger use-after-free.
> > > > + *
> > > > + * Only IO mappings and raw PFN mappings are allowed.  
> > > 
> > > What does this mean? The paragraph before said this can return a
> > > refcounted page?
> > 
> > This came from the old follow_pte(), I kept that as I suppose we should
> > allow VM_IO | VM_PFNMAP just like before, even if in this case I suppose
> > only the pfnmap matters where huge mappings can start to appear.
> 
> If that is the intention it should actively block returning anything
> that is vm_normal_page() not check the VM flags, see the other
> discussion..

The restriction should only be applied to the vma attributes, not a
specific pte mapping, IMHO.

I mean, the comment was describing "which VMA is allowed to use this
function", reflecting that we'll fail at anything !PFNMAP && !IO.

It seems legal to have private mappings of them, where vm_normal_page() can
return true here for some of the mappings under PFNMAP|IO. IIUC either the
old follow_pte() or follow_pfnmap*() API cared much on this part yet so
far.

> 
> It makes sense as a restriction if you call the API follow pfnmap.

I'm open to any better suggestion to names.  Again, I think here it's more
about the vma attribute, not "every mapping under the memory range".

> 
> > > > + * The mmap semaphore
> > > > + * should be taken for read, and the mmap semaphore cannot be released
> > > > + * before the end() is invoked.
> > > 
> > > This function is not safe for IO mappings and PFNs either, VFIO has a
> > > known security issue to call it. That should be emphasised in the
> > > comment.
> > 
> > Any elaboration on this?  I could have missed that..
> 
> Just because the memory is a PFN or IO doesn't mean it is safe to
> access it without a refcount. There are many driver scenarios where
> revoking a PFN from mmap needs to be a hard fence that nothing else
> has access to that PFN. Otherwise it is a security problem for that
> driver.

Oh ok, I suppose you meant the VFIO whole thing on "zapping mapping when
MMIO disabled"?  If so I get it.  More below.

> 
> > I suppose so?  As the pgtable is stable, I thought it means it's safe, but
> > I'm not sure now when you mentioned there's a VFIO known issue, so I could
> > have overlooked something.  There's no address returned, but pfn, pgprot,
> > write, etc.
> 
> zap/etc will wait on the PTL, I think, so it should be safe for at
> least the issues I am thinking of.
> 
> > The user needs to do proper mapping if they need an usable address,
> > e.g. generic_access_phys() does ioremap_prot() and recheck the pfn didn't
> > change.
> 
> No, you can't take the phys_addr_t outside the start/end region that
> explicitly holds the lock protecting it. This is what the comment must
> warn against doing.

I think the comment has that part covered more or less:

 * During the start() and end() calls, the results in @args will be valid
 * as proper locks will be held.  After the end() is called, all the fields
 * in @follow_pfnmap_args will be invalid to be further accessed.

Feel free to suggest anything that will make it better.

For generic_access_phys() as a specific example: I think it is safe to map
the pfn even after end().  I meant here the "map" operation is benign with
ioremap_prot(), afaiu: it doesn't include an access on top of the mapping
yet.

After the map, it rewalks the pgtable, making sure PFN is still there and
valid, and it'll only access it this time before end():

	if (write)
		memcpy_toio(maddr + offset, buf, len);
	else
		memcpy_fromio(buf, maddr + offset, len);
	ret = len;
	follow_pfnmap_end(&args);

If PFN changed, it properly releases the mapping:

	if ((prot != pgprot_val(args.pgprot)) ||
	    (phys_addr != (args.pfn << PAGE_SHIFT)) ||
	    (writable != args.writable)) {
		follow_pfnmap_end(&args);
		iounmap(maddr);
		goto retry;
	}

Then taking the example of VFIO: there's no risk of racing with a
concurrent zapping as far as I can see, because otherwise it'll see pfn
changed.

Thanks,

Jason Gunthorpe Aug. 15, 2024, 4:16 p.m. UTC | #5

On Thu, Aug 15, 2024 at 11:41:39AM -0400, Peter Xu wrote:
> On Wed, Aug 14, 2024 at 07:14:41PM -0300, Jason Gunthorpe wrote:
> > On Wed, Aug 14, 2024 at 02:24:47PM -0400, Peter Xu wrote:
> > > On Wed, Aug 14, 2024 at 10:19:54AM -0300, Jason Gunthorpe wrote:
> > > > On Fri, Aug 09, 2024 at 12:08:59PM -0400, Peter Xu wrote:
> > > > 
> > > > > +/**
> > > > > + * follow_pfnmap_start() - Look up a pfn mapping at a user virtual address
> > > > > + * @args: Pointer to struct @follow_pfnmap_args
> > > > > + *
> > > > > + * The caller needs to setup args->vma and args->address to point to the
> > > > > + * virtual address as the target of such lookup.  On a successful return,
> > > > > + * the results will be put into other output fields.
> > > > > + *
> > > > > + * After the caller finished using the fields, the caller must invoke
> > > > > + * another follow_pfnmap_end() to proper releases the locks and resources
> > > > > + * of such look up request.
> > > > > + *
> > > > > + * During the start() and end() calls, the results in @args will be valid
> > > > > + * as proper locks will be held.  After the end() is called, all the fields
> > > > > + * in @follow_pfnmap_args will be invalid to be further accessed.
> > > > > + *
> > > > > + * If the PTE maps a refcounted page, callers are responsible to protect
> > > > > + * against invalidation with MMU notifiers; otherwise access to the PFN at
> > > > > + * a later point in time can trigger use-after-free.
> > > > > + *
> > > > > + * Only IO mappings and raw PFN mappings are allowed.  
> > > > 
> > > > What does this mean? The paragraph before said this can return a
> > > > refcounted page?
> > > 
> > > This came from the old follow_pte(), I kept that as I suppose we should
> > > allow VM_IO | VM_PFNMAP just like before, even if in this case I suppose
> > > only the pfnmap matters where huge mappings can start to appear.
> > 
> > If that is the intention it should actively block returning anything
> > that is vm_normal_page() not check the VM flags, see the other
> > discussion..
> 
> The restriction should only be applied to the vma attributes, not a
> specific pte mapping, IMHO.
> 
> I mean, the comment was describing "which VMA is allowed to use this
> function", reflecting that we'll fail at anything !PFNMAP && !IO.
> 
> It seems legal to have private mappings of them, where vm_normal_page() can
> return true here for some of the mappings under PFNMAP|IO. IIUC either the
> old follow_pte() or follow_pfnmap*() API cared much on this part yet so
> far.

Why? Either the function only returns PFN map no-struct page things or
it returns struct page stuff too, in which case why bother to check
the VMA flags if the caller already has to be correct for struct page
backed results?

This function is only safe to use under the proper locking, and under
those rules it doesn't matter at all what the result is..
> > > > > + * The mmap semaphore
> > > > > + * should be taken for read, and the mmap semaphore cannot be released
> > > > > + * before the end() is invoked.
> > > > 
> > > > This function is not safe for IO mappings and PFNs either, VFIO has a
> > > > known security issue to call it. That should be emphasised in the
> > > > comment.
> > > 
> > > Any elaboration on this?  I could have missed that..
> > 
> > Just because the memory is a PFN or IO doesn't mean it is safe to
> > access it without a refcount. There are many driver scenarios where
> > revoking a PFN from mmap needs to be a hard fence that nothing else
> > has access to that PFN. Otherwise it is a security problem for that
> > driver.
> 
> Oh ok, I suppose you meant the VFIO whole thing on "zapping mapping when
> MMIO disabled"?  If so I get it.  More below.

And more..

> > > The user needs to do proper mapping if they need an usable address,
> > > e.g. generic_access_phys() does ioremap_prot() and recheck the pfn didn't
> > > change.
> > 
> > No, you can't take the phys_addr_t outside the start/end region that
> > explicitly holds the lock protecting it. This is what the comment must
> > warn against doing.
> 
> I think the comment has that part covered more or less:
> 
>  * During the start() and end() calls, the results in @args will be valid
>  * as proper locks will be held.  After the end() is called, all the fields
>  * in @follow_pfnmap_args will be invalid to be further accessed.
> 
> Feel free to suggest anything that will make it better.

Be much more specific and scary:

  Any physical address obtained through this API is only valid while
  the @follow_pfnmap_args. Continuing to use the address after end(),
  without some other means to synchronize with page table updates
  will create a security bug.
 
> For generic_access_phys() as a specific example: I think it is safe to map
> the pfn even after end(). 

The map could be safe, but also the memory could be hot unplugged as a
race. I don't know either way if all arch code is safe for that.

> After the map, it rewalks the pgtable, making sure PFN is still there and
> valid, and it'll only access it this time before end():
> 
> 	if (write)
> 		memcpy_toio(maddr + offset, buf, len);
> 	else
> 		memcpy_fromio(buf, maddr + offset, len);
> 	ret = len;
> 	follow_pfnmap_end(&args);

Yes
 
> If PFN changed, it properly releases the mapping:
> 
> 	if ((prot != pgprot_val(args.pgprot)) ||
> 	    (phys_addr != (args.pfn << PAGE_SHIFT)) ||
> 	    (writable != args.writable)) {
> 		follow_pfnmap_end(&args);
> 		iounmap(maddr);
> 		goto retry;
> 	}
> 
> Then taking the example of VFIO: there's no risk of racing with a
> concurrent zapping as far as I can see, because otherwise it'll see pfn
> changed.

VFIO dumps the physical address into the IOMMU and ignores
zap. Concurrent zap results in a UAF through the IOMMU mapping.

Jason

Peter Xu Aug. 15, 2024, 5:21 p.m. UTC | #6

On Thu, Aug 15, 2024 at 01:16:03PM -0300, Jason Gunthorpe wrote:
> On Thu, Aug 15, 2024 at 11:41:39AM -0400, Peter Xu wrote:
> > On Wed, Aug 14, 2024 at 07:14:41PM -0300, Jason Gunthorpe wrote:
> > > On Wed, Aug 14, 2024 at 02:24:47PM -0400, Peter Xu wrote:
> > > > On Wed, Aug 14, 2024 at 10:19:54AM -0300, Jason Gunthorpe wrote:
> > > > > On Fri, Aug 09, 2024 at 12:08:59PM -0400, Peter Xu wrote:
> > > > > 
> > > > > > +/**
> > > > > > + * follow_pfnmap_start() - Look up a pfn mapping at a user virtual address
> > > > > > + * @args: Pointer to struct @follow_pfnmap_args
> > > > > > + *
> > > > > > + * The caller needs to setup args->vma and args->address to point to the
> > > > > > + * virtual address as the target of such lookup.  On a successful return,
> > > > > > + * the results will be put into other output fields.
> > > > > > + *
> > > > > > + * After the caller finished using the fields, the caller must invoke
> > > > > > + * another follow_pfnmap_end() to proper releases the locks and resources
> > > > > > + * of such look up request.
> > > > > > + *
> > > > > > + * During the start() and end() calls, the results in @args will be valid
> > > > > > + * as proper locks will be held.  After the end() is called, all the fields
> > > > > > + * in @follow_pfnmap_args will be invalid to be further accessed.
> > > > > > + *
> > > > > > + * If the PTE maps a refcounted page, callers are responsible to protect
> > > > > > + * against invalidation with MMU notifiers; otherwise access to the PFN at
> > > > > > + * a later point in time can trigger use-after-free.
> > > > > > + *
> > > > > > + * Only IO mappings and raw PFN mappings are allowed.  
> > > > > 
> > > > > What does this mean? The paragraph before said this can return a
> > > > > refcounted page?
> > > > 
> > > > This came from the old follow_pte(), I kept that as I suppose we should
> > > > allow VM_IO | VM_PFNMAP just like before, even if in this case I suppose
> > > > only the pfnmap matters where huge mappings can start to appear.
> > > 
> > > If that is the intention it should actively block returning anything
> > > that is vm_normal_page() not check the VM flags, see the other
> > > discussion..
> > 
> > The restriction should only be applied to the vma attributes, not a
> > specific pte mapping, IMHO.
> > 
> > I mean, the comment was describing "which VMA is allowed to use this
> > function", reflecting that we'll fail at anything !PFNMAP && !IO.
> > 
> > It seems legal to have private mappings of them, where vm_normal_page() can
> > return true here for some of the mappings under PFNMAP|IO. IIUC either the
> > old follow_pte() or follow_pfnmap*() API cared much on this part yet so
> > far.
> 
> Why? Either the function only returns PFN map no-struct page things or
> it returns struct page stuff too, in which case why bother to check
> the VMA flags if the caller already has to be correct for struct page
> backed results?
> 
> This function is only safe to use under the proper locking, and under
> those rules it doesn't matter at all what the result is..

Do you mean we should drop the PFNMAP|IO check?  I didn't see all the
callers to say that they won't rely on proper failing of !PFNMAP&&!IO vmas
to work alright.  So I assume we should definitely keep them around.

Or I could have totally missed what you're suggesting here..

> > > > > > + * The mmap semaphore
> > > > > > + * should be taken for read, and the mmap semaphore cannot be released
> > > > > > + * before the end() is invoked.
> > > > > 
> > > > > This function is not safe for IO mappings and PFNs either, VFIO has a
> > > > > known security issue to call it. That should be emphasised in the
> > > > > comment.
> > > > 
> > > > Any elaboration on this?  I could have missed that..
> > > 
> > > Just because the memory is a PFN or IO doesn't mean it is safe to
> > > access it without a refcount. There are many driver scenarios where
> > > revoking a PFN from mmap needs to be a hard fence that nothing else
> > > has access to that PFN. Otherwise it is a security problem for that
> > > driver.
> > 
> > Oh ok, I suppose you meant the VFIO whole thing on "zapping mapping when
> > MMIO disabled"?  If so I get it.  More below.
> 
> And more..
> 
> > > > The user needs to do proper mapping if they need an usable address,
> > > > e.g. generic_access_phys() does ioremap_prot() and recheck the pfn didn't
> > > > change.
> > > 
> > > No, you can't take the phys_addr_t outside the start/end region that
> > > explicitly holds the lock protecting it. This is what the comment must
> > > warn against doing.
> > 
> > I think the comment has that part covered more or less:
> > 
> >  * During the start() and end() calls, the results in @args will be valid
> >  * as proper locks will be held.  After the end() is called, all the fields
> >  * in @follow_pfnmap_args will be invalid to be further accessed.
> > 
> > Feel free to suggest anything that will make it better.
> 
> Be much more specific and scary:
> 
>   Any physical address obtained through this API is only valid while
>   the @follow_pfnmap_args. Continuing to use the address after end(),
>   without some other means to synchronize with page table updates
>   will create a security bug.

Some misuse on wordings here (e.g. we don't return PA but PFN), and some
sentence doesn't seem to be complete.. but I think I get the "scary" part
of it.  How about this, appending the scary part to the end?

 * During the start() and end() calls, the results in @args will be valid
 * as proper locks will be held.  After the end() is called, all the fields
 * in @follow_pfnmap_args will be invalid to be further accessed.  Further
 * use of such information after end() may require proper synchronizations
 * by the caller with page table updates, otherwise it can create a
 * security bug.

>  
> > For generic_access_phys() as a specific example: I think it is safe to map
> > the pfn even after end(). 
> 
> The map could be safe, but also the memory could be hot unplugged as a
> race. I don't know either way if all arch code is safe for that.

I hope it's ok, or we have similar problem with follow_pte() for all
theseyears.. in all cases, this sounds like another thing to be checked
outside of scope of this patch..

> 
> > After the map, it rewalks the pgtable, making sure PFN is still there and
> > valid, and it'll only access it this time before end():
> > 
> > 	if (write)
> > 		memcpy_toio(maddr + offset, buf, len);
> > 	else
> > 		memcpy_fromio(buf, maddr + offset, len);
> > 	ret = len;
> > 	follow_pfnmap_end(&args);
> 
> Yes
>  
> > If PFN changed, it properly releases the mapping:
> > 
> > 	if ((prot != pgprot_val(args.pgprot)) ||
> > 	    (phys_addr != (args.pfn << PAGE_SHIFT)) ||
> > 	    (writable != args.writable)) {
> > 		follow_pfnmap_end(&args);
> > 		iounmap(maddr);
> > 		goto retry;
> > 	}
> > 
> > Then taking the example of VFIO: there's no risk of racing with a
> > concurrent zapping as far as I can see, because otherwise it'll see pfn
> > changed.
> 
> VFIO dumps the physical address into the IOMMU and ignores
> zap. Concurrent zap results in a UAF through the IOMMU mapping.

Ohhh, so this is what I'm missing..

It worked for generic mem only because VFIO pins them upfront so any form
of zapping won't throw things away, but we can't pin pfnmaps yet so far.

It sounds like we need some mmu notifiers when mapping the IOMMU pgtables,
as long as there's MMIO-region / P2P involved.  It'll make sure when
tearing down the BAR mappings, the devices will at least see the same view
as the processors.

Thanks,

Jason Gunthorpe Aug. 15, 2024, 5:24 p.m. UTC | #7

On Thu, Aug 15, 2024 at 01:21:01PM -0400, Peter Xu wrote:
> > Why? Either the function only returns PFN map no-struct page things or
> > it returns struct page stuff too, in which case why bother to check
> > the VMA flags if the caller already has to be correct for struct page
> > backed results?
> > 
> > This function is only safe to use under the proper locking, and under
> > those rules it doesn't matter at all what the result is..
> 
> Do you mean we should drop the PFNMAP|IO check?

Yeah

>  I didn't see all the
> callers to say that they won't rely on proper failing of !PFNMAP&&!IO vmas
> to work alright.  So I assume we should definitely keep them around.

But as before, if we care about this we should be using vm_normal_page
as that is sort of abusing the PFNMAP flags.

> >   Any physical address obtained through this API is only valid while
> >   the @follow_pfnmap_args. Continuing to use the address after end(),
> >   without some other means to synchronize with page table updates
> >   will create a security bug.
> 
> Some misuse on wordings here (e.g. we don't return PA but PFN), and some
> sentence doesn't seem to be complete.. but I think I get the "scary" part
> of it.  How about this, appending the scary part to the end?
> 
>  * During the start() and end() calls, the results in @args will be valid
>  * as proper locks will be held.  After the end() is called, all the fields
>  * in @follow_pfnmap_args will be invalid to be further accessed.  Further
>  * use of such information after end() may require proper synchronizations
>  * by the caller with page table updates, otherwise it can create a
>  * security bug.

I would specifically emphasis that the pfn may not be used after
end. That is the primary mistake people have made.

They think it is a PFN so it is safe.

> It sounds like we need some mmu notifiers when mapping the IOMMU pgtables,
> as long as there's MMIO-region / P2P involved.  It'll make sure when
> tearing down the BAR mappings, the devices will at least see the same view
> as the processors.

I think the mmu notifiers can trigger too often for this to be
practical for DMA :(

Jason

Peter Xu Aug. 15, 2024, 6:52 p.m. UTC | #8

On Thu, Aug 15, 2024 at 02:24:45PM -0300, Jason Gunthorpe wrote:
> On Thu, Aug 15, 2024 at 01:21:01PM -0400, Peter Xu wrote:
> > > Why? Either the function only returns PFN map no-struct page things or
> > > it returns struct page stuff too, in which case why bother to check
> > > the VMA flags if the caller already has to be correct for struct page
> > > backed results?
> > > 
> > > This function is only safe to use under the proper locking, and under
> > > those rules it doesn't matter at all what the result is..
> > 
> > Do you mean we should drop the PFNMAP|IO check?
> 
> Yeah
> 
> >  I didn't see all the
> > callers to say that they won't rely on proper failing of !PFNMAP&&!IO vmas
> > to work alright.  So I assume we should definitely keep them around.
> 
> But as before, if we care about this we should be using vm_normal_page
> as that is sort of abusing the PFNMAP flags.

I can't say it's abusing..  Taking access_remote_vm() as example again, it
can go back as far as 2008 with Rik's commit here:

    commit 28b2ee20c7cba812b6f2ccf6d722cf86d00a84dc
    Author: Rik van Riel <riel@redhat.com>
    Date:   Wed Jul 23 21:27:05 2008 -0700

    access_process_vm device memory infrastructure

So it starts with having GUP failing pfnmaps first for remote vm access, as
what we also do right now with check_vma_flags(), then this whole walker is
a remedy for that.

It isn't used at all for normal VMAs, unless it's a private pfnmap mapping
which should be extremely rare, or if it's IO+!PFNMAP, which is a world I
am not familiar with..

In all cases, I hope we can still leave this alone in the huge pfnmap
effort, as they do not yet to be closely relevant.  From that POV, this
patch as simple as "teach follow_pte() to know huge mappings", while it's
just that we can't modify on top when the old interface won't work when
stick with pte_t.  Most of the rest was inherited from follow_pte();
there're still some trivial changes elsewhere, but here on the vma flag
check we stick the same with old.

> 
> > >   Any physical address obtained through this API is only valid while
> > >   the @follow_pfnmap_args. Continuing to use the address after end(),
> > >   without some other means to synchronize with page table updates
> > >   will create a security bug.
> > 
> > Some misuse on wordings here (e.g. we don't return PA but PFN), and some
> > sentence doesn't seem to be complete.. but I think I get the "scary" part
> > of it.  How about this, appending the scary part to the end?
> > 
> >  * During the start() and end() calls, the results in @args will be valid
> >  * as proper locks will be held.  After the end() is called, all the fields
> >  * in @follow_pfnmap_args will be invalid to be further accessed.  Further
> >  * use of such information after end() may require proper synchronizations
> >  * by the caller with page table updates, otherwise it can create a
> >  * security bug.
> 
> I would specifically emphasis that the pfn may not be used after
> end. That is the primary mistake people have made.
> 
> They think it is a PFN so it is safe.

I understand your concern. It's just that it seems still legal to me to use
it as long as proper action is taken.

I hope "require proper synchronizations" would be the best way to phrase
this matter, but maybe you have even better suggestion to put this, which
I'm definitely open to that too.

> 
> > It sounds like we need some mmu notifiers when mapping the IOMMU pgtables,
> > as long as there's MMIO-region / P2P involved.  It'll make sure when
> > tearing down the BAR mappings, the devices will at least see the same view
> > as the processors.
> 
> I think the mmu notifiers can trigger too often for this to be
> practical for DMA :(

I guess the DMAs are fine as normally the notifier will be no-op, as long
as the BAR enable/disable happens rare.  But yeah, I see you point, and
that is probably a concern if those notifier needs to be kicked off and
walk a bunch of MMIO regions, even if 99% of the cases it'll do nothing.

Thanks,

Sean Christopherson Aug. 16, 2024, 11:12 p.m. UTC | #9

On Fri, Aug 09, 2024, Peter Xu wrote:
> Introduce a pair of APIs to follow pfn mappings to get entry information.
> It's very similar to what follow_pte() does before, but different in that
> it recognizes huge pfn mappings.

...

> +int follow_pfnmap_start(struct follow_pfnmap_args *args);
> +void follow_pfnmap_end(struct follow_pfnmap_args *args);

I find the start+end() terminology to be unintuitive.  E.g. I had to look at the
implementation to understand why KVM invoke fixup_user_fault() if follow_pfnmap_start()
failed.

What about follow_pfnmap_and_lock()?  And then maybe follow_pfnmap_unlock()?
Though that second one reads a little weird.

> + * Return: zero on success, -ve otherwise.

ve?

> +int follow_pfnmap_start(struct follow_pfnmap_args *args)
> +{
> +	struct vm_area_struct *vma = args->vma;
> +	unsigned long address = args->address;
> +	struct mm_struct *mm = vma->vm_mm;
> +	spinlock_t *lock;
> +	pgd_t *pgdp;
> +	p4d_t *p4dp, p4d;
> +	pud_t *pudp, pud;
> +	pmd_t *pmdp, pmd;
> +	pte_t *ptep, pte;
> +
> +	pfnmap_lockdep_assert(vma);
> +
> +	if (unlikely(address < vma->vm_start || address >= vma->vm_end))
> +		goto out;
> +
> +	if (!(vma->vm_flags & (VM_IO | VM_PFNMAP)))
> +		goto out;

Why use goto intead of simply?

		return -EINVAL;

That's relevant because I think the cases where no PxE is found should return
-ENOENT, not -EINVAL.  E.g. if the caller doesn't precheck, then it can bail
immediately on EINVAL, but know that it's worth trying to fault-in the pfn on
ENOENT. 

> +retry:
> +	pgdp = pgd_offset(mm, address);
> +	if (pgd_none(*pgdp) || unlikely(pgd_bad(*pgdp)))
> +		goto out;
> +
> +	p4dp = p4d_offset(pgdp, address);
> +	p4d = READ_ONCE(*p4dp);
> +	if (p4d_none(p4d) || unlikely(p4d_bad(p4d)))
> +		goto out;
> +
> +	pudp = pud_offset(p4dp, address);
> +	pud = READ_ONCE(*pudp);
> +	if (pud_none(pud))
> +		goto out;
> +	if (pud_leaf(pud)) {
> +		lock = pud_lock(mm, pudp);
> +		if (!unlikely(pud_leaf(pud))) {
> +			spin_unlock(lock);
> +			goto retry;
> +		}
> +		pfnmap_args_setup(args, lock, NULL, pud_pgprot(pud),
> +				  pud_pfn(pud), PUD_MASK, pud_write(pud),
> +				  pud_special(pud));
> +		return 0;
> +	}
> +
> +	pmdp = pmd_offset(pudp, address);
> +	pmd = pmdp_get_lockless(pmdp);
> +	if (pmd_leaf(pmd)) {
> +		lock = pmd_lock(mm, pmdp);
> +		if (!unlikely(pmd_leaf(pmd))) {
> +			spin_unlock(lock);
> +			goto retry;
> +		}
> +		pfnmap_args_setup(args, lock, NULL, pmd_pgprot(pmd),
> +				  pmd_pfn(pmd), PMD_MASK, pmd_write(pmd),
> +				  pmd_special(pmd));
> +		return 0;
> +	}
> +
> +	ptep = pte_offset_map_lock(mm, pmdp, address, &lock);
> +	if (!ptep)
> +		goto out;
> +	pte = ptep_get(ptep);
> +	if (!pte_present(pte))
> +		goto unlock;
> +	pfnmap_args_setup(args, lock, ptep, pte_pgprot(pte),
> +			  pte_pfn(pte), PAGE_MASK, pte_write(pte),
> +			  pte_special(pte));
> +	return 0;
> +unlock:
> +	pte_unmap_unlock(ptep, lock);
> +out:
> +	return -EINVAL;
> +}
> +EXPORT_SYMBOL_GPL(follow_pfnmap_start);

David Hildenbrand Aug. 17, 2024, 11:05 a.m. UTC | #10

On 17.08.24 01:12, Sean Christopherson wrote:
> On Fri, Aug 09, 2024, Peter Xu wrote:
>> Introduce a pair of APIs to follow pfn mappings to get entry information.
>> It's very similar to what follow_pte() does before, but different in that
>> it recognizes huge pfn mappings.
> 
> ...
> 
>> +int follow_pfnmap_start(struct follow_pfnmap_args *args);
>> +void follow_pfnmap_end(struct follow_pfnmap_args *args);
> 
> I find the start+end() terminology to be unintuitive.  E.g. I had to look at the
> implementation to understand why KVM invoke fixup_user_fault() if follow_pfnmap_start()
> failed.

It roughly matches folio_walk_start() / folio_walk_end(), that I 
recently introduced.

Maybe we should call it pfnmap_walk_start() / pfnmap_walk_end() here, to 
remove the old "follow" semantics for good.

> 
> What about follow_pfnmap_and_lock()?  And then maybe follow_pfnmap_unlock()?
> Though that second one reads a little weird.

Yes, I prefer start/end (lock/unlock reads like an implementation 
detail). But whatever we do, let's try doing something that is 
consistent with existing stuff.

Peter Xu Aug. 21, 2024, 7:10 p.m. UTC | #11

On Fri, Aug 16, 2024 at 04:12:24PM -0700, Sean Christopherson wrote:
> On Fri, Aug 09, 2024, Peter Xu wrote:
> > Introduce a pair of APIs to follow pfn mappings to get entry information.
> > It's very similar to what follow_pte() does before, but different in that
> > it recognizes huge pfn mappings.
> 
> ...
> 
> > +int follow_pfnmap_start(struct follow_pfnmap_args *args);
> > +void follow_pfnmap_end(struct follow_pfnmap_args *args);
> 
> I find the start+end() terminology to be unintuitive.  E.g. I had to look at the
> implementation to understand why KVM invoke fixup_user_fault() if follow_pfnmap_start()
> failed.
> 
> What about follow_pfnmap_and_lock()?  And then maybe follow_pfnmap_unlock()?
> Though that second one reads a little weird.

If to go with the _lock() I tend to drop "and" to follow_pfnmap_[un]lock().
However looks like David preferred me keeping the name, so we don't reach a
quorum yet.  I'm happy to change the name as long as we have enough votes..

> 
> > + * Return: zero on success, -ve otherwise.
> 
> ve?

This one came from the old follow_pte() and I kept it. I only knew this
after search: a short way to write "negative" (while positive is "+ve").

Doesn't look like something productive.. I'll spell it out in the next
version.

> 
> > +int follow_pfnmap_start(struct follow_pfnmap_args *args)
> > +{
> > +	struct vm_area_struct *vma = args->vma;
> > +	unsigned long address = args->address;
> > +	struct mm_struct *mm = vma->vm_mm;
> > +	spinlock_t *lock;
> > +	pgd_t *pgdp;
> > +	p4d_t *p4dp, p4d;
> > +	pud_t *pudp, pud;
> > +	pmd_t *pmdp, pmd;
> > +	pte_t *ptep, pte;
> > +
> > +	pfnmap_lockdep_assert(vma);
> > +
> > +	if (unlikely(address < vma->vm_start || address >= vma->vm_end))
> > +		goto out;
> > +
> > +	if (!(vma->vm_flags & (VM_IO | VM_PFNMAP)))
> > +		goto out;
> 
> Why use goto intead of simply?
> 
> 		return -EINVAL;
> 
> That's relevant because I think the cases where no PxE is found should return
> -ENOENT, not -EINVAL.  E.g. if the caller doesn't precheck, then it can bail
> immediately on EINVAL, but know that it's worth trying to fault-in the pfn on
> ENOENT. 

I tend to avoid changing the retval in this series to make the goal of this
patchset simple.

One issue is I _think_ there's one ioctl() that will rely on this retval:

      acrn_dev_ioctl ->
        acrn_vm_memseg_map ->
          acrn_vm_ram_map ->
            follow_pfnmap_start

So we may want to try check with people to not break it..

Thanks,

[09/19] mm: New follow_pfnmap API

Commit Message

Comments

Patch