diff mbox series

[1/2] mm: provide a sane PTE walking API for modules

Message ID 20210205103259.42866-2-pbonzini@redhat.com (mailing list archive)
State New, archived
Headers show
Series KVM: do not assume PTE is writable after follow_pfn | expand

Commit Message

Paolo Bonzini Feb. 5, 2021, 10:32 a.m. UTC
Currently, the follow_pfn function is exported for modules but
follow_pte is not.  However, follow_pfn is very easy to misuse,
because it does not provide protections (so most of its callers
assume the page is writable!) and because it returns after having
already unlocked the page table lock.

Provide instead a simplified version of follow_pte that does
not have the pmdpp and range arguments.  The older version
survives as follow_invalidate_pte() for use by fs/dax.c.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 arch/s390/pci/pci_mmio.c |  2 +-
 fs/dax.c                 |  5 +++--
 include/linux/mm.h       |  6 ++++--
 mm/memory.c              | 35 ++++++++++++++++++++++++++++++-----
 4 files changed, 38 insertions(+), 10 deletions(-)

Comments

Jason Gunthorpe Feb. 5, 2021, 1:49 p.m. UTC | #1
On Fri, Feb 05, 2021 at 05:32:58AM -0500, Paolo Bonzini wrote:
> Currently, the follow_pfn function is exported for modules but
> follow_pte is not.  However, follow_pfn is very easy to misuse,
> because it does not provide protections (so most of its callers
> assume the page is writable!) and because it returns after having
> already unlocked the page table lock.
> 
> Provide instead a simplified version of follow_pte that does
> not have the pmdpp and range arguments.  The older version
> survives as follow_invalidate_pte() for use by fs/dax.c.
> 
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> ---
>  arch/s390/pci/pci_mmio.c |  2 +-
>  fs/dax.c                 |  5 +++--
>  include/linux/mm.h       |  6 ++++--
>  mm/memory.c              | 35 ++++++++++++++++++++++++++++++-----
>  4 files changed, 38 insertions(+), 10 deletions(-)

Looks good to me, thanks

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>

Jason
Christoph Hellwig Feb. 8, 2021, 5:39 p.m. UTC | #2
> +int follow_invalidate_pte(struct mm_struct *mm, unsigned long address,
> +			  struct mmu_notifier_range *range, pte_t **ptepp, pmd_t **pmdpp,
> +			  spinlock_t **ptlp);

This adds a very pointless overy long line.

> +/**
> + * follow_pte - look up PTE at a user virtual address
> + * @vma: memory mapping
> + * @address: user virtual address
> + * @ptepp: location to store found PTE
> + * @ptlp: location to store the lock for the PTE
> + *
> + * On a successful return, the pointer to the PTE is stored in @ptepp;
> + * the corresponding lock is taken and its location is stored in @ptlp.
> + * The contents of the PTE are only stable until @ptlp is released;
> + * any further use, if any, must be protected against invalidation
> + * with MMU notifiers.
> + *
> + * Only IO mappings and raw PFN mappings are allowed.  The mmap semaphore
> + * should be taken for read.
> + *
> + * Return: zero on success, -ve otherwise.
> + */
> +int follow_pte(struct mm_struct *mm, unsigned long address,
> +	       pte_t **ptepp, spinlock_t **ptlp)
> +{
> +	return follow_invalidate_pte(mm, address, NULL, ptepp, NULL, ptlp);
> +}
> +EXPORT_SYMBOL_GPL(follow_pte);

I still don't think this is good as a general API.  Please document this
as KVM only for now, and hopefully next merge window I'll finish an
export variant restricting us to specific modules.
Paolo Bonzini Feb. 8, 2021, 6:18 p.m. UTC | #3
On 08/02/21 18:39, Christoph Hellwig wrote:
>> +int follow_pte(struct mm_struct *mm, unsigned long address,
>> +	       pte_t **ptepp, spinlock_t **ptlp)
>> +{
>> +	return follow_invalidate_pte(mm, address, NULL, ptepp, NULL, ptlp);
>> +}
>> +EXPORT_SYMBOL_GPL(follow_pte);
>
> I still don't think this is good as a general API.  Please document this
> as KVM only for now, and hopefully next merge window I'll finish an
> export variant restricting us to specific modules.

Fair enough.  I would expect that pretty much everyone using follow_pfn 
will at least want to switch to this one (as it's less bad and not 
impossible to use correctly), but I'll squash this in:

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 90b527260edf..24b292fce8e5 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1659,8 +1659,8 @@ void free_pgd_range(struct mmu_gather *tlb, 
unsigned long addr,
  int
  copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct 
*src_vma);
  int follow_invalidate_pte(struct mm_struct *mm, unsigned long address,
-			  struct mmu_notifier_range *range, pte_t **ptepp, pmd_t **pmdpp,
-			  spinlock_t **ptlp);
+			  struct mmu_notifier_range *range, pte_t **ptepp,
+			  pmd_t **pmdpp, spinlock_t **ptlp);
  int follow_pte(struct mm_struct *mm, unsigned long address,
  	       pte_t **ptepp, spinlock_t **ptlp);
  int follow_pfn(struct vm_area_struct *vma, unsigned long address,
diff --git a/mm/memory.c b/mm/memory.c
index 3632f7416248..c8679b15c004 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4792,6 +4792,9 @@ int follow_invalidate_pte(struct mm_struct *mm, 
unsigned long address,
   * Only IO mappings and raw PFN mappings are allowed.  The mmap semaphore
   * should be taken for read.
   *
+ * KVM uses this function.  While it is arguably less bad than
+ * ``follow_pfn``, it is not a good general-purpose API.
+ *
   * Return: zero on success, -ve otherwise.
   */
  int follow_pte(struct mm_struct *mm, unsigned long address,
@@ -4809,6 +4812,9 @@ EXPORT_SYMBOL_GPL(follow_pte);
   *
   * Only IO mappings and raw PFN mappings are allowed.
   *
+ * This function does not allow the caller to read the permissions
+ * of the PTE.  Do not use it.
+ *
   * Return: zero and the pfn at @pfn on success, -ve otherwise.
   */
  int follow_pfn(struct vm_area_struct *vma, unsigned long address,


(apologies in advance if Thunderbird destroys the patch).

Paolo
Christoph Hellwig Feb. 9, 2021, 8:14 a.m. UTC | #4
On Mon, Feb 08, 2021 at 07:18:56PM +0100, Paolo Bonzini wrote:
> Fair enough.  I would expect that pretty much everyone using follow_pfn will
> at least want to switch to this one (as it's less bad and not impossible to
> use correctly), but I'll squash this in:


Daniel looked into them, so he may correct me, but the other follow_pfn
users and their destiny are:

 - SGX, which is not modular and I think I just saw a patch to kill them
 - v4l videobuf and frame vector: I think those are going away
   entirely as they implement a rather broken pre-dmabuf P2P scheme
 - vfio: should use MMU notifiers eventually

Daniel, what happened to your follow_pfn series?
Paolo Bonzini Feb. 9, 2021, 9:19 a.m. UTC | #5
On 09/02/21 09:14, Christoph Hellwig wrote:
> On Mon, Feb 08, 2021 at 07:18:56PM +0100, Paolo Bonzini wrote:
>> Fair enough.  I would expect that pretty much everyone using follow_pfn will
>> at least want to switch to this one (as it's less bad and not impossible to
>> use correctly), but I'll squash this in:
> 
> 
> Daniel looked into them, so he may correct me, but the other follow_pfn
> users and their destiny are:
> 
>   - SGX, which is not modular and I think I just saw a patch to kill them
>   - v4l videobuf and frame vector: I think those are going away
>     entirely as they implement a rather broken pre-dmabuf P2P scheme
>   - vfio: should use MMU notifiers eventually

Yes, I'm thinking mostly of vfio, which could use follow_pte as a 
short-term fix for just the missing permission check.

There's also s390 PCI, which is also not modular.

Paolo

> Daniel, what happened to your follow_pfn series?
diff mbox series

Patch

diff --git a/arch/s390/pci/pci_mmio.c b/arch/s390/pci/pci_mmio.c
index 18f2d10c3176..5922dce94bfd 100644
--- a/arch/s390/pci/pci_mmio.c
+++ b/arch/s390/pci/pci_mmio.c
@@ -170,7 +170,7 @@  SYSCALL_DEFINE3(s390_pci_mmio_write, unsigned long, mmio_addr,
 	if (!(vma->vm_flags & VM_WRITE))
 		goto out_unlock_mmap;
 
-	ret = follow_pte(vma->vm_mm, mmio_addr, NULL, &ptep, NULL, &ptl);
+	ret = follow_pte(vma->vm_mm, mmio_addr, &ptep, &ptl);
 	if (ret)
 		goto out_unlock_mmap;
 
@@ -311,7 +311,7 @@  SYSCALL_DEFINE3(s390_pci_mmio_read, unsigned long, mmio_addr,
 	if (!(vma->vm_flags & VM_WRITE))
 		goto out_unlock_mmap;
 
-	ret = follow_pte(vma->vm_mm, mmio_addr, NULL, &ptep, NULL, &ptl);
+	ret = follow_pte(vma->vm_mm, mmio_addr, &ptep, &ptl);
 	if (ret)
 		goto out_unlock_mmap;
 
diff --git a/fs/dax.c b/fs/dax.c
index 26d5dcd2d69e..b14874c7b983 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -810,11 +810,12 @@  static void dax_entry_mkclean(struct address_space *mapping, pgoff_t index,
 		address = pgoff_address(index, vma);
 
 		/*
-		 * Note because we provide range to follow_pte it will call
+		 * follow_invalidate_pte() will use the range to call
 		 * mmu_notifier_invalidate_range_start() on our behalf before
 		 * taking any lock.
 		 */
-		if (follow_pte(vma->vm_mm, address, &range, &ptep, &pmdp, &ptl))
+		if (follow_invalidate_pte(vma->vm_mm, address, &range, &ptep,
+					  &pmdp, &ptl))
 			continue;
 
 		/*
diff --git a/include/linux/mm.h b/include/linux/mm.h
index ecdf8a8cd6ae..90b527260edf 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1658,9 +1658,11 @@  void free_pgd_range(struct mmu_gather *tlb, unsigned long addr,
 		unsigned long end, unsigned long floor, unsigned long ceiling);
 int
 copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma);
+int follow_invalidate_pte(struct mm_struct *mm, unsigned long address,
+			  struct mmu_notifier_range *range, pte_t **ptepp, pmd_t **pmdpp,
+			  spinlock_t **ptlp);
 int follow_pte(struct mm_struct *mm, unsigned long address,
-		struct mmu_notifier_range *range, pte_t **ptepp, pmd_t **pmdpp,
-		spinlock_t **ptlp);
+	       pte_t **ptepp, spinlock_t **ptlp);
 int follow_pfn(struct vm_area_struct *vma, unsigned long address,
 	unsigned long *pfn);
 int follow_phys(struct vm_area_struct *vma, unsigned long address,
diff --git a/mm/memory.c b/mm/memory.c
index feff48e1465a..b247d296931c 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4709,9 +4709,9 @@  int __pmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long address)
 }
 #endif /* __PAGETABLE_PMD_FOLDED */
 
-int follow_pte(struct mm_struct *mm, unsigned long address,
-	       struct mmu_notifier_range *range, pte_t **ptepp, pmd_t **pmdpp,
-	       spinlock_t **ptlp)
+int follow_invalidate_pte(struct mm_struct *mm, unsigned long address,
+			  struct mmu_notifier_range *range, pte_t **ptepp,
+			  pmd_t **pmdpp, spinlock_t **ptlp)
 {
 	pgd_t *pgd;
 	p4d_t *p4d;
@@ -4776,6 +4776,31 @@  int follow_pte(struct mm_struct *mm, unsigned long address,
 	return -EINVAL;
 }
 
+/**
+ * follow_pte - look up PTE at a user virtual address
+ * @vma: memory mapping
+ * @address: user virtual address
+ * @ptepp: location to store found PTE
+ * @ptlp: location to store the lock for the PTE
+ *
+ * On a successful return, the pointer to the PTE is stored in @ptepp;
+ * the corresponding lock is taken and its location is stored in @ptlp.
+ * The contents of the PTE are only stable until @ptlp is released;
+ * any further use, if any, must be protected against invalidation
+ * with MMU notifiers.
+ *
+ * Only IO mappings and raw PFN mappings are allowed.  The mmap semaphore
+ * should be taken for read.
+ *
+ * Return: zero on success, -ve otherwise.
+ */
+int follow_pte(struct mm_struct *mm, unsigned long address,
+	       pte_t **ptepp, spinlock_t **ptlp)
+{
+	return follow_invalidate_pte(mm, address, NULL, ptepp, NULL, ptlp);
+}
+EXPORT_SYMBOL_GPL(follow_pte);
+
 /**
  * follow_pfn - look up PFN at a user virtual address
  * @vma: memory mapping
@@ -4796,7 +4821,7 @@  int follow_pfn(struct vm_area_struct *vma, unsigned long address,
 	if (!(vma->vm_flags & (VM_IO | VM_PFNMAP)))
 		return ret;
 
-	ret = follow_pte(vma->vm_mm, address, NULL, &ptep, NULL, &ptl);
+	ret = follow_pte(vma->vm_mm, address, &ptep, &ptl);
 	if (ret)
 		return ret;
 	*pfn = pte_pfn(*ptep);
@@ -4817,7 +4842,7 @@  int follow_phys(struct vm_area_struct *vma,
 	if (!(vma->vm_flags & (VM_IO | VM_PFNMAP)))
 		goto out;
 
-	if (follow_pte(vma->vm_mm, address, NULL, &ptep, NULL, &ptl))
+	if (follow_pte(vma->vm_mm, address, &ptep, &ptl))
 		goto out;
 	pte = *ptep;