diff mbox series

[RFC,1/4] mm: pagewalk: add the ability to install PTEs

Message ID 59e218670565accf978aeb8cf4745de4c0738773.1727440966.git.lorenzo.stoakes@oracle.com (mailing list archive)
State New
Headers show
Series implement lightweight guard pages | expand

Commit Message

Lorenzo Stoakes Sept. 27, 2024, 12:51 p.m. UTC
The existing generic pagewalk logic permits the walking of page tables,
invoking callbacks at individual page table levels via user-provided
mm_walk_ops callbacks.

This is useful for traversing existing page table entries, but precludes
the ability to establish new ones.

Existing mechanism for performing a walk which also installs page table
entries if necessary are heavily duplicated throughout the kernel, each
with semantic differences from one another and largely unavailable for use
elsewhere.

Rather than add yet another implementation, we extend the generic pagewalk
logic to enable the installation of page table entries by adding a new
install_pte() callback in mm_walk_ops. If this is specified, then upon
encountering a missing page table entry, we allocate and install a new one
and continue the traversal.

If a THP huge page is encountered, we make use of existing logic to split
it. Then once we reach the PTE level, we invoke the install_pte() callback
which provides a PTE entry to install. We do not support hugetlb at this
stage.

If this function returns an error, or an allocation fails during the
operation, we abort the operation altogether. It is up to the caller to
deal appropriately with partially populated page table ranges.

If install_pte() is defined, the semantics of pte_entry() change - this
callback is then only invoked if the entry already exists. This is a useful
property, as it allows a caller to handle existing PTEs while installing
new ones where necessary in the specified range.

If install_pte() is not defined, then there is no functional difference to
this patch, so all existing logic will work precisely as it did before.

As we only permit the installation of PTEs where a mapping does not already
exist there is no need for TLB management, however we do invoke
update_mmu_cache() for architectures which require manual maintenance of
mappings for other CPUs.

We explicitly do not allow the existing page walk API to expose this
feature as it is dangerous and intended for internal mm use only. Therefore
we provide a new walk_page_range_mm() function exposed only to
mm/internal.h.

Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 include/linux/pagewalk.h |  18 +++-
 mm/internal.h            |   6 ++
 mm/pagewalk.c            | 174 ++++++++++++++++++++++++++-------------
 3 files changed, 136 insertions(+), 62 deletions(-)

Comments

Lorenzo Stoakes Sept. 30, 2024, 4:05 p.m. UTC | #1
On Fri, Sep 27, 2024 at 01:51:11PM GMT, Lorenzo Stoakes wrote:
> The existing generic pagewalk logic permits the walking of page tables,
> invoking callbacks at individual page table levels via user-provided
> mm_walk_ops callbacks.
>
> This is useful for traversing existing page table entries, but precludes
> the ability to establish new ones.
>
> Existing mechanism for performing a walk which also installs page table
> entries if necessary are heavily duplicated throughout the kernel, each
> with semantic differences from one another and largely unavailable for use
> elsewhere.
>
> Rather than add yet another implementation, we extend the generic pagewalk
> logic to enable the installation of page table entries by adding a new
> install_pte() callback in mm_walk_ops. If this is specified, then upon
> encountering a missing page table entry, we allocate and install a new one
> and continue the traversal.
>
> If a THP huge page is encountered, we make use of existing logic to split
> it. Then once we reach the PTE level, we invoke the install_pte() callback
> which provides a PTE entry to install. We do not support hugetlb at this
> stage.
>
> If this function returns an error, or an allocation fails during the
> operation, we abort the operation altogether. It is up to the caller to
> deal appropriately with partially populated page table ranges.
>
> If install_pte() is defined, the semantics of pte_entry() change - this
> callback is then only invoked if the entry already exists. This is a useful
> property, as it allows a caller to handle existing PTEs while installing
> new ones where necessary in the specified range.
>
> If install_pte() is not defined, then there is no functional difference to
> this patch, so all existing logic will work precisely as it did before.
>
> As we only permit the installation of PTEs where a mapping does not already
> exist there is no need for TLB management, however we do invoke
> update_mmu_cache() for architectures which require manual maintenance of
> mappings for other CPUs.
>
> We explicitly do not allow the existing page walk API to expose this
> feature as it is dangerous and intended for internal mm use only. Therefore
> we provide a new walk_page_range_mm() function exposed only to
> mm/internal.h.
>
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> ---
>  include/linux/pagewalk.h |  18 +++-
>  mm/internal.h            |   6 ++
>  mm/pagewalk.c            | 174 ++++++++++++++++++++++++++-------------
>  3 files changed, 136 insertions(+), 62 deletions(-)
>
> diff --git a/include/linux/pagewalk.h b/include/linux/pagewalk.h
> index f5eb5a32aeed..9700a29f8afb 100644
> --- a/include/linux/pagewalk.h
> +++ b/include/linux/pagewalk.h
> @@ -25,12 +25,15 @@ enum page_walk_lock {
>   *			this handler is required to be able to handle
>   *			pmd_trans_huge() pmds.  They may simply choose to
>   *			split_huge_page() instead of handling it explicitly.
> - * @pte_entry:		if set, called for each PTE (lowest-level) entry,
> - *			including empty ones
> + * @pte_entry:		if set, called for each PTE (lowest-level) entry
> + *			including empty ones, except if @install_pte is set.
> + *			If @install_pte is set, @pte_entry is called only for
> + *			existing PTEs.
>   * @pte_hole:		if set, called for each hole at all levels,
>   *			depth is -1 if not known, 0:PGD, 1:P4D, 2:PUD, 3:PMD.
>   *			Any folded depths (where PTRS_PER_P?D is equal to 1)
> - *			are skipped.
> + *			are skipped. If @install_pte is specified, this will
> + *			not trigger for any populated ranges.
>   * @hugetlb_entry:	if set, called for each hugetlb entry. This hook
>   *			function is called with the vma lock held, in order to
>   *			protect against a concurrent freeing of the pte_t* or
> @@ -51,6 +54,13 @@ enum page_walk_lock {
>   * @pre_vma:            if set, called before starting walk on a non-null vma.
>   * @post_vma:           if set, called after a walk on a non-null vma, provided
>   *                      that @pre_vma and the vma walk succeeded.
> + * @install_pte:        if set, missing page table entries are installed and
> + *                      thus all levels are always walked in the specified
> + *                      range. This callback is then invoked at the PTE level
> + *                      (having split any THP pages prior), providing the PTE to
> + *                      install. If allocations fail, the walk is aborted. This
> + *                      operation is only available for userland memory. Not
> + *                      usable for hugetlb ranges.
>   *
>   * p?d_entry callbacks are called even if those levels are folded on a
>   * particular architecture/configuration.
> @@ -76,6 +86,8 @@ struct mm_walk_ops {
>  	int (*pre_vma)(unsigned long start, unsigned long end,
>  		       struct mm_walk *walk);
>  	void (*post_vma)(struct mm_walk *walk);
> +	int (*install_pte)(unsigned long addr, unsigned long next,
> +			   pte_t *ptep, struct mm_walk *walk);
>  	enum page_walk_lock walk_lock;
>  };
>
> diff --git a/mm/internal.h b/mm/internal.h
> index 93083bbeeefa..1bfe45b7fa08 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -12,6 +12,7 @@
>  #include <linux/mm.h>
>  #include <linux/mm_inline.h>
>  #include <linux/pagemap.h>
> +#include <linux/pagewalk.h>
>  #include <linux/rmap.h>
>  #include <linux/swap.h>
>  #include <linux/swapops.h>
> @@ -1443,4 +1444,9 @@ static inline void accept_page(struct page *page)
>  }
>  #endif /* CONFIG_UNACCEPTED_MEMORY */
>
> +/* pagewalk.c */
> +int walk_page_range_mm(struct mm_struct *mm, unsigned long start,
> +		unsigned long end, const struct mm_walk_ops *ops,
> +		void *private);
> +
>  #endif	/* __MM_INTERNAL_H */
> diff --git a/mm/pagewalk.c b/mm/pagewalk.c
> index 461ea3bbd8d9..c3b9624948c1 100644
> --- a/mm/pagewalk.c
> +++ b/mm/pagewalk.c
> @@ -6,6 +6,8 @@
>  #include <linux/swap.h>
>  #include <linux/swapops.h>

We need to add an include here for asm/tlbflush.h I believe to make
update_mmu_cache() available, this was overlooked as for x86 this is included
through some other header.

I will add on respin.

>
> +#include "internal.h"
> +
>  /*
>   * We want to know the real level where a entry is located ignoring any
>   * folding of levels which may be happening. For example if p4d is folded then
> @@ -29,9 +31,23 @@ static int walk_pte_range_inner(pte_t *pte, unsigned long addr,
>  	int err = 0;
>
>  	for (;;) {
> -		err = ops->pte_entry(pte, addr, addr + PAGE_SIZE, walk);
> -		if (err)
> -		       break;
> +		if (ops->install_pte && pte_none(ptep_get(pte))) {
> +			pte_t new_pte;
> +
> +			err = ops->install_pte(addr, addr + PAGE_SIZE, &new_pte,
> +					       walk);
> +			if (err)
> +				break;
> +
> +			set_pte_at(walk->mm, addr, pte, new_pte);
> +			/* Non-present before, so for arches that need it. */
> +			if (!WARN_ON_ONCE(walk->no_vma))
> +				update_mmu_cache(walk->vma, addr, pte);
> +		} else {
> +			err = ops->pte_entry(pte, addr, addr + PAGE_SIZE, walk);
> +			if (err)
> +				break;
> +		}
>  		if (addr >= end - PAGE_SIZE)
>  			break;
>  		addr += PAGE_SIZE;
> @@ -89,11 +105,14 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
>  again:
>  		next = pmd_addr_end(addr, end);
>  		if (pmd_none(*pmd)) {
> -			if (ops->pte_hole)
> +			if (ops->install_pte)
> +				err = __pte_alloc(walk->mm, pmd);
> +			else if (ops->pte_hole)
>  				err = ops->pte_hole(addr, next, depth, walk);
>  			if (err)
>  				break;
> -			continue;
> +			if (!ops->install_pte)
> +				continue;
>  		}
>
>  		walk->action = ACTION_SUBTREE;
> @@ -116,7 +135,7 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
>  		 */
>  		if ((!walk->vma && (pmd_leaf(*pmd) || !pmd_present(*pmd))) ||
>  		    walk->action == ACTION_CONTINUE ||
> -		    !(ops->pte_entry))
> +		    !(ops->pte_entry || ops->install_pte))
>  			continue;
>
>  		if (walk->vma)
> @@ -148,11 +167,14 @@ static int walk_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
>   again:
>  		next = pud_addr_end(addr, end);
>  		if (pud_none(*pud)) {
> -			if (ops->pte_hole)
> +			if (ops->install_pte)
> +				err = __pmd_alloc(walk->mm, pud, addr);
> +			else if (ops->pte_hole)
>  				err = ops->pte_hole(addr, next, depth, walk);
>  			if (err)
>  				break;
> -			continue;
> +			if (!ops->install_pte)
> +				continue;
>  		}
>
>  		walk->action = ACTION_SUBTREE;
> @@ -167,7 +189,7 @@ static int walk_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
>
>  		if ((!walk->vma && (pud_leaf(*pud) || !pud_present(*pud))) ||
>  		    walk->action == ACTION_CONTINUE ||
> -		    !(ops->pmd_entry || ops->pte_entry))
> +		    !(ops->pmd_entry || ops->pte_entry || ops->install_pte))
>  			continue;
>
>  		if (walk->vma)
> @@ -196,18 +218,22 @@ static int walk_p4d_range(pgd_t *pgd, unsigned long addr, unsigned long end,
>  	do {
>  		next = p4d_addr_end(addr, end);
>  		if (p4d_none_or_clear_bad(p4d)) {
> -			if (ops->pte_hole)
> +			if (ops->install_pte)
> +				err = __pud_alloc(walk->mm, p4d, addr);
> +			else if (ops->pte_hole)
>  				err = ops->pte_hole(addr, next, depth, walk);
>  			if (err)
>  				break;
> -			continue;
> +			if (!ops->install_pte)
> +				continue;
>  		}
>  		if (ops->p4d_entry) {
>  			err = ops->p4d_entry(p4d, addr, next, walk);
>  			if (err)
>  				break;
>  		}
> -		if (ops->pud_entry || ops->pmd_entry || ops->pte_entry)
> +		if (ops->pud_entry || ops->pmd_entry || ops->pte_entry ||
> +		    ops->install_pte)
>  			err = walk_pud_range(p4d, addr, next, walk);
>  		if (err)
>  			break;
> @@ -231,18 +257,22 @@ static int walk_pgd_range(unsigned long addr, unsigned long end,
>  	do {
>  		next = pgd_addr_end(addr, end);
>  		if (pgd_none_or_clear_bad(pgd)) {
> -			if (ops->pte_hole)
> +			if (ops->install_pte)
> +				err = __p4d_alloc(walk->mm, pgd, addr);
> +			else if (ops->pte_hole)
>  				err = ops->pte_hole(addr, next, 0, walk);
>  			if (err)
>  				break;
> -			continue;
> +			if (!ops->install_pte)
> +				continue;
>  		}
>  		if (ops->pgd_entry) {
>  			err = ops->pgd_entry(pgd, addr, next, walk);
>  			if (err)
>  				break;
>  		}
> -		if (ops->p4d_entry || ops->pud_entry || ops->pmd_entry || ops->pte_entry)
> +		if (ops->p4d_entry || ops->pud_entry || ops->pmd_entry ||
> +		    ops->pte_entry || ops->install_pte)
>  			err = walk_p4d_range(pgd, addr, next, walk);
>  		if (err)
>  			break;
> @@ -334,6 +364,11 @@ static int __walk_page_range(unsigned long start, unsigned long end,
>  	int err = 0;
>  	struct vm_area_struct *vma = walk->vma;
>  	const struct mm_walk_ops *ops = walk->ops;
> +	bool is_hugetlb = is_vm_hugetlb_page(vma);
> +
> +	/* We do not support hugetlb PTE installation. */
> +	if (ops->install_pte && is_hugetlb)
> +		return -EINVAL;
>
>  	if (ops->pre_vma) {
>  		err = ops->pre_vma(start, end, walk);
> @@ -341,7 +376,7 @@ static int __walk_page_range(unsigned long start, unsigned long end,
>  			return err;
>  	}
>
> -	if (is_vm_hugetlb_page(vma)) {
> +	if (is_hugetlb) {
>  		if (ops->hugetlb_entry)
>  			err = walk_hugetlb_range(start, end, walk);
>  	} else
> @@ -380,47 +415,7 @@ static inline void process_vma_walk_lock(struct vm_area_struct *vma,
>  #endif
>  }
>
> -/**
> - * walk_page_range - walk page table with caller specific callbacks
> - * @mm:		mm_struct representing the target process of page table walk
> - * @start:	start address of the virtual address range
> - * @end:	end address of the virtual address range
> - * @ops:	operation to call during the walk
> - * @private:	private data for callbacks' usage
> - *
> - * Recursively walk the page table tree of the process represented by @mm
> - * within the virtual address range [@start, @end). During walking, we can do
> - * some caller-specific works for each entry, by setting up pmd_entry(),
> - * pte_entry(), and/or hugetlb_entry(). If you don't set up for some of these
> - * callbacks, the associated entries/pages are just ignored.
> - * The return values of these callbacks are commonly defined like below:
> - *
> - *  - 0  : succeeded to handle the current entry, and if you don't reach the
> - *         end address yet, continue to walk.
> - *  - >0 : succeeded to handle the current entry, and return to the caller
> - *         with caller specific value.
> - *  - <0 : failed to handle the current entry, and return to the caller
> - *         with error code.
> - *
> - * Before starting to walk page table, some callers want to check whether
> - * they really want to walk over the current vma, typically by checking
> - * its vm_flags. walk_page_test() and @ops->test_walk() are used for this
> - * purpose.
> - *
> - * If operations need to be staged before and committed after a vma is walked,
> - * there are two callbacks, pre_vma() and post_vma(). Note that post_vma(),
> - * since it is intended to handle commit-type operations, can't return any
> - * errors.
> - *
> - * struct mm_walk keeps current values of some common data like vma and pmd,
> - * which are useful for the access from callbacks. If you want to pass some
> - * caller-specific data to callbacks, @private should be helpful.
> - *
> - * Locking:
> - *   Callers of walk_page_range() and walk_page_vma() should hold @mm->mmap_lock,
> - *   because these function traverse vma list and/or access to vma's data.
> - */
> -int walk_page_range(struct mm_struct *mm, unsigned long start,
> +int walk_page_range_mm(struct mm_struct *mm, unsigned long start,
>  		unsigned long end, const struct mm_walk_ops *ops,
>  		void *private)
>  {
> @@ -479,6 +474,57 @@ int walk_page_range(struct mm_struct *mm, unsigned long start,
>  	return err;
>  }
>
> +/**
> + * walk_page_range - walk page table with caller specific callbacks
> + * @mm:		mm_struct representing the target process of page table walk
> + * @start:	start address of the virtual address range
> + * @end:	end address of the virtual address range
> + * @ops:	operation to call during the walk
> + * @private:	private data for callbacks' usage
> + *
> + * Recursively walk the page table tree of the process represented by @mm
> + * within the virtual address range [@start, @end). During walking, we can do
> + * some caller-specific works for each entry, by setting up pmd_entry(),
> + * pte_entry(), and/or hugetlb_entry(). If you don't set up for some of these
> + * callbacks, the associated entries/pages are just ignored.
> + * The return values of these callbacks are commonly defined like below:
> + *
> + *  - 0  : succeeded to handle the current entry, and if you don't reach the
> + *         end address yet, continue to walk.
> + *  - >0 : succeeded to handle the current entry, and return to the caller
> + *         with caller specific value.
> + *  - <0 : failed to handle the current entry, and return to the caller
> + *         with error code.
> + *
> + * Before starting to walk page table, some callers want to check whether
> + * they really want to walk over the current vma, typically by checking
> + * its vm_flags. walk_page_test() and @ops->test_walk() are used for this
> + * purpose.
> + *
> + * If operations need to be staged before and committed after a vma is walked,
> + * there are two callbacks, pre_vma() and post_vma(). Note that post_vma(),
> + * since it is intended to handle commit-type operations, can't return any
> + * errors.
> + *
> + * struct mm_walk keeps current values of some common data like vma and pmd,
> + * which are useful for the access from callbacks. If you want to pass some
> + * caller-specific data to callbacks, @private should be helpful.
> + *
> + * Locking:
> + *   Callers of walk_page_range() and walk_page_vma() should hold @mm->mmap_lock,
> + *   because these function traverse vma list and/or access to vma's data.
> + */
> +int walk_page_range(struct mm_struct *mm, unsigned long start,
> +		unsigned long end, const struct mm_walk_ops *ops,
> +		void *private)
> +{
> +	/* For internal use only. */
> +	if (ops->install_pte)
> +		return -EINVAL;
> +
> +	return walk_page_range_mm(mm, start, end, ops, private);
> +}
> +
>  /**
>   * walk_page_range_novma - walk a range of pagetables not backed by a vma
>   * @mm:		mm_struct representing the target process of page table walk
> @@ -494,7 +540,7 @@ int walk_page_range(struct mm_struct *mm, unsigned long start,
>   * walking the kernel pages tables or page tables for firmware.
>   *
>   * Note: Be careful to walk the kernel pages tables, the caller may be need to
> - * take other effective approache (mmap lock may be insufficient) to prevent
> + * take other effective approaches (mmap lock may be insufficient) to prevent
>   * the intermediate kernel page tables belonging to the specified address range
>   * from being freed (e.g. memory hot-remove).
>   */
> @@ -511,7 +557,7 @@ int walk_page_range_novma(struct mm_struct *mm, unsigned long start,
>  		.no_vma		= true
>  	};
>
> -	if (start >= end || !walk.mm)
> +	if (start >= end || !walk.mm || ops->install_pte)
>  		return -EINVAL;
>
>  	/*
> @@ -556,6 +602,9 @@ int walk_page_range_vma(struct vm_area_struct *vma, unsigned long start,
>  		return -EINVAL;
>  	if (start < vma->vm_start || end > vma->vm_end)
>  		return -EINVAL;
> +	/* For internal use only. */
> +	if (ops->install_pte)
> +		return -EINVAL;
>
>  	process_mm_walk_lock(walk.mm, ops->walk_lock);
>  	process_vma_walk_lock(vma, ops->walk_lock);
> @@ -574,6 +623,9 @@ int walk_page_vma(struct vm_area_struct *vma, const struct mm_walk_ops *ops,
>
>  	if (!walk.mm)
>  		return -EINVAL;
> +	/* For internal use only. */
> +	if (ops->install_pte)
> +		return -EINVAL;
>
>  	process_mm_walk_lock(walk.mm, ops->walk_lock);
>  	process_vma_walk_lock(vma, ops->walk_lock);
> @@ -623,6 +675,10 @@ int walk_page_mapping(struct address_space *mapping, pgoff_t first_index,
>  	unsigned long start_addr, end_addr;
>  	int err = 0;
>
> +	/* For internal use only. */
> +	if (ops->install_pte)
> +		return -EINVAL;
> +
>  	lockdep_assert_held(&mapping->i_mmap_rwsem);
>  	vma_interval_tree_foreach(vma, &mapping->i_mmap, first_index,
>  				  first_index + nr - 1) {
> --
> 2.46.2
>
Jann Horn Oct. 11, 2024, 6:11 p.m. UTC | #2
On Fri, Sep 27, 2024 at 2:51 PM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
> Rather than add yet another implementation, we extend the generic pagewalk
> logic to enable the installation of page table entries by adding a new
> install_pte() callback in mm_walk_ops. If this is specified, then upon
> encountering a missing page table entry, we allocate and install a new one
> and continue the traversal.
[...]
>
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

Reviewed-by: Jann Horn <jannh@google.com>
Lorenzo Stoakes Oct. 14, 2024, 11:10 a.m. UTC | #3
On Fri, Oct 11, 2024 at 08:11:28PM +0200, Jann Horn wrote:
> On Fri, Sep 27, 2024 at 2:51 PM Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
> > Rather than add yet another implementation, we extend the generic pagewalk
> > logic to enable the installation of page table entries by adding a new
> > install_pte() callback in mm_walk_ops. If this is specified, then upon
> > encountering a missing page table entry, we allocate and install a new one
> > and continue the traversal.
> [...]
> >
> > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>
> Reviewed-by: Jann Horn <jannh@google.com>

Thanks!
Christoph Hellwig Oct. 15, 2024, 6:47 a.m. UTC | #4
Hi Lorenzo,

sorry for only replying to this so late.

On Fri, Sep 27, 2024 at 01:51:11PM +0100, Lorenzo Stoakes wrote:
> The existing generic pagewalk logic permits the walking of page tables,
> invoking callbacks at individual page table levels via user-provided
> mm_walk_ops callbacks.
> 
> This is useful for traversing existing page table entries, but precludes
> the ability to establish new ones.
> 
> Existing mechanism for performing a walk which also installs page table
> entries if necessary are heavily duplicated throughout the kernel, each
> with semantic differences from one another and largely unavailable for use
> elsewhere.

I do like the idea of having common code for installing page tables!

Minor nits below:

> +int walk_page_range_mm(struct mm_struct *mm, unsigned long start,
>  		unsigned long end, const struct mm_walk_ops *ops,
>  		void *private)

It would be good to have a minimum level of documentation for this
function, including how it differs from walk_page_range and why
it should remain internal.

> +	/* For internal use only. */
> +	if (ops->install_pte)
> +		return -EINVAL;

And this should probably be expanded a bit, including that no exported
symbol should allow inserting arbitrary PTEs.  Maybe best done with
a helper to share that comment with the other places that have this
check.
Lorenzo Stoakes Oct. 15, 2024, 7:27 a.m. UTC | #5
On Mon, Oct 14, 2024 at 11:47:58PM -0700, Christoph Hellwig wrote:
> Hi Lorenzo,
>
> sorry for only replying to this so late.

No worries, and thanks for taking a look! :)

>
> On Fri, Sep 27, 2024 at 01:51:11PM +0100, Lorenzo Stoakes wrote:
> > The existing generic pagewalk logic permits the walking of page tables,
> > invoking callbacks at individual page table levels via user-provided
> > mm_walk_ops callbacks.
> >
> > This is useful for traversing existing page table entries, but precludes
> > the ability to establish new ones.
> >
> > Existing mechanism for performing a walk which also installs page table
> > entries if necessary are heavily duplicated throughout the kernel, each
> > with semantic differences from one another and largely unavailable for use
> > elsewhere.
>
> I do like the idea of having common code for installing page tables!
>

Awesome.

> Minor nits below:
>
> > +int walk_page_range_mm(struct mm_struct *mm, unsigned long start,
> >  		unsigned long end, const struct mm_walk_ops *ops,
> >  		void *private)
>
> It would be good to have a minimum level of documentation for this
> function, including how it differs from walk_page_range and why
> it should remain internal.

Will add on respin!

>
> > +	/* For internal use only. */
> > +	if (ops->install_pte)
> > +		return -EINVAL;
>
> And this should probably be expanded a bit, including that no exported
> symbol should allow inserting arbitrary PTEs.  Maybe best done with
> a helper to share that comment with the other places that have this
> check.
>

Yeah a helper makes sense actually, a more general 'are these ops valid?'
thing. Will update on next respin with some explanation.

The next iteration I plan to un-RFC as seems generally the concept is
unopposed for this series, will make these changes then.

Thanks!
diff mbox series

Patch

diff --git a/include/linux/pagewalk.h b/include/linux/pagewalk.h
index f5eb5a32aeed..9700a29f8afb 100644
--- a/include/linux/pagewalk.h
+++ b/include/linux/pagewalk.h
@@ -25,12 +25,15 @@  enum page_walk_lock {
  *			this handler is required to be able to handle
  *			pmd_trans_huge() pmds.  They may simply choose to
  *			split_huge_page() instead of handling it explicitly.
- * @pte_entry:		if set, called for each PTE (lowest-level) entry,
- *			including empty ones
+ * @pte_entry:		if set, called for each PTE (lowest-level) entry
+ *			including empty ones, except if @install_pte is set.
+ *			If @install_pte is set, @pte_entry is called only for
+ *			existing PTEs.
  * @pte_hole:		if set, called for each hole at all levels,
  *			depth is -1 if not known, 0:PGD, 1:P4D, 2:PUD, 3:PMD.
  *			Any folded depths (where PTRS_PER_P?D is equal to 1)
- *			are skipped.
+ *			are skipped. If @install_pte is specified, this will
+ *			not trigger for any populated ranges.
  * @hugetlb_entry:	if set, called for each hugetlb entry. This hook
  *			function is called with the vma lock held, in order to
  *			protect against a concurrent freeing of the pte_t* or
@@ -51,6 +54,13 @@  enum page_walk_lock {
  * @pre_vma:            if set, called before starting walk on a non-null vma.
  * @post_vma:           if set, called after a walk on a non-null vma, provided
  *                      that @pre_vma and the vma walk succeeded.
+ * @install_pte:        if set, missing page table entries are installed and
+ *                      thus all levels are always walked in the specified
+ *                      range. This callback is then invoked at the PTE level
+ *                      (having split any THP pages prior), providing the PTE to
+ *                      install. If allocations fail, the walk is aborted. This
+ *                      operation is only available for userland memory. Not
+ *                      usable for hugetlb ranges.
  *
  * p?d_entry callbacks are called even if those levels are folded on a
  * particular architecture/configuration.
@@ -76,6 +86,8 @@  struct mm_walk_ops {
 	int (*pre_vma)(unsigned long start, unsigned long end,
 		       struct mm_walk *walk);
 	void (*post_vma)(struct mm_walk *walk);
+	int (*install_pte)(unsigned long addr, unsigned long next,
+			   pte_t *ptep, struct mm_walk *walk);
 	enum page_walk_lock walk_lock;
 };
 
diff --git a/mm/internal.h b/mm/internal.h
index 93083bbeeefa..1bfe45b7fa08 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -12,6 +12,7 @@ 
 #include <linux/mm.h>
 #include <linux/mm_inline.h>
 #include <linux/pagemap.h>
+#include <linux/pagewalk.h>
 #include <linux/rmap.h>
 #include <linux/swap.h>
 #include <linux/swapops.h>
@@ -1443,4 +1444,9 @@  static inline void accept_page(struct page *page)
 }
 #endif /* CONFIG_UNACCEPTED_MEMORY */
 
+/* pagewalk.c */
+int walk_page_range_mm(struct mm_struct *mm, unsigned long start,
+		unsigned long end, const struct mm_walk_ops *ops,
+		void *private);
+
 #endif	/* __MM_INTERNAL_H */
diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index 461ea3bbd8d9..c3b9624948c1 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -6,6 +6,8 @@ 
 #include <linux/swap.h>
 #include <linux/swapops.h>
 
+#include "internal.h"
+
 /*
  * We want to know the real level where a entry is located ignoring any
  * folding of levels which may be happening. For example if p4d is folded then
@@ -29,9 +31,23 @@  static int walk_pte_range_inner(pte_t *pte, unsigned long addr,
 	int err = 0;
 
 	for (;;) {
-		err = ops->pte_entry(pte, addr, addr + PAGE_SIZE, walk);
-		if (err)
-		       break;
+		if (ops->install_pte && pte_none(ptep_get(pte))) {
+			pte_t new_pte;
+
+			err = ops->install_pte(addr, addr + PAGE_SIZE, &new_pte,
+					       walk);
+			if (err)
+				break;
+
+			set_pte_at(walk->mm, addr, pte, new_pte);
+			/* Non-present before, so for arches that need it. */
+			if (!WARN_ON_ONCE(walk->no_vma))
+				update_mmu_cache(walk->vma, addr, pte);
+		} else {
+			err = ops->pte_entry(pte, addr, addr + PAGE_SIZE, walk);
+			if (err)
+				break;
+		}
 		if (addr >= end - PAGE_SIZE)
 			break;
 		addr += PAGE_SIZE;
@@ -89,11 +105,14 @@  static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
 again:
 		next = pmd_addr_end(addr, end);
 		if (pmd_none(*pmd)) {
-			if (ops->pte_hole)
+			if (ops->install_pte)
+				err = __pte_alloc(walk->mm, pmd);
+			else if (ops->pte_hole)
 				err = ops->pte_hole(addr, next, depth, walk);
 			if (err)
 				break;
-			continue;
+			if (!ops->install_pte)
+				continue;
 		}
 
 		walk->action = ACTION_SUBTREE;
@@ -116,7 +135,7 @@  static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
 		 */
 		if ((!walk->vma && (pmd_leaf(*pmd) || !pmd_present(*pmd))) ||
 		    walk->action == ACTION_CONTINUE ||
-		    !(ops->pte_entry))
+		    !(ops->pte_entry || ops->install_pte))
 			continue;
 
 		if (walk->vma)
@@ -148,11 +167,14 @@  static int walk_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
  again:
 		next = pud_addr_end(addr, end);
 		if (pud_none(*pud)) {
-			if (ops->pte_hole)
+			if (ops->install_pte)
+				err = __pmd_alloc(walk->mm, pud, addr);
+			else if (ops->pte_hole)
 				err = ops->pte_hole(addr, next, depth, walk);
 			if (err)
 				break;
-			continue;
+			if (!ops->install_pte)
+				continue;
 		}
 
 		walk->action = ACTION_SUBTREE;
@@ -167,7 +189,7 @@  static int walk_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
 
 		if ((!walk->vma && (pud_leaf(*pud) || !pud_present(*pud))) ||
 		    walk->action == ACTION_CONTINUE ||
-		    !(ops->pmd_entry || ops->pte_entry))
+		    !(ops->pmd_entry || ops->pte_entry || ops->install_pte))
 			continue;
 
 		if (walk->vma)
@@ -196,18 +218,22 @@  static int walk_p4d_range(pgd_t *pgd, unsigned long addr, unsigned long end,
 	do {
 		next = p4d_addr_end(addr, end);
 		if (p4d_none_or_clear_bad(p4d)) {
-			if (ops->pte_hole)
+			if (ops->install_pte)
+				err = __pud_alloc(walk->mm, p4d, addr);
+			else if (ops->pte_hole)
 				err = ops->pte_hole(addr, next, depth, walk);
 			if (err)
 				break;
-			continue;
+			if (!ops->install_pte)
+				continue;
 		}
 		if (ops->p4d_entry) {
 			err = ops->p4d_entry(p4d, addr, next, walk);
 			if (err)
 				break;
 		}
-		if (ops->pud_entry || ops->pmd_entry || ops->pte_entry)
+		if (ops->pud_entry || ops->pmd_entry || ops->pte_entry ||
+		    ops->install_pte)
 			err = walk_pud_range(p4d, addr, next, walk);
 		if (err)
 			break;
@@ -231,18 +257,22 @@  static int walk_pgd_range(unsigned long addr, unsigned long end,
 	do {
 		next = pgd_addr_end(addr, end);
 		if (pgd_none_or_clear_bad(pgd)) {
-			if (ops->pte_hole)
+			if (ops->install_pte)
+				err = __p4d_alloc(walk->mm, pgd, addr);
+			else if (ops->pte_hole)
 				err = ops->pte_hole(addr, next, 0, walk);
 			if (err)
 				break;
-			continue;
+			if (!ops->install_pte)
+				continue;
 		}
 		if (ops->pgd_entry) {
 			err = ops->pgd_entry(pgd, addr, next, walk);
 			if (err)
 				break;
 		}
-		if (ops->p4d_entry || ops->pud_entry || ops->pmd_entry || ops->pte_entry)
+		if (ops->p4d_entry || ops->pud_entry || ops->pmd_entry ||
+		    ops->pte_entry || ops->install_pte)
 			err = walk_p4d_range(pgd, addr, next, walk);
 		if (err)
 			break;
@@ -334,6 +364,11 @@  static int __walk_page_range(unsigned long start, unsigned long end,
 	int err = 0;
 	struct vm_area_struct *vma = walk->vma;
 	const struct mm_walk_ops *ops = walk->ops;
+	bool is_hugetlb = is_vm_hugetlb_page(vma);
+
+	/* We do not support hugetlb PTE installation. */
+	if (ops->install_pte && is_hugetlb)
+		return -EINVAL;
 
 	if (ops->pre_vma) {
 		err = ops->pre_vma(start, end, walk);
@@ -341,7 +376,7 @@  static int __walk_page_range(unsigned long start, unsigned long end,
 			return err;
 	}
 
-	if (is_vm_hugetlb_page(vma)) {
+	if (is_hugetlb) {
 		if (ops->hugetlb_entry)
 			err = walk_hugetlb_range(start, end, walk);
 	} else
@@ -380,47 +415,7 @@  static inline void process_vma_walk_lock(struct vm_area_struct *vma,
 #endif
 }
 
-/**
- * walk_page_range - walk page table with caller specific callbacks
- * @mm:		mm_struct representing the target process of page table walk
- * @start:	start address of the virtual address range
- * @end:	end address of the virtual address range
- * @ops:	operation to call during the walk
- * @private:	private data for callbacks' usage
- *
- * Recursively walk the page table tree of the process represented by @mm
- * within the virtual address range [@start, @end). During walking, we can do
- * some caller-specific works for each entry, by setting up pmd_entry(),
- * pte_entry(), and/or hugetlb_entry(). If you don't set up for some of these
- * callbacks, the associated entries/pages are just ignored.
- * The return values of these callbacks are commonly defined like below:
- *
- *  - 0  : succeeded to handle the current entry, and if you don't reach the
- *         end address yet, continue to walk.
- *  - >0 : succeeded to handle the current entry, and return to the caller
- *         with caller specific value.
- *  - <0 : failed to handle the current entry, and return to the caller
- *         with error code.
- *
- * Before starting to walk page table, some callers want to check whether
- * they really want to walk over the current vma, typically by checking
- * its vm_flags. walk_page_test() and @ops->test_walk() are used for this
- * purpose.
- *
- * If operations need to be staged before and committed after a vma is walked,
- * there are two callbacks, pre_vma() and post_vma(). Note that post_vma(),
- * since it is intended to handle commit-type operations, can't return any
- * errors.
- *
- * struct mm_walk keeps current values of some common data like vma and pmd,
- * which are useful for the access from callbacks. If you want to pass some
- * caller-specific data to callbacks, @private should be helpful.
- *
- * Locking:
- *   Callers of walk_page_range() and walk_page_vma() should hold @mm->mmap_lock,
- *   because these function traverse vma list and/or access to vma's data.
- */
-int walk_page_range(struct mm_struct *mm, unsigned long start,
+int walk_page_range_mm(struct mm_struct *mm, unsigned long start,
 		unsigned long end, const struct mm_walk_ops *ops,
 		void *private)
 {
@@ -479,6 +474,57 @@  int walk_page_range(struct mm_struct *mm, unsigned long start,
 	return err;
 }
 
+/**
+ * walk_page_range - walk page table with caller specific callbacks
+ * @mm:		mm_struct representing the target process of page table walk
+ * @start:	start address of the virtual address range
+ * @end:	end address of the virtual address range
+ * @ops:	operation to call during the walk
+ * @private:	private data for callbacks' usage
+ *
+ * Recursively walk the page table tree of the process represented by @mm
+ * within the virtual address range [@start, @end). During walking, we can do
+ * some caller-specific works for each entry, by setting up pmd_entry(),
+ * pte_entry(), and/or hugetlb_entry(). If you don't set up for some of these
+ * callbacks, the associated entries/pages are just ignored.
+ * The return values of these callbacks are commonly defined like below:
+ *
+ *  - 0  : succeeded to handle the current entry, and if you don't reach the
+ *         end address yet, continue to walk.
+ *  - >0 : succeeded to handle the current entry, and return to the caller
+ *         with caller specific value.
+ *  - <0 : failed to handle the current entry, and return to the caller
+ *         with error code.
+ *
+ * Before starting to walk page table, some callers want to check whether
+ * they really want to walk over the current vma, typically by checking
+ * its vm_flags. walk_page_test() and @ops->test_walk() are used for this
+ * purpose.
+ *
+ * If operations need to be staged before and committed after a vma is walked,
+ * there are two callbacks, pre_vma() and post_vma(). Note that post_vma(),
+ * since it is intended to handle commit-type operations, can't return any
+ * errors.
+ *
+ * struct mm_walk keeps current values of some common data like vma and pmd,
+ * which are useful for the access from callbacks. If you want to pass some
+ * caller-specific data to callbacks, @private should be helpful.
+ *
+ * Locking:
+ *   Callers of walk_page_range() and walk_page_vma() should hold @mm->mmap_lock,
+ *   because these function traverse vma list and/or access to vma's data.
+ */
+int walk_page_range(struct mm_struct *mm, unsigned long start,
+		unsigned long end, const struct mm_walk_ops *ops,
+		void *private)
+{
+	/* For internal use only. */
+	if (ops->install_pte)
+		return -EINVAL;
+
+	return walk_page_range_mm(mm, start, end, ops, private);
+}
+
 /**
  * walk_page_range_novma - walk a range of pagetables not backed by a vma
  * @mm:		mm_struct representing the target process of page table walk
@@ -494,7 +540,7 @@  int walk_page_range(struct mm_struct *mm, unsigned long start,
  * walking the kernel pages tables or page tables for firmware.
  *
  * Note: Be careful to walk the kernel pages tables, the caller may be need to
- * take other effective approache (mmap lock may be insufficient) to prevent
+ * take other effective approaches (mmap lock may be insufficient) to prevent
  * the intermediate kernel page tables belonging to the specified address range
  * from being freed (e.g. memory hot-remove).
  */
@@ -511,7 +557,7 @@  int walk_page_range_novma(struct mm_struct *mm, unsigned long start,
 		.no_vma		= true
 	};
 
-	if (start >= end || !walk.mm)
+	if (start >= end || !walk.mm || ops->install_pte)
 		return -EINVAL;
 
 	/*
@@ -556,6 +602,9 @@  int walk_page_range_vma(struct vm_area_struct *vma, unsigned long start,
 		return -EINVAL;
 	if (start < vma->vm_start || end > vma->vm_end)
 		return -EINVAL;
+	/* For internal use only. */
+	if (ops->install_pte)
+		return -EINVAL;
 
 	process_mm_walk_lock(walk.mm, ops->walk_lock);
 	process_vma_walk_lock(vma, ops->walk_lock);
@@ -574,6 +623,9 @@  int walk_page_vma(struct vm_area_struct *vma, const struct mm_walk_ops *ops,
 
 	if (!walk.mm)
 		return -EINVAL;
+	/* For internal use only. */
+	if (ops->install_pte)
+		return -EINVAL;
 
 	process_mm_walk_lock(walk.mm, ops->walk_lock);
 	process_vma_walk_lock(vma, ops->walk_lock);
@@ -623,6 +675,10 @@  int walk_page_mapping(struct address_space *mapping, pgoff_t first_index,
 	unsigned long start_addr, end_addr;
 	int err = 0;
 
+	/* For internal use only. */
+	if (ops->install_pte)
+		return -EINVAL;
+
 	lockdep_assert_held(&mapping->i_mmap_rwsem);
 	vma_interval_tree_foreach(vma, &mapping->i_mmap, first_index,
 				  first_index + nr - 1) {