diff mbox series

[v3,04/27] mm/userfaultfd: Introduce special pte for unmapped file-backed mem

Message ID 20210527201927.29586-5-peterx@redhat.com (mailing list archive)
State New, archived
Headers show
Series userfaultfd-wp: Support shmem and hugetlbfs | expand

Commit Message

Peter Xu May 27, 2021, 8:19 p.m. UTC
This patch introduces a very special swap-like pte for file-backed memories.

Currently it's only defined for x86_64 only, but as long as any arch that can
properly define the UFFD_WP_SWP_PTE_SPECIAL value as requested, it should
conceptually work too.

We will use this special pte to arm the ptes that got either unmapped or
swapped out for a file-backed region that was previously wr-protected.  This
special pte could trigger a page fault just like swap entries, and as long as
the page fault will satisfy pte_none()==false && pte_present()==false.

Then we can revive the special pte into a normal pte backed by the page cache.

This idea is greatly inspired by Hugh and Andrea in the discussion, which is
referenced in the links below.

The other idea (from Hugh) is that we use swp_type==1 and swp_offset=0 as the
special pte.  The current solution (as pointed out by Andrea) is slightly
preferred in that we don't even need swp_entry_t knowledge at all in trapping
these accesses.  Meanwhile, we also reuse _PAGE_SWP_UFFD_WP from the anonymous
swp entries.

This patch only introduces the special pte and its operators.  It's not yet
applied to have any functional difference.

Link: https://lore.kernel.org/lkml/20201126222359.8120-1-peterx@redhat.com/
Link: https://lore.kernel.org/lkml/20201130230603.46187-1-peterx@redhat.com/
Suggested-by: Andrea Arcangeli <aarcange@redhat.com>
Suggested-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 arch/x86/include/asm/pgtable.h     | 28 ++++++++++++++++++++++++++++
 include/asm-generic/pgtable_uffd.h |  3 +++
 include/linux/userfaultfd_k.h      | 21 +++++++++++++++++++++
 3 files changed, 52 insertions(+)

Comments

Alistair Popple May 28, 2021, 8:32 a.m. UTC | #1
On Friday, 28 May 2021 6:19:04 AM AEST Peter Xu wrote:
> This patch introduces a very special swap-like pte for file-backed memories.
> 
> Currently it's only defined for x86_64 only, but as long as any arch that
> can properly define the UFFD_WP_SWP_PTE_SPECIAL value as requested, it
> should conceptually work too.
> 
> We will use this special pte to arm the ptes that got either unmapped or
> swapped out for a file-backed region that was previously wr-protected.  This
> special pte could trigger a page fault just like swap entries, and as long
> as the page fault will satisfy pte_none()==false && pte_present()==false.
> 
> Then we can revive the special pte into a normal pte backed by the page
> cache.
> 
> This idea is greatly inspired by Hugh and Andrea in the discussion, which is
> referenced in the links below.
> 
> The other idea (from Hugh) is that we use swp_type==1 and swp_offset=0 as
> the special pte.  The current solution (as pointed out by Andrea) is
> slightly preferred in that we don't even need swp_entry_t knowledge at all
> in trapping these accesses.  Meanwhile, we also reuse _PAGE_SWP_UFFD_WP
> from the anonymous swp entries.

So to confirm my understanding the reason you use this special swap pte 
instead of a new swp_type is that you only need the fault and have no extra 
information that needs storing in the pte?

Personally I think it might be better to define a new swp_type for this rather 
than introducing a new arch-specific concept. swp_type entries are portable so 
wouldn't need extra arch-specific bits defined. And as I understand things not 
all architectures (eg. ARM) have spare bits in their swap entry encoding 
anyway so would have to reserve a bit specifically for this which would be 
less efficient than using a swp_type.

Anyway it seems I missed the initial discussion so don't have a strong opinion 
here, mainly just wanted to check my understanding of what's required and how 
these special entries work.

> This patch only introduces the special pte and its operators.  It's not yet
> applied to have any functional difference.
> 
> Link: https://lore.kernel.org/lkml/20201126222359.8120-1-peterx@redhat.com/
> Link: https://lore.kernel.org/lkml/20201130230603.46187-1-peterx@redhat.com/
> Suggested-by: Andrea Arcangeli <aarcange@redhat.com>
> Suggested-by: Hugh Dickins <hughd@google.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  arch/x86/include/asm/pgtable.h     | 28 ++++++++++++++++++++++++++++
>  include/asm-generic/pgtable_uffd.h |  3 +++
>  include/linux/userfaultfd_k.h      | 21 +++++++++++++++++++++
>  3 files changed, 52 insertions(+)
> 
> diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
> index b1099f2d9800..9781ba2da049 100644
> --- a/arch/x86/include/asm/pgtable.h
> +++ b/arch/x86/include/asm/pgtable.h
> @@ -1329,6 +1329,34 @@ static inline pmd_t pmd_swp_clear_soft_dirty(pmd_t
> pmd) #endif
> 
>  #ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP
> +
> +/*
> + * This is a very special swap-like pte that marks this pte as
> "wr-protected" + * by userfaultfd-wp.  It should only exist for file-backed
> memory where the + * page (previously got wr-protected) has been unmapped
> or swapped out. + *
> + * For anonymous memories, the userfaultfd-wp _PAGE_SWP_UFFD_WP bit is kept
> + * along with a real swp entry instead.
> + *
> + * Let's make some rules for this special pte:
> + *
> + * (1) pte_none()==false, so that it'll not trigger a missing page fault.
> + *
> + * (2) pte_present()==false, so that it's recognized as swap (is_swap_pte).
> + *
> + * (3) pte_swp_uffd_wp()==true, so it can be tested just like a swap pte
> that + *     contains a valid swap entry, so that we can check a swap pte
> always + *     using "is_swap_pte() && pte_swp_uffd_wp()" without caring
> about whether + *     there's one swap entry inside of the pte.
> + *
> + * (4) It should not be a valid swap pte anywhere, so that when we see this
> pte + *     we know it does not contain a swap entry.
> + *
> + * For x86, the simplest special pte which satisfies all of above should be
> the + * pte with only _PAGE_SWP_UFFD_WP bit set (where
> swp_type==swp_offset==0). + */
> +#define  UFFD_WP_SWP_PTE_SPECIAL  __pte(_PAGE_SWP_UFFD_WP)
> +
>  static inline pte_t pte_swp_mkuffd_wp(pte_t pte)
>  {
>  	return pte_set_flags(pte, _PAGE_SWP_UFFD_WP);
> diff --git a/include/asm-generic/pgtable_uffd.h
> b/include/asm-generic/pgtable_uffd.h index 828966d4c281..95e9811ce9d1
> 100644
> --- a/include/asm-generic/pgtable_uffd.h
> +++ b/include/asm-generic/pgtable_uffd.h
> @@ -2,6 +2,9 @@
>  #define _ASM_GENERIC_PGTABLE_UFFD_H
> 
>  #ifndef CONFIG_HAVE_ARCH_USERFAULTFD_WP
> +
> +#define  UFFD_WP_SWP_PTE_SPECIAL  __pte(0)
> +
>  static __always_inline int pte_uffd_wp(pte_t pte)
>  {
>  	return 0;
> diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
> index 331d2ccf0bcc..93f932b53a71 100644
> --- a/include/linux/userfaultfd_k.h
> +++ b/include/linux/userfaultfd_k.h
> @@ -145,6 +145,17 @@ extern int userfaultfd_unmap_prep(struct vm_area_struct
> *vma, extern void userfaultfd_unmap_complete(struct mm_struct *mm,
>  				       struct list_head *uf);
> 
> +static inline pte_t pte_swp_mkuffd_wp_special(struct vm_area_struct *vma)
> +{
> +	WARN_ON_ONCE(vma_is_anonymous(vma));
> +	return UFFD_WP_SWP_PTE_SPECIAL;
> +}
> +
> +static inline bool pte_swp_uffd_wp_special(pte_t pte)
> +{
> +	return pte_same(pte, UFFD_WP_SWP_PTE_SPECIAL);
> +}
> +
>  #else /* CONFIG_USERFAULTFD */
> 
>  /* mm helpers */
> @@ -234,6 +245,16 @@ static inline void userfaultfd_unmap_complete(struct
> mm_struct *mm, {
>  }
> 
> +static inline pte_t pte_swp_mkuffd_wp_special(struct vm_area_struct *vma)
> +{
> +	return __pte(0);
> +}
> +
> +static inline bool pte_swp_uffd_wp_special(pte_t pte)
> +{
> +	return false;
> +}
> +
>  #endif /* CONFIG_USERFAULTFD */
> 
>  #endif /* _LINUX_USERFAULTFD_K_H */
Peter Xu May 28, 2021, 12:56 p.m. UTC | #2
On Fri, May 28, 2021 at 06:32:52PM +1000, Alistair Popple wrote:
> On Friday, 28 May 2021 6:19:04 AM AEST Peter Xu wrote:
> > This patch introduces a very special swap-like pte for file-backed memories.
> > 
> > Currently it's only defined for x86_64 only, but as long as any arch that
> > can properly define the UFFD_WP_SWP_PTE_SPECIAL value as requested, it
> > should conceptually work too.
> > 
> > We will use this special pte to arm the ptes that got either unmapped or
> > swapped out for a file-backed region that was previously wr-protected.  This
> > special pte could trigger a page fault just like swap entries, and as long
> > as the page fault will satisfy pte_none()==false && pte_present()==false.
> > 
> > Then we can revive the special pte into a normal pte backed by the page
> > cache.
> > 
> > This idea is greatly inspired by Hugh and Andrea in the discussion, which is
> > referenced in the links below.
> > 
> > The other idea (from Hugh) is that we use swp_type==1 and swp_offset=0 as
> > the special pte.  The current solution (as pointed out by Andrea) is
> > slightly preferred in that we don't even need swp_entry_t knowledge at all
> > in trapping these accesses.  Meanwhile, we also reuse _PAGE_SWP_UFFD_WP
> > from the anonymous swp entries.
> 
> So to confirm my understanding the reason you use this special swap pte 
> instead of a new swp_type is that you only need the fault and have no extra 
> information that needs storing in the pte?

Yes.

> 
> Personally I think it might be better to define a new swp_type for this rather 
> than introducing a new arch-specific concept.

The concept should not be arch-specific, it's the pte that's arch-specific.

> swp_type entries are portable so wouldn't need extra arch-specific bits
> defined. And as I understand things not all architectures (eg. ARM) have
> spare bits in their swap entry encoding anyway so would have to reserve a bit
> specifically for this which would be less efficient than using a swp_type.

It looks a trade-off to me: I think it's fine to use swap type in my series, as
you said it's portable, but it will also waste the swap address space for the
arch when the arch enables it.

The format of the special pte to trigger the fault in this series should be
only a small portion of the code change.  The main logic should still be the
same - we just replace this pte with that one.  IMHO it also means the format
can be changed in the future, it's just that I don't know whether it's wise to
take over a new swap type from start.

> 
> Anyway it seems I missed the initial discussion so don't have a strong opinion 
> here, mainly just wanted to check my understanding of what's required and how 
> these special entries work.

Thanks for mentioning this and join the discussion. I don't know ARM enough so
good to know we may have issue on finding the bits.  Actually before finding
this bit for file-backed uffd-wp specifically, we need to firstly find a bit in
the normal pte for ARM too anyways (see _PAGE_UFFD_WP).  If there's no strong
reason to switch to a new swap type, I'd tend to leave all these to the future
when we enable them on ARM.
Alistair Popple June 3, 2021, 11:53 a.m. UTC | #3
On Friday, 28 May 2021 10:56:02 PM AEST Peter Xu wrote:
> On Fri, May 28, 2021 at 06:32:52PM +1000, Alistair Popple wrote:
> > On Friday, 28 May 2021 6:19:04 AM AEST Peter Xu wrote:
> > > This patch introduces a very special swap-like pte for file-backed
> > > memories.
> > > 
> > > Currently it's only defined for x86_64 only, but as long as any arch
> > > that
> > > can properly define the UFFD_WP_SWP_PTE_SPECIAL value as requested, it
> > > should conceptually work too.
> > > 
> > > We will use this special pte to arm the ptes that got either unmapped or
> > > swapped out for a file-backed region that was previously wr-protected. 
> > > This special pte could trigger a page fault just like swap entries, and
> > > as long as the page fault will satisfy pte_none()==false &&
> > > pte_present()==false.
> > > 
> > > Then we can revive the special pte into a normal pte backed by the page
> > > cache.
> > > 
> > > This idea is greatly inspired by Hugh and Andrea in the discussion,
> > > which is referenced in the links below.
> > > 
> > > The other idea (from Hugh) is that we use swp_type==1 and swp_offset=0
> > > as
> > > the special pte.  The current solution (as pointed out by Andrea) is
> > > slightly preferred in that we don't even need swp_entry_t knowledge at
> > > all
> > > in trapping these accesses.  Meanwhile, we also reuse _PAGE_SWP_UFFD_WP
> > > from the anonymous swp entries.
> > 
> > So to confirm my understanding the reason you use this special swap pte
> > instead of a new swp_type is that you only need the fault and have no
> > extra
> > information that needs storing in the pte?
> 
> Yes.
> 
> > Personally I think it might be better to define a new swp_type for this
> > rather than introducing a new arch-specific concept.
> 
> The concept should not be arch-specific, it's the pte that's arch-specific.

Right, agree this is a minor detail.
 
> > swp_type entries are portable so wouldn't need extra arch-specific bits
> > defined. And as I understand things not all architectures (eg. ARM) have
> > spare bits in their swap entry encoding anyway so would have to reserve a
> > bit specifically for this which would be less efficient than using a
> > swp_type.
> It looks a trade-off to me: I think it's fine to use swap type in my series,
> as you said it's portable, but it will also waste the swap address space
> for the arch when the arch enables it.
> 
> The format of the special pte to trigger the fault in this series should be
> only a small portion of the code change.  The main logic should still be the
> same - we just replace this pte with that one.  IMHO it also means the
> format can be changed in the future, it's just that I don't know whether
> it's wise to take over a new swap type from start.
>
> > Anyway it seems I missed the initial discussion so don't have a strong
> > opinion here, mainly just wanted to check my understanding of what's
> > required and how these special entries work.
> 
> Thanks for mentioning this and join the discussion. I don't know ARM enough
> so good to know we may have issue on finding the bits.  Actually before
> finding this bit for file-backed uffd-wp specifically, we need to firstly
> find a bit in the normal pte for ARM too anyways (see _PAGE_UFFD_WP).  If
> there's no strong reason to switch to a new swap type, I'd tend to leave
> all these to the future when we enable them on ARM.

Yeah, makes sense to me. As you say it should be easy to change and other 
architectures need to find another bit anyway. Not sure how useful it will be 
but I'll try and take a look over the rest of the series as well.

> --
> Peter Xu
Peter Xu June 3, 2021, 2:51 p.m. UTC | #4
On Thu, Jun 03, 2021 at 09:53:45PM +1000, Alistair Popple wrote:
> On Friday, 28 May 2021 10:56:02 PM AEST Peter Xu wrote:
> > On Fri, May 28, 2021 at 06:32:52PM +1000, Alistair Popple wrote:
> > > On Friday, 28 May 2021 6:19:04 AM AEST Peter Xu wrote:
> > > > This patch introduces a very special swap-like pte for file-backed
> > > > memories.
> > > > 
> > > > Currently it's only defined for x86_64 only, but as long as any arch
> > > > that
> > > > can properly define the UFFD_WP_SWP_PTE_SPECIAL value as requested, it
> > > > should conceptually work too.
> > > > 
> > > > We will use this special pte to arm the ptes that got either unmapped or
> > > > swapped out for a file-backed region that was previously wr-protected. 
> > > > This special pte could trigger a page fault just like swap entries, and
> > > > as long as the page fault will satisfy pte_none()==false &&
> > > > pte_present()==false.
> > > > 
> > > > Then we can revive the special pte into a normal pte backed by the page
> > > > cache.
> > > > 
> > > > This idea is greatly inspired by Hugh and Andrea in the discussion,
> > > > which is referenced in the links below.
> > > > 
> > > > The other idea (from Hugh) is that we use swp_type==1 and swp_offset=0
> > > > as
> > > > the special pte.  The current solution (as pointed out by Andrea) is
> > > > slightly preferred in that we don't even need swp_entry_t knowledge at
> > > > all
> > > > in trapping these accesses.  Meanwhile, we also reuse _PAGE_SWP_UFFD_WP
> > > > from the anonymous swp entries.
> > > 
> > > So to confirm my understanding the reason you use this special swap pte
> > > instead of a new swp_type is that you only need the fault and have no
> > > extra
> > > information that needs storing in the pte?
> > 
> > Yes.
> > 
> > > Personally I think it might be better to define a new swp_type for this
> > > rather than introducing a new arch-specific concept.
> > 
> > The concept should not be arch-specific, it's the pte that's arch-specific.
> 
> Right, agree this is a minor detail.

I can't say it's a minor detail, as that's still indeed one of the major ideas
that I'd like to get comment for within the whole series.  It's currently an
outcome from previous discussion with Andrea and Hugh, but of course if there's
better idea with reasoning I can always consider to rework the series.

>  
> > > swp_type entries are portable so wouldn't need extra arch-specific bits
> > > defined. And as I understand things not all architectures (eg. ARM) have
> > > spare bits in their swap entry encoding anyway so would have to reserve a
> > > bit specifically for this which would be less efficient than using a
> > > swp_type.
> > It looks a trade-off to me: I think it's fine to use swap type in my series,
> > as you said it's portable, but it will also waste the swap address space
> > for the arch when the arch enables it.
> > 
> > The format of the special pte to trigger the fault in this series should be
> > only a small portion of the code change.  The main logic should still be the
> > same - we just replace this pte with that one.  IMHO it also means the
> > format can be changed in the future, it's just that I don't know whether
> > it's wise to take over a new swap type from start.
> >
> > > Anyway it seems I missed the initial discussion so don't have a strong
> > > opinion here, mainly just wanted to check my understanding of what's
> > > required and how these special entries work.
> > 
> > Thanks for mentioning this and join the discussion. I don't know ARM enough
> > so good to know we may have issue on finding the bits.  Actually before
> > finding this bit for file-backed uffd-wp specifically, we need to firstly
> > find a bit in the normal pte for ARM too anyways (see _PAGE_UFFD_WP).  If
> > there's no strong reason to switch to a new swap type, I'd tend to leave
> > all these to the future when we enable them on ARM.
> 
> Yeah, makes sense to me. As you say it should be easy to change and other 
> architectures need to find another bit anyway. Not sure how useful it will be 
> but I'll try and take a look over the rest of the series as well.

I'll highly appreciate that.  Thanks Alistair!
Alistair Popple June 4, 2021, 12:55 a.m. UTC | #5
On Friday, 4 June 2021 12:51:19 AM AEST Peter Xu wrote:
> External email: Use caution opening links or attachments
> 
> On Thu, Jun 03, 2021 at 09:53:45PM +1000, Alistair Popple wrote:
> > On Friday, 28 May 2021 10:56:02 PM AEST Peter Xu wrote:
> > > On Fri, May 28, 2021 at 06:32:52PM +1000, Alistair Popple wrote:
> > > > On Friday, 28 May 2021 6:19:04 AM AEST Peter Xu wrote:
> > > > > This patch introduces a very special swap-like pte for file-backed
> > > > > memories.
> > > > > 
> > > > > Currently it's only defined for x86_64 only, but as long as any arch
> > > > > that
> > > > > can properly define the UFFD_WP_SWP_PTE_SPECIAL value as requested,
> > > > > it
> > > > > should conceptually work too.
> > > > > 
> > > > > We will use this special pte to arm the ptes that got either
> > > > > unmapped or
> > > > > swapped out for a file-backed region that was previously
> > > > > wr-protected.
> > > > > This special pte could trigger a page fault just like swap entries,
> > > > > and
> > > > > as long as the page fault will satisfy pte_none()==false &&
> > > > > pte_present()==false.
> > > > > 
> > > > > Then we can revive the special pte into a normal pte backed by the
> > > > > page
> > > > > cache.
> > > > > 
> > > > > This idea is greatly inspired by Hugh and Andrea in the discussion,
> > > > > which is referenced in the links below.
> > > > > 
> > > > > The other idea (from Hugh) is that we use swp_type==1 and
> > > > > swp_offset=0
> > > > > as
> > > > > the special pte.  The current solution (as pointed out by Andrea) is
> > > > > slightly preferred in that we don't even need swp_entry_t knowledge
> > > > > at
> > > > > all
> > > > > in trapping these accesses.  Meanwhile, we also reuse
> > > > > _PAGE_SWP_UFFD_WP
> > > > > from the anonymous swp entries.
> > > > 
> > > > So to confirm my understanding the reason you use this special swap
> > > > pte
> > > > instead of a new swp_type is that you only need the fault and have no
> > > > extra
> > > > information that needs storing in the pte?
> > > 
> > > Yes.
> > > 
> > > > Personally I think it might be better to define a new swp_type for
> > > > this
> > > > rather than introducing a new arch-specific concept.
> > > 
> > > The concept should not be arch-specific, it's the pte that's
> > > arch-specific.
> > 
> > Right, agree this is a minor detail.
> 
> I can't say it's a minor detail, as that's still indeed one of the major
> ideas that I'd like to get comment for within the whole series.  It's
> currently an outcome from previous discussion with Andrea and Hugh, but of
> course if there's better idea with reasoning I can always consider to
> rework the series.

Sorry, I wasn't very clear there. What I meant is the high level arch-
independent concept of using a special swap pte for this is the most important 
aspect of the design and looks good to me.

The detail which is perhaps less important is whether to implement this using 
a new swap entry type or arch-specific swap bit. The argument for using a swap 
type is it will work across architectures due to the use of pte_to_swp_entry() 
and swp_entry_to_pte() to convert to and from the arch-dependent and 
independent representations.

The argument against seems to have been that it is wasting a swap type. 
However if I'm understanding correctly that's not true for all architectures, 
and needing to reserve a bit is more wasteful than using a swap type. For 
example ARM encodes swap entries like so:

 * Encode and decode a swap entry.  Swap entries are stored in the Linux
 * page tables as follows:
 *
 *   3 3 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1
 *   1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0
 *   <--------------- offset ------------------------> < type -> 0 0

So the only way to get a spare bit is to reduce the width of type (or offset) 
which would halve the number of swap types. And if I understand correctly the 
same argument might apply to x86 - the spare bit being used here could instead 
be used to expand the width of type if a lack of available swap types is a 
concern.

> > > > swp_type entries are portable so wouldn't need extra arch-specific
> > > > bits
> > > > defined. And as I understand things not all architectures (eg. ARM)
> > > > have
> > > > spare bits in their swap entry encoding anyway so would have to
> > > > reserve a
> > > > bit specifically for this which would be less efficient than using a
> > > > swp_type.
> > > 
> > > It looks a trade-off to me: I think it's fine to use swap type in my
> > > series, as you said it's portable, but it will also waste the swap
> > > address space for the arch when the arch enables it.
> > > 
> > > The format of the special pte to trigger the fault in this series should
> > > be
> > > only a small portion of the code change.  The main logic should still be
> > > the same - we just replace this pte with that one.  IMHO it also means
> > > the format can be changed in the future, it's just that I don't know
> > > whether it's wise to take over a new swap type from start.
> > > 
> > > > Anyway it seems I missed the initial discussion so don't have a strong
> > > > opinion here, mainly just wanted to check my understanding of what's
> > > > required and how these special entries work.
> > > 
> > > Thanks for mentioning this and join the discussion. I don't know ARM
> > > enough
> > > so good to know we may have issue on finding the bits.  Actually before
> > > finding this bit for file-backed uffd-wp specifically, we need to
> > > firstly
> > > find a bit in the normal pte for ARM too anyways (see _PAGE_UFFD_WP). 
> > > If
> > > there's no strong reason to switch to a new swap type, I'd tend to leave
> > > all these to the future when we enable them on ARM.
> > 
> > Yeah, makes sense to me. As you say it should be easy to change and other
> > architectures need to find another bit anyway. Not sure how useful it will
> > be but I'll try and take a look over the rest of the series as well.
> 
> I'll highly appreciate that.  Thanks Alistair!
> 
> --
> Peter Xu
Hugh Dickins June 4, 2021, 3:14 a.m. UTC | #6
On Fri, 4 Jun 2021, Alistair Popple wrote:
> 
> The detail which is perhaps less important is whether to implement this using 
> a new swap entry type or arch-specific swap bit. The argument for using a swap 
> type is it will work across architectures due to the use of pte_to_swp_entry() 
> and swp_entry_to_pte() to convert to and from the arch-dependent and 
> independent representations.
> 
> The argument against seems to have been that it is wasting a swap type. 
> However if I'm understanding correctly that's not true for all architectures, 
> and needing to reserve a bit is more wasteful than using a swap type.

I'm on the outside, not paying much attention here,
but thought Peter would have cleared this up already.

My understanding is that it does *not* use an additional arch-dependent
bit, but puts the _PAGE_UFFD_WP bit (already set aside by any architecture
implementing UFFD WP) to an additional use.  That's why I called this
design (from Andrea) more elegant than mine (swap type business).

If I've got that wrong, and yet another arch-dependent bit is needed,
then I very much agree with you: finding arch-dependent pte bits is a
much tougher job than another play with swap type.

(And "more elegant" might not be "easier to understand": you decide.)

Hugh
Alistair Popple June 4, 2021, 6:16 a.m. UTC | #7
On Friday, 4 June 2021 1:14:31 PM AEST Hugh Dickins wrote:
> On Fri, 4 Jun 2021, Alistair Popple wrote:
> >
> > The detail which is perhaps less important is whether to implement this 
using
> > a new swap entry type or arch-specific swap bit. The argument for using a 
swap
> > type is it will work across architectures due to the use of 
pte_to_swp_entry()
> > and swp_entry_to_pte() to convert to and from the arch-dependent and
> > independent representations.
> >
> > The argument against seems to have been that it is wasting a swap type.
> > However if I'm understanding correctly that's not true for all 
architectures,
> > and needing to reserve a bit is more wasteful than using a swap type.
> 
> I'm on the outside, not paying much attention here,
> but thought Peter would have cleared this up already.
> 
> My understanding is that it does *not* use an additional arch-dependent
> bit, but puts the _PAGE_UFFD_WP bit (already set aside by any architecture
> implementing UFFD WP) to an additional use.  That's why I called this
> design (from Andrea) more elegant than mine (swap type business).

Oh my bad, I had somehow missed this was reusing an *existing* arch-dependent 
swap bit (_PAGE_SWP_UFFD_WP, although the same argument could apply) even 
though it's in the commit message. Obviously I should have read that more 
carefully, apologies for the noise but thanks for the clarification.

> If I've got that wrong, and yet another arch-dependent bit is needed,
> then I very much agree with you: finding arch-dependent pte bits is a
> much tougher job than another play with swap type.
> 
> (And "more elegant" might not be "easier to understand": you decide.)

Agree, that's a somewhat subjective debate. Conceptually I don't think this is 
particularly difficult to understand. It just adds another slightly different 
class of special swap pte's to know about.

 - Alistair

> Hugh
Peter Xu June 4, 2021, 4:01 p.m. UTC | #8
On Fri, Jun 04, 2021 at 04:16:30PM +1000, Alistair Popple wrote:
> > My understanding is that it does *not* use an additional arch-dependent
> > bit, but puts the _PAGE_UFFD_WP bit (already set aside by any architecture
> > implementing UFFD WP) to an additional use.  That's why I called this
> > design (from Andrea) more elegant than mine (swap type business).
> 
> Oh my bad, I had somehow missed this was reusing an *existing* arch-dependent 
> swap bit (_PAGE_SWP_UFFD_WP, although the same argument could apply) even 
> though it's in the commit message. Obviously I should have read that more 
> carefully, apologies for the noise but thanks for the clarification.

Right, as Hugh mentioned what this series wanted to use is one explicit pte
that no one should ever be using, so ideally that should be the most saving way
per address-space pov.

Meanwhile I think that pte can actually be not related to _PAGE_UFFD_WP at all,
as long as it's a specific pte value then it will service the same goal (even
if to reuse a new swp type, I'll probably only use one pte for it and leave the
rest for other use; but who knows who will start to use the rest!).

I kept using it because that's suggested by Andrea (it actually has
type==off==0 as Hugh suggested too - so it keeps a suggestion of both!) and
it's a good idea to use it since (1) it's never used by anyone before, and (2)
it is _somehow_ related to uffd-wp itself already by having that specific bit
set in the special pte, while that's also the only bit set for the u64 field.

It looks very nice too when debug, because when I dump the ptes it reads 0x4 on
x86.. so the pte value is even easy to read as a number. :)

However I can see that it is less easy to follow than the swap type solution.
In all cases it's still something worth thinking about before using up the swap
types - it's not so rich there, and we keep shrinking MAX_SWAPFILES.. so let's
see whether uffd-wp could be the 1st one to open a new field for unused
"invalid/swap pte" address space.

Meanwhile, I did have a look at ARM on supporting uffd-wp in general, starting
from anonymous pages.  I doubt whether it can be done for old arms (uffd-wp not
even supported on 32bit x86 after all), but for ARM64 I see it has:

For normal ptes:

/*
 * Level 3 descriptor (PTE).
 */
#define PTE_VALID		(_AT(pteval_t, 1) << 0)
#define PTE_TYPE_MASK		(_AT(pteval_t, 3) << 0)
#define PTE_TYPE_PAGE		(_AT(pteval_t, 3) << 0)
#define PTE_TABLE_BIT		(_AT(pteval_t, 1) << 1)
#define PTE_USER		(_AT(pteval_t, 1) << 6)		/* AP[1] */
#define PTE_RDONLY		(_AT(pteval_t, 1) << 7)		/* AP[2] */
#define PTE_SHARED		(_AT(pteval_t, 3) << 8)		/* SH[1:0], inner shareable */
#define PTE_AF			(_AT(pteval_t, 1) << 10)	/* Access Flag */
#define PTE_NG			(_AT(pteval_t, 1) << 11)	/* nG */
#define PTE_GP			(_AT(pteval_t, 1) << 50)	/* BTI guarded */
#define PTE_DBM			(_AT(pteval_t, 1) << 51)	/* Dirty Bit Management */
#define PTE_CONT		(_AT(pteval_t, 1) << 52)	/* Contiguous range */
#define PTE_PXN			(_AT(pteval_t, 1) << 53)	/* Privileged XN */
#define PTE_UXN			(_AT(pteval_t, 1) << 54)	/* User XN */

For swap ptes:

/*
 * Encode and decode a swap entry:
 *	bits 0-1:	present (must be zero)
 *	bits 2-7:	swap type
 *	bits 8-57:	swap offset
 *	bit  58:	PTE_PROT_NONE (must be zero)
 */

So I feel like we still have chance there at least for 64bit ARM? As both
normal/swap ptes have some bits free (bits 2-5,9 for normal ptes; bits 59-63
for swap ptes).  But as I know little on ARM64, I hope I looked at the right
things..

Thanks,
Alistair Popple June 8, 2021, 1:18 p.m. UTC | #9
On Saturday, 5 June 2021 2:01:59 AM AEST Peter Xu wrote:
> On Fri, Jun 04, 2021 at 04:16:30PM +1000, Alistair Popple wrote:
> > > My understanding is that it does *not* use an additional arch-dependent
> > > bit, but puts the _PAGE_UFFD_WP bit (already set aside by any architecture
> > > implementing UFFD WP) to an additional use.  That's why I called this
> > > design (from Andrea) more elegant than mine (swap type business).
> >
> > Oh my bad, I had somehow missed this was reusing an *existing* arch-dependent
> > swap bit (_PAGE_SWP_UFFD_WP, although the same argument could apply) even
> > though it's in the commit message. Obviously I should have read that more
> > carefully, apologies for the noise but thanks for the clarification.
> 
> Right, as Hugh mentioned what this series wanted to use is one explicit pte
> that no one should ever be using, so ideally that should be the most saving way
> per address-space pov.
> 
> Meanwhile I think that pte can actually be not related to _PAGE_UFFD_WP at all,
> as long as it's a specific pte value then it will service the same goal (even
> if to reuse a new swp type, I'll probably only use one pte for it and leave the
> rest for other use; but who knows who will start to use the rest!).
> 
> I kept using it because that's suggested by Andrea (it actually has
> type==off==0 as Hugh suggested too - so it keeps a suggestion of both!) and
> it's a good idea to use it since (1) it's never used by anyone before, and (2)
> it is _somehow_ related to uffd-wp itself already by having that specific bit
> set in the special pte, while that's also the only bit set for the u64 field.
> 
> It looks very nice too when debug, because when I dump the ptes it reads 0x4 on
> x86.. so the pte value is even easy to read as a number. :)
> 
> However I can see that it is less easy to follow than the swap type solution.
> In all cases it's still something worth thinking about before using up the swap
> types - it's not so rich there, and we keep shrinking MAX_SWAPFILES.. so let's
> see whether uffd-wp could be the 1st one to open a new field for unused
> "invalid/swap pte" address space.

Agreed, that matches with what I was thinking as well. If we do end up having
more swap types such as this which don't need to store much information in
the swap pte itself we could define a special swap type (eg. this bit) for
that.

> Meanwhile, I did have a look at ARM on supporting uffd-wp in general, starting
> from anonymous pages.  I doubt whether it can be done for old arms (uffd-wp not
> even supported on 32bit x86 after all), but for ARM64 I see it has:
> 
> For normal ptes:
> 
> /*
>  * Level 3 descriptor (PTE).
>  */
> #define PTE_VALID               (_AT(pteval_t, 1) << 0)
> #define PTE_TYPE_MASK           (_AT(pteval_t, 3) << 0)
> #define PTE_TYPE_PAGE           (_AT(pteval_t, 3) << 0)
> #define PTE_TABLE_BIT           (_AT(pteval_t, 1) << 1)
> #define PTE_USER                (_AT(pteval_t, 1) << 6)         /* AP[1] */
> #define PTE_RDONLY              (_AT(pteval_t, 1) << 7)         /* AP[2] */
> #define PTE_SHARED              (_AT(pteval_t, 3) << 8)         /* SH[1:0], inner shareable */
> #define PTE_AF                  (_AT(pteval_t, 1) << 10)        /* Access Flag */
> #define PTE_NG                  (_AT(pteval_t, 1) << 11)        /* nG */
> #define PTE_GP                  (_AT(pteval_t, 1) << 50)        /* BTI guarded */
> #define PTE_DBM                 (_AT(pteval_t, 1) << 51)        /* Dirty Bit Management */
> #define PTE_CONT                (_AT(pteval_t, 1) << 52)        /* Contiguous range */
> #define PTE_PXN                 (_AT(pteval_t, 1) << 53)        /* Privileged XN */
> #define PTE_UXN                 (_AT(pteval_t, 1) << 54)        /* User XN */
> 
> For swap ptes:
> 
> /*
>  * Encode and decode a swap entry:
>  *      bits 0-1:       present (must be zero)
>  *      bits 2-7:       swap type
>  *      bits 8-57:      swap offset
>  *      bit  58:        PTE_PROT_NONE (must be zero)
>  */
> 
> So I feel like we still have chance there at least for 64bit ARM? As both
> normal/swap ptes have some bits free (bits 2-5,9 for normal ptes; bits 59-63
> for swap ptes).  But as I know little on ARM64, I hope I looked at the right
> things..

I don't claim to be an expert there either. Given there's already a bit
defined for x86 anyway (which is what I missed) I now think the special
swap idea is ok, although I still need to look at the rest of the series.

> Thanks,
> 
> --
> Peter Xu
>
Alistair Popple June 9, 2021, 1:06 p.m. UTC | #10
On Friday, 28 May 2021 6:19:04 AM AEST Peter Xu wrote:

[...]

> diff --git a/include/asm-generic/pgtable_uffd.h b/include/asm-generic/pgtable_uffd.h
> index 828966d4c281..95e9811ce9d1 100644
> --- a/include/asm-generic/pgtable_uffd.h
> +++ b/include/asm-generic/pgtable_uffd.h
> @@ -2,6 +2,9 @@
>  #define _ASM_GENERIC_PGTABLE_UFFD_H
>  
>  #ifndef CONFIG_HAVE_ARCH_USERFAULTFD_WP
> +
> +#define  UFFD_WP_SWP_PTE_SPECIAL  __pte(0)
> +
>  static __always_inline int pte_uffd_wp(pte_t pte)
>  {
>  	return 0;
> diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
> index 331d2ccf0bcc..93f932b53a71 100644
> --- a/include/linux/userfaultfd_k.h
> +++ b/include/linux/userfaultfd_k.h
> @@ -145,6 +145,17 @@ extern int userfaultfd_unmap_prep(struct vm_area_struct *vma,
>  extern void userfaultfd_unmap_complete(struct mm_struct *mm,
>  				       struct list_head *uf);
>  
> +static inline pte_t pte_swp_mkuffd_wp_special(struct vm_area_struct *vma)
> +{
> +	WARN_ON_ONCE(vma_is_anonymous(vma));
> +	return UFFD_WP_SWP_PTE_SPECIAL;
> +}
> +
> +static inline bool pte_swp_uffd_wp_special(pte_t pte)
> +{
> +	return pte_same(pte, UFFD_WP_SWP_PTE_SPECIAL);
> +}
> +

Sorry, only just noticed this but do we need to define a different version of
this helper that returns false for CONFIG_HAVE_ARCH_USERFAULTFD_WP=n to avoid
spurious matches with __pte(0) on architectures supporting userfaultfd but not
userfaultfd-wp?

>  #else /* CONFIG_USERFAULTFD */
>  
>  /* mm helpers */
> @@ -234,6 +245,16 @@ static inline void userfaultfd_unmap_complete(struct mm_struct *mm,
>  {
>  }
>  
> +static inline pte_t pte_swp_mkuffd_wp_special(struct vm_area_struct *vma)
> +{
> +	return __pte(0);
> +}
> +
> +static inline bool pte_swp_uffd_wp_special(pte_t pte)
> +{
> +	return false;
> +}
> +
>  #endif /* CONFIG_USERFAULTFD */
>  
>  #endif /* _LINUX_USERFAULTFD_K_H */
>
Peter Xu June 9, 2021, 2:43 p.m. UTC | #11
On Wed, Jun 09, 2021 at 11:06:32PM +1000, Alistair Popple wrote:
> On Friday, 28 May 2021 6:19:04 AM AEST Peter Xu wrote:
> 
> [...]
> 
> > diff --git a/include/asm-generic/pgtable_uffd.h b/include/asm-generic/pgtable_uffd.h
> > index 828966d4c281..95e9811ce9d1 100644
> > --- a/include/asm-generic/pgtable_uffd.h
> > +++ b/include/asm-generic/pgtable_uffd.h
> > @@ -2,6 +2,9 @@
> >  #define _ASM_GENERIC_PGTABLE_UFFD_H
> >  
> >  #ifndef CONFIG_HAVE_ARCH_USERFAULTFD_WP
> > +
> > +#define  UFFD_WP_SWP_PTE_SPECIAL  __pte(0)
> > +
> >  static __always_inline int pte_uffd_wp(pte_t pte)
> >  {
> >  	return 0;
> > diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
> > index 331d2ccf0bcc..93f932b53a71 100644
> > --- a/include/linux/userfaultfd_k.h
> > +++ b/include/linux/userfaultfd_k.h
> > @@ -145,6 +145,17 @@ extern int userfaultfd_unmap_prep(struct vm_area_struct *vma,
> >  extern void userfaultfd_unmap_complete(struct mm_struct *mm,
> >  				       struct list_head *uf);
> >  
> > +static inline pte_t pte_swp_mkuffd_wp_special(struct vm_area_struct *vma)
> > +{
> > +	WARN_ON_ONCE(vma_is_anonymous(vma));
> > +	return UFFD_WP_SWP_PTE_SPECIAL;
> > +}
> > +
> > +static inline bool pte_swp_uffd_wp_special(pte_t pte)
> > +{
> > +	return pte_same(pte, UFFD_WP_SWP_PTE_SPECIAL);
> > +}
> > +
> 
> Sorry, only just noticed this but do we need to define a different version of
> this helper that returns false for CONFIG_HAVE_ARCH_USERFAULTFD_WP=n to avoid
> spurious matches with __pte(0) on architectures supporting userfaultfd but not
> userfaultfd-wp?

Good point.. Yes we definitely don't want the empty pte to be recognized as the
special pte..  I'll squash below into the same patch:

----8<----
diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index 489fb375e66c..23ca449240d1 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -177,7 +177,11 @@ static inline pte_t pte_swp_mkuffd_wp_special(struct vm_area_struct *vma)
 
 static inline bool pte_swp_uffd_wp_special(pte_t pte)
 {
+#ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP
        return pte_same(pte, UFFD_WP_SWP_PTE_SPECIAL);
+#else
+       return false;
+#fi
 }
 
 #else /* CONFIG_USERFAULTFD */
----8<----

I'll see whether I can give some dry run without HAVE_ARCH_USERFAULTFD_WP but
with USERFAULTFD.

Thanks for spotting that!
diff mbox series

Patch

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index b1099f2d9800..9781ba2da049 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1329,6 +1329,34 @@  static inline pmd_t pmd_swp_clear_soft_dirty(pmd_t pmd)
 #endif
 
 #ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP
+
+/*
+ * This is a very special swap-like pte that marks this pte as "wr-protected"
+ * by userfaultfd-wp.  It should only exist for file-backed memory where the
+ * page (previously got wr-protected) has been unmapped or swapped out.
+ *
+ * For anonymous memories, the userfaultfd-wp _PAGE_SWP_UFFD_WP bit is kept
+ * along with a real swp entry instead.
+ *
+ * Let's make some rules for this special pte:
+ *
+ * (1) pte_none()==false, so that it'll not trigger a missing page fault.
+ *
+ * (2) pte_present()==false, so that it's recognized as swap (is_swap_pte).
+ *
+ * (3) pte_swp_uffd_wp()==true, so it can be tested just like a swap pte that
+ *     contains a valid swap entry, so that we can check a swap pte always
+ *     using "is_swap_pte() && pte_swp_uffd_wp()" without caring about whether
+ *     there's one swap entry inside of the pte.
+ *
+ * (4) It should not be a valid swap pte anywhere, so that when we see this pte
+ *     we know it does not contain a swap entry.
+ *
+ * For x86, the simplest special pte which satisfies all of above should be the
+ * pte with only _PAGE_SWP_UFFD_WP bit set (where swp_type==swp_offset==0).
+ */
+#define  UFFD_WP_SWP_PTE_SPECIAL  __pte(_PAGE_SWP_UFFD_WP)
+
 static inline pte_t pte_swp_mkuffd_wp(pte_t pte)
 {
 	return pte_set_flags(pte, _PAGE_SWP_UFFD_WP);
diff --git a/include/asm-generic/pgtable_uffd.h b/include/asm-generic/pgtable_uffd.h
index 828966d4c281..95e9811ce9d1 100644
--- a/include/asm-generic/pgtable_uffd.h
+++ b/include/asm-generic/pgtable_uffd.h
@@ -2,6 +2,9 @@ 
 #define _ASM_GENERIC_PGTABLE_UFFD_H
 
 #ifndef CONFIG_HAVE_ARCH_USERFAULTFD_WP
+
+#define  UFFD_WP_SWP_PTE_SPECIAL  __pte(0)
+
 static __always_inline int pte_uffd_wp(pte_t pte)
 {
 	return 0;
diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index 331d2ccf0bcc..93f932b53a71 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -145,6 +145,17 @@  extern int userfaultfd_unmap_prep(struct vm_area_struct *vma,
 extern void userfaultfd_unmap_complete(struct mm_struct *mm,
 				       struct list_head *uf);
 
+static inline pte_t pte_swp_mkuffd_wp_special(struct vm_area_struct *vma)
+{
+	WARN_ON_ONCE(vma_is_anonymous(vma));
+	return UFFD_WP_SWP_PTE_SPECIAL;
+}
+
+static inline bool pte_swp_uffd_wp_special(pte_t pte)
+{
+	return pte_same(pte, UFFD_WP_SWP_PTE_SPECIAL);
+}
+
 #else /* CONFIG_USERFAULTFD */
 
 /* mm helpers */
@@ -234,6 +245,16 @@  static inline void userfaultfd_unmap_complete(struct mm_struct *mm,
 {
 }
 
+static inline pte_t pte_swp_mkuffd_wp_special(struct vm_area_struct *vma)
+{
+	return __pte(0);
+}
+
+static inline bool pte_swp_uffd_wp_special(pte_t pte)
+{
+	return false;
+}
+
 #endif /* CONFIG_USERFAULTFD */
 
 #endif /* _LINUX_USERFAULTFD_K_H */