diff mbox series

mm/page_table_check: Support userfault wr-protect entries

Message ID 20240415205259.2535077-1-peterx@redhat.com (mailing list archive)
State New
Headers show
Series mm/page_table_check: Support userfault wr-protect entries | expand

Commit Message

Peter Xu April 15, 2024, 8:52 p.m. UTC
Allow page_table_check hooks to check over userfaultfd wr-protect criteria
upon pgtable updates.  The rule is no co-existance allowed for any writable
flag against userfault wr-protect flag.

This should be better than c2da319c2e, where we used to only sanitize such
issues during a pgtable walk, but when hitting such issue we don't have a
good chance to know where does that writable bit came from [1], so that
even the pgtable walk exposes a kernel bug (which is still helpful on
triaging) but not easy to track and debug.

Now we switch to track the source.  It's much easier too with the recent
introduction of page table check.

There are some limitations with using the page table check here for
userfaultfd wr-protect purpose:

  - It is only enabled with explicit enablement of page table check configs
  and/or boot parameters, but should be good enough to track at least
  syzbot issues, as syzbot should enable PAGE_TABLE_CHECK[_ENFORCED] for
  x86 [1].  We used to have DEBUG_VM but it's now off for most distros,
  while distros also normally not enable PAGE_TABLE_CHECK[_ENFORCED], which
  is similar.

  - It conditionally works with the ptep_modify_prot API.  It will be
  bypassed when e.g. XEN PV is enabled, however still work for most of the
  rest scenarios, which should be the common cases so should be good
  enough.

  - Hugetlb check is a bit hairy, as the page table check cannot identify
  hugetlb pte or normal pte via trapping at set_pte_at(), because of the
  current design where hugetlb maps every layers to pte_t... For example,
  the default set_huge_pte_at() can invoke set_pte_at() directly and lose
  the hugetlb context, treating it the same as a normal pte_t. So far it's
  fine because we have huge_pte_uffd_wp() always equals to pte_uffd_wp() as
  long as supported (x86 only).  It'll be a bigger problem when we'll
  define _PAGE_UFFD_WP differently at various pgtable levels, because then
  one huge_pte_uffd_wp() per-arch will stop making sense first.. as of now
  we can leave this for later too.

This patch also removes commit c2da319c2e altogether, as we have something
better now.

[1] https://lore.kernel.org/all/000000000000dce0530615c89210@google.com/

Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 arch/x86/include/asm/pgtable.h | 18 +-----------------
 mm/page_table_check.c          | 32 +++++++++++++++++++++++++++++++-
 2 files changed, 32 insertions(+), 18 deletions(-)

Comments

Pasha Tatashin April 16, 2024, 9:34 p.m. UTC | #1
Hi Peter,

Thanks for this patch, I like this extra checking logic, my comments below:

On Mon, Apr 15, 2024 at 4:53 PM Peter Xu <peterx@redhat.com> wrote:
>
> Allow page_table_check hooks to check over userfaultfd wr-protect criteria
> upon pgtable updates.  The rule is no co-existance allowed for any writable
> flag against userfault wr-protect flag.
>
> This should be better than c2da319c2e, where we used to only sanitize such
> issues during a pgtable walk, but when hitting such issue we don't have a
> good chance to know where does that writable bit came from [1], so that
> even the pgtable walk exposes a kernel bug (which is still helpful on
> triaging) but not easy to track and debug.
>
> Now we switch to track the source.  It's much easier too with the recent
> introduction of page table check.
>
> There are some limitations with using the page table check here for
> userfaultfd wr-protect purpose:
>
>   - It is only enabled with explicit enablement of page table check configs
>   and/or boot parameters, but should be good enough to track at least
>   syzbot issues, as syzbot should enable PAGE_TABLE_CHECK[_ENFORCED] for
>   x86 [1].  We used to have DEBUG_VM but it's now off for most distros,
>   while distros also normally not enable PAGE_TABLE_CHECK[_ENFORCED], which
>   is similar.
>
>   - It conditionally works with the ptep_modify_prot API.  It will be
>   bypassed when e.g. XEN PV is enabled, however still work for most of the
>   rest scenarios, which should be the common cases so should be good
>   enough.
>
>   - Hugetlb check is a bit hairy, as the page table check cannot identify
>   hugetlb pte or normal pte via trapping at set_pte_at(), because of the
>   current design where hugetlb maps every layers to pte_t... For example,
>   the default set_huge_pte_at() can invoke set_pte_at() directly and lose
>   the hugetlb context, treating it the same as a normal pte_t. So far it's
>   fine because we have huge_pte_uffd_wp() always equals to pte_uffd_wp() as
>   long as supported (x86 only).  It'll be a bigger problem when we'll
>   define _PAGE_UFFD_WP differently at various pgtable levels, because then
>   one huge_pte_uffd_wp() per-arch will stop making sense first.. as of now
>   we can leave this for later too.
>
> This patch also removes commit c2da319c2e altogether, as we have something
> better now.
>
> [1] https://lore.kernel.org/all/000000000000dce0530615c89210@google.com/
>
> Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  arch/x86/include/asm/pgtable.h | 18 +-----------------
>  mm/page_table_check.c          | 32 +++++++++++++++++++++++++++++++-

Please add the new logic to: Documentation/mm/page_table_check.rst

>  2 files changed, 32 insertions(+), 18 deletions(-)
>
> diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
> index 273f7557218c..65b8e5bb902c 100644
> --- a/arch/x86/include/asm/pgtable.h
> +++ b/arch/x86/include/asm/pgtable.h
> @@ -388,23 +388,7 @@ static inline pte_t pte_wrprotect(pte_t pte)
>  #ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP
>  static inline int pte_uffd_wp(pte_t pte)
>  {
> -       bool wp = pte_flags(pte) & _PAGE_UFFD_WP;
> -
> -#ifdef CONFIG_DEBUG_VM
> -       /*
> -        * Having write bit for wr-protect-marked present ptes is fatal,
> -        * because it means the uffd-wp bit will be ignored and write will
> -        * just go through.
> -        *
> -        * Use any chance of pgtable walking to verify this (e.g., when
> -        * page swapped out or being migrated for all purposes). It means
> -        * something is already wrong.  Tell the admin even before the
> -        * process crashes. We also nail it with wrong pgtable setup.
> -        */
> -       WARN_ON_ONCE(wp && pte_write(pte));
> -#endif
> -
> -       return wp;
> +       return pte_flags(pte) & _PAGE_UFFD_WP;
>  }
>
>  static inline pte_t pte_mkuffd_wp(pte_t pte)
> diff --git a/mm/page_table_check.c b/mm/page_table_check.c
> index af69c3c8f7c2..d4eb1212f0f5 100644
> --- a/mm/page_table_check.c
> +++ b/mm/page_table_check.c
> @@ -7,6 +7,8 @@
>  #include <linux/kstrtox.h>
>  #include <linux/mm.h>
>  #include <linux/page_table_check.h>
> +#include <linux/swap.h>
> +#include <linux/swapops.h>
>
>  #undef pr_fmt
>  #define pr_fmt(fmt)    "page_table_check: " fmt
> @@ -182,6 +184,23 @@ void __page_table_check_pud_clear(struct mm_struct *mm, pud_t pud)
>  }
>  EXPORT_SYMBOL(__page_table_check_pud_clear);
>
> +/* Whether the swap entry cached writable information */
> +static inline bool swap_cached_writable(swp_entry_t entry)
> +{
> +       unsigned type = swp_type(entry);
> +
> +       return type == SWP_DEVICE_EXCLUSIVE_WRITE ||
> +           type == SWP_MIGRATION_WRITE;
> +}
> +
> +static inline void __page_table_check_pte(pte_t pte)

may be something like:
page_table_check_new_pte() ? Naming is starting to get confusing. The
idea for this function is to check the pte that is about to be set
into the page table.

> +{
> +       if (pte_present(pte) && pte_uffd_wp(pte))
> +               WARN_ON_ONCE(pte_write(pte));
> +       else if (is_swap_pte(pte) && pte_swp_uffd_wp(pte))
> +               WARN_ON_ONCE(swap_cached_writable(pte_to_swp_entry(pte)));
> +}
> +
>  void __page_table_check_ptes_set(struct mm_struct *mm, pte_t *ptep, pte_t pte,
>                 unsigned int nr)
>  {
> @@ -190,18 +209,29 @@ void __page_table_check_ptes_set(struct mm_struct *mm, pte_t *ptep, pte_t pte,
>         if (&init_mm == mm)
>                 return;
>
> -       for (i = 0; i < nr; i++)
> +       for (i = 0; i < nr; i++) {
> +               __page_table_check_pte(pte);

This should really be called only once after this loop.

>                 __page_table_check_pte_clear(mm, ptep_get(ptep + i));
> +       }
>         if (pte_user_accessible_page(pte))
>                 page_table_check_set(pte_pfn(pte), nr, pte_write(pte));
>  }
>  EXPORT_SYMBOL(__page_table_check_ptes_set);
>
> +static inline void __page_table_check_pmd(pmd_t pmd)

page_table_check_new_pmd() ?

> +{
> +       if (pmd_present(pmd) && pmd_uffd_wp(pmd))
> +               WARN_ON_ONCE(pmd_write(pmd));
> +       else if (is_swap_pmd(pmd) && pmd_swp_uffd_wp(pmd))
> +               WARN_ON_ONCE(swap_cached_writable(pmd_to_swp_entry(pmd)));
> +}
> +
>  void __page_table_check_pmd_set(struct mm_struct *mm, pmd_t *pmdp, pmd_t pmd)
>  {
>         if (&init_mm == mm)
>                 return;
>
> +       __page_table_check_pmd(pmd);
>         __page_table_check_pmd_clear(mm, *pmdp);
>         if (pmd_user_accessible_page(pmd)) {
>                 page_table_check_set(pmd_pfn(pmd), PMD_SIZE >> PAGE_SHIFT,
> --
> 2.44.0
>

Thanks,
Pasha
Peter Xu April 17, 2024, 4:53 p.m. UTC | #2
On Tue, Apr 16, 2024 at 05:34:50PM -0400, Pasha Tatashin wrote:
> Hi Peter,

Pasha,

> 
> Thanks for this patch, I like this extra checking logic, my comments below:

Thanks for taking a look.

> 
> On Mon, Apr 15, 2024 at 4:53 PM Peter Xu <peterx@redhat.com> wrote:
> >
> > Allow page_table_check hooks to check over userfaultfd wr-protect criteria
> > upon pgtable updates.  The rule is no co-existance allowed for any writable
> > flag against userfault wr-protect flag.
> >
> > This should be better than c2da319c2e, where we used to only sanitize such
> > issues during a pgtable walk, but when hitting such issue we don't have a
> > good chance to know where does that writable bit came from [1], so that
> > even the pgtable walk exposes a kernel bug (which is still helpful on
> > triaging) but not easy to track and debug.
> >
> > Now we switch to track the source.  It's much easier too with the recent
> > introduction of page table check.
> >
> > There are some limitations with using the page table check here for
> > userfaultfd wr-protect purpose:
> >
> >   - It is only enabled with explicit enablement of page table check configs
> >   and/or boot parameters, but should be good enough to track at least
> >   syzbot issues, as syzbot should enable PAGE_TABLE_CHECK[_ENFORCED] for
> >   x86 [1].  We used to have DEBUG_VM but it's now off for most distros,
> >   while distros also normally not enable PAGE_TABLE_CHECK[_ENFORCED], which
> >   is similar.
> >
> >   - It conditionally works with the ptep_modify_prot API.  It will be
> >   bypassed when e.g. XEN PV is enabled, however still work for most of the
> >   rest scenarios, which should be the common cases so should be good
> >   enough.
> >
> >   - Hugetlb check is a bit hairy, as the page table check cannot identify
> >   hugetlb pte or normal pte via trapping at set_pte_at(), because of the
> >   current design where hugetlb maps every layers to pte_t... For example,
> >   the default set_huge_pte_at() can invoke set_pte_at() directly and lose
> >   the hugetlb context, treating it the same as a normal pte_t. So far it's
> >   fine because we have huge_pte_uffd_wp() always equals to pte_uffd_wp() as
> >   long as supported (x86 only).  It'll be a bigger problem when we'll
> >   define _PAGE_UFFD_WP differently at various pgtable levels, because then
> >   one huge_pte_uffd_wp() per-arch will stop making sense first.. as of now
> >   we can leave this for later too.
> >
> > This patch also removes commit c2da319c2e altogether, as we have something
> > better now.
> >
> > [1] https://lore.kernel.org/all/000000000000dce0530615c89210@google.com/
> >
> > Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> >  arch/x86/include/asm/pgtable.h | 18 +-----------------
> >  mm/page_table_check.c          | 32 +++++++++++++++++++++++++++++++-
> 
> Please add the new logic to: Documentation/mm/page_table_check.rst

Will do.

> 
> >  2 files changed, 32 insertions(+), 18 deletions(-)
> >
> > diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
> > index 273f7557218c..65b8e5bb902c 100644
> > --- a/arch/x86/include/asm/pgtable.h
> > +++ b/arch/x86/include/asm/pgtable.h
> > @@ -388,23 +388,7 @@ static inline pte_t pte_wrprotect(pte_t pte)
> >  #ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP
> >  static inline int pte_uffd_wp(pte_t pte)
> >  {
> > -       bool wp = pte_flags(pte) & _PAGE_UFFD_WP;
> > -
> > -#ifdef CONFIG_DEBUG_VM
> > -       /*
> > -        * Having write bit for wr-protect-marked present ptes is fatal,
> > -        * because it means the uffd-wp bit will be ignored and write will
> > -        * just go through.
> > -        *
> > -        * Use any chance of pgtable walking to verify this (e.g., when
> > -        * page swapped out or being migrated for all purposes). It means
> > -        * something is already wrong.  Tell the admin even before the
> > -        * process crashes. We also nail it with wrong pgtable setup.
> > -        */
> > -       WARN_ON_ONCE(wp && pte_write(pte));
> > -#endif
> > -
> > -       return wp;
> > +       return pte_flags(pte) & _PAGE_UFFD_WP;
> >  }
> >
> >  static inline pte_t pte_mkuffd_wp(pte_t pte)
> > diff --git a/mm/page_table_check.c b/mm/page_table_check.c
> > index af69c3c8f7c2..d4eb1212f0f5 100644
> > --- a/mm/page_table_check.c
> > +++ b/mm/page_table_check.c
> > @@ -7,6 +7,8 @@
> >  #include <linux/kstrtox.h>
> >  #include <linux/mm.h>
> >  #include <linux/page_table_check.h>
> > +#include <linux/swap.h>
> > +#include <linux/swapops.h>
> >
> >  #undef pr_fmt
> >  #define pr_fmt(fmt)    "page_table_check: " fmt
> > @@ -182,6 +184,23 @@ void __page_table_check_pud_clear(struct mm_struct *mm, pud_t pud)
> >  }
> >  EXPORT_SYMBOL(__page_table_check_pud_clear);
> >
> > +/* Whether the swap entry cached writable information */
> > +static inline bool swap_cached_writable(swp_entry_t entry)
> > +{
> > +       unsigned type = swp_type(entry);
> > +
> > +       return type == SWP_DEVICE_EXCLUSIVE_WRITE ||
> > +           type == SWP_MIGRATION_WRITE;
> > +}
> > +
> > +static inline void __page_table_check_pte(pte_t pte)
> 
> may be something like:
> page_table_check_new_pte() ? Naming is starting to get confusing. The
> idea for this function is to check the pte that is about to be set
> into the page table.

But then we keep __page_table_check_ptes_set() as is?

It feels more natural if we keep using those underscores if all the rest
does so.  The "_new" is also not matching with what you used to have as
"_set".  If you see that's how I carefully chose the current name, with the
hope to match everything..

No strong opinions on these, but let me know your final choice of such
name.  I'm happy to align that to your preference.

> 
> > +{
> > +       if (pte_present(pte) && pte_uffd_wp(pte))
> > +               WARN_ON_ONCE(pte_write(pte));
> > +       else if (is_swap_pte(pte) && pte_swp_uffd_wp(pte))
> > +               WARN_ON_ONCE(swap_cached_writable(pte_to_swp_entry(pte)));
> > +}
> > +
> >  void __page_table_check_ptes_set(struct mm_struct *mm, pte_t *ptep, pte_t pte,
> >                 unsigned int nr)
> >  {
> > @@ -190,18 +209,29 @@ void __page_table_check_ptes_set(struct mm_struct *mm, pte_t *ptep, pte_t pte,
> >         if (&init_mm == mm)
> >                 return;
> >
> > -       for (i = 0; i < nr; i++)
> > +       for (i = 0; i < nr; i++) {
> > +               __page_table_check_pte(pte);
> 
> This should really be called only once after this loop.

This is also my intention to keep it in the loop just to make it as generic
e.g. to have no assumption of "ignoring PFNs", and I didn't worry on perf
much as we'll read/write these ptes anyway, also because it's only enabled
for debugging kernels.

But I made it at least inaccurate by checking pte not *ptep..

How about I move it out, rename it to __page_table_check_pte_flags(pte)?

> 
> >                 __page_table_check_pte_clear(mm, ptep_get(ptep + i));
> > +       }
> >         if (pte_user_accessible_page(pte))
> >                 page_table_check_set(pte_pfn(pte), nr, pte_write(pte));
> >  }
> >  EXPORT_SYMBOL(__page_table_check_ptes_set);
> >
> > +static inline void __page_table_check_pmd(pmd_t pmd)
> 
> page_table_check_new_pmd() ?

This is the same "careful choice" of mine on that, same reasoning as
above on the pte helpers. :)

But again, let me know your final decision on the namings of both.

Thanks,

> 
> > +{
> > +       if (pmd_present(pmd) && pmd_uffd_wp(pmd))
> > +               WARN_ON_ONCE(pmd_write(pmd));
> > +       else if (is_swap_pmd(pmd) && pmd_swp_uffd_wp(pmd))
> > +               WARN_ON_ONCE(swap_cached_writable(pmd_to_swp_entry(pmd)));
> > +}
> > +
> >  void __page_table_check_pmd_set(struct mm_struct *mm, pmd_t *pmdp, pmd_t pmd)
> >  {
> >         if (&init_mm == mm)
> >                 return;
> >
> > +       __page_table_check_pmd(pmd);
> >         __page_table_check_pmd_clear(mm, *pmdp);
> >         if (pmd_user_accessible_page(pmd)) {
> >                 page_table_check_set(pmd_pfn(pmd), PMD_SIZE >> PAGE_SHIFT,
> > --
> > 2.44.0
> >
> 
> Thanks,
> Pasha
>
Pasha Tatashin April 17, 2024, 5:17 p.m. UTC | #3
On Wed, Apr 17, 2024 at 12:53 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Tue, Apr 16, 2024 at 05:34:50PM -0400, Pasha Tatashin wrote:
> > Hi Peter,
>
> Pasha,
>
> >
> > Thanks for this patch, I like this extra checking logic, my comments below:
>
> Thanks for taking a look.
>
> >
> > On Mon, Apr 15, 2024 at 4:53 PM Peter Xu <peterx@redhat.com> wrote:
> > >
> > > Allow page_table_check hooks to check over userfaultfd wr-protect criteria
> > > upon pgtable updates.  The rule is no co-existance allowed for any writable
> > > flag against userfault wr-protect flag.
> > >
> > > This should be better than c2da319c2e, where we used to only sanitize such
> > > issues during a pgtable walk, but when hitting such issue we don't have a
> > > good chance to know where does that writable bit came from [1], so that
> > > even the pgtable walk exposes a kernel bug (which is still helpful on
> > > triaging) but not easy to track and debug.
> > >
> > > Now we switch to track the source.  It's much easier too with the recent
> > > introduction of page table check.
> > >
> > > There are some limitations with using the page table check here for
> > > userfaultfd wr-protect purpose:
> > >
> > >   - It is only enabled with explicit enablement of page table check configs
> > >   and/or boot parameters, but should be good enough to track at least
> > >   syzbot issues, as syzbot should enable PAGE_TABLE_CHECK[_ENFORCED] for
> > >   x86 [1].  We used to have DEBUG_VM but it's now off for most distros,
> > >   while distros also normally not enable PAGE_TABLE_CHECK[_ENFORCED], which
> > >   is similar.
> > >
> > >   - It conditionally works with the ptep_modify_prot API.  It will be
> > >   bypassed when e.g. XEN PV is enabled, however still work for most of the
> > >   rest scenarios, which should be the common cases so should be good
> > >   enough.
> > >
> > >   - Hugetlb check is a bit hairy, as the page table check cannot identify
> > >   hugetlb pte or normal pte via trapping at set_pte_at(), because of the
> > >   current design where hugetlb maps every layers to pte_t... For example,
> > >   the default set_huge_pte_at() can invoke set_pte_at() directly and lose
> > >   the hugetlb context, treating it the same as a normal pte_t. So far it's
> > >   fine because we have huge_pte_uffd_wp() always equals to pte_uffd_wp() as
> > >   long as supported (x86 only).  It'll be a bigger problem when we'll
> > >   define _PAGE_UFFD_WP differently at various pgtable levels, because then
> > >   one huge_pte_uffd_wp() per-arch will stop making sense first.. as of now
> > >   we can leave this for later too.
> > >
> > > This patch also removes commit c2da319c2e altogether, as we have something
> > > better now.
> > >
> > > [1] https://lore.kernel.org/all/000000000000dce0530615c89210@google.com/
> > >
> > > Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
> > > Signed-off-by: Peter Xu <peterx@redhat.com>
> > > ---
> > >  arch/x86/include/asm/pgtable.h | 18 +-----------------
> > >  mm/page_table_check.c          | 32 +++++++++++++++++++++++++++++++-
> >
> > Please add the new logic to: Documentation/mm/page_table_check.rst
>
> Will do.
>
> >
> > >  2 files changed, 32 insertions(+), 18 deletions(-)
> > >
> > > diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
> > > index 273f7557218c..65b8e5bb902c 100644
> > > --- a/arch/x86/include/asm/pgtable.h
> > > +++ b/arch/x86/include/asm/pgtable.h
> > > @@ -388,23 +388,7 @@ static inline pte_t pte_wrprotect(pte_t pte)
> > >  #ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP
> > >  static inline int pte_uffd_wp(pte_t pte)
> > >  {
> > > -       bool wp = pte_flags(pte) & _PAGE_UFFD_WP;
> > > -
> > > -#ifdef CONFIG_DEBUG_VM
> > > -       /*
> > > -        * Having write bit for wr-protect-marked present ptes is fatal,
> > > -        * because it means the uffd-wp bit will be ignored and write will
> > > -        * just go through.
> > > -        *
> > > -        * Use any chance of pgtable walking to verify this (e.g., when
> > > -        * page swapped out or being migrated for all purposes). It means
> > > -        * something is already wrong.  Tell the admin even before the
> > > -        * process crashes. We also nail it with wrong pgtable setup.
> > > -        */
> > > -       WARN_ON_ONCE(wp && pte_write(pte));
> > > -#endif
> > > -
> > > -       return wp;
> > > +       return pte_flags(pte) & _PAGE_UFFD_WP;
> > >  }
> > >
> > >  static inline pte_t pte_mkuffd_wp(pte_t pte)
> > > diff --git a/mm/page_table_check.c b/mm/page_table_check.c
> > > index af69c3c8f7c2..d4eb1212f0f5 100644
> > > --- a/mm/page_table_check.c
> > > +++ b/mm/page_table_check.c
> > > @@ -7,6 +7,8 @@
> > >  #include <linux/kstrtox.h>
> > >  #include <linux/mm.h>
> > >  #include <linux/page_table_check.h>
> > > +#include <linux/swap.h>
> > > +#include <linux/swapops.h>
> > >
> > >  #undef pr_fmt
> > >  #define pr_fmt(fmt)    "page_table_check: " fmt
> > > @@ -182,6 +184,23 @@ void __page_table_check_pud_clear(struct mm_struct *mm, pud_t pud)
> > >  }
> > >  EXPORT_SYMBOL(__page_table_check_pud_clear);
> > >
> > > +/* Whether the swap entry cached writable information */
> > > +static inline bool swap_cached_writable(swp_entry_t entry)
> > > +{
> > > +       unsigned type = swp_type(entry);
> > > +
> > > +       return type == SWP_DEVICE_EXCLUSIVE_WRITE ||
> > > +           type == SWP_MIGRATION_WRITE;
> > > +}
> > > +
> > > +static inline void __page_table_check_pte(pte_t pte)
> >
> > may be something like:
> > page_table_check_new_pte() ? Naming is starting to get confusing. The
> > idea for this function is to check the pte that is about to be set
> > into the page table.
>
> But then we keep __page_table_check_ptes_set() as is?
>
> It feels more natural if we keep using those underscores if all the rest
> does so.  The "_new" is also not matching with what you used to have as

In mm/page_table_check.c, function names with an underscore prefix are
intended for global symbols with internal use only. All local
functions, such as page_table_check_set() and
page_table_check_clear(), do not have this prefix as we do not pollute
the global namespace.

> "_set".  If you see that's how I carefully chose the current name, with the
> hope to match everything..
>
> No strong opinions on these, but let me know your final choice of such
> name.  I'm happy to align that to your preference.
>
> >
> > > +{
> > > +       if (pte_present(pte) && pte_uffd_wp(pte))
> > > +               WARN_ON_ONCE(pte_write(pte));
> > > +       else if (is_swap_pte(pte) && pte_swp_uffd_wp(pte))
> > > +               WARN_ON_ONCE(swap_cached_writable(pte_to_swp_entry(pte)));
> > > +}
> > > +
> > >  void __page_table_check_ptes_set(struct mm_struct *mm, pte_t *ptep, pte_t pte,
> > >                 unsigned int nr)
> > >  {
> > > @@ -190,18 +209,29 @@ void __page_table_check_ptes_set(struct mm_struct *mm, pte_t *ptep, pte_t pte,
> > >         if (&init_mm == mm)
> > >                 return;
> > >
> > > -       for (i = 0; i < nr; i++)
> > > +       for (i = 0; i < nr; i++) {
> > > +               __page_table_check_pte(pte);
> >
> > This should really be called only once after this loop.
>
> This is also my intention to keep it in the loop just to make it as generic
> e.g. to have no assumption of "ignoring PFNs", and I didn't worry on perf
> much as we'll read/write these ptes anyway, also because it's only enabled
> for debugging kernels.
>
> But I made it at least inaccurate by checking pte not *ptep..
> How about I move it out, rename it to __page_table_check_pte_flags(pte)?

Sounds good. I like:
 page_table_check_pte_flags()
 page_table_check_pmd_flags()

Pasha
Pasha Tatashin April 17, 2024, 6:51 p.m. UTC | #4
On Wed, Apr 17, 2024 at 1:17 PM Pasha Tatashin
<pasha.tatashin@soleen.com> wrote:
>
> On Wed, Apr 17, 2024 at 12:53 PM Peter Xu <peterx@redhat.com> wrote:
> >
> > On Tue, Apr 16, 2024 at 05:34:50PM -0400, Pasha Tatashin wrote:
> > > Hi Peter,
> >
> > Pasha,
> >
> > >
> > > Thanks for this patch, I like this extra checking logic, my comments below:
> >
> > Thanks for taking a look.
> >
> > >
> > > On Mon, Apr 15, 2024 at 4:53 PM Peter Xu <peterx@redhat.com> wrote:
> > > >
> > > > Allow page_table_check hooks to check over userfaultfd wr-protect criteria
> > > > upon pgtable updates.  The rule is no co-existance allowed for any writable
> > > > flag against userfault wr-protect flag.
> > > >
> > > > This should be better than c2da319c2e, where we used to only sanitize such
> > > > issues during a pgtable walk, but when hitting such issue we don't have a
> > > > good chance to know where does that writable bit came from [1], so that
> > > > even the pgtable walk exposes a kernel bug (which is still helpful on
> > > > triaging) but not easy to track and debug.
> > > >
> > > > Now we switch to track the source.  It's much easier too with the recent
> > > > introduction of page table check.
> > > >
> > > > There are some limitations with using the page table check here for
> > > > userfaultfd wr-protect purpose:
> > > >
> > > >   - It is only enabled with explicit enablement of page table check configs
> > > >   and/or boot parameters, but should be good enough to track at least
> > > >   syzbot issues, as syzbot should enable PAGE_TABLE_CHECK[_ENFORCED] for
> > > >   x86 [1].  We used to have DEBUG_VM but it's now off for most distros,
> > > >   while distros also normally not enable PAGE_TABLE_CHECK[_ENFORCED], which
> > > >   is similar.
> > > >
> > > >   - It conditionally works with the ptep_modify_prot API.  It will be
> > > >   bypassed when e.g. XEN PV is enabled, however still work for most of the
> > > >   rest scenarios, which should be the common cases so should be good
> > > >   enough.
> > > >
> > > >   - Hugetlb check is a bit hairy, as the page table check cannot identify
> > > >   hugetlb pte or normal pte via trapping at set_pte_at(), because of the
> > > >   current design where hugetlb maps every layers to pte_t... For example,
> > > >   the default set_huge_pte_at() can invoke set_pte_at() directly and lose
> > > >   the hugetlb context, treating it the same as a normal pte_t. So far it's
> > > >   fine because we have huge_pte_uffd_wp() always equals to pte_uffd_wp() as
> > > >   long as supported (x86 only).  It'll be a bigger problem when we'll
> > > >   define _PAGE_UFFD_WP differently at various pgtable levels, because then
> > > >   one huge_pte_uffd_wp() per-arch will stop making sense first.. as of now
> > > >   we can leave this for later too.
> > > >
> > > > This patch also removes commit c2da319c2e altogether, as we have something
> > > > better now.
> > > >
> > > > [1] https://lore.kernel.org/all/000000000000dce0530615c89210@google.com/
> > > >
> > > > Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
> > > > Signed-off-by: Peter Xu <peterx@redhat.com>
> > > > ---
> > > >  arch/x86/include/asm/pgtable.h | 18 +-----------------
> > > >  mm/page_table_check.c          | 32 +++++++++++++++++++++++++++++++-
> > >
> > > Please add the new logic to: Documentation/mm/page_table_check.rst
> >
> > Will do.
> >
> > >
> > > >  2 files changed, 32 insertions(+), 18 deletions(-)
> > > >
> > > > diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
> > > > index 273f7557218c..65b8e5bb902c 100644
> > > > --- a/arch/x86/include/asm/pgtable.h
> > > > +++ b/arch/x86/include/asm/pgtable.h
> > > > @@ -388,23 +388,7 @@ static inline pte_t pte_wrprotect(pte_t pte)
> > > >  #ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP
> > > >  static inline int pte_uffd_wp(pte_t pte)
> > > >  {
> > > > -       bool wp = pte_flags(pte) & _PAGE_UFFD_WP;
> > > > -
> > > > -#ifdef CONFIG_DEBUG_VM
> > > > -       /*
> > > > -        * Having write bit for wr-protect-marked present ptes is fatal,
> > > > -        * because it means the uffd-wp bit will be ignored and write will
> > > > -        * just go through.
> > > > -        *
> > > > -        * Use any chance of pgtable walking to verify this (e.g., when
> > > > -        * page swapped out or being migrated for all purposes). It means
> > > > -        * something is already wrong.  Tell the admin even before the
> > > > -        * process crashes. We also nail it with wrong pgtable setup.
> > > > -        */
> > > > -       WARN_ON_ONCE(wp && pte_write(pte));
> > > > -#endif
> > > > -
> > > > -       return wp;
> > > > +       return pte_flags(pte) & _PAGE_UFFD_WP;
> > > >  }
> > > >
> > > >  static inline pte_t pte_mkuffd_wp(pte_t pte)
> > > > diff --git a/mm/page_table_check.c b/mm/page_table_check.c
> > > > index af69c3c8f7c2..d4eb1212f0f5 100644
> > > > --- a/mm/page_table_check.c
> > > > +++ b/mm/page_table_check.c
> > > > @@ -7,6 +7,8 @@
> > > >  #include <linux/kstrtox.h>
> > > >  #include <linux/mm.h>
> > > >  #include <linux/page_table_check.h>
> > > > +#include <linux/swap.h>
> > > > +#include <linux/swapops.h>
> > > >
> > > >  #undef pr_fmt
> > > >  #define pr_fmt(fmt)    "page_table_check: " fmt
> > > > @@ -182,6 +184,23 @@ void __page_table_check_pud_clear(struct mm_struct *mm, pud_t pud)
> > > >  }
> > > >  EXPORT_SYMBOL(__page_table_check_pud_clear);
> > > >
> > > > +/* Whether the swap entry cached writable information */
> > > > +static inline bool swap_cached_writable(swp_entry_t entry)
> > > > +{
> > > > +       unsigned type = swp_type(entry);
> > > > +
> > > > +       return type == SWP_DEVICE_EXCLUSIVE_WRITE ||
> > > > +           type == SWP_MIGRATION_WRITE;

This breaks linux-next build:

mm/page_table_check.c: In function ‘swap_cached_writable’:
mm/page_table_check.c:192:24: error: ‘SWP_DEVICE_EXCLUSIVE_WRITE’
undeclared (first use in this function)
  192 |         return type == SWP_DEVICE_EXCLUSIVE_WRITE ||
      |                        ^~~~~~~~~~~~~~~~~~~~~~~~~~
mm/page_table_check.c:192:24: note: each undeclared identifier is
reported only once for each function it appears in
mm/page_table_check.c:194:1: error: control reaches end of non-void
function [-Werror=return-type]
  194 | }

Looks like there is a dependence on CONFIG_DEVICE_PRIVATE.

Pasha
Peter Xu April 17, 2024, 6:55 p.m. UTC | #5
On Wed, Apr 17, 2024 at 02:51:48PM -0400, Pasha Tatashin wrote:
> This breaks linux-next build:
> 
> mm/page_table_check.c: In function ‘swap_cached_writable’:
> mm/page_table_check.c:192:24: error: ‘SWP_DEVICE_EXCLUSIVE_WRITE’
> undeclared (first use in this function)
>   192 |         return type == SWP_DEVICE_EXCLUSIVE_WRITE ||
>       |                        ^~~~~~~~~~~~~~~~~~~~~~~~~~
> mm/page_table_check.c:192:24: note: each undeclared identifier is
> reported only once for each function it appears in
> mm/page_table_check.c:194:1: error: control reaches end of non-void
> function [-Werror=return-type]
>   194 | }
> 
> Looks like there is a dependence on CONFIG_DEVICE_PRIVATE.

Right, same to SWP_MIGRATION_WRITE.

I've already sent a v2 with everything fixed, please have a look.  I also
added one more device private entry type that I overlooked.

Thanks,
diff mbox series

Patch

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 273f7557218c..65b8e5bb902c 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -388,23 +388,7 @@  static inline pte_t pte_wrprotect(pte_t pte)
 #ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP
 static inline int pte_uffd_wp(pte_t pte)
 {
-	bool wp = pte_flags(pte) & _PAGE_UFFD_WP;
-
-#ifdef CONFIG_DEBUG_VM
-	/*
-	 * Having write bit for wr-protect-marked present ptes is fatal,
-	 * because it means the uffd-wp bit will be ignored and write will
-	 * just go through.
-	 *
-	 * Use any chance of pgtable walking to verify this (e.g., when
-	 * page swapped out or being migrated for all purposes). It means
-	 * something is already wrong.  Tell the admin even before the
-	 * process crashes. We also nail it with wrong pgtable setup.
-	 */
-	WARN_ON_ONCE(wp && pte_write(pte));
-#endif
-
-	return wp;
+	return pte_flags(pte) & _PAGE_UFFD_WP;
 }
 
 static inline pte_t pte_mkuffd_wp(pte_t pte)
diff --git a/mm/page_table_check.c b/mm/page_table_check.c
index af69c3c8f7c2..d4eb1212f0f5 100644
--- a/mm/page_table_check.c
+++ b/mm/page_table_check.c
@@ -7,6 +7,8 @@ 
 #include <linux/kstrtox.h>
 #include <linux/mm.h>
 #include <linux/page_table_check.h>
+#include <linux/swap.h>
+#include <linux/swapops.h>
 
 #undef pr_fmt
 #define pr_fmt(fmt)	"page_table_check: " fmt
@@ -182,6 +184,23 @@  void __page_table_check_pud_clear(struct mm_struct *mm, pud_t pud)
 }
 EXPORT_SYMBOL(__page_table_check_pud_clear);
 
+/* Whether the swap entry cached writable information */
+static inline bool swap_cached_writable(swp_entry_t entry)
+{
+	unsigned type = swp_type(entry);
+
+	return type == SWP_DEVICE_EXCLUSIVE_WRITE ||
+	    type == SWP_MIGRATION_WRITE;
+}
+
+static inline void __page_table_check_pte(pte_t pte)
+{
+	if (pte_present(pte) && pte_uffd_wp(pte))
+		WARN_ON_ONCE(pte_write(pte));
+	else if (is_swap_pte(pte) && pte_swp_uffd_wp(pte))
+		WARN_ON_ONCE(swap_cached_writable(pte_to_swp_entry(pte)));
+}
+
 void __page_table_check_ptes_set(struct mm_struct *mm, pte_t *ptep, pte_t pte,
 		unsigned int nr)
 {
@@ -190,18 +209,29 @@  void __page_table_check_ptes_set(struct mm_struct *mm, pte_t *ptep, pte_t pte,
 	if (&init_mm == mm)
 		return;
 
-	for (i = 0; i < nr; i++)
+	for (i = 0; i < nr; i++) {
+		__page_table_check_pte(pte);
 		__page_table_check_pte_clear(mm, ptep_get(ptep + i));
+	}
 	if (pte_user_accessible_page(pte))
 		page_table_check_set(pte_pfn(pte), nr, pte_write(pte));
 }
 EXPORT_SYMBOL(__page_table_check_ptes_set);
 
+static inline void __page_table_check_pmd(pmd_t pmd)
+{
+	if (pmd_present(pmd) && pmd_uffd_wp(pmd))
+		WARN_ON_ONCE(pmd_write(pmd));
+	else if (is_swap_pmd(pmd) && pmd_swp_uffd_wp(pmd))
+		WARN_ON_ONCE(swap_cached_writable(pmd_to_swp_entry(pmd)));
+}
+
 void __page_table_check_pmd_set(struct mm_struct *mm, pmd_t *pmdp, pmd_t pmd)
 {
 	if (&init_mm == mm)
 		return;
 
+	__page_table_check_pmd(pmd);
 	__page_table_check_pmd_clear(mm, *pmdp);
 	if (pmd_user_accessible_page(pmd)) {
 		page_table_check_set(pmd_pfn(pmd), PMD_SIZE >> PAGE_SHIFT,