diff mbox series

[v5,05/26] mm/swap: Introduce the idea of special swap ptes

Message ID 20210715201422.211004-6-peterx@redhat.com (mailing list archive)
State New
Headers show
Series userfaultfd-wp: Support shmem and hugetlbfs | expand

Commit Message

Peter Xu July 15, 2021, 8:14 p.m. UTC
We used to have special swap entries, like migration entries, hw-poison
entries, device private entries, etc.

Those "special swap entries" reside in the range that they need to be at least
swap entries first, and their types are decided by swp_type(entry).

This patch introduces another idea called "special swap ptes".

It's very easy to get confused against "special swap entries", but a speical
swap pte should never contain a swap entry at all.  It means, it's illegal to
call pte_to_swp_entry() upon a special swap pte.

Make the uffd-wp special pte to be the first special swap pte.

Before this patch, is_swap_pte()==true means one of the below:

   (a.1) The pte has a normal swap entry (non_swap_entry()==false).  For
         example, when an anonymous page got swapped out.

   (a.2) The pte has a special swap entry (non_swap_entry()==true).  For
         example, a migration entry, a hw-poison entry, etc.

After this patch, is_swap_pte()==true means one of the below, where case (b) is
added:

 (a) The pte contains a swap entry.

   (a.1) The pte has a normal swap entry (non_swap_entry()==false).  For
         example, when an anonymous page got swapped out.

   (a.2) The pte has a special swap entry (non_swap_entry()==true).  For
         example, a migration entry, a hw-poison entry, etc.

 (b) The pte does not contain a swap entry at all (so it cannot be passed
     into pte_to_swp_entry()).  For example, uffd-wp special swap pte.

Teach the whole mm core about this new idea.  It's done by introducing another
helper called pte_has_swap_entry() which stands for case (a.1) and (a.2).
Before this patch, it will be the same as is_swap_pte() because there's no
special swap pte yet.  Now for most of the previous use of is_swap_entry() in
mm core, we'll need to use the new helper pte_has_swap_entry() instead, to make
sure we won't try to parse a swap entry from a swap special pte (which does not
contain a swap entry at all!).  We either handle the swap special pte, or it'll
naturally use the default "else" paths.

Warn properly (e.g., in do_swap_page()) when we see a special swap pte - we
should never call do_swap_page() upon those ptes, but just to bail out early if
it happens.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 arch/arm64/kernel/mte.c |  2 +-
 fs/proc/task_mmu.c      | 14 ++++++++------
 include/linux/swapops.h | 39 ++++++++++++++++++++++++++++++++++++++-
 mm/gup.c                |  2 +-
 mm/hmm.c                |  2 +-
 mm/khugepaged.c         | 11 ++++++++++-
 mm/madvise.c            |  4 ++--
 mm/memcontrol.c         |  2 +-
 mm/memory.c             |  7 +++++++
 mm/migrate.c            |  4 ++--
 mm/mincore.c            |  2 +-
 mm/mprotect.c           |  2 +-
 mm/mremap.c             |  2 +-
 mm/page_vma_mapped.c    |  6 +++---
 mm/swapfile.c           |  2 +-
 15 files changed, 78 insertions(+), 23 deletions(-)

Comments

Alistair Popple July 16, 2021, 5:50 a.m. UTC | #1
Hi Peter,

[...]

> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index ae1f5d0cb581..4b46c099ad94 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -5738,7 +5738,7 @@ static enum mc_target_type get_mctgt_type(struct vm_area_struct *vma,
>  
>  	if (pte_present(ptent))
>  		page = mc_handle_present_pte(vma, addr, ptent);
> -	else if (is_swap_pte(ptent))
> +	else if (pte_has_swap_entry(ptent))
>  		page = mc_handle_swap_pte(vma, ptent, &ent);
>  	else if (pte_none(ptent))
>  		page = mc_handle_file_pte(vma, addr, ptent, &ent);

As I understand things pte_none() == False for a special swap pte, but
shouldn't this be treated as pte_none() here? Ie. does this need to be
pte_none(ptent) || is_swap_special_pte() here?

> diff --git a/mm/memory.c b/mm/memory.c
> index 0e0de08a2cd5..998a4f9a3744 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3491,6 +3491,13 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  	if (!pte_unmap_same(vmf))
>  		goto out;
>  
> +	/*
> +	 * We should never call do_swap_page upon a swap special pte; just be
> +	 * safe to bail out if it happens.
> +	 */
> +	if (WARN_ON_ONCE(is_swap_special_pte(vmf->orig_pte)))
> +		goto out;
> +
>  	entry = pte_to_swp_entry(vmf->orig_pte);
>  	if (unlikely(non_swap_entry(entry))) {
>  		if (is_migration_entry(entry)) {

Are there other changes required here? Because we can end up with stale special
pte's and a special pte is !pte_none don't we need to fix some of the !pte_none
checks in these functions:

insert_pfn() -> checks for !pte_none
remap_pte_range() -> BUG_ON(!pte_none)
apply_to_pte_range() -> didn't check further but it tests for !pte_none

In general it feels like I might be missing something here though. There are
plenty of checks in the kernel for pte_none() which haven't been updated. Is
there some rule that says none of those paths can see a special pte?

> diff --git a/mm/migrate.c b/mm/migrate.c
> index 23cbd9de030b..b477d0d5f911 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -294,7 +294,7 @@ void __migration_entry_wait(struct mm_struct *mm, pte_t *ptep,
>  
>  	spin_lock(ptl);
>  	pte = *ptep;
> -	if (!is_swap_pte(pte))
> +	if (!pte_has_swap_entry(pte))
>  		goto out;
>  
>  	entry = pte_to_swp_entry(pte);
> @@ -2276,7 +2276,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>  
>  		pte = *ptep;
>  
> -		if (pte_none(pte)) {
> +		if (pte_none(pte) || is_swap_special_pte(pte)) {

I was wondering if we can loose the special pte information here? However I see
that in migrate_vma_insert_page() we check again and fail the migration if
!pte_none() so I think this is ok.

I think it would be better if this check was moved below so the migration fails
early. Ie:

		if (pte_none(pte)) {
 			if (vma_is_anonymous(vma) && !is_swap_special_pte(pte)) {

Also how does this work for page migration in general? I can see in
page_vma_mapped_walk() that we skip special pte's, but doesn't this mean we
loose the special pte in that instance? Or is that ok for some reason?

>  			if (vma_is_anonymous(vma)) {
>  				mpfn = MIGRATE_PFN_MIGRATE;
>  				migrate->cpages++;
> diff --git a/mm/mincore.c b/mm/mincore.c
> index 9122676b54d6..5728c3e6473f 100644
> --- a/mm/mincore.c
> +++ b/mm/mincore.c
> @@ -121,7 +121,7 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
>  	for (; addr != end; ptep++, addr += PAGE_SIZE) {
>  		pte_t pte = *ptep;
>  
> -		if (pte_none(pte))
> +		if (pte_none(pte) || is_swap_special_pte(pte))
>  			__mincore_unmapped_range(addr, addr + PAGE_SIZE,
>  						 vma, vec);
>  		else if (pte_present(pte))
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index 883e2cc85cad..4b743394afbe 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -139,7 +139,7 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
>  			}
>  			ptep_modify_prot_commit(vma, addr, pte, oldpte, ptent);
>  			pages++;
> -		} else if (is_swap_pte(oldpte)) {
> +		} else if (pte_has_swap_entry(oldpte)) {
>  			swp_entry_t entry = pte_to_swp_entry(oldpte);
>  			pte_t newpte;
>  
> diff --git a/mm/mremap.c b/mm/mremap.c
> index 5989d3990020..122b279333ee 100644
> --- a/mm/mremap.c
> +++ b/mm/mremap.c
> @@ -125,7 +125,7 @@ static pte_t move_soft_dirty_pte(pte_t pte)
>  #ifdef CONFIG_MEM_SOFT_DIRTY
>  	if (pte_present(pte))
>  		pte = pte_mksoft_dirty(pte);
> -	else if (is_swap_pte(pte))
> +	else if (pte_has_swap_entry(pte))
>  		pte = pte_swp_mksoft_dirty(pte);
>  #endif
>  	return pte;
> diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
> index f7b331081791..ff57b67426af 100644
> --- a/mm/page_vma_mapped.c
> +++ b/mm/page_vma_mapped.c
> @@ -36,7 +36,7 @@ static bool map_pte(struct page_vma_mapped_walk *pvmw)
>  			 * For more details on device private memory see HMM
>  			 * (include/linux/hmm.h or mm/hmm.c).
>  			 */
> -			if (is_swap_pte(*pvmw->pte)) {
> +			if (pte_has_swap_entry(*pvmw->pte)) {
>  				swp_entry_t entry;
>  
>  				/* Handle un-addressable ZONE_DEVICE memory */
> @@ -90,7 +90,7 @@ static bool check_pte(struct page_vma_mapped_walk *pvmw)
>  
>  	if (pvmw->flags & PVMW_MIGRATION) {
>  		swp_entry_t entry;
> -		if (!is_swap_pte(*pvmw->pte))
> +		if (!pte_has_swap_entry(*pvmw->pte))
>  			return false;
>  		entry = pte_to_swp_entry(*pvmw->pte);
>  
> @@ -99,7 +99,7 @@ static bool check_pte(struct page_vma_mapped_walk *pvmw)
>  			return false;
>  
>  		pfn = swp_offset(entry);
> -	} else if (is_swap_pte(*pvmw->pte)) {
> +	} else if (pte_has_swap_entry(*pvmw->pte)) {
>  		swp_entry_t entry;
>  
>  		/* Handle un-addressable ZONE_DEVICE memory */
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 1e07d1c776f2..4993b4454c13 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -1951,7 +1951,7 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
>  	si = swap_info[type];
>  	pte = pte_offset_map(pmd, addr);
>  	do {
> -		if (!is_swap_pte(*pte))
> +		if (!pte_has_swap_entry(*pte))
>  			continue;
>  
>  		entry = pte_to_swp_entry(*pte);
>
Peter Xu July 16, 2021, 7:11 p.m. UTC | #2
On Fri, Jul 16, 2021 at 03:50:52PM +1000, Alistair Popple wrote:
> Hi Peter,
> 
> [...]
> 
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index ae1f5d0cb581..4b46c099ad94 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -5738,7 +5738,7 @@ static enum mc_target_type get_mctgt_type(struct vm_area_struct *vma,
> >  
> >  	if (pte_present(ptent))
> >  		page = mc_handle_present_pte(vma, addr, ptent);
> > -	else if (is_swap_pte(ptent))
> > +	else if (pte_has_swap_entry(ptent))
> >  		page = mc_handle_swap_pte(vma, ptent, &ent);
> >  	else if (pte_none(ptent))
> >  		page = mc_handle_file_pte(vma, addr, ptent, &ent);
> 
> As I understand things pte_none() == False for a special swap pte, but
> shouldn't this be treated as pte_none() here? Ie. does this need to be
> pte_none(ptent) || is_swap_special_pte() here?

Looks correct; here the page/swap cache could hide behind the special pte just
like a none pte.  Will fix it.  Thanks!

> 
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 0e0de08a2cd5..998a4f9a3744 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -3491,6 +3491,13 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >  	if (!pte_unmap_same(vmf))
> >  		goto out;
> >  
> > +	/*
> > +	 * We should never call do_swap_page upon a swap special pte; just be
> > +	 * safe to bail out if it happens.
> > +	 */
> > +	if (WARN_ON_ONCE(is_swap_special_pte(vmf->orig_pte)))
> > +		goto out;
> > +
> >  	entry = pte_to_swp_entry(vmf->orig_pte);
> >  	if (unlikely(non_swap_entry(entry))) {
> >  		if (is_migration_entry(entry)) {
> 
> Are there other changes required here? Because we can end up with stale special
> pte's and a special pte is !pte_none don't we need to fix some of the !pte_none
> checks in these functions:
> 
> insert_pfn() -> checks for !pte_none
> remap_pte_range() -> BUG_ON(!pte_none)
> apply_to_pte_range() -> didn't check further but it tests for !pte_none
> 
> In general it feels like I might be missing something here though. There are
> plenty of checks in the kernel for pte_none() which haven't been updated. Is
> there some rule that says none of those paths can see a special pte?

My rule on doing this was to only care about vma that can be backed by RAM,
majorly shmem/hugetlb, so the special pte can only exist there within those
vmas.  I believe in most pte_none() users this special pte won't exist.

So if it's not related to RAM backed memory at all, maybe it's fine to keep the
pte_none() usage like before.

Take the example of insert_pfn() referenced first - I think it can be used to
map some MMIO regions, but I don't think we'll call that upon a RAM region
(either shmem or hugetlb), nor can it be uffd wr-protected.  So I'm not sure
adding special pte check there would be helpful.

apply_to_pte_range() seems to be a bit special - I think the pte_fn_t matters
more on whether the special pte will matter.  I had a quick look, it seems
still be used mostly by all kinds of driver code not mm core.  It's used in two
forms:

        apply_to_page_range
        apply_to_existing_page_range

The first one creates ptes only, so it ignores the pte_none() check so I skipped.

The second one has two call sites:

*** arch/powerpc/mm/pageattr.c:
change_memory_attr[99]         return apply_to_existing_page_range(&init_mm, start, size,
set_memory_attr[132]           return apply_to_existing_page_range(&init_mm, start, sz, set_page_attr,

*** mm/kasan/shadow.c:
kasan_release_vmalloc[485]     apply_to_existing_page_range(&init_mm,

I'll leave the ppc callers for now as uffd-wp is not even supported there.  The
kasan_release_vmalloc() should be for kernel allocated memories only, so should
not be a target for special pte either.

So indeed it's hard to 100% cover all pte_none() users to make sure things are
used right.  As stated above I still believe most callers don't need that, but
the worst case is if someone triggered uffd-wp issues with a specific feature,
we can look into it.  I am not sure whether it's good we add this for all the
pte_none() users, because mostly they'll be useless checks, imho.

So far what I planned to do is to cover most things we know that may be
affected like this patch so the change may bring a difference, hopefully we
won't miss any important spots.

> 
> > diff --git a/mm/migrate.c b/mm/migrate.c
> > index 23cbd9de030b..b477d0d5f911 100644
> > --- a/mm/migrate.c
> > +++ b/mm/migrate.c
> > @@ -294,7 +294,7 @@ void __migration_entry_wait(struct mm_struct *mm, pte_t *ptep,
> >  
> >  	spin_lock(ptl);
> >  	pte = *ptep;
> > -	if (!is_swap_pte(pte))
> > +	if (!pte_has_swap_entry(pte))
> >  		goto out;
> >  
> >  	entry = pte_to_swp_entry(pte);
> > @@ -2276,7 +2276,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
> >  
> >  		pte = *ptep;
> >  
> > -		if (pte_none(pte)) {
> > +		if (pte_none(pte) || is_swap_special_pte(pte)) {
> 
> I was wondering if we can loose the special pte information here? However I see
> that in migrate_vma_insert_page() we check again and fail the migration if
> !pte_none() so I think this is ok.
> 
> I think it would be better if this check was moved below so the migration fails
> early. Ie:
> 
> 		if (pte_none(pte)) {
>  			if (vma_is_anonymous(vma) && !is_swap_special_pte(pte)) {

Hmm.. but shouldn't vma_is_anonymous()==true already means it must not be a
swap special pte?  Because swap special pte only exists when !vma_is_anonymous().

> 
> Also how does this work for page migration in general? I can see in
> page_vma_mapped_walk() that we skip special pte's, but doesn't this mean we
> loose the special pte in that instance? Or is that ok for some reason?

Do you mean try_to_migrate_one()? Does it need to be aware of that?  Per my
understanding that's only for anonymous private memory, while in that world
there should have no swap special pte (page_lock_anon_vma_read will return NULL
early for !vma_is_anonymous).

Thanks,
Alistair Popple July 21, 2021, 11:28 a.m. UTC | #3
On Saturday, 17 July 2021 5:11:33 AM AEST Peter Xu wrote:
> On Fri, Jul 16, 2021 at 03:50:52PM +1000, Alistair Popple wrote:
> > Hi Peter,
> > 
> > [...]
> > 
> > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > index ae1f5d0cb581..4b46c099ad94 100644
> > > --- a/mm/memcontrol.c
> > > +++ b/mm/memcontrol.c
> > > @@ -5738,7 +5738,7 @@ static enum mc_target_type get_mctgt_type(struct vm_area_struct *vma,
> > >  
> > >  	if (pte_present(ptent))
> > >  		page = mc_handle_present_pte(vma, addr, ptent);
> > > -	else if (is_swap_pte(ptent))
> > > +	else if (pte_has_swap_entry(ptent))
> > >  		page = mc_handle_swap_pte(vma, ptent, &ent);
> > >  	else if (pte_none(ptent))
> > >  		page = mc_handle_file_pte(vma, addr, ptent, &ent);
> > 
> > As I understand things pte_none() == False for a special swap pte, but
> > shouldn't this be treated as pte_none() here? Ie. does this need to be
> > pte_none(ptent) || is_swap_special_pte() here?
> 
> Looks correct; here the page/swap cache could hide behind the special pte just
> like a none pte.  Will fix it.  Thanks!
> 
> > 
> > > diff --git a/mm/memory.c b/mm/memory.c
> > > index 0e0de08a2cd5..998a4f9a3744 100644
> > > --- a/mm/memory.c
> > > +++ b/mm/memory.c
> > > @@ -3491,6 +3491,13 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > >  	if (!pte_unmap_same(vmf))
> > >  		goto out;
> > >  
> > > +	/*
> > > +	 * We should never call do_swap_page upon a swap special pte; just be
> > > +	 * safe to bail out if it happens.
> > > +	 */
> > > +	if (WARN_ON_ONCE(is_swap_special_pte(vmf->orig_pte)))
> > > +		goto out;
> > > +
> > >  	entry = pte_to_swp_entry(vmf->orig_pte);
> > >  	if (unlikely(non_swap_entry(entry))) {
> > >  		if (is_migration_entry(entry)) {
> > 
> > Are there other changes required here? Because we can end up with stale special
> > pte's and a special pte is !pte_none don't we need to fix some of the !pte_none
> > checks in these functions:
> > 
> > insert_pfn() -> checks for !pte_none
> > remap_pte_range() -> BUG_ON(!pte_none)
> > apply_to_pte_range() -> didn't check further but it tests for !pte_none
> > 
> > In general it feels like I might be missing something here though. There are
> > plenty of checks in the kernel for pte_none() which haven't been updated. Is
> > there some rule that says none of those paths can see a special pte?
> 
> My rule on doing this was to only care about vma that can be backed by RAM,
> majorly shmem/hugetlb, so the special pte can only exist there within those
> vmas.  I believe in most pte_none() users this special pte won't exist.
> 
> So if it's not related to RAM backed memory at all, maybe it's fine to keep the
> pte_none() usage like before.
> 
> Take the example of insert_pfn() referenced first - I think it can be used to
> map some MMIO regions, but I don't think we'll call that upon a RAM region
> (either shmem or hugetlb), nor can it be uffd wr-protected.  So I'm not sure
> adding special pte check there would be helpful.
> 
> apply_to_pte_range() seems to be a bit special - I think the pte_fn_t matters
> more on whether the special pte will matter.  I had a quick look, it seems
> still be used mostly by all kinds of driver code not mm core.  It's used in two
> forms:
> 
>         apply_to_page_range
>         apply_to_existing_page_range
> 
> The first one creates ptes only, so it ignores the pte_none() check so I skipped.
> 
> The second one has two call sites:
> 
> *** arch/powerpc/mm/pageattr.c:
> change_memory_attr[99]         return apply_to_existing_page_range(&init_mm, start, size,
> set_memory_attr[132]           return apply_to_existing_page_range(&init_mm, start, sz, set_page_attr,
> 
> *** mm/kasan/shadow.c:
> kasan_release_vmalloc[485]     apply_to_existing_page_range(&init_mm,
> 
> I'll leave the ppc callers for now as uffd-wp is not even supported there.  The
> kasan_release_vmalloc() should be for kernel allocated memories only, so should
> not be a target for special pte either.
> 
> So indeed it's hard to 100% cover all pte_none() users to make sure things are
> used right.  As stated above I still believe most callers don't need that, but
> the worst case is if someone triggered uffd-wp issues with a specific feature,
> we can look into it.  I am not sure whether it's good we add this for all the
> pte_none() users, because mostly they'll be useless checks, imho.

I wonder then - should we make pte_none() return true for these special pte's
as well? It seems if we do miss any callers it could result in some fairly hard
to find bugs if the code follows a different path due to the presence of an
unexpected special pte changing the result of pte_none().

> So far what I planned to do is to cover most things we know that may be
> affected like this patch so the change may bring a difference, hopefully we
> won't miss any important spots.
> 
> > 
> > > diff --git a/mm/migrate.c b/mm/migrate.c
> > > index 23cbd9de030b..b477d0d5f911 100644
> > > --- a/mm/migrate.c
> > > +++ b/mm/migrate.c
> > > @@ -294,7 +294,7 @@ void __migration_entry_wait(struct mm_struct *mm, pte_t *ptep,
> > >  
> > >  	spin_lock(ptl);
> > >  	pte = *ptep;
> > > -	if (!is_swap_pte(pte))
> > > +	if (!pte_has_swap_entry(pte))
> > >  		goto out;
> > >  
> > >  	entry = pte_to_swp_entry(pte);
> > > @@ -2276,7 +2276,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
> > >  
> > >  		pte = *ptep;
> > >  
> > > -		if (pte_none(pte)) {
> > > +		if (pte_none(pte) || is_swap_special_pte(pte)) {
> > 
> > I was wondering if we can loose the special pte information here? However I see
> > that in migrate_vma_insert_page() we check again and fail the migration if
> > !pte_none() so I think this is ok.
> > 
> > I think it would be better if this check was moved below so the migration fails
> > early. Ie:
> > 
> > 		if (pte_none(pte)) {
> >  			if (vma_is_anonymous(vma) && !is_swap_special_pte(pte)) {
> 
> Hmm.. but shouldn't vma_is_anonymous()==true already means it must not be a
> swap special pte?  Because swap special pte only exists when !vma_is_anonymous().

Oh ok that makes sense. With the code written that way it is easy to forget
that though so maybe a comment would help?

> > 
> > Also how does this work for page migration in general? I can see in
> > page_vma_mapped_walk() that we skip special pte's, but doesn't this mean we
> > loose the special pte in that instance? Or is that ok for some reason?
> 
> Do you mean try_to_migrate_one()? Does it need to be aware of that?  Per my
> understanding that's only for anonymous private memory, while in that world
> there should have no swap special pte (page_lock_anon_vma_read will return NULL
> early for !vma_is_anonymous).

As far as I know try_to_migrate_one() gets called for both anonymous pages and
file-backed pages. page_lock_anon_vma_read() is only called in the case of an
anonymous vma. See the implementation of rmap_walk() - it will call either
rmap_walk_anon() or rmap_walk_file() depending on the result of PageAnon().

 - Alistair

> Thanks,
> 
>
Peter Xu July 21, 2021, 9:35 p.m. UTC | #4
On Wed, Jul 21, 2021 at 09:28:49PM +1000, Alistair Popple wrote:
> On Saturday, 17 July 2021 5:11:33 AM AEST Peter Xu wrote:
> > On Fri, Jul 16, 2021 at 03:50:52PM +1000, Alistair Popple wrote:
> > > Hi Peter,
> > > 
> > > [...]
> > > 
> > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > > index ae1f5d0cb581..4b46c099ad94 100644
> > > > --- a/mm/memcontrol.c
> > > > +++ b/mm/memcontrol.c
> > > > @@ -5738,7 +5738,7 @@ static enum mc_target_type get_mctgt_type(struct vm_area_struct *vma,
> > > >  
> > > >  	if (pte_present(ptent))
> > > >  		page = mc_handle_present_pte(vma, addr, ptent);
> > > > -	else if (is_swap_pte(ptent))
> > > > +	else if (pte_has_swap_entry(ptent))
> > > >  		page = mc_handle_swap_pte(vma, ptent, &ent);
> > > >  	else if (pte_none(ptent))
> > > >  		page = mc_handle_file_pte(vma, addr, ptent, &ent);
> > > 
> > > As I understand things pte_none() == False for a special swap pte, but
> > > shouldn't this be treated as pte_none() here? Ie. does this need to be
> > > pte_none(ptent) || is_swap_special_pte() here?
> > 
> > Looks correct; here the page/swap cache could hide behind the special pte just
> > like a none pte.  Will fix it.  Thanks!
> > 
> > > 
> > > > diff --git a/mm/memory.c b/mm/memory.c
> > > > index 0e0de08a2cd5..998a4f9a3744 100644
> > > > --- a/mm/memory.c
> > > > +++ b/mm/memory.c
> > > > @@ -3491,6 +3491,13 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > > >  	if (!pte_unmap_same(vmf))
> > > >  		goto out;
> > > >  
> > > > +	/*
> > > > +	 * We should never call do_swap_page upon a swap special pte; just be
> > > > +	 * safe to bail out if it happens.
> > > > +	 */
> > > > +	if (WARN_ON_ONCE(is_swap_special_pte(vmf->orig_pte)))
> > > > +		goto out;
> > > > +
> > > >  	entry = pte_to_swp_entry(vmf->orig_pte);
> > > >  	if (unlikely(non_swap_entry(entry))) {
> > > >  		if (is_migration_entry(entry)) {
> > > 
> > > Are there other changes required here? Because we can end up with stale special
> > > pte's and a special pte is !pte_none don't we need to fix some of the !pte_none
> > > checks in these functions:
> > > 
> > > insert_pfn() -> checks for !pte_none
> > > remap_pte_range() -> BUG_ON(!pte_none)
> > > apply_to_pte_range() -> didn't check further but it tests for !pte_none
> > > 
> > > In general it feels like I might be missing something here though. There are
> > > plenty of checks in the kernel for pte_none() which haven't been updated. Is
> > > there some rule that says none of those paths can see a special pte?
> > 
> > My rule on doing this was to only care about vma that can be backed by RAM,
> > majorly shmem/hugetlb, so the special pte can only exist there within those
> > vmas.  I believe in most pte_none() users this special pte won't exist.
> > 
> > So if it's not related to RAM backed memory at all, maybe it's fine to keep the
> > pte_none() usage like before.
> > 
> > Take the example of insert_pfn() referenced first - I think it can be used to
> > map some MMIO regions, but I don't think we'll call that upon a RAM region
> > (either shmem or hugetlb), nor can it be uffd wr-protected.  So I'm not sure
> > adding special pte check there would be helpful.
> > 
> > apply_to_pte_range() seems to be a bit special - I think the pte_fn_t matters
> > more on whether the special pte will matter.  I had a quick look, it seems
> > still be used mostly by all kinds of driver code not mm core.  It's used in two
> > forms:
> > 
> >         apply_to_page_range
> >         apply_to_existing_page_range
> > 
> > The first one creates ptes only, so it ignores the pte_none() check so I skipped.
> > 
> > The second one has two call sites:
> > 
> > *** arch/powerpc/mm/pageattr.c:
> > change_memory_attr[99]         return apply_to_existing_page_range(&init_mm, start, size,
> > set_memory_attr[132]           return apply_to_existing_page_range(&init_mm, start, sz, set_page_attr,
> > 
> > *** mm/kasan/shadow.c:
> > kasan_release_vmalloc[485]     apply_to_existing_page_range(&init_mm,
> > 
> > I'll leave the ppc callers for now as uffd-wp is not even supported there.  The
> > kasan_release_vmalloc() should be for kernel allocated memories only, so should
> > not be a target for special pte either.
> > 
> > So indeed it's hard to 100% cover all pte_none() users to make sure things are
> > used right.  As stated above I still believe most callers don't need that, but
> > the worst case is if someone triggered uffd-wp issues with a specific feature,
> > we can look into it.  I am not sure whether it's good we add this for all the
> > pte_none() users, because mostly they'll be useless checks, imho.
> 
> I wonder then - should we make pte_none() return true for these special pte's
> as well? It seems if we do miss any callers it could result in some fairly hard
> to find bugs if the code follows a different path due to the presence of an
> unexpected special pte changing the result of pte_none().

I thought about something similar before, but I didn't dare to change
pte_none() as it's been there for ages and I'm afraid people will get confused
when it's meaning changed.  So even if we want to have some helper identifying
"either none pte or the swap special pte" it should use a different name.

Modifying the meaning of pte_none() could also have other risks that when we
really want an empty pte to be doing something else now.  It turns out there's
no easy way to not identify the case one by one, at least to me.  I'm always
open to good suggestions.

Btw, as you mentioned before, we can use a new number out of MAX_SWAPFILES,
that'll make all these easier a bit here, then we don't need to worry on
pte_none() issues too.  Two days ago Hugh has raised some similar concern on
whether it's good to implement this uffd-wp special pte like this.  I think we
can discuss this separately.

> 
> > So far what I planned to do is to cover most things we know that may be
> > affected like this patch so the change may bring a difference, hopefully we
> > won't miss any important spots.
> > 
> > > 
> > > > diff --git a/mm/migrate.c b/mm/migrate.c
> > > > index 23cbd9de030b..b477d0d5f911 100644
> > > > --- a/mm/migrate.c
> > > > +++ b/mm/migrate.c
> > > > @@ -294,7 +294,7 @@ void __migration_entry_wait(struct mm_struct *mm, pte_t *ptep,
> > > >  
> > > >  	spin_lock(ptl);
> > > >  	pte = *ptep;
> > > > -	if (!is_swap_pte(pte))
> > > > +	if (!pte_has_swap_entry(pte))
> > > >  		goto out;
> > > >  
> > > >  	entry = pte_to_swp_entry(pte);
> > > > @@ -2276,7 +2276,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
> > > >  
> > > >  		pte = *ptep;
> > > >  
> > > > -		if (pte_none(pte)) {
> > > > +		if (pte_none(pte) || is_swap_special_pte(pte)) {
> > > 
> > > I was wondering if we can loose the special pte information here? However I see
> > > that in migrate_vma_insert_page() we check again and fail the migration if
> > > !pte_none() so I think this is ok.
> > > 
> > > I think it would be better if this check was moved below so the migration fails
> > > early. Ie:
> > > 
> > > 		if (pte_none(pte)) {
> > >  			if (vma_is_anonymous(vma) && !is_swap_special_pte(pte)) {
> > 
> > Hmm.. but shouldn't vma_is_anonymous()==true already means it must not be a
> > swap special pte?  Because swap special pte only exists when !vma_is_anonymous().
> 
> Oh ok that makes sense. With the code written that way it is easy to forget
> that though so maybe a comment would help?

I've put most words in comment of is_swap_special_pte().  Do you perhaps have a
suggestion on the comment here?

> 
> > > 
> > > Also how does this work for page migration in general? I can see in
> > > page_vma_mapped_walk() that we skip special pte's, but doesn't this mean we
> > > loose the special pte in that instance? Or is that ok for some reason?
> > 
> > Do you mean try_to_migrate_one()? Does it need to be aware of that?  Per my
> > understanding that's only for anonymous private memory, while in that world
> > there should have no swap special pte (page_lock_anon_vma_read will return NULL
> > early for !vma_is_anonymous).
> 
> As far as I know try_to_migrate_one() gets called for both anonymous pages and
> file-backed pages. page_lock_anon_vma_read() is only called in the case of an
> anonymous vma. See the implementation of rmap_walk() - it will call either
> rmap_walk_anon() or rmap_walk_file() depending on the result of PageAnon().

I may have replied too soon there. :)  I think you're right.

So I think how it should work with page migration is: we skip that pte just
like what you said (check_pte returns false), then the per-pte info will be
kept there, irrelevant of what's the backing page is.  When it faults, it'll
bring up with either the old/new page depending on migration finished or not.
Does that sound working to you?

Thanks,
Alistair Popple July 22, 2021, 1:08 a.m. UTC | #5
On Thursday, 22 July 2021 7:35:32 AM AEST Peter Xu wrote:
> On Wed, Jul 21, 2021 at 09:28:49PM +1000, Alistair Popple wrote:
> > On Saturday, 17 July 2021 5:11:33 AM AEST Peter Xu wrote:
> > > On Fri, Jul 16, 2021 at 03:50:52PM +1000, Alistair Popple wrote:
> > > > Hi Peter,
> > > > 
> > > > [...]
> > > > 
> > > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > > > index ae1f5d0cb581..4b46c099ad94 100644
> > > > > --- a/mm/memcontrol.c
> > > > > +++ b/mm/memcontrol.c
> > > > > @@ -5738,7 +5738,7 @@ static enum mc_target_type get_mctgt_type(struct vm_area_struct *vma,
> > > > >  
> > > > >  	if (pte_present(ptent))
> > > > >  		page = mc_handle_present_pte(vma, addr, ptent);
> > > > > -	else if (is_swap_pte(ptent))
> > > > > +	else if (pte_has_swap_entry(ptent))
> > > > >  		page = mc_handle_swap_pte(vma, ptent, &ent);
> > > > >  	else if (pte_none(ptent))
> > > > >  		page = mc_handle_file_pte(vma, addr, ptent, &ent);
> > > > 
> > > > As I understand things pte_none() == False for a special swap pte, but
> > > > shouldn't this be treated as pte_none() here? Ie. does this need to be
> > > > pte_none(ptent) || is_swap_special_pte() here?
> > > 
> > > Looks correct; here the page/swap cache could hide behind the special pte just
> > > like a none pte.  Will fix it.  Thanks!
> > > 
> > > > 
> > > > > diff --git a/mm/memory.c b/mm/memory.c
> > > > > index 0e0de08a2cd5..998a4f9a3744 100644
> > > > > --- a/mm/memory.c
> > > > > +++ b/mm/memory.c
> > > > > @@ -3491,6 +3491,13 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > > > >  	if (!pte_unmap_same(vmf))
> > > > >  		goto out;
> > > > >  
> > > > > +	/*
> > > > > +	 * We should never call do_swap_page upon a swap special pte; just be
> > > > > +	 * safe to bail out if it happens.
> > > > > +	 */
> > > > > +	if (WARN_ON_ONCE(is_swap_special_pte(vmf->orig_pte)))
> > > > > +		goto out;
> > > > > +
> > > > >  	entry = pte_to_swp_entry(vmf->orig_pte);
> > > > >  	if (unlikely(non_swap_entry(entry))) {
> > > > >  		if (is_migration_entry(entry)) {
> > > > 
> > > > Are there other changes required here? Because we can end up with stale special
> > > > pte's and a special pte is !pte_none don't we need to fix some of the !pte_none
> > > > checks in these functions:
> > > > 
> > > > insert_pfn() -> checks for !pte_none
> > > > remap_pte_range() -> BUG_ON(!pte_none)
> > > > apply_to_pte_range() -> didn't check further but it tests for !pte_none
> > > > 
> > > > In general it feels like I might be missing something here though. There are
> > > > plenty of checks in the kernel for pte_none() which haven't been updated. Is
> > > > there some rule that says none of those paths can see a special pte?
> > > 
> > > My rule on doing this was to only care about vma that can be backed by RAM,
> > > majorly shmem/hugetlb, so the special pte can only exist there within those
> > > vmas.  I believe in most pte_none() users this special pte won't exist.
> > > 
> > > So if it's not related to RAM backed memory at all, maybe it's fine to keep the
> > > pte_none() usage like before.
> > > 
> > > Take the example of insert_pfn() referenced first - I think it can be used to
> > > map some MMIO regions, but I don't think we'll call that upon a RAM region
> > > (either shmem or hugetlb), nor can it be uffd wr-protected.  So I'm not sure
> > > adding special pte check there would be helpful.
> > > 
> > > apply_to_pte_range() seems to be a bit special - I think the pte_fn_t matters
> > > more on whether the special pte will matter.  I had a quick look, it seems
> > > still be used mostly by all kinds of driver code not mm core.  It's used in two
> > > forms:
> > > 
> > >         apply_to_page_range
> > >         apply_to_existing_page_range
> > > 
> > > The first one creates ptes only, so it ignores the pte_none() check so I skipped.
> > > 
> > > The second one has two call sites:
> > > 
> > > *** arch/powerpc/mm/pageattr.c:
> > > change_memory_attr[99]         return apply_to_existing_page_range(&init_mm, start, size,
> > > set_memory_attr[132]           return apply_to_existing_page_range(&init_mm, start, sz, set_page_attr,
> > > 
> > > *** mm/kasan/shadow.c:
> > > kasan_release_vmalloc[485]     apply_to_existing_page_range(&init_mm,
> > > 
> > > I'll leave the ppc callers for now as uffd-wp is not even supported there.  The
> > > kasan_release_vmalloc() should be for kernel allocated memories only, so should
> > > not be a target for special pte either.
> > > 
> > > So indeed it's hard to 100% cover all pte_none() users to make sure things are
> > > used right.  As stated above I still believe most callers don't need that, but
> > > the worst case is if someone triggered uffd-wp issues with a specific feature,
> > > we can look into it.  I am not sure whether it's good we add this for all the
> > > pte_none() users, because mostly they'll be useless checks, imho.
> > 
> > I wonder then - should we make pte_none() return true for these special pte's
> > as well? It seems if we do miss any callers it could result in some fairly hard
> > to find bugs if the code follows a different path due to the presence of an
> > unexpected special pte changing the result of pte_none().
> 
> I thought about something similar before, but I didn't dare to change
> pte_none() as it's been there for ages and I'm afraid people will get confused
> when it's meaning changed.  So even if we want to have some helper identifying
> "either none pte or the swap special pte" it should use a different name.
> 
> Modifying the meaning of pte_none() could also have other risks that when we
> really want an empty pte to be doing something else now.  It turns out there's
> no easy way to not identify the case one by one, at least to me.  I'm always
> open to good suggestions.

I'm not convinced it's changing the behaviour of pte_none() though and my
concern is that introducing special swap ptes does change it. Prior to this
clearing a pte would result in pte_none()==True. After this series clearing a
pte can some sometimes result in pte_none()==False because it doesn't really
get cleared.

Now as you say it's hard to cover 100% of pte_none() uses, so it's possible we
have missed cases that may now encounter a special pte and take a different
path (get_mctgt_type() is one example, I stopped looking for other possible
ones after mm/memory.c).

So perhaps if we want to keep pte_none() to check for really clear pte's then
what is required is converting all callers to a new helper
(pte_none_not_special()?) that treats special swap ptes as pte_none() and warns
if a special pte is encountered?

> Btw, as you mentioned before, we can use a new number out of MAX_SWAPFILES,
> that'll make all these easier a bit here, then we don't need to worry on
> pte_none() issues too.  Two days ago Hugh has raised some similar concern on
> whether it's good to implement this uffd-wp special pte like this.  I think we
> can discuss this separately.

Yes, I saw that and personally I still prefer that approach.

> > 
> > > So far what I planned to do is to cover most things we know that may be
> > > affected like this patch so the change may bring a difference, hopefully we
> > > won't miss any important spots.
> > > 
> > > > 
> > > > > diff --git a/mm/migrate.c b/mm/migrate.c
> > > > > index 23cbd9de030b..b477d0d5f911 100644
> > > > > --- a/mm/migrate.c
> > > > > +++ b/mm/migrate.c
> > > > > @@ -294,7 +294,7 @@ void __migration_entry_wait(struct mm_struct *mm, pte_t *ptep,
> > > > >  
> > > > >  	spin_lock(ptl);
> > > > >  	pte = *ptep;
> > > > > -	if (!is_swap_pte(pte))
> > > > > +	if (!pte_has_swap_entry(pte))
> > > > >  		goto out;
> > > > >  
> > > > >  	entry = pte_to_swp_entry(pte);
> > > > > @@ -2276,7 +2276,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
> > > > >  
> > > > >  		pte = *ptep;
> > > > >  
> > > > > -		if (pte_none(pte)) {
> > > > > +		if (pte_none(pte) || is_swap_special_pte(pte)) {
> > > > 
> > > > I was wondering if we can loose the special pte information here? However I see
> > > > that in migrate_vma_insert_page() we check again and fail the migration if
> > > > !pte_none() so I think this is ok.
> > > > 
> > > > I think it would be better if this check was moved below so the migration fails
> > > > early. Ie:
> > > > 
> > > > 		if (pte_none(pte)) {
> > > >  			if (vma_is_anonymous(vma) && !is_swap_special_pte(pte)) {
> > > 
> > > Hmm.. but shouldn't vma_is_anonymous()==true already means it must not be a
> > > swap special pte?  Because swap special pte only exists when !vma_is_anonymous().
> > 
> > Oh ok that makes sense. With the code written that way it is easy to forget
> > that though so maybe a comment would help?
> 
> I've put most words in comment of is_swap_special_pte().  Do you perhaps have a
> suggestion on the comment here?

Perhaps something like "swap special ptes only exist for !vma_is_anonymous(vma)"?

And I now see my original code suggestion was wrong anyway :)

> > 
> > > > 
> > > > Also how does this work for page migration in general? I can see in
> > > > page_vma_mapped_walk() that we skip special pte's, but doesn't this mean we
> > > > loose the special pte in that instance? Or is that ok for some reason?
> > > 
> > > Do you mean try_to_migrate_one()? Does it need to be aware of that?  Per my
> > > understanding that's only for anonymous private memory, while in that world
> > > there should have no swap special pte (page_lock_anon_vma_read will return NULL
> > > early for !vma_is_anonymous).
> > 
> > As far as I know try_to_migrate_one() gets called for both anonymous pages and
> > file-backed pages. page_lock_anon_vma_read() is only called in the case of an
> > anonymous vma. See the implementation of rmap_walk() - it will call either
> > rmap_walk_anon() or rmap_walk_file() depending on the result of PageAnon().
> 
> I may have replied too soon there. :)  I think you're right.
> 
> So I think how it should work with page migration is: we skip that pte just
> like what you said (check_pte returns false), then the per-pte info will be
> kept there, irrelevant of what's the backing page is.  When it faults, it'll
> bring up with either the old/new page depending on migration finished or not.
> Does that sound working to you?

Yes actually I think this is ok. check_pte returns false for special pte's so
the existing special pte will be left in place to be dealt with as normal.

 - Alistair

> Thanks,
> 
>
Peter Xu July 22, 2021, 3:21 p.m. UTC | #6
On Thu, Jul 22, 2021 at 11:08:53AM +1000, Alistair Popple wrote:
> On Thursday, 22 July 2021 7:35:32 AM AEST Peter Xu wrote:
> > On Wed, Jul 21, 2021 at 09:28:49PM +1000, Alistair Popple wrote:
> > > On Saturday, 17 July 2021 5:11:33 AM AEST Peter Xu wrote:
> > > > On Fri, Jul 16, 2021 at 03:50:52PM +1000, Alistair Popple wrote:
> > > > > Hi Peter,
> > > > > 
> > > > > [...]
> > > > > 
> > > > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > > > > index ae1f5d0cb581..4b46c099ad94 100644
> > > > > > --- a/mm/memcontrol.c
> > > > > > +++ b/mm/memcontrol.c
> > > > > > @@ -5738,7 +5738,7 @@ static enum mc_target_type get_mctgt_type(struct vm_area_struct *vma,
> > > > > >  
> > > > > >  	if (pte_present(ptent))
> > > > > >  		page = mc_handle_present_pte(vma, addr, ptent);
> > > > > > -	else if (is_swap_pte(ptent))
> > > > > > +	else if (pte_has_swap_entry(ptent))
> > > > > >  		page = mc_handle_swap_pte(vma, ptent, &ent);
> > > > > >  	else if (pte_none(ptent))
> > > > > >  		page = mc_handle_file_pte(vma, addr, ptent, &ent);
> > > > > 
> > > > > As I understand things pte_none() == False for a special swap pte, but
> > > > > shouldn't this be treated as pte_none() here? Ie. does this need to be
> > > > > pte_none(ptent) || is_swap_special_pte() here?
> > > > 
> > > > Looks correct; here the page/swap cache could hide behind the special pte just
> > > > like a none pte.  Will fix it.  Thanks!
> > > > 
> > > > > 
> > > > > > diff --git a/mm/memory.c b/mm/memory.c
> > > > > > index 0e0de08a2cd5..998a4f9a3744 100644
> > > > > > --- a/mm/memory.c
> > > > > > +++ b/mm/memory.c
> > > > > > @@ -3491,6 +3491,13 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > > > > >  	if (!pte_unmap_same(vmf))
> > > > > >  		goto out;
> > > > > >  
> > > > > > +	/*
> > > > > > +	 * We should never call do_swap_page upon a swap special pte; just be
> > > > > > +	 * safe to bail out if it happens.
> > > > > > +	 */
> > > > > > +	if (WARN_ON_ONCE(is_swap_special_pte(vmf->orig_pte)))
> > > > > > +		goto out;
> > > > > > +
> > > > > >  	entry = pte_to_swp_entry(vmf->orig_pte);
> > > > > >  	if (unlikely(non_swap_entry(entry))) {
> > > > > >  		if (is_migration_entry(entry)) {
> > > > > 
> > > > > Are there other changes required here? Because we can end up with stale special
> > > > > pte's and a special pte is !pte_none don't we need to fix some of the !pte_none
> > > > > checks in these functions:
> > > > > 
> > > > > insert_pfn() -> checks for !pte_none
> > > > > remap_pte_range() -> BUG_ON(!pte_none)
> > > > > apply_to_pte_range() -> didn't check further but it tests for !pte_none
> > > > > 
> > > > > In general it feels like I might be missing something here though. There are
> > > > > plenty of checks in the kernel for pte_none() which haven't been updated. Is
> > > > > there some rule that says none of those paths can see a special pte?
> > > > 
> > > > My rule on doing this was to only care about vma that can be backed by RAM,
> > > > majorly shmem/hugetlb, so the special pte can only exist there within those
> > > > vmas.  I believe in most pte_none() users this special pte won't exist.
> > > > 
> > > > So if it's not related to RAM backed memory at all, maybe it's fine to keep the
> > > > pte_none() usage like before.
> > > > 
> > > > Take the example of insert_pfn() referenced first - I think it can be used to
> > > > map some MMIO regions, but I don't think we'll call that upon a RAM region
> > > > (either shmem or hugetlb), nor can it be uffd wr-protected.  So I'm not sure
> > > > adding special pte check there would be helpful.
> > > > 
> > > > apply_to_pte_range() seems to be a bit special - I think the pte_fn_t matters
> > > > more on whether the special pte will matter.  I had a quick look, it seems
> > > > still be used mostly by all kinds of driver code not mm core.  It's used in two
> > > > forms:
> > > > 
> > > >         apply_to_page_range
> > > >         apply_to_existing_page_range
> > > > 
> > > > The first one creates ptes only, so it ignores the pte_none() check so I skipped.
> > > > 
> > > > The second one has two call sites:
> > > > 
> > > > *** arch/powerpc/mm/pageattr.c:
> > > > change_memory_attr[99]         return apply_to_existing_page_range(&init_mm, start, size,
> > > > set_memory_attr[132]           return apply_to_existing_page_range(&init_mm, start, sz, set_page_attr,
> > > > 
> > > > *** mm/kasan/shadow.c:
> > > > kasan_release_vmalloc[485]     apply_to_existing_page_range(&init_mm,
> > > > 
> > > > I'll leave the ppc callers for now as uffd-wp is not even supported there.  The
> > > > kasan_release_vmalloc() should be for kernel allocated memories only, so should
> > > > not be a target for special pte either.
> > > > 
> > > > So indeed it's hard to 100% cover all pte_none() users to make sure things are
> > > > used right.  As stated above I still believe most callers don't need that, but
> > > > the worst case is if someone triggered uffd-wp issues with a specific feature,
> > > > we can look into it.  I am not sure whether it's good we add this for all the
> > > > pte_none() users, because mostly they'll be useless checks, imho.
> > > 
> > > I wonder then - should we make pte_none() return true for these special pte's
> > > as well? It seems if we do miss any callers it could result in some fairly hard
> > > to find bugs if the code follows a different path due to the presence of an
> > > unexpected special pte changing the result of pte_none().
> > 
> > I thought about something similar before, but I didn't dare to change
> > pte_none() as it's been there for ages and I'm afraid people will get confused
> > when it's meaning changed.  So even if we want to have some helper identifying
> > "either none pte or the swap special pte" it should use a different name.
> > 
> > Modifying the meaning of pte_none() could also have other risks that when we
> > really want an empty pte to be doing something else now.  It turns out there's
> > no easy way to not identify the case one by one, at least to me.  I'm always
> > open to good suggestions.
> 
> I'm not convinced it's changing the behaviour of pte_none() though and my
> concern is that introducing special swap ptes does change it. Prior to this
> clearing a pte would result in pte_none()==True. After this series clearing a
> pte can some sometimes result in pte_none()==False because it doesn't really
> get cleared.

The thing is the uffd special pte is not "none" literally; there's something
inside.  That's what makes it feel not right to me.  I'm not against trapping
all of pte_none(), but as I mentioned I think at least it needs to be renamed
to something else (maybe pte_none_mostly(), but I don't know..).

> 
> Now as you say it's hard to cover 100% of pte_none() uses, so it's possible we
> have missed cases that may now encounter a special pte and take a different
> path (get_mctgt_type() is one example, I stopped looking for other possible
> ones after mm/memory.c).
> 
> So perhaps if we want to keep pte_none() to check for really clear pte's then
> what is required is converting all callers to a new helper
> (pte_none_not_special()?) that treats special swap ptes as pte_none() and warns
> if a special pte is encountered?

By double check all core memory calls to pte_none()?

The special swap pte shouldn't exist for most cases but only for shmem and
hugetlbfs so far.  So we can sensibly drop a lot of pte_none() users IMHO
depending on the type of memory.

> 
> > Btw, as you mentioned before, we can use a new number out of MAX_SWAPFILES,
> > that'll make all these easier a bit here, then we don't need to worry on
> > pte_none() issues too.  Two days ago Hugh has raised some similar concern on
> > whether it's good to implement this uffd-wp special pte like this.  I think we
> > can discuss this separately.
> 
> Yes, I saw that and personally I still prefer that approach.

Yes I see your preference.  Let's hold off a bit on the pte_none() discussions;
I'll re-raise this in the cover letter soon.  If everyone is okay that we use
yet another MAX_SWAPFILES and that's preferred, then I can switch the design.
Then I think I can also avoid touching the pte_none() bits at all, which seems
to be controversial here.

But still, I am also not convinced that we can blindly replace pte_none() into
"either none pte or some special pte", either in this series or (if this series
will switch to swp_entry) in the future when we want to use !pte_present and
!swp_entry ptes.  If we want to replace that, we may still want to check over
all the users of pte_none then it's the same as what we should do now, and do a
proper rename of it.

Thanks,
diff mbox series

Patch

diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c
index 69b3fde8759e..841ff639b4b5 100644
--- a/arch/arm64/kernel/mte.c
+++ b/arch/arm64/kernel/mte.c
@@ -35,7 +35,7 @@  EXPORT_SYMBOL_GPL(mte_async_mode);
 static void mte_sync_page_tags(struct page *page, pte_t old_pte,
 			       bool check_swap, bool pte_is_tagged)
 {
-	if (check_swap && is_swap_pte(old_pte)) {
+	if (check_swap && pte_has_swap_entry(old_pte)) {
 		swp_entry_t entry = pte_to_swp_entry(old_pte);
 
 		if (!non_swap_entry(entry) && mte_restore_tags(entry, page))
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index eb97468dfe4c..9c5af77b5290 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -498,7 +498,7 @@  static void smaps_pte_entry(pte_t *pte, unsigned long addr,
 
 	if (pte_present(*pte)) {
 		page = vm_normal_page(vma, addr, *pte);
-	} else if (is_swap_pte(*pte)) {
+	} else if (pte_has_swap_entry(*pte)) {
 		swp_entry_t swpent = pte_to_swp_entry(*pte);
 
 		if (!non_swap_entry(swpent)) {
@@ -516,8 +516,10 @@  static void smaps_pte_entry(pte_t *pte, unsigned long addr,
 			}
 		} else if (is_pfn_swap_entry(swpent))
 			page = pfn_swap_entry_to_page(swpent);
-	} else if (unlikely(IS_ENABLED(CONFIG_SHMEM) && mss->check_shmem_swap
-							&& pte_none(*pte))) {
+	} else if (unlikely(IS_ENABLED(CONFIG_SHMEM) &&
+			    mss->check_shmem_swap &&
+			    /* Here swap special pte is the same as none pte */
+			    (pte_none(*pte) || is_swap_special_pte(*pte)))) {
 		page = xa_load(&vma->vm_file->f_mapping->i_pages,
 						linear_page_index(vma, addr));
 		if (xa_is_value(page))
@@ -689,7 +691,7 @@  static int smaps_hugetlb_range(pte_t *pte, unsigned long hmask,
 
 	if (pte_present(*pte)) {
 		page = vm_normal_page(vma, addr, *pte);
-	} else if (is_swap_pte(*pte)) {
+	} else if (pte_has_swap_entry(*pte)) {
 		swp_entry_t swpent = pte_to_swp_entry(*pte);
 
 		if (is_pfn_swap_entry(swpent))
@@ -1071,7 +1073,7 @@  static inline void clear_soft_dirty(struct vm_area_struct *vma,
 		ptent = pte_wrprotect(old_pte);
 		ptent = pte_clear_soft_dirty(ptent);
 		ptep_modify_prot_commit(vma, addr, pte, old_pte, ptent);
-	} else if (is_swap_pte(ptent)) {
+	} else if (pte_has_swap_entry(ptent)) {
 		ptent = pte_swp_clear_soft_dirty(ptent);
 		set_pte_at(vma->vm_mm, addr, pte, ptent);
 	}
@@ -1374,7 +1376,7 @@  static pagemap_entry_t pte_to_pagemap_entry(struct pagemapread *pm,
 			flags |= PM_SOFT_DIRTY;
 		if (pte_uffd_wp(pte))
 			flags |= PM_UFFD_WP;
-	} else if (is_swap_pte(pte)) {
+	} else if (pte_has_swap_entry(pte)) {
 		swp_entry_t entry;
 		if (pte_swp_soft_dirty(pte))
 			flags |= PM_SOFT_DIRTY;
diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index d356ab4047f7..7f46dec3be1d 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -5,6 +5,7 @@ 
 #include <linux/radix-tree.h>
 #include <linux/bug.h>
 #include <linux/mm_types.h>
+#include <linux/userfaultfd_k.h>
 
 #ifdef CONFIG_MMU
 
@@ -62,12 +63,48 @@  static inline pgoff_t swp_offset(swp_entry_t entry)
 	return entry.val & SWP_OFFSET_MASK;
 }
 
-/* check whether a pte points to a swap entry */
+/*
+ * is_swap_pte() returns true for three cases:
+ *
+ * (a) The pte contains a swap entry.
+ *
+ *   (a.1) The pte has a normal swap entry (non_swap_entry()==false).  For
+ *         example, when an anonymous page got swapped out.
+ *
+ *   (a.2) The pte has a special swap entry (non_swap_entry()==true).  For
+ *         example, a migration entry, a hw-poison entry, etc.
+ *
+ * (b) The pte does not contain a swap entry at all (so it cannot be passed
+ *     into pte_to_swp_entry()).  For example, uffd-wp special swap pte.
+ */
 static inline int is_swap_pte(pte_t pte)
 {
 	return !pte_none(pte) && !pte_present(pte);
 }
 
+/*
+ * A swap-like special pte should only be used as special marker to trigger a
+ * page fault.  We should treat them similarly as pte_none() in most cases,
+ * except that it may contain some special information that can persist within
+ * the pte.  Currently the only special swap pte is UFFD_WP_SWP_PTE_SPECIAL.
+ *
+ * Note: we should never call pte_to_swp_entry() upon a special swap pte,
+ * Because a swap special pte does not contain a swap entry!
+ */
+static inline bool is_swap_special_pte(pte_t pte)
+{
+	return pte_swp_uffd_wp_special(pte);
+}
+
+/*
+ * Returns true if the pte contains a swap entry.  This includes not only the
+ * normal swp entry case, but also for migration entries, etc.
+ */
+static inline bool pte_has_swap_entry(pte_t pte)
+{
+	return is_swap_pte(pte) && !is_swap_special_pte(pte);
+}
+
 /*
  * Convert the arch-dependent pte representation of a swp_entry_t into an
  * arch-independent swp_entry_t.
diff --git a/mm/gup.c b/mm/gup.c
index 42b8b1fa6521..425c08788921 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -513,7 +513,7 @@  static struct page *follow_page_pte(struct vm_area_struct *vma,
 		 */
 		if (likely(!(flags & FOLL_MIGRATION)))
 			goto no_page;
-		if (pte_none(pte))
+		if (!pte_has_swap_entry(pte))
 			goto no_page;
 		entry = pte_to_swp_entry(pte);
 		if (!is_migration_entry(entry))
diff --git a/mm/hmm.c b/mm/hmm.c
index fad6be2bf072..aba1bf2c6742 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -239,7 +239,7 @@  static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
 	pte_t pte = *ptep;
 	uint64_t pfn_req_flags = *hmm_pfn;
 
-	if (pte_none(pte)) {
+	if (pte_none(pte) || is_swap_special_pte(pte)) {
 		required_fault =
 			hmm_pte_need_fault(hmm_vma_walk, pfn_req_flags, 0);
 		if (required_fault)
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index b0412be08fa2..7376a9b5bfc9 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1018,7 +1018,7 @@  static bool __collapse_huge_page_swapin(struct mm_struct *mm,
 
 		vmf.pte = pte_offset_map(pmd, address);
 		vmf.orig_pte = *vmf.pte;
-		if (!is_swap_pte(vmf.orig_pte)) {
+		if (!pte_has_swap_entry(vmf.orig_pte)) {
 			pte_unmap(vmf.pte);
 			continue;
 		}
@@ -1245,6 +1245,15 @@  static int khugepaged_scan_pmd(struct mm_struct *mm,
 	     _pte++, _address += PAGE_SIZE) {
 		pte_t pteval = *_pte;
 		if (is_swap_pte(pteval)) {
+			if (is_swap_special_pte(pteval)) {
+				/*
+				 * Reuse SCAN_PTE_UFFD_WP.  If there will be
+				 * new users of is_swap_special_pte(), we'd
+				 * better introduce a new result type.
+				 */
+				result = SCAN_PTE_UFFD_WP;
+				goto out_unmap;
+			}
 			if (++unmapped <= khugepaged_max_ptes_swap) {
 				/*
 				 * Always be strict with uffd-wp
diff --git a/mm/madvise.c b/mm/madvise.c
index 6d3d348b17f4..2a8d2a9fc514 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -204,7 +204,7 @@  static int swapin_walk_pmd_entry(pmd_t *pmd, unsigned long start,
 		pte = *(orig_pte + ((index - start) / PAGE_SIZE));
 		pte_unmap_unlock(orig_pte, ptl);
 
-		if (pte_present(pte) || pte_none(pte))
+		if (!pte_has_swap_entry(pte))
 			continue;
 		entry = pte_to_swp_entry(pte);
 		if (unlikely(non_swap_entry(entry)))
@@ -596,7 +596,7 @@  static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 	for (; addr != end; pte++, addr += PAGE_SIZE) {
 		ptent = *pte;
 
-		if (pte_none(ptent))
+		if (pte_none(ptent) || is_swap_special_pte(ptent))
 			continue;
 		/*
 		 * If the pte has swp_entry, just clear page table to
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ae1f5d0cb581..4b46c099ad94 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5738,7 +5738,7 @@  static enum mc_target_type get_mctgt_type(struct vm_area_struct *vma,
 
 	if (pte_present(ptent))
 		page = mc_handle_present_pte(vma, addr, ptent);
-	else if (is_swap_pte(ptent))
+	else if (pte_has_swap_entry(ptent))
 		page = mc_handle_swap_pte(vma, ptent, &ent);
 	else if (pte_none(ptent))
 		page = mc_handle_file_pte(vma, addr, ptent, &ent);
diff --git a/mm/memory.c b/mm/memory.c
index 0e0de08a2cd5..998a4f9a3744 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3491,6 +3491,13 @@  vm_fault_t do_swap_page(struct vm_fault *vmf)
 	if (!pte_unmap_same(vmf))
 		goto out;
 
+	/*
+	 * We should never call do_swap_page upon a swap special pte; just be
+	 * safe to bail out if it happens.
+	 */
+	if (WARN_ON_ONCE(is_swap_special_pte(vmf->orig_pte)))
+		goto out;
+
 	entry = pte_to_swp_entry(vmf->orig_pte);
 	if (unlikely(non_swap_entry(entry))) {
 		if (is_migration_entry(entry)) {
diff --git a/mm/migrate.c b/mm/migrate.c
index 23cbd9de030b..b477d0d5f911 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -294,7 +294,7 @@  void __migration_entry_wait(struct mm_struct *mm, pte_t *ptep,
 
 	spin_lock(ptl);
 	pte = *ptep;
-	if (!is_swap_pte(pte))
+	if (!pte_has_swap_entry(pte))
 		goto out;
 
 	entry = pte_to_swp_entry(pte);
@@ -2276,7 +2276,7 @@  static int migrate_vma_collect_pmd(pmd_t *pmdp,
 
 		pte = *ptep;
 
-		if (pte_none(pte)) {
+		if (pte_none(pte) || is_swap_special_pte(pte)) {
 			if (vma_is_anonymous(vma)) {
 				mpfn = MIGRATE_PFN_MIGRATE;
 				migrate->cpages++;
diff --git a/mm/mincore.c b/mm/mincore.c
index 9122676b54d6..5728c3e6473f 100644
--- a/mm/mincore.c
+++ b/mm/mincore.c
@@ -121,7 +121,7 @@  static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 	for (; addr != end; ptep++, addr += PAGE_SIZE) {
 		pte_t pte = *ptep;
 
-		if (pte_none(pte))
+		if (pte_none(pte) || is_swap_special_pte(pte))
 			__mincore_unmapped_range(addr, addr + PAGE_SIZE,
 						 vma, vec);
 		else if (pte_present(pte))
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 883e2cc85cad..4b743394afbe 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -139,7 +139,7 @@  static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 			}
 			ptep_modify_prot_commit(vma, addr, pte, oldpte, ptent);
 			pages++;
-		} else if (is_swap_pte(oldpte)) {
+		} else if (pte_has_swap_entry(oldpte)) {
 			swp_entry_t entry = pte_to_swp_entry(oldpte);
 			pte_t newpte;
 
diff --git a/mm/mremap.c b/mm/mremap.c
index 5989d3990020..122b279333ee 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -125,7 +125,7 @@  static pte_t move_soft_dirty_pte(pte_t pte)
 #ifdef CONFIG_MEM_SOFT_DIRTY
 	if (pte_present(pte))
 		pte = pte_mksoft_dirty(pte);
-	else if (is_swap_pte(pte))
+	else if (pte_has_swap_entry(pte))
 		pte = pte_swp_mksoft_dirty(pte);
 #endif
 	return pte;
diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index f7b331081791..ff57b67426af 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -36,7 +36,7 @@  static bool map_pte(struct page_vma_mapped_walk *pvmw)
 			 * For more details on device private memory see HMM
 			 * (include/linux/hmm.h or mm/hmm.c).
 			 */
-			if (is_swap_pte(*pvmw->pte)) {
+			if (pte_has_swap_entry(*pvmw->pte)) {
 				swp_entry_t entry;
 
 				/* Handle un-addressable ZONE_DEVICE memory */
@@ -90,7 +90,7 @@  static bool check_pte(struct page_vma_mapped_walk *pvmw)
 
 	if (pvmw->flags & PVMW_MIGRATION) {
 		swp_entry_t entry;
-		if (!is_swap_pte(*pvmw->pte))
+		if (!pte_has_swap_entry(*pvmw->pte))
 			return false;
 		entry = pte_to_swp_entry(*pvmw->pte);
 
@@ -99,7 +99,7 @@  static bool check_pte(struct page_vma_mapped_walk *pvmw)
 			return false;
 
 		pfn = swp_offset(entry);
-	} else if (is_swap_pte(*pvmw->pte)) {
+	} else if (pte_has_swap_entry(*pvmw->pte)) {
 		swp_entry_t entry;
 
 		/* Handle un-addressable ZONE_DEVICE memory */
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 1e07d1c776f2..4993b4454c13 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1951,7 +1951,7 @@  static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 	si = swap_info[type];
 	pte = pte_offset_map(pmd, addr);
 	do {
-		if (!is_swap_pte(*pte))
+		if (!pte_has_swap_entry(*pte))
 			continue;
 
 		entry = pte_to_swp_entry(*pte);