diff mbox series

[v2,1/2] mm/migrate: Fix read-only page got writable when recover pte

Message ID 20221110203132.1498183-2-peterx@redhat.com (mailing list archive)
State New
Headers show
Series mm/migrate: Fix writable pte for read migration entry | expand

Commit Message

Peter Xu Nov. 10, 2022, 8:31 p.m. UTC
Ives van Hoorne from codesandbox.io reported an issue regarding possible
data loss of uffd-wp when applied to memfds on heavily loaded systems.  The
sympton is some read page got data mismatch from the snapshot child VMs.

Here I can also reproduce with a Rust reproducer that was provided by Ives
that keeps taking snapshot of a 256MB VM, on a 32G system when I initiate
80 instances I can trigger the issues in ten minutes.

It turns out that we got some pages write-through even if uffd-wp is
applied to the pte.

The problem is, when removing migration entries, we didn't really worry
about write bit as long as we know it's not a write migration entry.  That
may not be true, for some memory types (e.g. writable shmem) mk_pte can
return a pte with write bit set, then to recover the migration entry to its
original state we need to explicit wr-protect the pte or it'll has the
write bit set if it's a read migration entry.

For uffd it can cause write-through.  I didn't verify, but I think it'll be
the same for mprotect()ed pages and after migration we can miss the sigbus
instead.

The relevant code on uffd was introduced in the anon support, which is
commit f45ec5ff16a7 ("userfaultfd: wp: support swap and page migration",
2020-04-07).  However anon shouldn't suffer from this problem because anon
should already have the write bit cleared always, so that may not be a
proper Fixes target.  To satisfy the need on the backport, I'm attaching
the Fixes tag to the uffd-wp shmem support.  Since no one had issue with
mprotect, so I assume that's also the kernel version we should start to
backport for stable, and we shouldn't need to worry before that.

Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: stable@vger.kernel.org
Fixes: b1f9e876862d ("mm/uffd: enable write protection for shmem & hugetlbfs")
Reported-by: Ives van Hoorne <ives@codesandbox.io>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/migrate.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

Comments

Nadav Amit Nov. 10, 2022, 9:28 p.m. UTC | #1
On Nov 10, 2022, at 12:31 PM, Peter Xu <peterx@redhat.com> wrote:

> Ives van Hoorne from codesandbox.io reported an issue regarding possible
> data loss of uffd-wp when applied to memfds on heavily loaded systems.  The
> sympton is some read page got data mismatch from the snapshot child VMs.

symptom (if you care)

Other than that LGTM
Ives van Hoorne Nov. 10, 2022, 9:53 p.m. UTC | #2
I verified these latest changes with my initial test case, and can confirm
that I haven't been able to reproduce the problem anymore! Usually, I would
be able to reproduce the error after ~1-5 runs, given that I would fill the
page cache before the run, now I have been running the same test case for
50+ times, and I haven't seen the error anymore.

Thanks again Peter for taking the time to find the issue and coming up with
a fix!


On Thu, Nov 10, 2022 at 21:31:31, Peter Xu <peterx@redhat.com> wrote:

> Ives van Hoorne from codesandbox.io reported an issue regarding possible
> data loss of uffd-wp when applied to memfds on heavily loaded systems. The
> sympton is some read page got data mismatch from the snapshot child VMs.
>
> Here I can also reproduce with a Rust reproducer that was provided by Ives
> that keeps taking snapshot of a 256MB VM, on a 32G system when I initiate
> 80 instances I can trigger the issues in ten minutes.
>
> It turns out that we got some pages write-through even if uffd-wp is
> applied to the pte.
>
> The problem is, when removing migration entries, we didn't really worry
> about write bit as long as we know it's not a write migration entry. That
> may not be true, for some memory types (e.g. writable shmem) mk_pte can
> return a pte with write bit set, then to recover the migration entry to its
> original state we need to explicit wr-protect the pte or it'll has the
> write bit set if it's a read migration entry.
>
> For uffd it can cause write-through. I didn't verify, but I think it'll be
> the same for mprotect()ed pages and after migration we can miss the sigbus
> instead.
>
> The relevant code on uffd was introduced in the anon support, which is
> commit f45ec5ff16a7 ("userfaultfd: wp: support swap and page migration",
> 2020-04-07). However anon shouldn't suffer from this problem because anon
> should already have the write bit cleared always, so that may not be a
> proper Fixes target. To satisfy the need on the backport, I'm attaching the
> Fixes tag to the uffd-wp shmem support. Since no one had issue with
> mprotect, so I assume that's also the kernel version we should start to
> backport for stable, and we shouldn't need to worry before that.
>
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: stable@vger.kernel.org
> Fixes: b1f9e876862d ("mm/uffd: enable write protection for shmem &
> hugetlbfs") Reported-by: Ives van Hoorne <ives@codesandbox.io>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
> mm/migrate.c | 8 +++++++-
> 1 file changed, 7 insertions(+), 1 deletion(-)
>
> diff --git a/mm/migrate.c b/mm/migrate.c
> index dff333593a8a..8b6351c08c78 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -213,8 +213,14 @@ static bool remove_migration_pte(struct folio *folio,
> pte = pte_mkdirty(pte);
> if (is_writable_migration_entry(entry))
> pte = maybe_mkwrite(pte, vma);
> - else if (pte_swp_uffd_wp(*pvmw.pte))
> + else
> + /* NOTE: mk_pte can have write bit set */
> + pte = pte_wrprotect(pte);
> +
> + if (pte_swp_uffd_wp(*pvmw.pte)) {
> + WARN_ON_ONCE(pte_write(pte));
> pte = pte_mkuffd_wp(pte);
> + }
>
> if (folio_test_anon(folio) && !is_readable_migration_entry(entry))
> rmap_flags |= RMAP_EXCLUSIVE;
> --
> 2.37.3
>
Peter Xu Nov. 10, 2022, 10:08 p.m. UTC | #3
On Thu, Nov 10, 2022 at 01:53:25PM -0800, Ives van Hoorne wrote:
> I verified these latest changes with my initial test case, and can confirm
> that I haven't been able to reproduce the problem anymore! Usually, I would
> be able to reproduce the error after ~1-5 runs, given that I would fill the
> page cache before the run, now I have been running the same test case for
> 50+ times, and I haven't seen the error anymore.
> 
> Thanks again Peter for taking the time to find the issue and coming up with
> a fix!

Thanks for the quick feedback, Ives!
Peter Xu Nov. 10, 2022, 10:09 p.m. UTC | #4
On Thu, Nov 10, 2022 at 01:28:31PM -0800, Nadav Amit wrote:
> On Nov 10, 2022, at 12:31 PM, Peter Xu <peterx@redhat.com> wrote:
> 
> > Ives van Hoorne from codesandbox.io reported an issue regarding possible
> > data loss of uffd-wp when applied to memfds on heavily loaded systems.  The
> > sympton is some read page got data mismatch from the snapshot child VMs.
> 
> symptom (if you care)

Yes, but,

> 
> Other than that LGTM

this definitely matters more. :)  Thanks for taking a look, Nadav.
Alistair Popple Nov. 10, 2022, 11:42 p.m. UTC | #5
Peter Xu <peterx@redhat.com> writes:

> Ives van Hoorne from codesandbox.io reported an issue regarding possible
> data loss of uffd-wp when applied to memfds on heavily loaded systems.  The
> sympton is some read page got data mismatch from the snapshot child VMs.
>
> Here I can also reproduce with a Rust reproducer that was provided by Ives
> that keeps taking snapshot of a 256MB VM, on a 32G system when I initiate
> 80 instances I can trigger the issues in ten minutes.
>
> It turns out that we got some pages write-through even if uffd-wp is
> applied to the pte.
>
> The problem is, when removing migration entries, we didn't really worry
> about write bit as long as we know it's not a write migration entry.  That
> may not be true, for some memory types (e.g. writable shmem) mk_pte can
> return a pte with write bit set, then to recover the migration entry to its
> original state we need to explicit wr-protect the pte or it'll has the
> write bit set if it's a read migration entry.
>
> For uffd it can cause write-through.  I didn't verify, but I think it'll be
> the same for mprotect()ed pages and after migration we can miss the sigbus
> instead.
>
> The relevant code on uffd was introduced in the anon support, which is
> commit f45ec5ff16a7 ("userfaultfd: wp: support swap and page migration",
> 2020-04-07).  However anon shouldn't suffer from this problem because anon
> should already have the write bit cleared always, so that may not be a
> proper Fixes target.  To satisfy the need on the backport, I'm attaching
> the Fixes tag to the uffd-wp shmem support.  Since no one had issue with
> mprotect, so I assume that's also the kernel version we should start to
> backport for stable, and we shouldn't need to worry before that.

Hi Peter, for the patch feel free to add:

Reviewed-by: Alistair Popple <apopple@nvidia.com>

I did wonder if this should be backported further for migrate_vma as
well given that a migration failure there might lead a shmem read-only
PTE to become read-write. I couldn't think of an obvious reason why that
would cause an actual problem though.

I think folio_mkclean() will wrprotect the pte for writeback to swap,
but it holds the page lock which prevents migrate_vma installing
migration entries in the first place.

I suppose there is a small window there because migrate_vma will unlock
the page before removing the migration entries. So to be safe we could
consider going back to 8763cb45ab96 ("mm/migrate: new memory migration
helper for use with device memory") but I doubt in practice it's a real
problem.

> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: stable@vger.kernel.org
> Fixes: b1f9e876862d ("mm/uffd: enable write protection for shmem & hugetlbfs")
> Reported-by: Ives van Hoorne <ives@codesandbox.io>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  mm/migrate.c | 8 +++++++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
>
> diff --git a/mm/migrate.c b/mm/migrate.c
> index dff333593a8a..8b6351c08c78 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -213,8 +213,14 @@ static bool remove_migration_pte(struct folio *folio,
>  			pte = pte_mkdirty(pte);
>  		if (is_writable_migration_entry(entry))
>  			pte = maybe_mkwrite(pte, vma);
> -		else if (pte_swp_uffd_wp(*pvmw.pte))
> +		else
> +			/* NOTE: mk_pte can have write bit set */
> +			pte = pte_wrprotect(pte);
> +
> +		if (pte_swp_uffd_wp(*pvmw.pte)) {
> +			WARN_ON_ONCE(pte_write(pte));
>  			pte = pte_mkuffd_wp(pte);
> +		}
>  
>  		if (folio_test_anon(folio) && !is_readable_migration_entry(entry))
>  			rmap_flags |= RMAP_EXCLUSIVE;
Peter Xu Nov. 13, 2022, 11:56 p.m. UTC | #6
On Fri, Nov 11, 2022 at 10:42:13AM +1100, Alistair Popple wrote:
> Hi Peter, for the patch feel free to add:
> 
> Reviewed-by: Alistair Popple <apopple@nvidia.com>

Will do, thanks.

> 
> I did wonder if this should be backported further for migrate_vma as
> well given that a migration failure there might lead a shmem read-only
> PTE to become read-write. I couldn't think of an obvious reason why that
> would cause an actual problem though.
> 
> I think folio_mkclean() will wrprotect the pte for writeback to swap,
> but it holds the page lock which prevents migrate_vma installing
> migration entries in the first place.
> 
> I suppose there is a small window there because migrate_vma will unlock
> the page before removing the migration entries. So to be safe we could
> consider going back to 8763cb45ab96 ("mm/migrate: new memory migration
> helper for use with device memory") but I doubt in practice it's a real
> problem.

IIRC migrate_vma API only supports anonymous memory, then it's not
affected by this issue?

One thing reminded me is I thought mprotect could be affected but I think
it's actually not, because mprotect is vma-based, and that should always be
fine with current mk_pte().  I'll remove the paragraph on mprotect in the
commit message; that could be slightly misleading.
Alistair Popple Nov. 14, 2022, 6:22 a.m. UTC | #7
Peter Xu <peterx@redhat.com> writes:

> On Fri, Nov 11, 2022 at 10:42:13AM +1100, Alistair Popple wrote:
>> Hi Peter, for the patch feel free to add:
>> 
>> Reviewed-by: Alistair Popple <apopple@nvidia.com>
>
> Will do, thanks.
>
>> 
>> I did wonder if this should be backported further for migrate_vma as
>> well given that a migration failure there might lead a shmem read-only
>> PTE to become read-write. I couldn't think of an obvious reason why that
>> would cause an actual problem though.
>> 
>> I think folio_mkclean() will wrprotect the pte for writeback to swap,
>> but it holds the page lock which prevents migrate_vma installing
>> migration entries in the first place.
>> 
>> I suppose there is a small window there because migrate_vma will unlock
>> the page before removing the migration entries. So to be safe we could
>> consider going back to 8763cb45ab96 ("mm/migrate: new memory migration
>> helper for use with device memory") but I doubt in practice it's a real
>> problem.
>
> IIRC migrate_vma API only supports anonymous memory, then it's not
> affected by this issue?

It does only support anonymous memory, but it doesn't actually check
that until after the page mapping has already been replaced with
migration entries.

So I was worried that could remap a previously read-only page as
read-write and what interaction that would have with a page that was
swapped out to disk say. But I haven't tought of a case where that would
be a problem, and I'm pretty convinced it isn't. It's just I haven't had
the time to entirely prove that for myself.

> One thing reminded me is I thought mprotect could be affected but I think
> it's actually not, because mprotect is vma-based, and that should always be
> fine with current mk_pte().  I'll remove the paragraph on mprotect in the
> commit message; that could be slightly misleading.
David Hildenbrand Nov. 14, 2022, 4:09 p.m. UTC | #8
On 10.11.22 21:31, Peter Xu wrote:
> Ives van Hoorne from codesandbox.io reported an issue regarding possible
> data loss of uffd-wp when applied to memfds on heavily loaded systems.  The
> sympton is some read page got data mismatch from the snapshot child VMs.
> 
> Here I can also reproduce with a Rust reproducer that was provided by Ives
> that keeps taking snapshot of a 256MB VM, on a 32G system when I initiate
> 80 instances I can trigger the issues in ten minutes.
> 
> It turns out that we got some pages write-through even if uffd-wp is
> applied to the pte.
> 
> The problem is, when removing migration entries, we didn't really worry
> about write bit as long as we know it's not a write migration entry.  That
> may not be true, for some memory types (e.g. writable shmem) mk_pte can
> return a pte with write bit set, then to recover the migration entry to its
> original state we need to explicit wr-protect the pte or it'll has the
> write bit set if it's a read migration entry.
> 
> For uffd it can cause write-through.  I didn't verify, but I think it'll be
> the same for mprotect()ed pages and after migration we can miss the sigbus
> instead.

I don't think so. mprotect() handling relies on vma->vm_page_prot, which 
is supposed to do the right thing. E.g., map the pte protnone without 
VM_READ/VM_WRITE/....

> 
> The relevant code on uffd was introduced in the anon support, which is
> commit f45ec5ff16a7 ("userfaultfd: wp: support swap and page migration",
> 2020-04-07).  However anon shouldn't suffer from this problem because anon
> should already have the write bit cleared always, so that may not be a
> proper Fixes target.  To satisfy the need on the backport, I'm attaching
> the Fixes tag to the uffd-wp shmem support.  Since no one had issue with
> mprotect, so I assume that's also the kernel version we should start to
> backport for stable, and we shouldn't need to worry before that.
> 
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: stable@vger.kernel.org
> Fixes: b1f9e876862d ("mm/uffd: enable write protection for shmem & hugetlbfs")
> Reported-by: Ives van Hoorne <ives@codesandbox.io>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>   mm/migrate.c | 8 +++++++-
>   1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/migrate.c b/mm/migrate.c
> index dff333593a8a..8b6351c08c78 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -213,8 +213,14 @@ static bool remove_migration_pte(struct folio *folio,
>   			pte = pte_mkdirty(pte);
>   		if (is_writable_migration_entry(entry))
>   			pte = maybe_mkwrite(pte, vma);
> -		else if (pte_swp_uffd_wp(*pvmw.pte))
> +		else
> +			/* NOTE: mk_pte can have write bit set */
> +			pte = pte_wrprotect(pte);


Any particular reason why not to simply glue this to pte_swp_uffd_wp(), 
because only that needs special care:

if (pte_swp_uffd_wp(*pvmw.pte)) {
	pte = pte_wrprotect(pte);
	pte = pte_mkuffd_wp(pte);
}


And that would match what actually should have been done in commit 
f45ec5ff16a7 -- only special-case uffd-wp.

Note that I think there are cases where we have a PTE that was 
!writable, but after migration we can map it writable.


BTW, does unuse_pte() need similar care?

new_pte = pte_mkold(mk_pte(page, vma->vm_page_prot));
if (pte_swp_uffd_wp(*pte))
	new_pte = pte_mkuffd_wp(new_pte);
set_pte_at(vma->vm_mm, addr, pte, new_pte);
Peter Xu Nov. 14, 2022, 8:09 p.m. UTC | #9
On Mon, Nov 14, 2022 at 05:09:32PM +0100, David Hildenbrand wrote:
> On 10.11.22 21:31, Peter Xu wrote:
> > Ives van Hoorne from codesandbox.io reported an issue regarding possible
> > data loss of uffd-wp when applied to memfds on heavily loaded systems.  The
> > sympton is some read page got data mismatch from the snapshot child VMs.
> > 
> > Here I can also reproduce with a Rust reproducer that was provided by Ives
> > that keeps taking snapshot of a 256MB VM, on a 32G system when I initiate
> > 80 instances I can trigger the issues in ten minutes.
> > 
> > It turns out that we got some pages write-through even if uffd-wp is
> > applied to the pte.
> > 
> > The problem is, when removing migration entries, we didn't really worry
> > about write bit as long as we know it's not a write migration entry.  That
> > may not be true, for some memory types (e.g. writable shmem) mk_pte can
> > return a pte with write bit set, then to recover the migration entry to its
> > original state we need to explicit wr-protect the pte or it'll has the
> > write bit set if it's a read migration entry.
> > 
> > For uffd it can cause write-through.  I didn't verify, but I think it'll be
> > the same for mprotect()ed pages and after migration we can miss the sigbus
> > instead.
> 
> I don't think so. mprotect() handling relies on vma->vm_page_prot, which is
> supposed to do the right thing. E.g., map the pte protnone without
> VM_READ/VM_WRITE/....

I've removed that example when I posted v3, feel free to have a look.

> 
> > 
> > The relevant code on uffd was introduced in the anon support, which is
> > commit f45ec5ff16a7 ("userfaultfd: wp: support swap and page migration",
> > 2020-04-07).  However anon shouldn't suffer from this problem because anon
> > should already have the write bit cleared always, so that may not be a
> > proper Fixes target.  To satisfy the need on the backport, I'm attaching
> > the Fixes tag to the uffd-wp shmem support.  Since no one had issue with
> > mprotect, so I assume that's also the kernel version we should start to
> > backport for stable, and we shouldn't need to worry before that.
> > 
> > Cc: Andrea Arcangeli <aarcange@redhat.com>
> > Cc: stable@vger.kernel.org
> > Fixes: b1f9e876862d ("mm/uffd: enable write protection for shmem & hugetlbfs")
> > Reported-by: Ives van Hoorne <ives@codesandbox.io>
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> >   mm/migrate.c | 8 +++++++-
> >   1 file changed, 7 insertions(+), 1 deletion(-)
> > 
> > diff --git a/mm/migrate.c b/mm/migrate.c
> > index dff333593a8a..8b6351c08c78 100644
> > --- a/mm/migrate.c
> > +++ b/mm/migrate.c
> > @@ -213,8 +213,14 @@ static bool remove_migration_pte(struct folio *folio,
> >   			pte = pte_mkdirty(pte);
> >   		if (is_writable_migration_entry(entry))
> >   			pte = maybe_mkwrite(pte, vma);
> > -		else if (pte_swp_uffd_wp(*pvmw.pte))
> > +		else
> > +			/* NOTE: mk_pte can have write bit set */
> > +			pte = pte_wrprotect(pte);
> 
> 
> Any particular reason why not to simply glue this to pte_swp_uffd_wp(),
> because only that needs special care:
> 
> if (pte_swp_uffd_wp(*pvmw.pte)) {
> 	pte = pte_wrprotect(pte);
> 	pte = pte_mkuffd_wp(pte);
> }
> 
> 
> And that would match what actually should have been done in commit
> f45ec5ff16a7 -- only special-case uffd-wp.
> 
> Note that I think there are cases where we have a PTE that was !writable,
> but after migration we can map it writable.

The thing is recovering the pte into its original form is the safest
approach to me, so I think we need justification on why it's always safe to
set the write bit.

Or do you perhaps have solid clue and think it's always safe?

> 
> BTW, does unuse_pte() need similar care?
> 
> new_pte = pte_mkold(mk_pte(page, vma->vm_page_prot));
> if (pte_swp_uffd_wp(*pte))
> 	new_pte = pte_mkuffd_wp(new_pte);
> set_pte_at(vma->vm_mm, addr, pte, new_pte);

I think unuse path is fine because unuse only applies to private mappings,
so we should always have the W bit removed there within mk_pte().

Thanks,
David Hildenbrand Nov. 15, 2022, 9:13 a.m. UTC | #10
>>
>> Any particular reason why not to simply glue this to pte_swp_uffd_wp(),
>> because only that needs special care:
>>
>> if (pte_swp_uffd_wp(*pvmw.pte)) {
>> 	pte = pte_wrprotect(pte);
>> 	pte = pte_mkuffd_wp(pte);
>> }
>>
>>
>> And that would match what actually should have been done in commit
>> f45ec5ff16a7 -- only special-case uffd-wp.
>>
>> Note that I think there are cases where we have a PTE that was !writable,
>> but after migration we can map it writable.
> 
> The thing is recovering the pte into its original form is the safest
> approach to me, so I think we need justification on why it's always safe to
> set the write bit.
> 
> Or do you perhaps have solid clue and think it's always safe
The problem I am having with this broader change, is that this changes 
something independent of your original patch/problem.

If we identify this to be an actual problem, it should most probably be 
separate fix + backport.


My understanding is that vma->vm_page_prot always tells you what the 
default PTE protection in a mapping is.

If the mapping is private, it is never writable (due to COW). Similarly, 
if the shared file mapping needs writenotify, it is never writable.


I consider UFFD-wp a special case: while the default VMA protection 
might state that it is writable, you actually want individual PTEs to be 
write-protected and have to manually remove the protection.

softdirty tracking is another special case: however, softdirty tracking 
is enabled for the whole VMA. For remove_migration_pte() that should be 
fine (I guess) because writenotify is active when the VMA needs to track 
softdirty bits, and consequently vma->vm_page_prot has the proper 
default permissions.


I wonder if the following (valid), for example is possible:


1) clear_refs() clears VM_SOFTDIRTY and pte_wrprotect() the pte.
-> writenotify is active and vma->vm_page_prot updated accordingly

VM_SOFTDIRTY is reset due to VMA merging and vma->vm_page_prot is 
updated accordingly. See mmap_region() where we set VM_SOFTDIRTY.

If you now migrate the (still write-protected in the PTE) page, it was 
not writable, but it can be writable on the destination.

> 
>>
>> BTW, does unuse_pte() need similar care?
>>
>> new_pte = pte_mkold(mk_pte(page, vma->vm_page_prot));
>> if (pte_swp_uffd_wp(*pte))
>> 	new_pte = pte_mkuffd_wp(new_pte);
>> set_pte_at(vma->vm_mm, addr, pte, new_pte);
> 
> I think unuse path is fine because unuse only applies to private mappings,
> so we should always have the W bit removed there within mk_pte().

You're right, however, shmem swapping confuses me. Maybe that does not 
apply here.
Peter Xu Nov. 15, 2022, 4:08 p.m. UTC | #11
On Tue, Nov 15, 2022 at 10:13:24AM +0100, David Hildenbrand wrote:
> > > 
> > > Any particular reason why not to simply glue this to pte_swp_uffd_wp(),
> > > because only that needs special care:
> > > 
> > > if (pte_swp_uffd_wp(*pvmw.pte)) {
> > > 	pte = pte_wrprotect(pte);
> > > 	pte = pte_mkuffd_wp(pte);
> > > }
> > > 
> > > 
> > > And that would match what actually should have been done in commit
> > > f45ec5ff16a7 -- only special-case uffd-wp.
> > > 
> > > Note that I think there are cases where we have a PTE that was !writable,
> > > but after migration we can map it writable.
> > 
> > The thing is recovering the pte into its original form is the safest
> > approach to me, so I think we need justification on why it's always safe to
> > set the write bit.
> > 
> > Or do you perhaps have solid clue and think it's always safe
> The problem I am having with this broader change, is that this changes
> something independent of your original patch/problem.
> 
> If we identify this to be an actual problem, it should most probably be
> separate fix + backport.
> 
> 
> My understanding is that vma->vm_page_prot always tells you what the default
> PTE protection in a mapping is.
> 
> If the mapping is private, it is never writable (due to COW). Similarly, if
> the shared file mapping needs writenotify, it is never writable.

Right.

> 
> 
> I consider UFFD-wp a special case: while the default VMA protection might
> state that it is writable, you actually want individual PTEs to be
> write-protected and have to manually remove the protection.
> 
> softdirty tracking is another special case: however, softdirty tracking is
> enabled for the whole VMA. For remove_migration_pte() that should be fine (I
> guess) because writenotify is active when the VMA needs to track softdirty
> bits, and consequently vma->vm_page_prot has the proper default permissions.
> 
> 
> I wonder if the following (valid), for example is possible:
> 
> 
> 1) clear_refs() clears VM_SOFTDIRTY and pte_wrprotect() the pte.
> -> writenotify is active and vma->vm_page_prot updated accordingly
> 
> VM_SOFTDIRTY is reset due to VMA merging and vma->vm_page_prot is updated
> accordingly. See mmap_region() where we set VM_SOFTDIRTY.
> 
> If you now migrate the (still write-protected in the PTE) page, it was not
> writable, but it can be writable on the destination.

I didn't even notice merging could work with soft-dirty enabled, that's
interesting to know.

Yes I think it's possible and I agree it's safe, as VM_SOFTDIRTY is set for
the merged vma so afaiu the write bit is safe to set.  We get a bunch of
false positives but that's how soft-dirty works.

I think the whole problem is easier if we see this at a higher level.
You're discussing this from vma pov and it's fair to do so, at least I
agree with what you mentioned so far and I can't see anything outside
uffd-wp that can be affected.  However, it is also true when you noticed we
already have quite a few paragraphs trying to discuss the safety for this
and that, that's the part where I think we need justification and it's not
that "natural".

For "natural", I meant fundamentally we're talking about page migrations
here.  The natural way (at least to me) for page migration to happen as a
fundamental rule is that, we leverag the migration pte to make sure the pte
being stable so we can do the migration work, then we "recover" the pte to
present either by a full recovery or just (hopefully) only replace the pfn,
keeping all the rest untouched.

One thing to prove that is we have two migration entries not one (I'm
temporarily put the exclusive read one aside since that's solving different
problems): MIGRATION_READ and MIGRATION_WRITE.  If we only rely on vma
flags logically we don't need MIGRATION_READ and MIGRATION_WRITE, we only
need MIGRATION generic swap pte then we recover the write bit from vma
flags and other things (like uffd-wp, currently we have the bit set in swap
pte besides the swap entry type).

So maybe one day we can use two migration types rather than three
(MIGRATION and MIGRATION_EXCLUSIVE)?  I can't tell, but hopefully that
shows what I meant, that we need further justification to grant write bit
only base on vma, rather than recovering write bit based on migration entry
type.

> 
> > 
> > > 
> > > BTW, does unuse_pte() need similar care?
> > > 
> > > new_pte = pte_mkold(mk_pte(page, vma->vm_page_prot));
> > > if (pte_swp_uffd_wp(*pte))
> > > 	new_pte = pte_mkuffd_wp(new_pte);
> > > set_pte_at(vma->vm_mm, addr, pte, new_pte);
> > 
> > I think unuse path is fine because unuse only applies to private mappings,
> > so we should always have the W bit removed there within mk_pte().
> 
> You're right, however, shmem swapping confuses me. Maybe that does not apply
> here.

Yeah these are confusing.  Actually I think the unuse path should apply to
private mappings on shmem when the page was CoWed already and then got
swapped out, but again as long as it's private mapping I think we don't
have write bit anyway in retval of mk_pte() even if it's shmem.
David Hildenbrand Nov. 15, 2022, 5:22 p.m. UTC | #12
>> I consider UFFD-wp a special case: while the default VMA protection might
>> state that it is writable, you actually want individual PTEs to be
>> write-protected and have to manually remove the protection.
>>
>> softdirty tracking is another special case: however, softdirty tracking is
>> enabled for the whole VMA. For remove_migration_pte() that should be fine (I
>> guess) because writenotify is active when the VMA needs to track softdirty
>> bits, and consequently vma->vm_page_prot has the proper default permissions.
>>
>>
>> I wonder if the following (valid), for example is possible:
>>
>>
>> 1) clear_refs() clears VM_SOFTDIRTY and pte_wrprotect() the pte.
>> -> writenotify is active and vma->vm_page_prot updated accordingly
>>
>> VM_SOFTDIRTY is reset due to VMA merging and vma->vm_page_prot is updated
>> accordingly. See mmap_region() where we set VM_SOFTDIRTY.
>>
>> If you now migrate the (still write-protected in the PTE) page, it was not
>> writable, but it can be writable on the destination.
> 
> I didn't even notice merging could work with soft-dirty enabled, that's
> interesting to know.
> 
> Yes I think it's possible and I agree it's safe, as VM_SOFTDIRTY is set for
> the merged vma so afaiu the write bit is safe to set.  We get a bunch of
> false positives but that's how soft-dirty works.
> 
> I think the whole problem is easier if we see this at a higher level.
> You're discussing this from vma pov and it's fair to do so, at least I
> agree with what you mentioned so far and I can't see anything outside
> uffd-wp that can be affected.  However, it is also true when you noticed we
> already have quite a few paragraphs trying to discuss the safety for this
> and that, that's the part where I think we need justification and it's not
> that "natural".
> 
> For "natural", I meant fundamentally we're talking about page migrations
> here.  The natural way (at least to me) for page migration to happen as a
> fundamental rule is that, we leverag the migration pte to make sure the pte
> being stable so we can do the migration work, then we "recover" the pte to
> present either by a full recovery or just (hopefully) only replace the pfn,
> keeping all the rest untouched.
> 
> One thing to prove that is we have two migration entries not one (I'm
> temporarily put the exclusive read one aside since that's solving different
> problems): MIGRATION_READ and MIGRATION_WRITE.  If we only rely on vma
> flags logically we don't need MIGRATION_READ and MIGRATION_WRITE, we only
> need MIGRATION generic swap pte then we recover the write bit from vma
> flags and other things (like uffd-wp, currently we have the bit set in swap
> pte besides the swap entry type).
> 
> So maybe one day we can use two migration types rather than three
> (MIGRATION and MIGRATION_EXCLUSIVE)?  I can't tell, but hopefully that
> shows what I meant, that we need further justification to grant write bit
> only base on vma, rather than recovering write bit based on migration entry
> type.

That's precisely what I had in mind recently, and I am happy to hear 
that you have similar idea:

https://lkml.kernel.org/r/20221108174652.198904-6-david@redhat.com

"
Note that we don't optimize for the actual migration case:
(1) When migration succeeds the new PTE will not be writable because the
     source PTE was not writable (protnone); in the future we
     might just optimize that case similarly by reusing
     can_change_pte_writable()/can_change_pmd_writable() when removing
     migration PTEs.
"

Currently, "readable_migration_entry" is even wrong: it might be 
PROT_NONE and not even readable.
David Hildenbrand Nov. 15, 2022, 5:54 p.m. UTC | #13
On 15.11.22 18:22, David Hildenbrand wrote:
>>> I consider UFFD-wp a special case: while the default VMA protection might
>>> state that it is writable, you actually want individual PTEs to be
>>> write-protected and have to manually remove the protection.
>>>
>>> softdirty tracking is another special case: however, softdirty tracking is
>>> enabled for the whole VMA. For remove_migration_pte() that should be fine (I
>>> guess) because writenotify is active when the VMA needs to track softdirty
>>> bits, and consequently vma->vm_page_prot has the proper default permissions.
>>>
>>>
>>> I wonder if the following (valid), for example is possible:
>>>
>>>
>>> 1) clear_refs() clears VM_SOFTDIRTY and pte_wrprotect() the pte.
>>> -> writenotify is active and vma->vm_page_prot updated accordingly
>>>
>>> VM_SOFTDIRTY is reset due to VMA merging and vma->vm_page_prot is updated
>>> accordingly. See mmap_region() where we set VM_SOFTDIRTY.
>>>
>>> If you now migrate the (still write-protected in the PTE) page, it was not
>>> writable, but it can be writable on the destination.
>>
>> I didn't even notice merging could work with soft-dirty enabled, that's
>> interesting to know.
>>
>> Yes I think it's possible and I agree it's safe, as VM_SOFTDIRTY is set for
>> the merged vma so afaiu the write bit is safe to set.  We get a bunch of
>> false positives but that's how soft-dirty works.
>>
>> I think the whole problem is easier if we see this at a higher level.
>> You're discussing this from vma pov and it's fair to do so, at least I
>> agree with what you mentioned so far and I can't see anything outside
>> uffd-wp that can be affected.  However, it is also true when you noticed we
>> already have quite a few paragraphs trying to discuss the safety for this
>> and that, that's the part where I think we need justification and it's not
>> that "natural".

Forgot to reply to that part:

No it isn't natural. But sneaking such a change into your fix seems 
wrong. Touching !uffd-wp code should be separate, if we want to do this 
at all (as we discussed, maybe the better/cleaner approach is to 
eliminate writable migration entries if possible).
Peter Xu Nov. 15, 2022, 6:03 p.m. UTC | #14
On Tue, Nov 15, 2022 at 06:22:03PM +0100, David Hildenbrand wrote:
> That's precisely what I had in mind recently, and I am happy to hear that
> you have similar idea:
> 
> https://lkml.kernel.org/r/20221108174652.198904-6-david@redhat.com
> 
> "
> Note that we don't optimize for the actual migration case:
> (1) When migration succeeds the new PTE will not be writable because the
>     source PTE was not writable (protnone); in the future we
>     might just optimize that case similarly by reusing
>     can_change_pte_writable()/can_change_pmd_writable() when removing
>     migration PTEs.
> "

I see, sorry I haven't yet read it, but sounds doable indeed.

> 
> Currently, "readable_migration_entry" is even wrong: it might be PROT_NONE
> and not even readable.

Do you mean mprotect(PROT_NONE)?

If we read the "read migration entry" as "migration entry with no write
bit", it seems still fine, and code-wise after pte recovered it should
still be PROT_NONE iiuc because mk_pte() will just make a pte without
e.g. _PRESENT bit set on x86 while it'll have the _PROT_NONE bit.

May not keep true for numa balancing though: when migration happens after a
numa hint applied to a pte, it seems to me it's prone to lose the hint
after migration completes (assuming this migration is not the numa
balancing operation itself caused by a page access).  Doesn't sound like a
severe issue though even if I didn't miss something, since if the page got
moved around the original hint may need to reconsider anyway.
David Hildenbrand Nov. 15, 2022, 6:08 p.m. UTC | #15
On 15.11.22 19:03, Peter Xu wrote:
> On Tue, Nov 15, 2022 at 06:22:03PM +0100, David Hildenbrand wrote:
>> That's precisely what I had in mind recently, and I am happy to hear that
>> you have similar idea:
>>
>> https://lkml.kernel.org/r/20221108174652.198904-6-david@redhat.com
>>
>> "
>> Note that we don't optimize for the actual migration case:
>> (1) When migration succeeds the new PTE will not be writable because the
>>      source PTE was not writable (protnone); in the future we
>>      might just optimize that case similarly by reusing
>>      can_change_pte_writable()/can_change_pmd_writable() when removing
>>      migration PTEs.
>> "
> 
> I see, sorry I haven't yet read it, but sounds doable indeed.
> 
>>
>> Currently, "readable_migration_entry" is even wrong: it might be PROT_NONE
>> and not even readable.
> 
> Do you mean mprotect(PROT_NONE)?
> 
> If we read the "read migration entry" as "migration entry with no write
> bit", it seems still fine, and code-wise after pte recovered it should
> still be PROT_NONE iiuc because mk_pte() will just make a pte without
> e.g. _PRESENT bit set on x86 while it'll have the _PROT_NONE bit.

Exactly that's the unintuitive interpretation of 
"readable_migration_entry". By "wrong" I meant: the naming is wrong.

> 
> May not keep true for numa balancing though: when migration happens after a
> numa hint applied to a pte, it seems to me it's prone to lose the hint
> after migration completes (assuming this migration is not the numa
> balancing operation itself caused by a page access).  Doesn't sound like a
> severe issue though even if I didn't miss something, since if the page got
> moved around the original hint may need to reconsider anyway.

Yes, I think any migration will lose fake PROT_NONE. "Fake" as in "not 
VMA permissions" but "additional permissions imposed by NUMA hinting 
faults."
Peter Xu Nov. 15, 2022, 6:11 p.m. UTC | #16
On Tue, Nov 15, 2022 at 06:54:09PM +0100, David Hildenbrand wrote:
> No it isn't natural. But sneaking such a change into your fix seems wrong.
> Touching !uffd-wp code should be separate, if we want to do this at all (as
> we discussed, maybe the better/cleaner approach is to eliminate writable
> migration entries if possible).

If you see I made the subject as mm/migrate, because I took it as a
migration change, not uffd-wp specific.  I also added the Fixes to the uffd
patch because afaict that's the only thing that it affects, but I can't
tell either.

As I said, I'm trying to follow the rule where we have read entry I want to
make sure write bit removed.

I hope we have this patch land soon, because it affects users.  I'm not
against further rework on migration entries as I replied in the other
email, but for this patch I keep the same thoughts as when I posted.

Thanks,
David Hildenbrand Nov. 15, 2022, 6:16 p.m. UTC | #17
On 15.11.22 19:11, Peter Xu wrote:
> On Tue, Nov 15, 2022 at 06:54:09PM +0100, David Hildenbrand wrote:
>> No it isn't natural. But sneaking such a change into your fix seems wrong.
>> Touching !uffd-wp code should be separate, if we want to do this at all (as
>> we discussed, maybe the better/cleaner approach is to eliminate writable
>> migration entries if possible).
> 
> If you see I made the subject as mm/migrate, because I took it as a
> migration change, not uffd-wp specific.  I also added the Fixes to the uffd
> patch because afaict that's the only thing that it affects, but I can't
> tell either.
> 
> As I said, I'm trying to follow the rule where we have read entry I want to
> make sure write bit removed.

I really don't think we should do that generic change.
diff mbox series

Patch

diff --git a/mm/migrate.c b/mm/migrate.c
index dff333593a8a..8b6351c08c78 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -213,8 +213,14 @@  static bool remove_migration_pte(struct folio *folio,
 			pte = pte_mkdirty(pte);
 		if (is_writable_migration_entry(entry))
 			pte = maybe_mkwrite(pte, vma);
-		else if (pte_swp_uffd_wp(*pvmw.pte))
+		else
+			/* NOTE: mk_pte can have write bit set */
+			pte = pte_wrprotect(pte);
+
+		if (pte_swp_uffd_wp(*pvmw.pte)) {
+			WARN_ON_ONCE(pte_write(pte));
 			pte = pte_mkuffd_wp(pte);
+		}
 
 		if (folio_test_anon(folio) && !is_readable_migration_entry(entry))
 			rmap_flags |= RMAP_EXCLUSIVE;