Message ID | 20221110203132.1498183-2-peterx@redhat.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | mm/migrate: Fix writable pte for read migration entry | expand |
On Nov 10, 2022, at 12:31 PM, Peter Xu <peterx@redhat.com> wrote: > Ives van Hoorne from codesandbox.io reported an issue regarding possible > data loss of uffd-wp when applied to memfds on heavily loaded systems. The > sympton is some read page got data mismatch from the snapshot child VMs. symptom (if you care) Other than that LGTM
I verified these latest changes with my initial test case, and can confirm that I haven't been able to reproduce the problem anymore! Usually, I would be able to reproduce the error after ~1-5 runs, given that I would fill the page cache before the run, now I have been running the same test case for 50+ times, and I haven't seen the error anymore. Thanks again Peter for taking the time to find the issue and coming up with a fix! On Thu, Nov 10, 2022 at 21:31:31, Peter Xu <peterx@redhat.com> wrote: > Ives van Hoorne from codesandbox.io reported an issue regarding possible > data loss of uffd-wp when applied to memfds on heavily loaded systems. The > sympton is some read page got data mismatch from the snapshot child VMs. > > Here I can also reproduce with a Rust reproducer that was provided by Ives > that keeps taking snapshot of a 256MB VM, on a 32G system when I initiate > 80 instances I can trigger the issues in ten minutes. > > It turns out that we got some pages write-through even if uffd-wp is > applied to the pte. > > The problem is, when removing migration entries, we didn't really worry > about write bit as long as we know it's not a write migration entry. That > may not be true, for some memory types (e.g. writable shmem) mk_pte can > return a pte with write bit set, then to recover the migration entry to its > original state we need to explicit wr-protect the pte or it'll has the > write bit set if it's a read migration entry. > > For uffd it can cause write-through. I didn't verify, but I think it'll be > the same for mprotect()ed pages and after migration we can miss the sigbus > instead. > > The relevant code on uffd was introduced in the anon support, which is > commit f45ec5ff16a7 ("userfaultfd: wp: support swap and page migration", > 2020-04-07). However anon shouldn't suffer from this problem because anon > should already have the write bit cleared always, so that may not be a > proper Fixes target. To satisfy the need on the backport, I'm attaching the > Fixes tag to the uffd-wp shmem support. Since no one had issue with > mprotect, so I assume that's also the kernel version we should start to > backport for stable, and we shouldn't need to worry before that. > > Cc: Andrea Arcangeli <aarcange@redhat.com> > Cc: stable@vger.kernel.org > Fixes: b1f9e876862d ("mm/uffd: enable write protection for shmem & > hugetlbfs") Reported-by: Ives van Hoorne <ives@codesandbox.io> > Signed-off-by: Peter Xu <peterx@redhat.com> > --- > mm/migrate.c | 8 +++++++- > 1 file changed, 7 insertions(+), 1 deletion(-) > > diff --git a/mm/migrate.c b/mm/migrate.c > index dff333593a8a..8b6351c08c78 100644 > --- a/mm/migrate.c > +++ b/mm/migrate.c > @@ -213,8 +213,14 @@ static bool remove_migration_pte(struct folio *folio, > pte = pte_mkdirty(pte); > if (is_writable_migration_entry(entry)) > pte = maybe_mkwrite(pte, vma); > - else if (pte_swp_uffd_wp(*pvmw.pte)) > + else > + /* NOTE: mk_pte can have write bit set */ > + pte = pte_wrprotect(pte); > + > + if (pte_swp_uffd_wp(*pvmw.pte)) { > + WARN_ON_ONCE(pte_write(pte)); > pte = pte_mkuffd_wp(pte); > + } > > if (folio_test_anon(folio) && !is_readable_migration_entry(entry)) > rmap_flags |= RMAP_EXCLUSIVE; > -- > 2.37.3 >
On Thu, Nov 10, 2022 at 01:53:25PM -0800, Ives van Hoorne wrote: > I verified these latest changes with my initial test case, and can confirm > that I haven't been able to reproduce the problem anymore! Usually, I would > be able to reproduce the error after ~1-5 runs, given that I would fill the > page cache before the run, now I have been running the same test case for > 50+ times, and I haven't seen the error anymore. > > Thanks again Peter for taking the time to find the issue and coming up with > a fix! Thanks for the quick feedback, Ives!
On Thu, Nov 10, 2022 at 01:28:31PM -0800, Nadav Amit wrote: > On Nov 10, 2022, at 12:31 PM, Peter Xu <peterx@redhat.com> wrote: > > > Ives van Hoorne from codesandbox.io reported an issue regarding possible > > data loss of uffd-wp when applied to memfds on heavily loaded systems. The > > sympton is some read page got data mismatch from the snapshot child VMs. > > symptom (if you care) Yes, but, > > Other than that LGTM this definitely matters more. :) Thanks for taking a look, Nadav.
Peter Xu <peterx@redhat.com> writes: > Ives van Hoorne from codesandbox.io reported an issue regarding possible > data loss of uffd-wp when applied to memfds on heavily loaded systems. The > sympton is some read page got data mismatch from the snapshot child VMs. > > Here I can also reproduce with a Rust reproducer that was provided by Ives > that keeps taking snapshot of a 256MB VM, on a 32G system when I initiate > 80 instances I can trigger the issues in ten minutes. > > It turns out that we got some pages write-through even if uffd-wp is > applied to the pte. > > The problem is, when removing migration entries, we didn't really worry > about write bit as long as we know it's not a write migration entry. That > may not be true, for some memory types (e.g. writable shmem) mk_pte can > return a pte with write bit set, then to recover the migration entry to its > original state we need to explicit wr-protect the pte or it'll has the > write bit set if it's a read migration entry. > > For uffd it can cause write-through. I didn't verify, but I think it'll be > the same for mprotect()ed pages and after migration we can miss the sigbus > instead. > > The relevant code on uffd was introduced in the anon support, which is > commit f45ec5ff16a7 ("userfaultfd: wp: support swap and page migration", > 2020-04-07). However anon shouldn't suffer from this problem because anon > should already have the write bit cleared always, so that may not be a > proper Fixes target. To satisfy the need on the backport, I'm attaching > the Fixes tag to the uffd-wp shmem support. Since no one had issue with > mprotect, so I assume that's also the kernel version we should start to > backport for stable, and we shouldn't need to worry before that. Hi Peter, for the patch feel free to add: Reviewed-by: Alistair Popple <apopple@nvidia.com> I did wonder if this should be backported further for migrate_vma as well given that a migration failure there might lead a shmem read-only PTE to become read-write. I couldn't think of an obvious reason why that would cause an actual problem though. I think folio_mkclean() will wrprotect the pte for writeback to swap, but it holds the page lock which prevents migrate_vma installing migration entries in the first place. I suppose there is a small window there because migrate_vma will unlock the page before removing the migration entries. So to be safe we could consider going back to 8763cb45ab96 ("mm/migrate: new memory migration helper for use with device memory") but I doubt in practice it's a real problem. > Cc: Andrea Arcangeli <aarcange@redhat.com> > Cc: stable@vger.kernel.org > Fixes: b1f9e876862d ("mm/uffd: enable write protection for shmem & hugetlbfs") > Reported-by: Ives van Hoorne <ives@codesandbox.io> > Signed-off-by: Peter Xu <peterx@redhat.com> > --- > mm/migrate.c | 8 +++++++- > 1 file changed, 7 insertions(+), 1 deletion(-) > > diff --git a/mm/migrate.c b/mm/migrate.c > index dff333593a8a..8b6351c08c78 100644 > --- a/mm/migrate.c > +++ b/mm/migrate.c > @@ -213,8 +213,14 @@ static bool remove_migration_pte(struct folio *folio, > pte = pte_mkdirty(pte); > if (is_writable_migration_entry(entry)) > pte = maybe_mkwrite(pte, vma); > - else if (pte_swp_uffd_wp(*pvmw.pte)) > + else > + /* NOTE: mk_pte can have write bit set */ > + pte = pte_wrprotect(pte); > + > + if (pte_swp_uffd_wp(*pvmw.pte)) { > + WARN_ON_ONCE(pte_write(pte)); > pte = pte_mkuffd_wp(pte); > + } > > if (folio_test_anon(folio) && !is_readable_migration_entry(entry)) > rmap_flags |= RMAP_EXCLUSIVE;
On Fri, Nov 11, 2022 at 10:42:13AM +1100, Alistair Popple wrote: > Hi Peter, for the patch feel free to add: > > Reviewed-by: Alistair Popple <apopple@nvidia.com> Will do, thanks. > > I did wonder if this should be backported further for migrate_vma as > well given that a migration failure there might lead a shmem read-only > PTE to become read-write. I couldn't think of an obvious reason why that > would cause an actual problem though. > > I think folio_mkclean() will wrprotect the pte for writeback to swap, > but it holds the page lock which prevents migrate_vma installing > migration entries in the first place. > > I suppose there is a small window there because migrate_vma will unlock > the page before removing the migration entries. So to be safe we could > consider going back to 8763cb45ab96 ("mm/migrate: new memory migration > helper for use with device memory") but I doubt in practice it's a real > problem. IIRC migrate_vma API only supports anonymous memory, then it's not affected by this issue? One thing reminded me is I thought mprotect could be affected but I think it's actually not, because mprotect is vma-based, and that should always be fine with current mk_pte(). I'll remove the paragraph on mprotect in the commit message; that could be slightly misleading.
Peter Xu <peterx@redhat.com> writes: > On Fri, Nov 11, 2022 at 10:42:13AM +1100, Alistair Popple wrote: >> Hi Peter, for the patch feel free to add: >> >> Reviewed-by: Alistair Popple <apopple@nvidia.com> > > Will do, thanks. > >> >> I did wonder if this should be backported further for migrate_vma as >> well given that a migration failure there might lead a shmem read-only >> PTE to become read-write. I couldn't think of an obvious reason why that >> would cause an actual problem though. >> >> I think folio_mkclean() will wrprotect the pte for writeback to swap, >> but it holds the page lock which prevents migrate_vma installing >> migration entries in the first place. >> >> I suppose there is a small window there because migrate_vma will unlock >> the page before removing the migration entries. So to be safe we could >> consider going back to 8763cb45ab96 ("mm/migrate: new memory migration >> helper for use with device memory") but I doubt in practice it's a real >> problem. > > IIRC migrate_vma API only supports anonymous memory, then it's not > affected by this issue? It does only support anonymous memory, but it doesn't actually check that until after the page mapping has already been replaced with migration entries. So I was worried that could remap a previously read-only page as read-write and what interaction that would have with a page that was swapped out to disk say. But I haven't tought of a case where that would be a problem, and I'm pretty convinced it isn't. It's just I haven't had the time to entirely prove that for myself. > One thing reminded me is I thought mprotect could be affected but I think > it's actually not, because mprotect is vma-based, and that should always be > fine with current mk_pte(). I'll remove the paragraph on mprotect in the > commit message; that could be slightly misleading.
On 10.11.22 21:31, Peter Xu wrote: > Ives van Hoorne from codesandbox.io reported an issue regarding possible > data loss of uffd-wp when applied to memfds on heavily loaded systems. The > sympton is some read page got data mismatch from the snapshot child VMs. > > Here I can also reproduce with a Rust reproducer that was provided by Ives > that keeps taking snapshot of a 256MB VM, on a 32G system when I initiate > 80 instances I can trigger the issues in ten minutes. > > It turns out that we got some pages write-through even if uffd-wp is > applied to the pte. > > The problem is, when removing migration entries, we didn't really worry > about write bit as long as we know it's not a write migration entry. That > may not be true, for some memory types (e.g. writable shmem) mk_pte can > return a pte with write bit set, then to recover the migration entry to its > original state we need to explicit wr-protect the pte or it'll has the > write bit set if it's a read migration entry. > > For uffd it can cause write-through. I didn't verify, but I think it'll be > the same for mprotect()ed pages and after migration we can miss the sigbus > instead. I don't think so. mprotect() handling relies on vma->vm_page_prot, which is supposed to do the right thing. E.g., map the pte protnone without VM_READ/VM_WRITE/.... > > The relevant code on uffd was introduced in the anon support, which is > commit f45ec5ff16a7 ("userfaultfd: wp: support swap and page migration", > 2020-04-07). However anon shouldn't suffer from this problem because anon > should already have the write bit cleared always, so that may not be a > proper Fixes target. To satisfy the need on the backport, I'm attaching > the Fixes tag to the uffd-wp shmem support. Since no one had issue with > mprotect, so I assume that's also the kernel version we should start to > backport for stable, and we shouldn't need to worry before that. > > Cc: Andrea Arcangeli <aarcange@redhat.com> > Cc: stable@vger.kernel.org > Fixes: b1f9e876862d ("mm/uffd: enable write protection for shmem & hugetlbfs") > Reported-by: Ives van Hoorne <ives@codesandbox.io> > Signed-off-by: Peter Xu <peterx@redhat.com> > --- > mm/migrate.c | 8 +++++++- > 1 file changed, 7 insertions(+), 1 deletion(-) > > diff --git a/mm/migrate.c b/mm/migrate.c > index dff333593a8a..8b6351c08c78 100644 > --- a/mm/migrate.c > +++ b/mm/migrate.c > @@ -213,8 +213,14 @@ static bool remove_migration_pte(struct folio *folio, > pte = pte_mkdirty(pte); > if (is_writable_migration_entry(entry)) > pte = maybe_mkwrite(pte, vma); > - else if (pte_swp_uffd_wp(*pvmw.pte)) > + else > + /* NOTE: mk_pte can have write bit set */ > + pte = pte_wrprotect(pte); Any particular reason why not to simply glue this to pte_swp_uffd_wp(), because only that needs special care: if (pte_swp_uffd_wp(*pvmw.pte)) { pte = pte_wrprotect(pte); pte = pte_mkuffd_wp(pte); } And that would match what actually should have been done in commit f45ec5ff16a7 -- only special-case uffd-wp. Note that I think there are cases where we have a PTE that was !writable, but after migration we can map it writable. BTW, does unuse_pte() need similar care? new_pte = pte_mkold(mk_pte(page, vma->vm_page_prot)); if (pte_swp_uffd_wp(*pte)) new_pte = pte_mkuffd_wp(new_pte); set_pte_at(vma->vm_mm, addr, pte, new_pte);
On Mon, Nov 14, 2022 at 05:09:32PM +0100, David Hildenbrand wrote: > On 10.11.22 21:31, Peter Xu wrote: > > Ives van Hoorne from codesandbox.io reported an issue regarding possible > > data loss of uffd-wp when applied to memfds on heavily loaded systems. The > > sympton is some read page got data mismatch from the snapshot child VMs. > > > > Here I can also reproduce with a Rust reproducer that was provided by Ives > > that keeps taking snapshot of a 256MB VM, on a 32G system when I initiate > > 80 instances I can trigger the issues in ten minutes. > > > > It turns out that we got some pages write-through even if uffd-wp is > > applied to the pte. > > > > The problem is, when removing migration entries, we didn't really worry > > about write bit as long as we know it's not a write migration entry. That > > may not be true, for some memory types (e.g. writable shmem) mk_pte can > > return a pte with write bit set, then to recover the migration entry to its > > original state we need to explicit wr-protect the pte or it'll has the > > write bit set if it's a read migration entry. > > > > For uffd it can cause write-through. I didn't verify, but I think it'll be > > the same for mprotect()ed pages and after migration we can miss the sigbus > > instead. > > I don't think so. mprotect() handling relies on vma->vm_page_prot, which is > supposed to do the right thing. E.g., map the pte protnone without > VM_READ/VM_WRITE/.... I've removed that example when I posted v3, feel free to have a look. > > > > > The relevant code on uffd was introduced in the anon support, which is > > commit f45ec5ff16a7 ("userfaultfd: wp: support swap and page migration", > > 2020-04-07). However anon shouldn't suffer from this problem because anon > > should already have the write bit cleared always, so that may not be a > > proper Fixes target. To satisfy the need on the backport, I'm attaching > > the Fixes tag to the uffd-wp shmem support. Since no one had issue with > > mprotect, so I assume that's also the kernel version we should start to > > backport for stable, and we shouldn't need to worry before that. > > > > Cc: Andrea Arcangeli <aarcange@redhat.com> > > Cc: stable@vger.kernel.org > > Fixes: b1f9e876862d ("mm/uffd: enable write protection for shmem & hugetlbfs") > > Reported-by: Ives van Hoorne <ives@codesandbox.io> > > Signed-off-by: Peter Xu <peterx@redhat.com> > > --- > > mm/migrate.c | 8 +++++++- > > 1 file changed, 7 insertions(+), 1 deletion(-) > > > > diff --git a/mm/migrate.c b/mm/migrate.c > > index dff333593a8a..8b6351c08c78 100644 > > --- a/mm/migrate.c > > +++ b/mm/migrate.c > > @@ -213,8 +213,14 @@ static bool remove_migration_pte(struct folio *folio, > > pte = pte_mkdirty(pte); > > if (is_writable_migration_entry(entry)) > > pte = maybe_mkwrite(pte, vma); > > - else if (pte_swp_uffd_wp(*pvmw.pte)) > > + else > > + /* NOTE: mk_pte can have write bit set */ > > + pte = pte_wrprotect(pte); > > > Any particular reason why not to simply glue this to pte_swp_uffd_wp(), > because only that needs special care: > > if (pte_swp_uffd_wp(*pvmw.pte)) { > pte = pte_wrprotect(pte); > pte = pte_mkuffd_wp(pte); > } > > > And that would match what actually should have been done in commit > f45ec5ff16a7 -- only special-case uffd-wp. > > Note that I think there are cases where we have a PTE that was !writable, > but after migration we can map it writable. The thing is recovering the pte into its original form is the safest approach to me, so I think we need justification on why it's always safe to set the write bit. Or do you perhaps have solid clue and think it's always safe? > > BTW, does unuse_pte() need similar care? > > new_pte = pte_mkold(mk_pte(page, vma->vm_page_prot)); > if (pte_swp_uffd_wp(*pte)) > new_pte = pte_mkuffd_wp(new_pte); > set_pte_at(vma->vm_mm, addr, pte, new_pte); I think unuse path is fine because unuse only applies to private mappings, so we should always have the W bit removed there within mk_pte(). Thanks,
>> >> Any particular reason why not to simply glue this to pte_swp_uffd_wp(), >> because only that needs special care: >> >> if (pte_swp_uffd_wp(*pvmw.pte)) { >> pte = pte_wrprotect(pte); >> pte = pte_mkuffd_wp(pte); >> } >> >> >> And that would match what actually should have been done in commit >> f45ec5ff16a7 -- only special-case uffd-wp. >> >> Note that I think there are cases where we have a PTE that was !writable, >> but after migration we can map it writable. > > The thing is recovering the pte into its original form is the safest > approach to me, so I think we need justification on why it's always safe to > set the write bit. > > Or do you perhaps have solid clue and think it's always safe The problem I am having with this broader change, is that this changes something independent of your original patch/problem. If we identify this to be an actual problem, it should most probably be separate fix + backport. My understanding is that vma->vm_page_prot always tells you what the default PTE protection in a mapping is. If the mapping is private, it is never writable (due to COW). Similarly, if the shared file mapping needs writenotify, it is never writable. I consider UFFD-wp a special case: while the default VMA protection might state that it is writable, you actually want individual PTEs to be write-protected and have to manually remove the protection. softdirty tracking is another special case: however, softdirty tracking is enabled for the whole VMA. For remove_migration_pte() that should be fine (I guess) because writenotify is active when the VMA needs to track softdirty bits, and consequently vma->vm_page_prot has the proper default permissions. I wonder if the following (valid), for example is possible: 1) clear_refs() clears VM_SOFTDIRTY and pte_wrprotect() the pte. -> writenotify is active and vma->vm_page_prot updated accordingly VM_SOFTDIRTY is reset due to VMA merging and vma->vm_page_prot is updated accordingly. See mmap_region() where we set VM_SOFTDIRTY. If you now migrate the (still write-protected in the PTE) page, it was not writable, but it can be writable on the destination. > >> >> BTW, does unuse_pte() need similar care? >> >> new_pte = pte_mkold(mk_pte(page, vma->vm_page_prot)); >> if (pte_swp_uffd_wp(*pte)) >> new_pte = pte_mkuffd_wp(new_pte); >> set_pte_at(vma->vm_mm, addr, pte, new_pte); > > I think unuse path is fine because unuse only applies to private mappings, > so we should always have the W bit removed there within mk_pte(). You're right, however, shmem swapping confuses me. Maybe that does not apply here.
On Tue, Nov 15, 2022 at 10:13:24AM +0100, David Hildenbrand wrote: > > > > > > Any particular reason why not to simply glue this to pte_swp_uffd_wp(), > > > because only that needs special care: > > > > > > if (pte_swp_uffd_wp(*pvmw.pte)) { > > > pte = pte_wrprotect(pte); > > > pte = pte_mkuffd_wp(pte); > > > } > > > > > > > > > And that would match what actually should have been done in commit > > > f45ec5ff16a7 -- only special-case uffd-wp. > > > > > > Note that I think there are cases where we have a PTE that was !writable, > > > but after migration we can map it writable. > > > > The thing is recovering the pte into its original form is the safest > > approach to me, so I think we need justification on why it's always safe to > > set the write bit. > > > > Or do you perhaps have solid clue and think it's always safe > The problem I am having with this broader change, is that this changes > something independent of your original patch/problem. > > If we identify this to be an actual problem, it should most probably be > separate fix + backport. > > > My understanding is that vma->vm_page_prot always tells you what the default > PTE protection in a mapping is. > > If the mapping is private, it is never writable (due to COW). Similarly, if > the shared file mapping needs writenotify, it is never writable. Right. > > > I consider UFFD-wp a special case: while the default VMA protection might > state that it is writable, you actually want individual PTEs to be > write-protected and have to manually remove the protection. > > softdirty tracking is another special case: however, softdirty tracking is > enabled for the whole VMA. For remove_migration_pte() that should be fine (I > guess) because writenotify is active when the VMA needs to track softdirty > bits, and consequently vma->vm_page_prot has the proper default permissions. > > > I wonder if the following (valid), for example is possible: > > > 1) clear_refs() clears VM_SOFTDIRTY and pte_wrprotect() the pte. > -> writenotify is active and vma->vm_page_prot updated accordingly > > VM_SOFTDIRTY is reset due to VMA merging and vma->vm_page_prot is updated > accordingly. See mmap_region() where we set VM_SOFTDIRTY. > > If you now migrate the (still write-protected in the PTE) page, it was not > writable, but it can be writable on the destination. I didn't even notice merging could work with soft-dirty enabled, that's interesting to know. Yes I think it's possible and I agree it's safe, as VM_SOFTDIRTY is set for the merged vma so afaiu the write bit is safe to set. We get a bunch of false positives but that's how soft-dirty works. I think the whole problem is easier if we see this at a higher level. You're discussing this from vma pov and it's fair to do so, at least I agree with what you mentioned so far and I can't see anything outside uffd-wp that can be affected. However, it is also true when you noticed we already have quite a few paragraphs trying to discuss the safety for this and that, that's the part where I think we need justification and it's not that "natural". For "natural", I meant fundamentally we're talking about page migrations here. The natural way (at least to me) for page migration to happen as a fundamental rule is that, we leverag the migration pte to make sure the pte being stable so we can do the migration work, then we "recover" the pte to present either by a full recovery or just (hopefully) only replace the pfn, keeping all the rest untouched. One thing to prove that is we have two migration entries not one (I'm temporarily put the exclusive read one aside since that's solving different problems): MIGRATION_READ and MIGRATION_WRITE. If we only rely on vma flags logically we don't need MIGRATION_READ and MIGRATION_WRITE, we only need MIGRATION generic swap pte then we recover the write bit from vma flags and other things (like uffd-wp, currently we have the bit set in swap pte besides the swap entry type). So maybe one day we can use two migration types rather than three (MIGRATION and MIGRATION_EXCLUSIVE)? I can't tell, but hopefully that shows what I meant, that we need further justification to grant write bit only base on vma, rather than recovering write bit based on migration entry type. > > > > > > > > > BTW, does unuse_pte() need similar care? > > > > > > new_pte = pte_mkold(mk_pte(page, vma->vm_page_prot)); > > > if (pte_swp_uffd_wp(*pte)) > > > new_pte = pte_mkuffd_wp(new_pte); > > > set_pte_at(vma->vm_mm, addr, pte, new_pte); > > > > I think unuse path is fine because unuse only applies to private mappings, > > so we should always have the W bit removed there within mk_pte(). > > You're right, however, shmem swapping confuses me. Maybe that does not apply > here. Yeah these are confusing. Actually I think the unuse path should apply to private mappings on shmem when the page was CoWed already and then got swapped out, but again as long as it's private mapping I think we don't have write bit anyway in retval of mk_pte() even if it's shmem.
>> I consider UFFD-wp a special case: while the default VMA protection might >> state that it is writable, you actually want individual PTEs to be >> write-protected and have to manually remove the protection. >> >> softdirty tracking is another special case: however, softdirty tracking is >> enabled for the whole VMA. For remove_migration_pte() that should be fine (I >> guess) because writenotify is active when the VMA needs to track softdirty >> bits, and consequently vma->vm_page_prot has the proper default permissions. >> >> >> I wonder if the following (valid), for example is possible: >> >> >> 1) clear_refs() clears VM_SOFTDIRTY and pte_wrprotect() the pte. >> -> writenotify is active and vma->vm_page_prot updated accordingly >> >> VM_SOFTDIRTY is reset due to VMA merging and vma->vm_page_prot is updated >> accordingly. See mmap_region() where we set VM_SOFTDIRTY. >> >> If you now migrate the (still write-protected in the PTE) page, it was not >> writable, but it can be writable on the destination. > > I didn't even notice merging could work with soft-dirty enabled, that's > interesting to know. > > Yes I think it's possible and I agree it's safe, as VM_SOFTDIRTY is set for > the merged vma so afaiu the write bit is safe to set. We get a bunch of > false positives but that's how soft-dirty works. > > I think the whole problem is easier if we see this at a higher level. > You're discussing this from vma pov and it's fair to do so, at least I > agree with what you mentioned so far and I can't see anything outside > uffd-wp that can be affected. However, it is also true when you noticed we > already have quite a few paragraphs trying to discuss the safety for this > and that, that's the part where I think we need justification and it's not > that "natural". > > For "natural", I meant fundamentally we're talking about page migrations > here. The natural way (at least to me) for page migration to happen as a > fundamental rule is that, we leverag the migration pte to make sure the pte > being stable so we can do the migration work, then we "recover" the pte to > present either by a full recovery or just (hopefully) only replace the pfn, > keeping all the rest untouched. > > One thing to prove that is we have two migration entries not one (I'm > temporarily put the exclusive read one aside since that's solving different > problems): MIGRATION_READ and MIGRATION_WRITE. If we only rely on vma > flags logically we don't need MIGRATION_READ and MIGRATION_WRITE, we only > need MIGRATION generic swap pte then we recover the write bit from vma > flags and other things (like uffd-wp, currently we have the bit set in swap > pte besides the swap entry type). > > So maybe one day we can use two migration types rather than three > (MIGRATION and MIGRATION_EXCLUSIVE)? I can't tell, but hopefully that > shows what I meant, that we need further justification to grant write bit > only base on vma, rather than recovering write bit based on migration entry > type. That's precisely what I had in mind recently, and I am happy to hear that you have similar idea: https://lkml.kernel.org/r/20221108174652.198904-6-david@redhat.com " Note that we don't optimize for the actual migration case: (1) When migration succeeds the new PTE will not be writable because the source PTE was not writable (protnone); in the future we might just optimize that case similarly by reusing can_change_pte_writable()/can_change_pmd_writable() when removing migration PTEs. " Currently, "readable_migration_entry" is even wrong: it might be PROT_NONE and not even readable.
On 15.11.22 18:22, David Hildenbrand wrote: >>> I consider UFFD-wp a special case: while the default VMA protection might >>> state that it is writable, you actually want individual PTEs to be >>> write-protected and have to manually remove the protection. >>> >>> softdirty tracking is another special case: however, softdirty tracking is >>> enabled for the whole VMA. For remove_migration_pte() that should be fine (I >>> guess) because writenotify is active when the VMA needs to track softdirty >>> bits, and consequently vma->vm_page_prot has the proper default permissions. >>> >>> >>> I wonder if the following (valid), for example is possible: >>> >>> >>> 1) clear_refs() clears VM_SOFTDIRTY and pte_wrprotect() the pte. >>> -> writenotify is active and vma->vm_page_prot updated accordingly >>> >>> VM_SOFTDIRTY is reset due to VMA merging and vma->vm_page_prot is updated >>> accordingly. See mmap_region() where we set VM_SOFTDIRTY. >>> >>> If you now migrate the (still write-protected in the PTE) page, it was not >>> writable, but it can be writable on the destination. >> >> I didn't even notice merging could work with soft-dirty enabled, that's >> interesting to know. >> >> Yes I think it's possible and I agree it's safe, as VM_SOFTDIRTY is set for >> the merged vma so afaiu the write bit is safe to set. We get a bunch of >> false positives but that's how soft-dirty works. >> >> I think the whole problem is easier if we see this at a higher level. >> You're discussing this from vma pov and it's fair to do so, at least I >> agree with what you mentioned so far and I can't see anything outside >> uffd-wp that can be affected. However, it is also true when you noticed we >> already have quite a few paragraphs trying to discuss the safety for this >> and that, that's the part where I think we need justification and it's not >> that "natural". Forgot to reply to that part: No it isn't natural. But sneaking such a change into your fix seems wrong. Touching !uffd-wp code should be separate, if we want to do this at all (as we discussed, maybe the better/cleaner approach is to eliminate writable migration entries if possible).
On Tue, Nov 15, 2022 at 06:22:03PM +0100, David Hildenbrand wrote: > That's precisely what I had in mind recently, and I am happy to hear that > you have similar idea: > > https://lkml.kernel.org/r/20221108174652.198904-6-david@redhat.com > > " > Note that we don't optimize for the actual migration case: > (1) When migration succeeds the new PTE will not be writable because the > source PTE was not writable (protnone); in the future we > might just optimize that case similarly by reusing > can_change_pte_writable()/can_change_pmd_writable() when removing > migration PTEs. > " I see, sorry I haven't yet read it, but sounds doable indeed. > > Currently, "readable_migration_entry" is even wrong: it might be PROT_NONE > and not even readable. Do you mean mprotect(PROT_NONE)? If we read the "read migration entry" as "migration entry with no write bit", it seems still fine, and code-wise after pte recovered it should still be PROT_NONE iiuc because mk_pte() will just make a pte without e.g. _PRESENT bit set on x86 while it'll have the _PROT_NONE bit. May not keep true for numa balancing though: when migration happens after a numa hint applied to a pte, it seems to me it's prone to lose the hint after migration completes (assuming this migration is not the numa balancing operation itself caused by a page access). Doesn't sound like a severe issue though even if I didn't miss something, since if the page got moved around the original hint may need to reconsider anyway.
On 15.11.22 19:03, Peter Xu wrote: > On Tue, Nov 15, 2022 at 06:22:03PM +0100, David Hildenbrand wrote: >> That's precisely what I had in mind recently, and I am happy to hear that >> you have similar idea: >> >> https://lkml.kernel.org/r/20221108174652.198904-6-david@redhat.com >> >> " >> Note that we don't optimize for the actual migration case: >> (1) When migration succeeds the new PTE will not be writable because the >> source PTE was not writable (protnone); in the future we >> might just optimize that case similarly by reusing >> can_change_pte_writable()/can_change_pmd_writable() when removing >> migration PTEs. >> " > > I see, sorry I haven't yet read it, but sounds doable indeed. > >> >> Currently, "readable_migration_entry" is even wrong: it might be PROT_NONE >> and not even readable. > > Do you mean mprotect(PROT_NONE)? > > If we read the "read migration entry" as "migration entry with no write > bit", it seems still fine, and code-wise after pte recovered it should > still be PROT_NONE iiuc because mk_pte() will just make a pte without > e.g. _PRESENT bit set on x86 while it'll have the _PROT_NONE bit. Exactly that's the unintuitive interpretation of "readable_migration_entry". By "wrong" I meant: the naming is wrong. > > May not keep true for numa balancing though: when migration happens after a > numa hint applied to a pte, it seems to me it's prone to lose the hint > after migration completes (assuming this migration is not the numa > balancing operation itself caused by a page access). Doesn't sound like a > severe issue though even if I didn't miss something, since if the page got > moved around the original hint may need to reconsider anyway. Yes, I think any migration will lose fake PROT_NONE. "Fake" as in "not VMA permissions" but "additional permissions imposed by NUMA hinting faults."
On Tue, Nov 15, 2022 at 06:54:09PM +0100, David Hildenbrand wrote: > No it isn't natural. But sneaking such a change into your fix seems wrong. > Touching !uffd-wp code should be separate, if we want to do this at all (as > we discussed, maybe the better/cleaner approach is to eliminate writable > migration entries if possible). If you see I made the subject as mm/migrate, because I took it as a migration change, not uffd-wp specific. I also added the Fixes to the uffd patch because afaict that's the only thing that it affects, but I can't tell either. As I said, I'm trying to follow the rule where we have read entry I want to make sure write bit removed. I hope we have this patch land soon, because it affects users. I'm not against further rework on migration entries as I replied in the other email, but for this patch I keep the same thoughts as when I posted. Thanks,
On 15.11.22 19:11, Peter Xu wrote: > On Tue, Nov 15, 2022 at 06:54:09PM +0100, David Hildenbrand wrote: >> No it isn't natural. But sneaking such a change into your fix seems wrong. >> Touching !uffd-wp code should be separate, if we want to do this at all (as >> we discussed, maybe the better/cleaner approach is to eliminate writable >> migration entries if possible). > > If you see I made the subject as mm/migrate, because I took it as a > migration change, not uffd-wp specific. I also added the Fixes to the uffd > patch because afaict that's the only thing that it affects, but I can't > tell either. > > As I said, I'm trying to follow the rule where we have read entry I want to > make sure write bit removed. I really don't think we should do that generic change.
diff --git a/mm/migrate.c b/mm/migrate.c index dff333593a8a..8b6351c08c78 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -213,8 +213,14 @@ static bool remove_migration_pte(struct folio *folio, pte = pte_mkdirty(pte); if (is_writable_migration_entry(entry)) pte = maybe_mkwrite(pte, vma); - else if (pte_swp_uffd_wp(*pvmw.pte)) + else + /* NOTE: mk_pte can have write bit set */ + pte = pte_wrprotect(pte); + + if (pte_swp_uffd_wp(*pvmw.pte)) { + WARN_ON_ONCE(pte_write(pte)); pte = pte_mkuffd_wp(pte); + } if (folio_test_anon(folio) && !is_readable_migration_entry(entry)) rmap_flags |= RMAP_EXCLUSIVE;
Ives van Hoorne from codesandbox.io reported an issue regarding possible data loss of uffd-wp when applied to memfds on heavily loaded systems. The sympton is some read page got data mismatch from the snapshot child VMs. Here I can also reproduce with a Rust reproducer that was provided by Ives that keeps taking snapshot of a 256MB VM, on a 32G system when I initiate 80 instances I can trigger the issues in ten minutes. It turns out that we got some pages write-through even if uffd-wp is applied to the pte. The problem is, when removing migration entries, we didn't really worry about write bit as long as we know it's not a write migration entry. That may not be true, for some memory types (e.g. writable shmem) mk_pte can return a pte with write bit set, then to recover the migration entry to its original state we need to explicit wr-protect the pte or it'll has the write bit set if it's a read migration entry. For uffd it can cause write-through. I didn't verify, but I think it'll be the same for mprotect()ed pages and after migration we can miss the sigbus instead. The relevant code on uffd was introduced in the anon support, which is commit f45ec5ff16a7 ("userfaultfd: wp: support swap and page migration", 2020-04-07). However anon shouldn't suffer from this problem because anon should already have the write bit cleared always, so that may not be a proper Fixes target. To satisfy the need on the backport, I'm attaching the Fixes tag to the uffd-wp shmem support. Since no one had issue with mprotect, so I assume that's also the kernel version we should start to backport for stable, and we shouldn't need to worry before that. Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: stable@vger.kernel.org Fixes: b1f9e876862d ("mm/uffd: enable write protection for shmem & hugetlbfs") Reported-by: Ives van Hoorne <ives@codesandbox.io> Signed-off-by: Peter Xu <peterx@redhat.com> --- mm/migrate.c | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-)