diff mbox series

mm/uffd: UFFD_FEATURE_WP_ZEROPAGE

Message ID 20230215210257.224243-1-peterx@redhat.com (mailing list archive)
State New
Headers show
Series mm/uffd: UFFD_FEATURE_WP_ZEROPAGE | expand

Commit Message

Peter Xu Feb. 15, 2023, 9:02 p.m. UTC
This is a new feature that controls how uffd-wp handles zero pages (aka,
empty ptes), majorly for anonymous pages only.

Note, here we used "zeropage" as a replacement of "empty pte" just to avoid
introducing the pte idea into uapi, since "zero page" is more well known to
an user app developer.

File memories handles none ptes consistently by allowing wr-protecting of
none ptes because of the unawareness of page cache being exist or not.  For
anonymous it was not as persistent because we used to assume that we don't
need protections on none ptes or known zero pages.

But it's actually not true.

One use case was VM live snapshot, where if without wr-protecting empty
ptes the snapshot can contain random rubbish in the holes of the anonymous
memory, which can cause misbehave of the guest when the guest assumes the
pages should (and were) all zeros.

QEMU worked it around by pre-populate the section with reads to fill in
zero page entries before starting the whole snapshot process [1].

Recently there's another need that raised on using userfaultfd wr-protect
for detecting dirty pages (to replace soft-dirty) [2].  In that case if
without being able to wr-protect zero pages by default, the dirty info can
get lost as long as a zero page is written, even after the tracking was
started.

In general, we want to be able to wr-protect empty ptes too even for
anonymous.

This patch implements UFFD_FEATURE_WP_ZEROPAGE so that it'll make uffd-wp
handling on zeropage being consistent no matter what the memory type is
underneath.  It doesn't have any impact on file memories so far because we
already have pte markers taking care of that.  So it only affects
anonymous.

One way to implement this is to also install pte markers for anonymous
memories.  However here we can actually do better (than i.e. shmem) because
we know there's no page that is backing the pte, so the better solution is
to directly install a zeropage read-only pte, so that if there'll be a
upcoming read it'll not trigger a fault at all.  It will also reduce the
changeset to implement this feature too.

To install zeropages, we'll also need to populate the pgtables just like
file memories during ioctl(UFFDIO_WRITEPROTECT), where zeropage needs to be
installed.  Rename uffd_wp_protect_file() to pgtable_populate_needed()
because it's not only about file memory, not anymore.  Add yet another
pgtable_split_needed() because for anonymous we don't need to split a thp
for wr-protections (e.g., when it's only read during the whole process).

[1] https://lore.kernel.org/all/20210401092226.102804-4-andrey.gruzdev@virtuozzo.com/
[1] https://lore.kernel.org/all/Y+v2HJ8+3i%2FKzDBu@x1n/

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 fs/userfaultfd.c                         |  8 +++
 include/linux/userfaultfd_k.h            |  6 +++
 include/uapi/linux/userfaultfd.h         | 10 +++-
 mm/mprotect.c                            | 68 ++++++++++++++++++++----
 tools/testing/selftests/mm/userfaultfd.c | 13 ++++-
 5 files changed, 93 insertions(+), 12 deletions(-)

Comments

David Hildenbrand Feb. 16, 2023, 10:47 a.m. UTC | #1
On 15.02.23 22:02, Peter Xu wrote:
> This is a new feature that controls how uffd-wp handles zero pages (aka,
> empty ptes), majorly for anonymous pages only.
> 
> Note, here we used "zeropage" as a replacement of "empty pte" just to avoid
> introducing the pte idea into uapi, since "zero page" is more well known to
> an user app developer.
> 
> File memories handles none ptes consistently by allowing wr-protecting of
> none ptes because of the unawareness of page cache being exist or not.  For
> anonymous it was not as persistent because we used to assume that we don't
> need protections on none ptes or known zero pages.
> 
> But it's actually not true.
> 
> One use case was VM live snapshot, where if without wr-protecting empty
> ptes the snapshot can contain random rubbish in the holes of the anonymous
> memory, which can cause misbehave of the guest when the guest assumes the
> pages should (and were) all zeros.
> 
> QEMU worked it around by pre-populate the section with reads to fill in
> zero page entries before starting the whole snapshot process [1].
> 
> Recently there's another need that raised on using userfaultfd wr-protect
> for detecting dirty pages (to replace soft-dirty) [2].  In that case if
> without being able to wr-protect zero pages by default, the dirty info can
> get lost as long as a zero page is written, even after the tracking was
> started.
> 
> In general, we want to be able to wr-protect empty ptes too even for
> anonymous.
> 
> This patch implements UFFD_FEATURE_WP_ZEROPAGE so that it'll make uffd-wp
> handling on zeropage being consistent no matter what the memory type is
> underneath.  It doesn't have any impact on file memories so far because we
> already have pte markers taking care of that.  So it only affects
> anonymous.
> 
> One way to implement this is to also install pte markers for anonymous
> memories.  However here we can actually do better (than i.e. shmem) because
> we know there's no page that is backing the pte, so the better solution is
> to directly install a zeropage read-only pte, so that if there'll be a
> upcoming read it'll not trigger a fault at all.  It will also reduce the
> changeset to implement this feature too.
> 

There are various reasons why I think a UFFD_FEATURE_WP_UNPOPULATED, 
using PTE markers, would be more benficial:

1) It would be applicable to anon hugetlb
2) It would be applicable even when the zeropage is disallowed
    (mm_forbids_zeropage())
3) It would be possible to optimize even without the huge zeropage, by
    using a PMD marker.
4) It would be possible to optimize even on the PUD level using a PMD
    marker.

Especially when uffd-wp'ing large ranges that are possibly all 
unpopulated (thinking about the existing VM background snapshot use case 
either with untouched memory or with things like free page reporting), 
we might neither be reading or writing that memory any time soon.
Peter Xu Feb. 16, 2023, 4:29 p.m. UTC | #2
On Thu, Feb 16, 2023 at 11:47:23AM +0100, David Hildenbrand wrote:
> On 15.02.23 22:02, Peter Xu wrote:
> > This is a new feature that controls how uffd-wp handles zero pages (aka,
> > empty ptes), majorly for anonymous pages only.
> > 
> > Note, here we used "zeropage" as a replacement of "empty pte" just to avoid
> > introducing the pte idea into uapi, since "zero page" is more well known to
> > an user app developer.
> > 
> > File memories handles none ptes consistently by allowing wr-protecting of
> > none ptes because of the unawareness of page cache being exist or not.  For
> > anonymous it was not as persistent because we used to assume that we don't
> > need protections on none ptes or known zero pages.
> > 
> > But it's actually not true.
> > 
> > One use case was VM live snapshot, where if without wr-protecting empty
> > ptes the snapshot can contain random rubbish in the holes of the anonymous
> > memory, which can cause misbehave of the guest when the guest assumes the
> > pages should (and were) all zeros.
> > 
> > QEMU worked it around by pre-populate the section with reads to fill in
> > zero page entries before starting the whole snapshot process [1].
> > 
> > Recently there's another need that raised on using userfaultfd wr-protect
> > for detecting dirty pages (to replace soft-dirty) [2].  In that case if
> > without being able to wr-protect zero pages by default, the dirty info can
> > get lost as long as a zero page is written, even after the tracking was
> > started.
> > 
> > In general, we want to be able to wr-protect empty ptes too even for
> > anonymous.
> > 
> > This patch implements UFFD_FEATURE_WP_ZEROPAGE so that it'll make uffd-wp
> > handling on zeropage being consistent no matter what the memory type is
> > underneath.  It doesn't have any impact on file memories so far because we
> > already have pte markers taking care of that.  So it only affects
> > anonymous.
> > 
> > One way to implement this is to also install pte markers for anonymous
> > memories.  However here we can actually do better (than i.e. shmem) because
> > we know there's no page that is backing the pte, so the better solution is
> > to directly install a zeropage read-only pte, so that if there'll be a
> > upcoming read it'll not trigger a fault at all.  It will also reduce the
> > changeset to implement this feature too.
> > 
> 
> There are various reasons why I think a UFFD_FEATURE_WP_UNPOPULATED, using
> PTE markers, would be more benficial:
> 
> 1) It would be applicable to anon hugetlb

Anon hugetlb should already work with non ptes with the markers?

> 2) It would be applicable even when the zeropage is disallowed
>    (mm_forbids_zeropage())

Do you mean s390 can disable zeropage with mm_uses_skeys()?  So far uffd-wp
doesn't support s390 yet, I'm not sure whether we over worried on this
effect.

Or is there any other projects / ideas that potentially can enlarge forbid
zero pages to more contexts?

> 3) It would be possible to optimize even without the huge zeropage, by
>    using a PMD marker.

This patch doesn't need huge zeropage being exist.

> 4) It would be possible to optimize even on the PUD level using a PMD
>    marker.

I think 3+4 is in general an interesting idea on using pte markers on
higher than pte levels, but that needs more changes.

Firstly, keep using pte markers is somehow preallocating the pgtables, so a
side effect of it could be speeding up future faults because they'll all
split into pmd locks and read doesn't need to fault at all, only writes.

Imagine when you hit a page fault on a pmd marker, it means you'll need to
spread that "marker" information to child ptes and you must - it moves the
slow operation of WP into future page faults in some way.  In some cases
(I'd say, most cases..) that's not wanted.  The same to PUDs.

> 
> Especially when uffd-wp'ing large ranges that are possibly all unpopulated
> (thinking about the existing VM background snapshot use case either with
> untouched memory or with things like free page reporting), we might neither
> be reading or writing that memory any time soon.

Right, I think that's a trade-off. But I still think large portion of
totally unpopulated memory should be rare case rather than majority, or am
I wrong?  Not to mention that requires a more involved changeset to the
kernel.

So what I proposed here is the (AFAIU) simplest solution towards providing
such a feature in a complete form.  I think we have chance to implement it
in other ways like pte markers, but that's something we can work upon, and
so far I'm not sure how much benefit we can get out of it yet.

Thanks,
David Hildenbrand Feb. 16, 2023, 5 p.m. UTC | #3
>>
>> There are various reasons why I think a UFFD_FEATURE_WP_UNPOPULATED, using
>> PTE markers, would be more benficial:
>>
>> 1) It would be applicable to anon hugetlb
> 
> Anon hugetlb should already work with non ptes with the markers?
> 

... really? I thought we'd do the whole pte marker handling only when 
dealing with hugetlb/shmem. Interesting, thanks. (we could skip 
population in QEMU in that case as well -- we always do it for now)

>> 2) It would be applicable even when the zeropage is disallowed
>>     (mm_forbids_zeropage())
> 
> Do you mean s390 can disable zeropage with mm_uses_skeys()?  So far uffd-wp
> doesn't support s390 yet, I'm not sure whether we over worried on this
> effect.
> 
> Or is there any other projects / ideas that potentially can enlarge forbid
> zero pages to more contexts?

I think it was shown that zeropages can be used to build covert channels 
(similar to memory deduplciation, because it effectively is memory 
deduplication). It's mentioned as a note in [1] under VII. A. ("Only 
Deduplicate Zero Pages.")


[1] https://www.ndss-symposium.org/wp-content/uploads/2022-81-paper.pdf

> 
>> 3) It would be possible to optimize even without the huge zeropage, by
>>     using a PMD marker.
> 
> This patch doesn't need huge zeropage being exist.

Yes, and for that reason I think it may perform worse than what we 
already have in some cases. Instead of populating a single PMD you'll 
have to fill a full PTE table.

> 
>> 4) It would be possible to optimize even on the PUD level using a PMD
>>     marker.
> 
> I think 3+4 is in general an interesting idea on using pte markers on
> higher than pte levels, but that needs more changes.
> 
> Firstly, keep using pte markers is somehow preallocating the pgtables, so a
> side effect of it could be speeding up future faults because they'll all
> split into pmd locks and read doesn't need to fault at all, only writes.
> 
> Imagine when you hit a page fault on a pmd marker, it means you'll need to
> spread that "marker" information to child ptes and you must - it moves the
> slow operation of WP into future page faults in some way.  In some cases
> (I'd say, most cases..) that's not wanted.  The same to PUDs.

Right, but user space already has that option (see below).

> 
>>
>> Especially when uffd-wp'ing large ranges that are possibly all unpopulated
>> (thinking about the existing VM background snapshot use case either with
>> untouched memory or with things like free page reporting), we might neither
>> be reading or writing that memory any time soon.
> 
> Right, I think that's a trade-off. But I still think large portion of
> totally unpopulated memory should be rare case rather than majority, or am
> I wrong?  Not to mention that requires a more involved changeset to the
> kernel.
> 
> So what I proposed here is the (AFAIU) simplest solution towards providing
> such a feature in a complete form.  I think we have chance to implement it
> in other ways like pte markers, but that's something we can work upon, and
> so far I'm not sure how much benefit we can get out of it yet.
> 

What you propose here can already be achieved by user space fairly 
easily (in fact, QEMU implementation could be further sped up using 
MADV_POPULATE_READ). Usually, we only do that when there are very good 
reasons to (performance).

Using PTE markers would provide a real advantage IMHO for some users 
(IMHO background snapshots), where we might want to avoid populating 
zeropages/page tables as best as we can completely if the VM memory is 
mostly untouched.

Naturally, I wonder if UFFD_FEATURE_WP_ZEROPAGE is really worth it. Is 
there is another good reason to combine the populate zeropage+wp that I 
am missing (e.g., atomicity by doing both in one operation)?
Peter Xu Feb. 16, 2023, 5:55 p.m. UTC | #4
On Thu, Feb 16, 2023 at 06:00:51PM +0100, David Hildenbrand wrote:
> > > 
> > > There are various reasons why I think a UFFD_FEATURE_WP_UNPOPULATED, using
> > > PTE markers, would be more benficial:
> > > 
> > > 1) It would be applicable to anon hugetlb
> > 
> > Anon hugetlb should already work with non ptes with the markers?
> > 
> 
> ... really? I thought we'd do the whole pte marker handling only when
> dealing with hugetlb/shmem. Interesting, thanks. (we could skip population
> in QEMU in that case as well -- we always do it for now)

Hmm, you're talking about "anon hugetlb", so it's still hugetlb, right? :)

For qemu's live snapshot, let me take a look at that later; and I will
definitely do so if this patch lands, then we avoid it completely.

> 
> > > 2) It would be applicable even when the zeropage is disallowed
> > >     (mm_forbids_zeropage())
> > 
> > Do you mean s390 can disable zeropage with mm_uses_skeys()?  So far uffd-wp
> > doesn't support s390 yet, I'm not sure whether we over worried on this
> > effect.
> > 
> > Or is there any other projects / ideas that potentially can enlarge forbid
> > zero pages to more contexts?
> 
> I think it was shown that zeropages can be used to build covert channels
> (similar to memory deduplciation, because it effectively is memory
> deduplication). It's mentioned as a note in [1] under VII. A. ("Only
> Deduplicate Zero Pages.")
> 
> 
> [1] https://www.ndss-symposium.org/wp-content/uploads/2022-81-paper.pdf

Thanks for the link.  I'm slightly confused how dedup of zero pages is a
concern here, though.

IIUC the security risk is when the dedup-ed pages contain valid information
so the attacker can measure latency of requests when the attemped malicious
page contains exactly the same content of the data page, by trying to
detect the CoW from happening.

Here it's the zero page, even if there's CoW difference the data being
exposed can only be all zeros?  Then what's the risk?

Another note for s390: when it comes we can consider moving to pte markers
conditionally when !zeropage.  But we can leave that for later.

> 
> > 
> > > 3) It would be possible to optimize even without the huge zeropage, by
> > >     using a PMD marker.
> > 
> > This patch doesn't need huge zeropage being exist.
> 
> Yes, and for that reason I think it may perform worse than what we already
> have in some cases. Instead of populating a single PMD you'll have to fill a
> full PTE table.

Yes.  If you think that'll worth it, I can conditionally do pmd zero thp in
a new version.  Maybe it will be a good intermediate step between
introducing pte markers to pmd/pud/etc, so at least we don't need other
changes to coordinate pte markers to higher levels.

> 
> > 
> > > 4) It would be possible to optimize even on the PUD level using a PMD
> > >     marker.
> > 
> > I think 3+4 is in general an interesting idea on using pte markers on
> > higher than pte levels, but that needs more changes.
> > 
> > Firstly, keep using pte markers is somehow preallocating the pgtables, so a
> > side effect of it could be speeding up future faults because they'll all
> > split into pmd locks and read doesn't need to fault at all, only writes.
> > 
> > Imagine when you hit a page fault on a pmd marker, it means you'll need to
> > spread that "marker" information to child ptes and you must - it moves the
> > slow operation of WP into future page faults in some way.  In some cases
> > (I'd say, most cases..) that's not wanted.  The same to PUDs.
> 
> Right, but user space already has that option (see below).
> 
> > 
> > > 
> > > Especially when uffd-wp'ing large ranges that are possibly all unpopulated
> > > (thinking about the existing VM background snapshot use case either with
> > > untouched memory or with things like free page reporting), we might neither
> > > be reading or writing that memory any time soon.
> > 
> > Right, I think that's a trade-off. But I still think large portion of
> > totally unpopulated memory should be rare case rather than majority, or am
> > I wrong?  Not to mention that requires a more involved changeset to the
> > kernel.
> > 
> > So what I proposed here is the (AFAIU) simplest solution towards providing
> > such a feature in a complete form.  I think we have chance to implement it
> > in other ways like pte markers, but that's something we can work upon, and
> > so far I'm not sure how much benefit we can get out of it yet.
> > 
> 
> What you propose here can already be achieved by user space fairly easily
> (in fact, QEMU implementation could be further sped up using
> MADV_POPULATE_READ). Usually, we only do that when there are very good
> reasons to (performance).

Yes POPULATE_READ will be faster.  This patch should make it even faster,
because it merges the two walks.

> 
> Using PTE markers would provide a real advantage IMHO for some users (IMHO
> background snapshots), where we might want to avoid populating
> zeropages/page tables as best as we can completely if the VM memory is
> mostly untouched.
> 
> Naturally, I wonder if UFFD_FEATURE_WP_ZEROPAGE is really worth it. Is there
> is another good reason to combine the populate zeropage+wp that I am missing
> (e.g., atomicity by doing both in one operation)?

It also makes the new WP_ASYNC and pagemap interface clean: we don't want
to have user pre-fault it every time too as a common tactic..  It's hard to
use, and the user doesn't need to know the internals of why it is needed,
either.

The other thing is it provides a way to make anon and !anon behave the same
on empty ptes; it's a pity that it was not already like that.

We can always optimize this behavior in the future with either
PMD/PUD/.. pte markers as you said, but IMHO that just needs further
justification on the complexity, and also on whether that's beneficial to
the majority to become the default behavior.

The worst case (if anyone would like that behavior) is we can have another
feature bit making decision of that behavior, but that'll be something on
top.

Thanks,
David Hildenbrand Feb. 16, 2023, 6:23 p.m. UTC | #5
On 16.02.23 18:55, Peter Xu wrote:
> On Thu, Feb 16, 2023 at 06:00:51PM +0100, David Hildenbrand wrote:
>>>>
>>>> There are various reasons why I think a UFFD_FEATURE_WP_UNPOPULATED, using
>>>> PTE markers, would be more benficial:
>>>>
>>>> 1) It would be applicable to anon hugetlb
>>>
>>> Anon hugetlb should already work with non ptes with the markers?
>>>
>>
>> ... really? I thought we'd do the whole pte marker handling only when
>> dealing with hugetlb/shmem. Interesting, thanks. (we could skip population
>> in QEMU in that case as well -- we always do it for now)
> 
> Hmm, you're talking about "anon hugetlb", so it's still hugetlb, right? :)

I mean especially MAP_PRIVATE|MAP_HUGETLB|MAP_ANONYMOUS, so "in theory" 
without any fd and thus pagecache. ... but anon hugetlb keeps confusing 
me with pagecache handling.

>>
>>>> 2) It would be applicable even when the zeropage is disallowed
>>>>      (mm_forbids_zeropage())
>>>
>>> Do you mean s390 can disable zeropage with mm_uses_skeys()?  So far uffd-wp
>>> doesn't support s390 yet, I'm not sure whether we over worried on this
>>> effect.
>>>
>>> Or is there any other projects / ideas that potentially can enlarge forbid
>>> zero pages to more contexts?
>>
>> I think it was shown that zeropages can be used to build covert channels
>> (similar to memory deduplciation, because it effectively is memory
>> deduplication). It's mentioned as a note in [1] under VII. A. ("Only
>> Deduplicate Zero Pages.")
>>
>>
>> [1] https://www.ndss-symposium.org/wp-content/uploads/2022-81-paper.pdf
> 
> Thanks for the link.  I'm slightly confused how dedup of zero pages is a
> concern here, though.
> 
> IIUC the security risk is when the dedup-ed pages contain valid information
> so the attacker can measure latency of requests when the attemped malicious
> page contains exactly the same content of the data page, by trying to
> detect the CoW from happening. >
> Here it's the zero page, even if there's CoW difference the data being
> exposed can only be all zeros?  Then what's the risk?

The focus of that paper is on CoW latency yes (and deduplication 
instantiating shared zeropages -- but building a covert channel using 
CoW latency might be rather tricky I think, because they will get 
deduplciated independently of a sender action ...).

However, in theory, one could build a covert channel between two VMs 
simply by using cache flushes and reading from the shared zeropage. 
Measuring access time can reveal if the sender read the page (L3 filled) 
or not (L3 not filled).

Having that said, I don't think that we are going to disable the shared 
zeropage because of that for some workloads, I assume in most cases it 
will simply be way too noisy to transmit any kind of data and we have 
more critical covert channels to sort out if we want to.

Just wanted to raise it because you asked :)

> 
> Another note for s390: when it comes we can consider moving to pte markers
> conditionally when !zeropage.  But we can leave that for later.

Sure, we could always have another feature flag.

> 
>>
>>>
>>>> 3) It would be possible to optimize even without the huge zeropage, by
>>>>      using a PMD marker.
>>>
>>> This patch doesn't need huge zeropage being exist.
>>
>> Yes, and for that reason I think it may perform worse than what we already
>> have in some cases. Instead of populating a single PMD you'll have to fill a
>> full PTE table.
> 
> Yes.  If you think that'll worth it, I can conditionally do pmd zero thp in
> a new version.  Maybe it will be a good intermediate step between
> introducing pte markers to pmd/pud/etc, so at least we don't need other
> changes to coordinate pte markers to higher levels.

[...]

>>>> Especially when uffd-wp'ing large ranges that are possibly all unpopulated
>>>> (thinking about the existing VM background snapshot use case either with
>>>> untouched memory or with things like free page reporting), we might neither
>>>> be reading or writing that memory any time soon.
>>>
>>> Right, I think that's a trade-off. But I still think large portion of
>>> totally unpopulated memory should be rare case rather than majority, or am
>>> I wrong?  Not to mention that requires a more involved changeset to the
>>> kernel.
>>>
>>> So what I proposed here is the (AFAIU) simplest solution towards providing
>>> such a feature in a complete form.  I think we have chance to implement it
>>> in other ways like pte markers, but that's something we can work upon, and
>>> so far I'm not sure how much benefit we can get out of it yet.
>>>
>>
>> What you propose here can already be achieved by user space fairly easily
>> (in fact, QEMU implementation could be further sped up using
>> MADV_POPULATE_READ). Usually, we only do that when there are very good
>> reasons to (performance).
> 
> Yes POPULATE_READ will be faster.  This patch should make it even faster,
> because it merges the two walks.
> 

Getting some performance numbers would be nice.

>>
>> Using PTE markers would provide a real advantage IMHO for some users (IMHO
>> background snapshots), where we might want to avoid populating
>> zeropages/page tables as best as we can completely if the VM memory is
>> mostly untouched.
>>
>> Naturally, I wonder if UFFD_FEATURE_WP_ZEROPAGE is really worth it. Is there
>> is another good reason to combine the populate zeropage+wp that I am missing
>> (e.g., atomicity by doing both in one operation)?
> 
> It also makes the new WP_ASYNC and pagemap interface clean: we don't want
> to have user pre-fault it every time too as a common tactic..  It's hard to
> use, and the user doesn't need to know the internals of why it is needed,
> either.

I feel like we're building a lot of infrastructure on uffd-wp instead of 
having an alternative softdirty mode (using a world switch?) that works 
as expected and doesn't require that many uffd-wp extensions. ;)

Having that said, I have the feeling that you and Muhammad have a plan 
to make it work using uffd-wp and I won't interfere. It would be nicer 
to use softdirty infrastructure IMHO, though.

> 
> The other thing is it provides a way to make anon and !anon behave the same
> on empty ptes; it's a pity that it was not already like that.

In an ideal world, we'd simply be using PTE markers unconditionally I 
think and avoid this zeropage feature :/

Is there any particular reason to have UFFD_FEATURE_WP_ZEROPAGE and not 
simply always do that unconditionally? (sure, we have to indicate to 
user space that it now works as expected) Are we really expecting to 
break user space by protecting what was asked for to protect?

> 
> We can always optimize this behavior in the future with either
> PMD/PUD/.. pte markers as you said, but IMHO that just needs further
> justification on the complexity, and also on whether that's beneficial to
> the majority to become the default behavior.

As I said, usually any new features require good justification. Maybe 
there really is a measurable performance gain (less syscalls, less 
pgtable walks).

> 
> The worst case (if anyone would like that behavior) is we can have another
> feature bit making decision of that behavior, but that'll be something on
> top.

Right.
Peter Xu Feb. 16, 2023, 8:08 p.m. UTC | #6
On Thu, Feb 16, 2023 at 07:23:17PM +0100, David Hildenbrand wrote:
> On 16.02.23 18:55, Peter Xu wrote:
> > On Thu, Feb 16, 2023 at 06:00:51PM +0100, David Hildenbrand wrote:
> > > > > 
> > > > > There are various reasons why I think a UFFD_FEATURE_WP_UNPOPULATED, using
> > > > > PTE markers, would be more benficial:
> > > > > 
> > > > > 1) It would be applicable to anon hugetlb
> > > > 
> > > > Anon hugetlb should already work with non ptes with the markers?
> > > > 
> > > 
> > > ... really? I thought we'd do the whole pte marker handling only when
> > > dealing with hugetlb/shmem. Interesting, thanks. (we could skip population
> > > in QEMU in that case as well -- we always do it for now)
> > 
> > Hmm, you're talking about "anon hugetlb", so it's still hugetlb, right? :)
> 
> I mean especially MAP_PRIVATE|MAP_HUGETLB|MAP_ANONYMOUS, so "in theory"
> without any fd and thus pagecache. ... but anon hugetlb keeps confusing me
> with pagecache handling.

IIUC when mmap(fd==-1) it's the same as MAP_PRIVATE|MAP_HUGETLB.

> 
> > > 
> > > > > 2) It would be applicable even when the zeropage is disallowed
> > > > >      (mm_forbids_zeropage())
> > > > 
> > > > Do you mean s390 can disable zeropage with mm_uses_skeys()?  So far uffd-wp
> > > > doesn't support s390 yet, I'm not sure whether we over worried on this
> > > > effect.
> > > > 
> > > > Or is there any other projects / ideas that potentially can enlarge forbid
> > > > zero pages to more contexts?
> > > 
> > > I think it was shown that zeropages can be used to build covert channels
> > > (similar to memory deduplciation, because it effectively is memory
> > > deduplication). It's mentioned as a note in [1] under VII. A. ("Only
> > > Deduplicate Zero Pages.")
> > > 
> > > 
> > > [1] https://www.ndss-symposium.org/wp-content/uploads/2022-81-paper.pdf
> > 
> > Thanks for the link.  I'm slightly confused how dedup of zero pages is a
> > concern here, though.
> > 
> > IIUC the security risk is when the dedup-ed pages contain valid information
> > so the attacker can measure latency of requests when the attemped malicious
> > page contains exactly the same content of the data page, by trying to
> > detect the CoW from happening. >
> > Here it's the zero page, even if there's CoW difference the data being
> > exposed can only be all zeros?  Then what's the risk?
> 
> The focus of that paper is on CoW latency yes (and deduplication
> instantiating shared zeropages -- but building a covert channel using CoW
> latency might be rather tricky I think, because they will get deduplciated
> independently of a sender action ...).
> 
> However, in theory, one could build a covert channel between two VMs simply
> by using cache flushes and reading from the shared zeropage. Measuring
> access time can reveal if the sender read the page (L3 filled) or not (L3
> not filled).

So the attacker will know when someone reads a zeropage, but I still don't
get how that can leads to data leak..

> 
> Having that said, I don't think that we are going to disable the shared
> zeropage because of that for some workloads, I assume in most cases it will
> simply be way too noisy to transmit any kind of data and we have more
> critical covert channels to sort out if we want to.
> 
> Just wanted to raise it because you asked :)
> 
> > 
> > Another note for s390: when it comes we can consider moving to pte markers
> > conditionally when !zeropage.  But we can leave that for later.
> 
> Sure, we could always have another feature flag.

I think that doesn't need to be another feature flag.  If someone will port
uffd-wp to s390 we can implement pte markers for WP_ZEROPAGE, then we
either use it when zeropage not exist, or we can switch to pte markers
completely too without changing the interface if we want, depending on
whether we think replacing zeropages with pte markers will be a major issue
with existing apps.  I don't worry too much on that part.

> 
> > 
> > > 
> > > > 
> > > > > 3) It would be possible to optimize even without the huge zeropage, by
> > > > >      using a PMD marker.
> > > > 
> > > > This patch doesn't need huge zeropage being exist.
> > > 
> > > Yes, and for that reason I think it may perform worse than what we already
> > > have in some cases. Instead of populating a single PMD you'll have to fill a
> > > full PTE table.
> > 
> > Yes.  If you think that'll worth it, I can conditionally do pmd zero thp in
> > a new version.  Maybe it will be a good intermediate step between
> > introducing pte markers to pmd/pud/etc, so at least we don't need other
> > changes to coordinate pte markers to higher levels.
> 
> [...]
> 
> > > > > Especially when uffd-wp'ing large ranges that are possibly all unpopulated
> > > > > (thinking about the existing VM background snapshot use case either with
> > > > > untouched memory or with things like free page reporting), we might neither
> > > > > be reading or writing that memory any time soon.
> > > > 
> > > > Right, I think that's a trade-off. But I still think large portion of
> > > > totally unpopulated memory should be rare case rather than majority, or am
> > > > I wrong?  Not to mention that requires a more involved changeset to the
> > > > kernel.
> > > > 
> > > > So what I proposed here is the (AFAIU) simplest solution towards providing
> > > > such a feature in a complete form.  I think we have chance to implement it
> > > > in other ways like pte markers, but that's something we can work upon, and
> > > > so far I'm not sure how much benefit we can get out of it yet.
> > > > 
> > > 
> > > What you propose here can already be achieved by user space fairly easily
> > > (in fact, QEMU implementation could be further sped up using
> > > MADV_POPULATE_READ). Usually, we only do that when there are very good
> > > reasons to (performance).
> > 
> > Yes POPULATE_READ will be faster.  This patch should make it even faster,
> > because it merges the two walks.
> > 
> 
> Getting some performance numbers would be nice.

Sure, I'll collect some data and post a new version.

> 
> > > 
> > > Using PTE markers would provide a real advantage IMHO for some users (IMHO
> > > background snapshots), where we might want to avoid populating
> > > zeropages/page tables as best as we can completely if the VM memory is
> > > mostly untouched.
> > > 
> > > Naturally, I wonder if UFFD_FEATURE_WP_ZEROPAGE is really worth it. Is there
> > > is another good reason to combine the populate zeropage+wp that I am missing
> > > (e.g., atomicity by doing both in one operation)?
> > 
> > It also makes the new WP_ASYNC and pagemap interface clean: we don't want
> > to have user pre-fault it every time too as a common tactic..  It's hard to
> > use, and the user doesn't need to know the internals of why it is needed,
> > either.
> 
> I feel like we're building a lot of infrastructure on uffd-wp instead of
> having an alternative softdirty mode (using a world switch?) that works as
> expected and doesn't require that many uffd-wp extensions. ;)

We used to discuss this WP_ZEROPAGE before, and I thought we were all happy
to have that.  Obviously you changed your mind. :)

I wasn't really eager on this before because the workaround of pre-read
works good already (I assume slightly slower but it's fine; not until
someone starts to worry).  But if we want to extend soft-dirty that's not
good at all to have any new user being requested to prefault memory and
figuring out why it's needed.

> 
> Having that said, I have the feeling that you and Muhammad have a plan to
> make it work using uffd-wp and I won't interfere. It would be nicer to use
> softdirty infrastructure IMHO, though.

Thanks.  If you have any good idea on reusing soft-dirty, please shoot.
I'll be perfectly happy with it as long as it resolves the issue for
Muhammad.  Trust me - I wished the soft dirty thing worked out, but
unfortunately it didn't..  Because at least so far uffd-wp has two major
issues as I can see:

  (1) Memory type limitations (e.g. general fs memories stop working)
  (2) Tracing uffd application is, afaict, impossible

So if there's better way to do with soft-dirty or anything else (and I
assume it'll not be limited to any of above) it's time to say..

> 
> > 
> > The other thing is it provides a way to make anon and !anon behave the same
> > on empty ptes; it's a pity that it was not already like that.
> 
> In an ideal world, we'd simply be using PTE markers unconditionally I think
> and avoid this zeropage feature :/
> 
> Is there any particular reason to have UFFD_FEATURE_WP_ZEROPAGE and not
> simply always do that unconditionally? (sure, we have to indicate to user
> space that it now works as expected) Are we really expecting to break user
> space by protecting what was asked for to protect?

I suspect so.

From high level, the major functional changes will be:

  (1) The user will start to receive more WP message with zero page being
      reported,

  (2) Wr-protecting a very sparse memory can be much slower

I would expect there're cases where the app just works as usual.

However in some other cases the user may really not care about zero pages
at all, and I had a feeling that's actually the majority.

Live snapshot is actually special because IIUC the old semantics should
work perfectly if the guest OS won't try to sanity check freed pages being
all zeros..  IOW that's some corner case, and if we can control that we may
not even need WP_ZEROPAGE too for QEMU, iiuc.  For many other apps people
may leverage this (ignoring mem holes) and make the app faster.

Normally when I'm not confident of any functional change, I'd rather use a
flag.  Luckily uffd is very friendly to that, so the user can have better
control of what to expect.  Some future app may explicitly want to always
ignore zero pages when on extremely sparse mem, and without the flag it
can't choose.

> 
> > 
> > We can always optimize this behavior in the future with either
> > PMD/PUD/.. pte markers as you said, but IMHO that just needs further
> > justification on the complexity, and also on whether that's beneficial to
> > the majority to become the default behavior.
> 
> As I said, usually any new features require good justification. Maybe there
> really is a measurable performance gain (less syscalls, less pgtable walks).

Muhammad may have a word to say here; let's see whether he has any comment.

Besides that, as I replied above I'll collect some data in my next post
regardless, with an attempt to optimize with huge zeropages on top.

Thanks,
David Hildenbrand Feb. 17, 2023, 11:41 a.m. UTC | #7
On 16.02.23 21:08, Peter Xu wrote:
>          On Thu, Feb 16, 2023 at 07:23:17PM +0100, David Hildenbrand wrote:
>> On 16.02.23 18:55, Peter Xu wrote:
>>> On Thu, Feb 16, 2023 at 06:00:51PM +0100, David Hildenbrand wrote:
>>>>>>
>>>>>> There are various reasons why I think a UFFD_FEATURE_WP_UNPOPULATED, using
>>>>>> PTE markers, would be more benficial:
>>>>>>
>>>>>> 1) It would be applicable to anon hugetlb
>>>>>
>>>>> Anon hugetlb should already work with non ptes with the markers?
>>>>>
>>>>
>>>> ... really? I thought we'd do the whole pte marker handling only when
>>>> dealing with hugetlb/shmem. Interesting, thanks. (we could skip population
>>>> in QEMU in that case as well -- we always do it for now)
>>>
>>> Hmm, you're talking about "anon hugetlb", so it's still hugetlb, right? :)
>>
>> I mean especially MAP_PRIVATE|MAP_HUGETLB|MAP_ANONYMOUS, so "in theory"
>> without any fd and thus pagecache. ... but anon hugetlb keeps confusing me
>> with pagecache handling.
> 
> IIUC when mmap(fd==-1) it's the same as MAP_PRIVATE|MAP_HUGETLB.

Let me rephrase my original statement: I thought we'd do the whole pte 
marker handling only when dealing with MAP_SHARED hugetlb ("hugetlb 
shared memory (shmem)"). So not on MAP_PRIVATE hugetlb where we might 
have anon hugetlb pages.


>> The focus of that paper is on CoW latency yes (and deduplication
>> instantiating shared zeropages -- but building a covert channel using CoW
>> latency might be rather tricky I think, because they will get deduplciated
>> independently of a sender action ...).
>>
>> However, in theory, one could build a covert channel between two VMs simply
>> by using cache flushes and reading from the shared zeropage. Measuring
>> access time can reveal if the sender read the page (L3 filled) or not (L3
>> not filled).
> 
> So the attacker will know when someone reads a zeropage, but I still don't
> get how that can leads to data leak..

Oh, a covert channel is not for leaking data, but for communicating via 
an unofficial channel (bypassing firewalls etc). I also don't think one 
can really leak data with the shared zeropage ... unless that data is 
all zero :)

>>
>> Having that said, I don't think that we are going to disable the shared
>> zeropage because of that for some workloads, I assume in most cases it will
>> simply be way too noisy to transmit any kind of data and we have more
>> critical covert channels to sort out if we want to.
>>
>> Just wanted to raise it because you asked :)
>>
>>>
>>> Another note for s390: when it comes we can consider moving to pte markers
>>> conditionally when !zeropage.  But we can leave that for later.
>>
>> Sure, we could always have another feature flag.
> 
> I think that doesn't need to be another feature flag.  If someone will port
> uffd-wp to s390 we can implement pte markers for WP_ZEROPAGE, then we
> either use it when zeropage not exist, or we can switch to pte markers
> completely too without changing the interface if we want, depending on
> whether we think replacing zeropages with pte markers will be a major issue
> with existing apps.  I don't worry too much on that part.

Then maybe the feature name/description should be more generic, such 
that it's merely an implementation detail that could change?

"
UFFD_FEATURE_WP_UNPOPULATED: for anonymous memory, if PTEs/PMDs are 
still unpopulated (no page mapped), uffd-wp protection to work will not 
require a previous manual population (e.g., using MADV_POPULATE_READ). 
The kernel might or might not populate the shared zeropage for that 
purpose. So after a uffd-wp protection with UFFD_FEATURE_WP_UNPOPULATED 
enabled, the PTEs/PMDs might or might not be populated.
"

For example, it has to be clear that when doing an uffd-wp protect + 
unprotect, that there could be suddenly zeropages mapped (observable via 
  uffd-missing later, /proc/pagemap).

I'd be fine with something like that.

[...]

>>>> Using PTE markers would provide a real advantage IMHO for some users (IMHO
>>>> background snapshots), where we might want to avoid populating
>>>> zeropages/page tables as best as we can completely if the VM memory is
>>>> mostly untouched.
>>>>
>>>> Naturally, I wonder if UFFD_FEATURE_WP_ZEROPAGE is really worth it. Is there
>>>> is another good reason to combine the populate zeropage+wp that I am missing
>>>> (e.g., atomicity by doing both in one operation)?
>>>
>>> It also makes the new WP_ASYNC and pagemap interface clean: we don't want
>>> to have user pre-fault it every time too as a common tactic..  It's hard to
>>> use, and the user doesn't need to know the internals of why it is needed,
>>> either.
>>
>> I feel like we're building a lot of infrastructure on uffd-wp instead of
>> having an alternative softdirty mode (using a world switch?) that works as
>> expected and doesn't require that many uffd-wp extensions. ;)
> 
> We used to discuss this WP_ZEROPAGE before, and I thought we were all happy
> to have that.  Obviously you changed your mind. :)

As I said, I'd be sold on the PTE marker idea, or designing the feature 
more generic that it is an implementation detail. :)

The whole "let's place zeropages" is rather a hack and IMHO suboptimal.

For example, when uffd-wp unprotecting again (e.g., after a VM 
snapshot), we'd be left with zeropages mapped. Getting rid of them (for 
example, to reclaim page tables), will require TLB flushes, mmu 
notifiers, because we had present PTEs. In comparison, the nice thing 
about a PTE marker is that it is !present and you can just rip it out.

Similarly, with zeropages (as in your current patch), getting a THP 
later allocated requires going through khugepaged. In comparison, a PMD 
marker could more easily avoid that. The huge zeropage can work around 
that, but you'd still need an MADV_DONTNEED on the hole huge zeropage 
first to remove it, in order to replace it with a "real" THP.

> 
> I wasn't really eager on this before because the workaround of pre-read
> works good already (I assume slightly slower but it's fine; not until
> someone starts to worry).  But if we want to extend soft-dirty that's not
> good at all to have any new user being requested to prefault memory and
> figuring out why it's needed.
> 
>>
>> Having that said, I have the feeling that you and Muhammad have a plan to
>> make it work using uffd-wp and I won't interfere. It would be nicer to use
>> softdirty infrastructure IMHO, though.
> 
> Thanks.  If you have any good idea on reusing soft-dirty, please shoot.
> I'll be perfectly happy with it as long as it resolves the issue for
> Muhammad.  Trust me - I wished the soft dirty thing worked out, but
> unfortunately it didn't..  Because at least so far uffd-wp has two major
> issues as I can see:
> 
>    (1) Memory type limitations (e.g. general fs memories stop working)
>    (2) Tracing uffd application is, afaict, impossible

I guess the nice thing is, that you can only track individual ranges, 
you don't have to enable it for the whole process. I assume softdirty 
tracking could be similarly extended, but with uffd-wp this comes naturally.

> 
> So if there's better way to do with soft-dirty or anything else (and I
> assume it'll not be limited to any of above) it's time to say..

I started discussing that [1] but as nobody seemed to care about my 
input I decided to not further spend my time on that. But maybe it's a 
more fundamental issue we'd have to solve to get this clean.

The problem I had with the original approach is that it required precise 
softdirty tracking even when nobody cared about softdirty tracking, and 
that it made wrong assumptions about the meaning of VM_SOFTDIRTY. We 
shouldn't have to worry about any of that if it's disabled (just like 
with uffd-wp).


The semantical difference I see is that with uffd-wp, initially 
(pte_none()) all PTEs are "dirty". With soft-dirty, initially 
(pte_none()) all PTEs are "clean". We work around that with 
VM_SOFTIDRTY, which sets all PTEs of a VMA effectively dirty.


Maybe we should rethink that approach and logically track soft-clean 
instead. That is, everything is assumed to be soft-dirty until we set it 
clean on a PTE/PMD/PUD level. Setting something clean is wp'ing a PTE 
and marking it soft-clean (which is, clearing the soft-dirty bit logically).

The special case is when we don't have anything mapped yet, which is the 
same thing we are trying to handle AFAICS for uffd-wp here. A PTE 
(pte_none()) in which case we have to install a soft-dirty PTE/PMD 
marker to remember that we marked it as clean -- or map the shared 
zeropage to mark that one clean (not set the soft-dirty bit).

Further, I would propose to have VM_SOFTDIRTY_ENABLED flags per VMA and 
interfaces to (a) enable/disable it either for some VMAs or the whole MM 
and (b) a process toggle to automatically enable softclean tracking on 
new VMAs. So we can really only care about it when someone cares about 
softdirty tracking. The old "VM_SOFTDIRTY" flag could go, because 
everything (pte_none()) starts out softdirty. So VM_SOFTDIRTY -> 
VM_SOFTDIRTY_ENABLED with changed semantics.

Enabling softdirty-tracking on a VMA might have to go over PTEs, to make 
sure we really don't have any soft-clean leftovers due to imprecise 
soft-dirty tracking on VMA level while it was disabled (setting all PTEs 
soft-dirty if they not already are). Might require a thought if it's 
really required.

Note that I think this resembles what we are doing with uffd-wp. Not 
sure if we'd still have to handle some unmap handling on pagecache 
pages. We might want to remember that they are soft-clean using a PTE 
marker.


Requires some more thought, but I'm going to throw it out here that I 
think there might be ways to modify softdirty tracking.

>>>
>>> The other thing is it provides a way to make anon and !anon behave the same
>>> on empty ptes; it's a pity that it was not already like that.
>>
>> In an ideal world, we'd simply be using PTE markers unconditionally I think
>> and avoid this zeropage feature :/
>>
>> Is there any particular reason to have UFFD_FEATURE_WP_ZEROPAGE and not
>> simply always do that unconditionally? (sure, we have to indicate to user
>> space that it now works as expected) Are we really expecting to break user
>> space by protecting what was asked for to protect?
> 
> I suspect so.
> 
>  From high level, the major functional changes will be:
> 
>    (1) The user will start to receive more WP message with zero page being
>        reported,
> 
>    (2) Wr-protecting a very sparse memory can be much slower
> 
> I would expect there're cases where the app just works as usual.
> 
> However in some other cases the user may really not care about zero pages
> at all, and I had a feeling that's actually the majority.
> 
> Live snapshot is actually special because IIUC the old semantics should
> work perfectly if the guest OS won't try to sanity check freed pages being
> all zeros..  IOW that's some corner case, and if we can control that we may
> not even need WP_ZEROPAGE too for QEMU, iiuc.  For many other apps people
> may leverage this (ignoring mem holes) and make the app faster.
> 
> Normally when I'm not confident of any functional change, I'd rather use a
> flag.  Luckily uffd is very friendly to that, so the user can have better
> control of what to expect.  Some future app may explicitly want to always
> ignore zero pages when on extremely sparse mem, and without the flag it
> can't choose.
> 
>>
>>>
>>> We can always optimize this behavior in the future with either
>>> PMD/PUD/.. pte markers as you said, but IMHO that just needs further
>>> justification on the complexity, and also on whether that's beneficial to
>>> the majority to become the default behavior.
>>
>> As I said, usually any new features require good justification. Maybe there
>> really is a measurable performance gain (less syscalls, less pgtable walks).
> 
> Muhammad may have a word to say here; let's see whether he has any comment.
> 
> Besides that, as I replied above I'll collect some data in my next post
> regardless, with an attempt to optimize with huge zeropages on top.

If we can agree on making the shared zeropage an implementation detail, 
that would be great and I'd see long-term value in that.

[1] 
https://lkml.kernel.org/r/d95d59d7-308d-831c-d8bd-16d06e66e8af@redhat.com
Muhammad Usama Anjum Feb. 17, 2023, 12:31 p.m. UTC | #8
On 2/17/23 1:08 AM, Peter Xu wrote:
>         On Thu, Feb 16, 2023 at 07:23:17PM +0100, David Hildenbrand wrote:
>> On 16.02.23 18:55, Peter Xu wrote:
>>> On Thu, Feb 16, 2023 at 06:00:51PM +0100, David Hildenbrand wrote:
>>>>>>
>>>>>> There are various reasons why I think a UFFD_FEATURE_WP_UNPOPULATED, using
>>>>>> PTE markers, would be more benficial:
>>>>>>
>>>>>> 1) It would be applicable to anon hugetlb
>>>>>
>>>>> Anon hugetlb should already work with non ptes with the markers?
>>>>>
>>>>
>>>> ... really? I thought we'd do the whole pte marker handling only when
>>>> dealing with hugetlb/shmem. Interesting, thanks. (we could skip population
>>>> in QEMU in that case as well -- we always do it for now)
>>>
>>> Hmm, you're talking about "anon hugetlb", so it's still hugetlb, right? :)
>>
>> I mean especially MAP_PRIVATE|MAP_HUGETLB|MAP_ANONYMOUS, so "in theory"
>> without any fd and thus pagecache. ... but anon hugetlb keeps confusing me
>> with pagecache handling.
> 
> IIUC when mmap(fd==-1) it's the same as MAP_PRIVATE|MAP_HUGETLB.
> 
>>
>>>>
>>>>>> 2) It would be applicable even when the zeropage is disallowed
>>>>>>      (mm_forbids_zeropage())
>>>>>
>>>>> Do you mean s390 can disable zeropage with mm_uses_skeys()?  So far uffd-wp
>>>>> doesn't support s390 yet, I'm not sure whether we over worried on this
>>>>> effect.
>>>>>
>>>>> Or is there any other projects / ideas that potentially can enlarge forbid
>>>>> zero pages to more contexts?
>>>>
>>>> I think it was shown that zeropages can be used to build covert channels
>>>> (similar to memory deduplciation, because it effectively is memory
>>>> deduplication). It's mentioned as a note in [1] under VII. A. ("Only
>>>> Deduplicate Zero Pages.")
>>>>
>>>>
>>>> [1] https://www.ndss-symposium.org/wp-content/uploads/2022-81-paper.pdf
>>>
>>> Thanks for the link.  I'm slightly confused how dedup of zero pages is a
>>> concern here, though.
>>>
>>> IIUC the security risk is when the dedup-ed pages contain valid information
>>> so the attacker can measure latency of requests when the attemped malicious
>>> page contains exactly the same content of the data page, by trying to
>>> detect the CoW from happening. >
>>> Here it's the zero page, even if there's CoW difference the data being
>>> exposed can only be all zeros?  Then what's the risk?
>>
>> The focus of that paper is on CoW latency yes (and deduplication
>> instantiating shared zeropages -- but building a covert channel using CoW
>> latency might be rather tricky I think, because they will get deduplciated
>> independently of a sender action ...).
>>
>> However, in theory, one could build a covert channel between two VMs simply
>> by using cache flushes and reading from the shared zeropage. Measuring
>> access time can reveal if the sender read the page (L3 filled) or not (L3
>> not filled).
> 
> So the attacker will know when someone reads a zeropage, but I still don't
> get how that can leads to data leak..
> 
>>
>> Having that said, I don't think that we are going to disable the shared
>> zeropage because of that for some workloads, I assume in most cases it will
>> simply be way too noisy to transmit any kind of data and we have more
>> critical covert channels to sort out if we want to.
>>
>> Just wanted to raise it because you asked :)
>>
>>>
>>> Another note for s390: when it comes we can consider moving to pte markers
>>> conditionally when !zeropage.  But we can leave that for later.
>>
>> Sure, we could always have another feature flag.
> 
> I think that doesn't need to be another feature flag.  If someone will port
> uffd-wp to s390 we can implement pte markers for WP_ZEROPAGE, then we
> either use it when zeropage not exist, or we can switch to pte markers
> completely too without changing the interface if we want, depending on
> whether we think replacing zeropages with pte markers will be a major issue
> with existing apps.  I don't worry too much on that part.
> 
>>
>>>
>>>>
>>>>>
>>>>>> 3) It would be possible to optimize even without the huge zeropage, by
>>>>>>      using a PMD marker.
>>>>>
>>>>> This patch doesn't need huge zeropage being exist.
>>>>
>>>> Yes, and for that reason I think it may perform worse than what we already
>>>> have in some cases. Instead of populating a single PMD you'll have to fill a
>>>> full PTE table.
>>>
>>> Yes.  If you think that'll worth it, I can conditionally do pmd zero thp in
>>> a new version.  Maybe it will be a good intermediate step between
>>> introducing pte markers to pmd/pud/etc, so at least we don't need other
>>> changes to coordinate pte markers to higher levels.
>>
>> [...]
>>
>>>>>> Especially when uffd-wp'ing large ranges that are possibly all unpopulated
>>>>>> (thinking about the existing VM background snapshot use case either with
>>>>>> untouched memory or with things like free page reporting), we might neither
>>>>>> be reading or writing that memory any time soon.
>>>>>
>>>>> Right, I think that's a trade-off. But I still think large portion of
>>>>> totally unpopulated memory should be rare case rather than majority, or am
>>>>> I wrong?  Not to mention that requires a more involved changeset to the
>>>>> kernel.
>>>>>
>>>>> So what I proposed here is the (AFAIU) simplest solution towards providing
>>>>> such a feature in a complete form.  I think we have chance to implement it
>>>>> in other ways like pte markers, but that's something we can work upon, and
>>>>> so far I'm not sure how much benefit we can get out of it yet.
>>>>>
>>>>
>>>> What you propose here can already be achieved by user space fairly easily
>>>> (in fact, QEMU implementation could be further sped up using
>>>> MADV_POPULATE_READ). Usually, we only do that when there are very good
>>>> reasons to (performance).
>>>
>>> Yes POPULATE_READ will be faster.  This patch should make it even faster,
>>> because it merges the two walks.
>>>
>>
>> Getting some performance numbers would be nice.
> 
> Sure, I'll collect some data and post a new version.
> 
>>
>>>>
>>>> Using PTE markers would provide a real advantage IMHO for some users (IMHO
>>>> background snapshots), where we might want to avoid populating
>>>> zeropages/page tables as best as we can completely if the VM memory is
>>>> mostly untouched.
>>>>
>>>> Naturally, I wonder if UFFD_FEATURE_WP_ZEROPAGE is really worth it. Is there
>>>> is another good reason to combine the populate zeropage+wp that I am missing
>>>> (e.g., atomicity by doing both in one operation)?
>>>
>>> It also makes the new WP_ASYNC and pagemap interface clean: we don't want
>>> to have user pre-fault it every time too as a common tactic..  It's hard to
>>> use, and the user doesn't need to know the internals of why it is needed,
>>> either.
>>
>> I feel like we're building a lot of infrastructure on uffd-wp instead of
>> having an alternative softdirty mode (using a world switch?) that works as
>> expected and doesn't require that many uffd-wp extensions. ;)
> 
> We used to discuss this WP_ZEROPAGE before, and I thought we were all happy
> to have that.  Obviously you changed your mind. :)
> 
> I wasn't really eager on this before because the workaround of pre-read
> works good already (I assume slightly slower but it's fine; not until
> someone starts to worry).  But if we want to extend soft-dirty that's not
> good at all to have any new user being requested to prefault memory and
> figuring out why it's needed.
> 
>>
>> Having that said, I have the feeling that you and Muhammad have a plan to
>> make it work using uffd-wp and I won't interfere. It would be nicer to use
>> softdirty infrastructure IMHO, though.
> 
> Thanks.  If you have any good idea on reusing soft-dirty, please shoot.
> I'll be perfectly happy with it as long as it resolves the issue for
> Muhammad.  Trust me - I wished the soft dirty thing worked out, but
> unfortunately it didn't..  Because at least so far uffd-wp has two major
> issues as I can see:
> 
>   (1) Memory type limitations (e.g. general fs memories stop working)
>   (2) Tracing uffd application is, afaict, impossible
> 
> So if there's better way to do with soft-dirty or anything else (and I
> assume it'll not be limited to any of above) it's time to say..
I'm on the same boat with you guys about soft-dirty feature. But I'm very
happy after shifting to UFFD WP. We are closer than ever to complete the
feature.

> 
>>
>>>
>>> The other thing is it provides a way to make anon and !anon behave the same
>>> on empty ptes; it's a pity that it was not already like that.
As you must have guessed, we are looking for same behavior on anon and !anon.

>>
>> In an ideal world, we'd simply be using PTE markers unconditionally I think
>> and avoid this zeropage feature :/
>>
>> Is there any particular reason to have UFFD_FEATURE_WP_ZEROPAGE and not
>> simply always do that unconditionally? (sure, we have to indicate to user
>> space that it now works as expected) Are we really expecting to break user
>> space by protecting what was asked for to protect?
> 
> I suspect so.
> 
> From high level, the major functional changes will be:
> 
>   (1) The user will start to receive more WP message with zero page being
>       reported,
> 
>   (2) Wr-protecting a very sparse memory can be much slower
> 
> I would expect there're cases where the app just works as usual.
> 
> However in some other cases the user may really not care about zero pages
> at all, and I had a feeling that's actually the majority.
> 
> Live snapshot is actually special because IIUC the old semantics should
> work perfectly if the guest OS won't try to sanity check freed pages being
> all zeros..  IOW that's some corner case, and if we can control that we may
> not even need WP_ZEROPAGE too for QEMU, iiuc.  For many other apps people
> may leverage this (ignoring mem holes) and make the app faster.
> 
> Normally when I'm not confident of any functional change, I'd rather use a
> flag.  Luckily uffd is very friendly to that, so the user can have better
> control of what to expect.  Some future app may explicitly want to always
> ignore zero pages when on extremely sparse mem, and without the flag it
> can't choose.
> 
>>
>>>
>>> We can always optimize this behavior in the future with either
>>> PMD/PUD/.. pte markers as you said, but IMHO that just needs further
>>> justification on the complexity, and also on whether that's beneficial to
>>> the majority to become the default behavior.
>>
>> As I said, usually any new features require good justification. Maybe there
>> really is a measurable performance gain (less syscalls, less pgtable walks).
> 
> Muhammad may have a word to say here; let's see whether he has any comment.
> 
> Besides that, as I replied above I'll collect some data in my next post
> regardless, with an attempt to optimize with huge zeropages on top.
I've just ran my single threaded selftest [1] over an over again to get
some numbers.

Without zeropage
qemu has 6 cores: 26.0355
With zeropage
qemu has 6 cores: 39.203

33% worse performance with zero pages

Definitely, there can be better benchmark application. Please let me know
if I should write better benchmarks on my end.

[1]
https://lore.kernel.org/all/20230202112915.867409-7-usama.anjum@collabora.com

> 
> Thanks,
>
Peter Xu Feb. 17, 2023, 11:04 p.m. UTC | #9
On Fri, Feb 17, 2023 at 12:41:40PM +0100, David Hildenbrand wrote:
> On 16.02.23 21:08, Peter Xu wrote:
> >          On Thu, Feb 16, 2023 at 07:23:17PM +0100, David Hildenbrand wrote:
> > > On 16.02.23 18:55, Peter Xu wrote:
> > > > On Thu, Feb 16, 2023 at 06:00:51PM +0100, David Hildenbrand wrote:
> > > > > > > 
> > > > > > > There are various reasons why I think a UFFD_FEATURE_WP_UNPOPULATED, using
> > > > > > > PTE markers, would be more benficial:
> > > > > > > 
> > > > > > > 1) It would be applicable to anon hugetlb
> > > > > > 
> > > > > > Anon hugetlb should already work with non ptes with the markers?
> > > > > > 
> > > > > 
> > > > > ... really? I thought we'd do the whole pte marker handling only when
> > > > > dealing with hugetlb/shmem. Interesting, thanks. (we could skip population
> > > > > in QEMU in that case as well -- we always do it for now)
> > > > 
> > > > Hmm, you're talking about "anon hugetlb", so it's still hugetlb, right? :)
> > > 
> > > I mean especially MAP_PRIVATE|MAP_HUGETLB|MAP_ANONYMOUS, so "in theory"
> > > without any fd and thus pagecache. ... but anon hugetlb keeps confusing me
> > > with pagecache handling.
> > 
> > IIUC when mmap(fd==-1) it's the same as MAP_PRIVATE|MAP_HUGETLB.
> 
> Let me rephrase my original statement: I thought we'd do the whole pte
> marker handling only when dealing with MAP_SHARED hugetlb ("hugetlb shared
> memory (shmem)"). So not on MAP_PRIVATE hugetlb where we might have anon
> hugetlb pages.

Okay, then we're good - I believe private was handled there.

> 
> 
> > > The focus of that paper is on CoW latency yes (and deduplication
> > > instantiating shared zeropages -- but building a covert channel using CoW
> > > latency might be rather tricky I think, because they will get deduplciated
> > > independently of a sender action ...).
> > > 
> > > However, in theory, one could build a covert channel between two VMs simply
> > > by using cache flushes and reading from the shared zeropage. Measuring
> > > access time can reveal if the sender read the page (L3 filled) or not (L3
> > > not filled).
> > 
> > So the attacker will know when someone reads a zeropage, but I still don't
> > get how that can leads to data leak..
> 
> Oh, a covert channel is not for leaking data, but for communicating via an
> unofficial channel (bypassing firewalls etc). I also don't think one can
> really leak data with the shared zeropage ... unless that data is all zero
> :)

I see.

> 
> > > 
> > > Having that said, I don't think that we are going to disable the shared
> > > zeropage because of that for some workloads, I assume in most cases it will
> > > simply be way too noisy to transmit any kind of data and we have more
> > > critical covert channels to sort out if we want to.
> > > 
> > > Just wanted to raise it because you asked :)
> > > 
> > > > 
> > > > Another note for s390: when it comes we can consider moving to pte markers
> > > > conditionally when !zeropage.  But we can leave that for later.
> > > 
> > > Sure, we could always have another feature flag.
> > 
> > I think that doesn't need to be another feature flag.  If someone will port
> > uffd-wp to s390 we can implement pte markers for WP_ZEROPAGE, then we
> > either use it when zeropage not exist, or we can switch to pte markers
> > completely too without changing the interface if we want, depending on
> > whether we think replacing zeropages with pte markers will be a major issue
> > with existing apps.  I don't worry too much on that part.
> 
> Then maybe the feature name/description should be more generic, such that
> it's merely an implementation detail that could change?
> 
> "
> UFFD_FEATURE_WP_UNPOPULATED: for anonymous memory, if PTEs/PMDs are still
> unpopulated (no page mapped), uffd-wp protection to work will not require a
> previous manual population (e.g., using MADV_POPULATE_READ). The kernel
> might or might not populate the shared zeropage for that purpose. So after a
> uffd-wp protection with UFFD_FEATURE_WP_UNPOPULATED enabled, the PTEs/PMDs
> might or might not be populated.
> "

I would really hope we don't mention "manual population" in the man page or
header comments for app developers.

That's "how to workaound some issue", I'll read it as "with flag XXX you
don't need to work around (issue YYY) by doing ZZZ".

It's much more straightforward to state what the flag will do.  The
workaround can be documented somewhere but that shouldn't be in the general
description of the feature.

I don't have a strong opinion on the name, though.  I guess you would like
the new feature to avoid using the term zeropage; I'm fine with changing it.

> 
> For example, it has to be clear that when doing an uffd-wp protect +
> unprotect, that there could be suddenly zeropages mapped (observable via
> uffd-missing later, /proc/pagemap).
> 
> I'd be fine with something like that.
> 
> [...]
> 
> > > > > Using PTE markers would provide a real advantage IMHO for some users (IMHO
> > > > > background snapshots), where we might want to avoid populating
> > > > > zeropages/page tables as best as we can completely if the VM memory is
> > > > > mostly untouched.
> > > > > 
> > > > > Naturally, I wonder if UFFD_FEATURE_WP_ZEROPAGE is really worth it. Is there
> > > > > is another good reason to combine the populate zeropage+wp that I am missing
> > > > > (e.g., atomicity by doing both in one operation)?
> > > > 
> > > > It also makes the new WP_ASYNC and pagemap interface clean: we don't want
> > > > to have user pre-fault it every time too as a common tactic..  It's hard to
> > > > use, and the user doesn't need to know the internals of why it is needed,
> > > > either.
> > > 
> > > I feel like we're building a lot of infrastructure on uffd-wp instead of
> > > having an alternative softdirty mode (using a world switch?) that works as
> > > expected and doesn't require that many uffd-wp extensions. ;)
> > 
> > We used to discuss this WP_ZEROPAGE before, and I thought we were all happy
> > to have that.  Obviously you changed your mind. :)
> 
> As I said, I'd be sold on the PTE marker idea, or designing the feature more
> generic that it is an implementation detail. :)
> 
> The whole "let's place zeropages" is rather a hack and IMHO suboptimal.
> 
> For example, when uffd-wp unprotecting again (e.g., after a VM snapshot),
> we'd be left with zeropages mapped. Getting rid of them (for example, to
> reclaim page tables), will require TLB flushes, mmu notifiers, because we
> had present PTEs. In comparison, the nice thing about a PTE marker is that
> it is !present and you can just rip it out.

MMU notifiers do not make a difference, afaict, because that's
unconditional and irrelevant of what's changed underneath.

For TLB flush - I'm not sure how large will there be a difference
(e.g. considering tlb batching being there), but it's a very valid point.

> 
> Similarly, with zeropages (as in your current patch), getting a THP later
> allocated requires going through khugepaged. In comparison, a PMD marker
> could more easily avoid that. The huge zeropage can work around that, but
> you'd still need an MADV_DONTNEED on the hole huge zeropage first to remove
> it, in order to replace it with a "real" THP.

IMHO relying on khugepaged is fine.

Note again that the whole wr-protect idea so far is always talking on
PAGE_SIZE.  Any write happened in whatever PMD/PUD marker will go into:

  - A pte marker population storm, which happens _inside_ a page fault of
    the workload thread (rather than the monitor thread),

  - The bigger the pte marker (PUD > PMD > PTE), the higher possibility
    that it'll trigger such a storm as long as any of the page is written.

It means no matter which way we choose, as long as a write happened the
whole THP idea will get lost as long as the tracking is still in progress.

I still think zeropage should be rare in serious productions, I think it's
more likely that above will happen easily in practise and it can have an
negative impact on page faults.  Even if we give more chance of having THPs
in the future, we at least pay something else.  It's hard to tell which is
more worthwhile.

Not to mention that'll introduce more compexity to the kernel besides the
anonymous support, which should still be relatively straightforward.

So I'm fine with switching to pte markers, but I'd leave the PMD+ markers
alone for now.

> 
> > 
> > I wasn't really eager on this before because the workaround of pre-read
> > works good already (I assume slightly slower but it's fine; not until
> > someone starts to worry).  But if we want to extend soft-dirty that's not
> > good at all to have any new user being requested to prefault memory and
> > figuring out why it's needed.
> > 
> > > 
> > > Having that said, I have the feeling that you and Muhammad have a plan to
> > > make it work using uffd-wp and I won't interfere. It would be nicer to use
> > > softdirty infrastructure IMHO, though.
> > 
> > Thanks.  If you have any good idea on reusing soft-dirty, please shoot.
> > I'll be perfectly happy with it as long as it resolves the issue for
> > Muhammad.  Trust me - I wished the soft dirty thing worked out, but
> > unfortunately it didn't..  Because at least so far uffd-wp has two major
> > issues as I can see:
> > 
> >    (1) Memory type limitations (e.g. general fs memories stop working)

[1]

> >    (2) Tracing uffd application is, afaict, impossible
> 
> I guess the nice thing is, that you can only track individual ranges, you
> don't have to enable it for the whole process. I assume softdirty tracking
> could be similarly extended, but with uffd-wp this comes naturally.
> 
> > 
> > So if there's better way to do with soft-dirty or anything else (and I
> > assume it'll not be limited to any of above) it's time to say..
> 
> I started discussing that [1] but as nobody seemed to care about my input I
> decided to not further spend my time on that. But maybe it's a more
> fundamental issue we'd have to solve to get this clean.

The idea of merging VMAs during clear_refs sounds interesting.

> 
> The problem I had with the original approach is that it required precise
> softdirty tracking even when nobody cared about softdirty tracking, and that
> it made wrong assumptions about the meaning of VM_SOFTDIRTY. We shouldn't
> have to worry about any of that if it's disabled (just like with uffd-wp).
> 
> 
> The semantical difference I see is that with uffd-wp, initially (pte_none())
> all PTEs are "dirty". With soft-dirty, initially (pte_none()) all PTEs are
> "clean". We work around that with VM_SOFTIDRTY, which sets all PTEs of a VMA
> effectively dirty.
> 
> 
> Maybe we should rethink that approach and logically track soft-clean
> instead. That is, everything is assumed to be soft-dirty until we set it
> clean on a PTE/PMD/PUD level. Setting something clean is wp'ing a PTE and
> marking it soft-clean (which is, clearing the soft-dirty bit logically).

To me both solutions should be mostly the same but just inverted bits.  The
bits (for async) doesn't make sense before the "trigger" is pulled, then
it's about 0/1 with inverted meanings to me.  It's just that soft-dirty
will naturally work well with none ptes but uffd-wp is not (since none pte
has all bits 0).

> 
> The special case is when we don't have anything mapped yet, which is the
> same thing we are trying to handle AFAICS for uffd-wp here. A PTE
> (pte_none()) in which case we have to install a soft-dirty PTE/PMD marker to
> remember that we marked it as clean -- or map the shared zeropage to mark
> that one clean (not set the soft-dirty bit).

Yes, soft-dirty is another valid user to pte markers; that's also in my
very original proposal of it.  If we can have async-uffd-wp then it's
easier because the uffd-wp marker will simply be reused naturally.

> 
> Further, I would propose to have VM_SOFTDIRTY_ENABLED flags per VMA and
> interfaces to (a) enable/disable it either for some VMAs or the whole MM and
> (b) a process toggle to automatically enable softclean tracking on new VMAs.
> So we can really only care about it when someone cares about softdirty
> tracking. The old "VM_SOFTDIRTY" flag could go, because everything
> (pte_none()) starts out softdirty. So VM_SOFTDIRTY -> VM_SOFTDIRTY_ENABLED
> with changed semantics.
> 
> Enabling softdirty-tracking on a VMA might have to go over PTEs, to make
> sure we really don't have any soft-clean leftovers due to imprecise
> soft-dirty tracking on VMA level while it was disabled (setting all PTEs
> soft-dirty if they not already are). Might require a thought if it's really
> required.
> 
> Note that I think this resembles what we are doing with uffd-wp. Not sure if
> we'd still have to handle some unmap handling on pagecache pages. We might
> want to remember that they are soft-clean using a PTE marker.
> 
> 
> Requires some more thought, but I'm going to throw it out here that I think
> there might be ways to modify softdirty tracking.

As you mentioned, I think the new proposal is growing into uffd-wp-alike.
I think that's (maybe) also another reason why reusing uffd here is a good
idea, because we don't need to reinvent everything.

Need a new vma flag to identify from old?  Uffd-wp already has it.

What happens if shmem swapped out?  Uffd-wp handles it.

Need to be accurate and walk the pgtables?  Uffd-wp does it..

One thing I didn't mention before (mostly referring to the 1st major
"defect" of using uffd-wp above I said [1] on memory types): _maybe_ we can
someday extend at least async mode of uffd-wp to all memory types, so it'll
even get everything covered.  So far I don't see a strong requirement of
doing so, but I don't see a major blocker either.

While the other "uffd cannot be nested" defect is actually the same to
soft-dirty (no way to have a tracee being able to clear_refs itself or
it'll also go a mess), it's just that we can still use soft-dirty to track
an uffd application.

> 
> > > > 
> > > > The other thing is it provides a way to make anon and !anon behave the same
> > > > on empty ptes; it's a pity that it was not already like that.
> > > 
> > > In an ideal world, we'd simply be using PTE markers unconditionally I think
> > > and avoid this zeropage feature :/
> > > 
> > > Is there any particular reason to have UFFD_FEATURE_WP_ZEROPAGE and not
> > > simply always do that unconditionally? (sure, we have to indicate to user
> > > space that it now works as expected) Are we really expecting to break user
> > > space by protecting what was asked for to protect?
> > 
> > I suspect so.
> > 
> >  From high level, the major functional changes will be:
> > 
> >    (1) The user will start to receive more WP message with zero page being
> >        reported,
> > 
> >    (2) Wr-protecting a very sparse memory can be much slower
> > 
> > I would expect there're cases where the app just works as usual.
> > 
> > However in some other cases the user may really not care about zero pages
> > at all, and I had a feeling that's actually the majority.
> > 
> > Live snapshot is actually special because IIUC the old semantics should
> > work perfectly if the guest OS won't try to sanity check freed pages being
> > all zeros..  IOW that's some corner case, and if we can control that we may
> > not even need WP_ZEROPAGE too for QEMU, iiuc.  For many other apps people
> > may leverage this (ignoring mem holes) and make the app faster.
> > 
> > Normally when I'm not confident of any functional change, I'd rather use a
> > flag.  Luckily uffd is very friendly to that, so the user can have better
> > control of what to expect.  Some future app may explicitly want to always
> > ignore zero pages when on extremely sparse mem, and without the flag it
> > can't choose.
> > 
> > > 
> > > > 
> > > > We can always optimize this behavior in the future with either
> > > > PMD/PUD/.. pte markers as you said, but IMHO that just needs further
> > > > justification on the complexity, and also on whether that's beneficial to
> > > > the majority to become the default behavior.
> > > 
> > > As I said, usually any new features require good justification. Maybe there
> > > really is a measurable performance gain (less syscalls, less pgtable walks).
> > 
> > Muhammad may have a word to say here; let's see whether he has any comment.
> > 
> > Besides that, as I replied above I'll collect some data in my next post
> > regardless, with an attempt to optimize with huge zeropages on top.
> 
> If we can agree on making the shared zeropage an implementation detail, that
> would be great and I'd see long-term value in that.
> 
> [1]
> https://lkml.kernel.org/r/d95d59d7-308d-831c-d8bd-16d06e66e8af@redhat.com

So I assume there's no major issue to not continue with a new version, then
I'll move on.

Thanks,
Peter Xu Feb. 17, 2023, 11:10 p.m. UTC | #10
Hi, Muhammad,

On Fri, Feb 17, 2023 at 05:31:19PM +0500, Muhammad Usama Anjum wrote:
> I've just ran my single threaded selftest [1] over an over again to get
> some numbers.
> 
> Without zeropage
> qemu has 6 cores: 26.0355

Did you count in the time of read prefault?  Or did you not prefault at
all?

> With zeropage
> qemu has 6 cores: 39.203
> 
> 33% worse performance with zero pages
> 
> Definitely, there can be better benchmark application. Please let me know
> if I should write better benchmarks on my end.
> 
> [1]
> https://lore.kernel.org/all/20230202112915.867409-7-usama.anjum@collabora.com

I'll have a closer look too next week.

Thanks,
Muhammad Usama Anjum Feb. 20, 2023, 7:15 a.m. UTC | #11
Hi Peter,

Thank you so much for working on this.

On 2/18/23 4:10 AM, Peter Xu wrote:
> Hi, Muhammad,
> 
> On Fri, Feb 17, 2023 at 05:31:19PM +0500, Muhammad Usama Anjum wrote:
>> I've just ran my single threaded selftest [1] over an over again to get
>> some numbers.
>>
>> Without zeropage
>> qemu has 6 cores: 26.0355
> 
> Did you count in the time of read prefault?  Or did you not prefault at
> all?
No, pre-faulting is not being done in both of the runs.

Without zeropage, I'm checking pte_none() to decide if page is dirty.
With zeropage, I'm just checking if WP flag isn't set to decide if page is
dirty.

> 
>> With zeropage
>> qemu has 6 cores: 39.203
>>
>> 33% worse performance with zero pages
>>
>> Definitely, there can be better benchmark application. Please let me know
>> if I should write better benchmarks on my end.
>>
>> [1]
>> https://lore.kernel.org/all/20230202112915.867409-7-usama.anjum@collabora.com
> 
> I'll have a closer look too next week.
> 
> Thanks,
>
David Hildenbrand Feb. 21, 2023, 12:43 p.m. UTC | #12
>> "
>> UFFD_FEATURE_WP_UNPOPULATED: for anonymous memory, if PTEs/PMDs are still
>> unpopulated (no page mapped), uffd-wp protection to work will not require a
>> previous manual population (e.g., using MADV_POPULATE_READ). The kernel
>> might or might not populate the shared zeropage for that purpose. So after a
>> uffd-wp protection with UFFD_FEATURE_WP_UNPOPULATED enabled, the PTEs/PMDs
>> might or might not be populated.
>> "
> 
> I would really hope we don't mention "manual population" in the man page or
> header comments for app developers.
> 
> That's "how to workaound some issue", I'll read it as "with flag XXX you
> don't need to work around (issue YYY) by doing ZZZ".
> 
> It's much more straightforward to state what the flag will do.  The
> workaround can be documented somewhere but that shouldn't be in the general
> description of the feature.
> 
> I don't have a strong opinion on the name, though.  I guess you would like
> the new feature to avoid using the term zeropage; I'm fine with changing it.
> 
>>
>> For example, it has to be clear that when doing an uffd-wp protect +
>> unprotect, that there could be suddenly zeropages mapped (observable via
>> uffd-missing later, /proc/pagemap).
>>
>> I'd be fine with something like that.
>>
>> [...]
>>

[...]

>> Similarly, with zeropages (as in your current patch), getting a THP later
>> allocated requires going through khugepaged. In comparison, a PMD marker
>> could more easily avoid that. The huge zeropage can work around that, but
>> you'd still need an MADV_DONTNEED on the hole huge zeropage first to remove
>> it, in order to replace it with a "real" THP.
> 
> IMHO relying on khugepaged is fine.
> 
> Note again that the whole wr-protect idea so far is always talking on
> PAGE_SIZE.  Any write happened in whatever PMD/PUD marker will go into:
> 
>    - A pte marker population storm, which happens _inside_ a page fault of
>      the workload thread (rather than the monitor thread),
> 
>    - The bigger the pte marker (PUD > PMD > PTE), the higher possibility
>      that it'll trigger such a storm as long as any of the page is written.
> 
> It means no matter which way we choose, as long as a write happened the
> whole THP idea will get lost as long as the tracking is still in progress.
> 
> I still think zeropage should be rare in serious productions, I think it's
> more likely that above will happen easily in practise and it can have an
> negative impact on page faults.  Even if we give more chance of having THPs
> in the future, we at least pay something else.  It's hard to tell which is
> more worthwhile.
> 
> Not to mention that'll introduce more compexity to the kernel besides the
> anonymous support, which should still be relatively straightforward.
> 
> So I'm fine with switching to pte markers, but I'd leave the PMD+ markers
> alone for now.
> 
[...]

>>
>>>
>>> I wasn't really eager on this before because the workaround of pre-read
>>> works good already (I assume slightly slower but it's fine; not until
>>> someone starts to worry).  But if we want to extend soft-dirty that's not
>>> good at all to have any new user being requested to prefault memory and
>>> figuring out why it's needed.
>>>
>>>>
>>>> Having that said, I have the feeling that you and Muhammad have a plan to
>>>> make it work using uffd-wp and I won't interfere. It would be nicer to use
>>>> softdirty infrastructure IMHO, though.
>>>
>>> Thanks.  If you have any good idea on reusing soft-dirty, please shoot.
>>> I'll be perfectly happy with it as long as it resolves the issue for
>>> Muhammad.  Trust me - I wished the soft dirty thing worked out, but
>>> unfortunately it didn't..  Because at least so far uffd-wp has two major
>>> issues as I can see:
>>>
>>>     (1) Memory type limitations (e.g. general fs memories stop working)
> 
> [1]
> 
>>>     (2) Tracing uffd application is, afaict, impossible
>>
>> I guess the nice thing is, that you can only track individual ranges, you
>> don't have to enable it for the whole process. I assume softdirty tracking
>> could be similarly extended, but with uffd-wp this comes naturally.
>>
>>>
>>> So if there's better way to do with soft-dirty or anything else (and I
>>> assume it'll not be limited to any of above) it's time to say..
>>
>> I started discussing that [1] but as nobody seemed to care about my input I
>> decided to not further spend my time on that. But maybe it's a more
>> fundamental issue we'd have to solve to get this clean.
> 
> The idea of merging VMAs during clear_refs sounds interesting.

Yes. the required changes would be fairly minimal. But I guess we'd 
still have to tackle some of the other issues to make it work really 
precise. (e.g., MADV_DONTNEED)

> 
>>
>> The problem I had with the original approach is that it required precise
>> softdirty tracking even when nobody cared about softdirty tracking, and that
>> it made wrong assumptions about the meaning of VM_SOFTDIRTY. We shouldn't
>> have to worry about any of that if it's disabled (just like with uffd-wp).
>>
>>
>> The semantical difference I see is that with uffd-wp, initially (pte_none())
>> all PTEs are "dirty". With soft-dirty, initially (pte_none()) all PTEs are
>> "clean". We work around that with VM_SOFTIDRTY, which sets all PTEs of a VMA
>> effectively dirty.
>>
>>
>> Maybe we should rethink that approach and logically track soft-clean
>> instead. That is, everything is assumed to be soft-dirty until we set it
>> clean on a PTE/PMD/PUD level. Setting something clean is wp'ing a PTE and
>> marking it soft-clean (which is, clearing the soft-dirty bit logically).
> 
> To me both solutions should be mostly the same but just inverted bits.  The
> bits (for async) doesn't make sense before the "trigger" is pulled, then
> it's about 0/1 with inverted meanings to me.  It's just that soft-dirty
> will naturally work well with none ptes but uffd-wp is not (since none pte
> has all bits 0).

I think what we really want to avoid is, creating a new VMA and 
requiring to populate page tables just to set the PTEs softdirty.

The VMA flag is one way, but it might prevent merging as we discovered. 
Changing the semantic of "pte_none()" to mean " dirty" is another one.

> 
>>
>> The special case is when we don't have anything mapped yet, which is the
>> same thing we are trying to handle AFAICS for uffd-wp here. A PTE
>> (pte_none()) in which case we have to install a soft-dirty PTE/PMD marker to
>> remember that we marked it as clean -- or map the shared zeropage to mark
>> that one clean (not set the soft-dirty bit).
> 
> Yes, soft-dirty is another valid user to pte markers; that's also in my
> very original proposal of it.  If we can have async-uffd-wp then it's
> easier because the uffd-wp marker will simply be reused naturally.
> 
>>
>> Further, I would propose to have VM_SOFTDIRTY_ENABLED flags per VMA and
>> interfaces to (a) enable/disable it either for some VMAs or the whole MM and
>> (b) a process toggle to automatically enable softclean tracking on new VMAs.
>> So we can really only care about it when someone cares about softdirty
>> tracking. The old "VM_SOFTDIRTY" flag could go, because everything
>> (pte_none()) starts out softdirty. So VM_SOFTDIRTY -> VM_SOFTDIRTY_ENABLED
>> with changed semantics.
>>
>> Enabling softdirty-tracking on a VMA might have to go over PTEs, to make
>> sure we really don't have any soft-clean leftovers due to imprecise
>> soft-dirty tracking on VMA level while it was disabled (setting all PTEs
>> soft-dirty if they not already are). Might require a thought if it's really
>> required.
>>
>> Note that I think this resembles what we are doing with uffd-wp. Not sure if
>> we'd still have to handle some unmap handling on pagecache pages. We might
>> want to remember that they are soft-clean using a PTE marker.
>>
>>
>> Requires some more thought, but I'm going to throw it out here that I think
>> there might be ways to modify softdirty tracking.
> 
> As you mentioned, I think the new proposal is growing into uffd-wp-alike.
> I think that's (maybe) also another reason why reusing uffd here is a good
> idea, because we don't need to reinvent everything.
> 
> Need a new vma flag to identify from old?  Uffd-wp already has it.
> 
> What happens if shmem swapped out?  Uffd-wp handles it.
> 
> Need to be accurate and walk the pgtables?  Uffd-wp does it..

Simply because we cared about getting it precise for uffd-wp, which 
nobody cared for before for soft-dirty. And yes, there are similar 
issues to be solve.

You are much rather turning uffd-wp with the async mode into a 
soft-dirty replacement, instead using what we learned with uffd-wp to 
make soft-dirty more precise.

Fair enough, I won't interfere. The natural way for me to tackle this 
would be to try fixing soft-dirty instead, or handle the details on how 
soft-dirty is implemented internally: not exposing to user space that we 
are using uffd-wp under the hood, for example.


Maybe that would be a reasonable approach? Handle this all internally if 
possible, and remove the old soft-dirty infrastructure once it's working.

We wouldn't be able to use uffd-wp + softdirty, but who really cares I 
guess ...

> 
> One thing I didn't mention before (mostly referring to the 1st major
> "defect" of using uffd-wp above I said [1] on memory types): _maybe_ we can
> someday extend at least async mode of uffd-wp to all memory types, so it'll
> even get everything covered.  So far I don't see a strong requirement of
> doing so, but I don't see a major blocker either.

Architecture support is, of course, another issue. Of course, if we 
could replace soft-dirty tracking by uffd-wp internally that would make 
things easier ...

> 
> While the other "uffd cannot be nested" defect is actually the same to
> soft-dirty (no way to have a tracee being able to clear_refs itself or
> it'll also go a mess), it's just that we can still use soft-dirty to track
> an uffd application.

I wonder if we really care about that. Would be good to know if there 
are any relevant softdirty users still around ... from what I 
understoodm even CRIU wants to handle it using uffd-wp.

> 
>>
>>>>>
>>>>> The other thing is it provides a way to make anon and !anon behave the same
>>>>> on empty ptes; it's a pity that it was not already like that.
>>>>
>>>> In an ideal world, we'd simply be using PTE markers unconditionally I think
>>>> and avoid this zeropage feature :/
>>>>
>>>> Is there any particular reason to have UFFD_FEATURE_WP_ZEROPAGE and not
>>>> simply always do that unconditionally? (sure, we have to indicate to user
>>>> space that it now works as expected) Are we really expecting to break user
>>>> space by protecting what was asked for to protect?
>>>
>>> I suspect so.
>>>
>>>   From high level, the major functional changes will be:
>>>
>>>     (1) The user will start to receive more WP message with zero page being
>>>         reported,
>>>
>>>     (2) Wr-protecting a very sparse memory can be much slower
>>>
>>> I would expect there're cases where the app just works as usual.
>>>
>>> However in some other cases the user may really not care about zero pages
>>> at all, and I had a feeling that's actually the majority.
>>>
>>> Live snapshot is actually special because IIUC the old semantics should
>>> work perfectly if the guest OS won't try to sanity check freed pages being
>>> all zeros..  IOW that's some corner case, and if we can control that we may
>>> not even need WP_ZEROPAGE too for QEMU, iiuc.  For many other apps people
>>> may leverage this (ignoring mem holes) and make the app faster.
>>>
>>> Normally when I'm not confident of any functional change, I'd rather use a
>>> flag.  Luckily uffd is very friendly to that, so the user can have better
>>> control of what to expect.  Some future app may explicitly want to always
>>> ignore zero pages when on extremely sparse mem, and without the flag it
>>> can't choose.
>>>
>>>>
>>>>>
>>>>> We can always optimize this behavior in the future with either
>>>>> PMD/PUD/.. pte markers as you said, but IMHO that just needs further
>>>>> justification on the complexity, and also on whether that's beneficial to
>>>>> the majority to become the default behavior.
>>>>
>>>> As I said, usually any new features require good justification. Maybe there
>>>> really is a measurable performance gain (less syscalls, less pgtable walks).
>>>
>>> Muhammad may have a word to say here; let's see whether he has any comment.
>>>
>>> Besides that, as I replied above I'll collect some data in my next post
>>> regardless, with an attempt to optimize with huge zeropages on top.
>>
>> If we can agree on making the shared zeropage an implementation detail, that
>> would be great and I'd see long-term value in that.
>>
>> [1]
>> https://lkml.kernel.org/r/d95d59d7-308d-831c-d8bd-16d06e66e8af@redhat.com
> 
> So I assume there's no major issue to not continue with a new version, then
> I'll move on.

Jup.
Peter Xu Feb. 21, 2023, 11:13 p.m. UTC | #13
On Tue, Feb 21, 2023 at 01:43:28PM +0100, David Hildenbrand wrote:
> I think what we really want to avoid is, creating a new VMA and requiring to
> populate page tables just to set the PTEs softdirty.
> 
> The VMA flag is one way, but it might prevent merging as we discovered.
> Changing the semantic of "pte_none()" to mean " dirty" is another one.

AFAIU, seeing pte_none() as dirty obviously adds false positives in another
way, comparing to what happens when we merge vmas.

> Simply because we cared about getting it precise for uffd-wp, which nobody
> cared for before for soft-dirty. And yes, there are similar issues to be
> solve.
> 
> You are much rather turning uffd-wp with the async mode into a soft-dirty
> replacement,

Exactly.  When I was discussing uffd-wp years ago with Andrea, Andrea
already mentioned about replacing soft-dirty with uffd-wp since then.

We wasn't really clear about what interface it would look like; at that
time the plan was not using pagemap, but probably something else to avoid
the pgtable walking.

I thought about that later with other forms like ring structures, not so
much. Later on I figured that maybe it's not that trivial to do so, and the
benefit is not clear, either.  We know we may avoid pgtable walks, but we
don't yet know what to lose.

> instead using what we learned with uffd-wp to make soft-dirty more
> precise.

I hope it's not in a way we duplicate many things from userfaultfd, though.

As I mentioned before, we can have yet another bit reserved in pte markers
for soft-dirty and that was actually the plan, but if they'll grow into
something even more similar, it'll be fair if someone asks "why bother?".

The other thing is IIUC soft dirty just took the burden of compatibility,
if that works out we don't probably need uffd-wp async mode on the other
way round - in short, if we can have one thing working for all cases IMHO
we don't bother duplicating in the other.

> 
> Fair enough, I won't interfere. The natural way for me to tackle this would
> be to try fixing soft-dirty instead, or handle the details on how soft-dirty
> is implemented internally: not exposing to user space that we are using
> uffd-wp under the hood, for example.
> 
> 
> Maybe that would be a reasonable approach? Handle this all internally if
> possible, and remove the old soft-dirty infrastructure once it's working.
> 
> We wouldn't be able to use uffd-wp + softdirty, but who really cares I guess
> ...

The thing is userfaultfd is an exposed and formal kernel interface to
userspace already, before / if this new async mode will land.  IMHO it's
necessary in this case to let the user know what's happening inside rather
than thinking this is not important and make decision for the user. We
don't want to surprise anyone I guess..

It's not only from the angle where an user may be using userfault in its
tracee app, so the user will know why the "new soft-dirty" won't work.

It's also about maintaining compatible with soft-dirty even if we want to
replace it some day with uffd-wp - it means there'll at least be a period
of having both of them exist, not until we know they're solidly replaceable
between each other.

So far it's definitely not in that stage.. and they're not alike - it's
just that some of us wanted to have soft-dirty change into something like
uffd-wp, then since the 1st way is not easily achievable, we can try the
other way round.

> 
> > 
> > One thing I didn't mention before (mostly referring to the 1st major
> > "defect" of using uffd-wp above I said [1] on memory types): _maybe_ we can
> > someday extend at least async mode of uffd-wp to all memory types, so it'll
> > even get everything covered.  So far I don't see a strong requirement of
> > doing so, but I don't see a major blocker either.
> 
> Architecture support is, of course, another issue. Of course, if we could
> replace soft-dirty tracking by uffd-wp internally that would make things
> easier ...

Yes, here it was about page caches, but arch support is another thing.
Uffd-wp is just not as widely spread as soft-dirty to multi-archs, and also
many users may not need that accuracy (by paying off performance).

> 
> > 
> > While the other "uffd cannot be nested" defect is actually the same to
> > soft-dirty (no way to have a tracee being able to clear_refs itself or
> > it'll also go a mess), it's just that we can still use soft-dirty to track
> > an uffd application.
> 
> I wonder if we really care about that. Would be good to know if there are
> any relevant softdirty users still around ... from what I understoodm even
> CRIU wants to handle it using uffd-wp.

Yeah I don't know either.

> Jup.

What does this mean?

Thanks,
David Hildenbrand Feb. 22, 2023, 5:02 p.m. UTC | #14
On 22.02.23 00:13, Peter Xu wrote:
> On Tue, Feb 21, 2023 at 01:43:28PM +0100, David Hildenbrand wrote:
>> I think what we really want to avoid is, creating a new VMA and requiring to
>> populate page tables just to set the PTEs softdirty.
>>
>> The VMA flag is one way, but it might prevent merging as we discovered.
>> Changing the semantic of "pte_none()" to mean " dirty" is another one.
> 
> AFAIU, seeing pte_none() as dirty obviously adds false positives in another
> way, comparing to what happens when we merge vmas.

Yeah ... I think it's all tuned for "well, let's allow some false 
positives, but keep it simple such that we don't have to allocate a 
bunch of page tables just to track memory that won't eventually be 
written either way".

So when merging, we say "well, let's consider all pte_none() as dirty 
again" by setting the softdirty flag. Otherwise, we'd have to handle 
allocation pf page tables and whatnot to make merging work without 
setting the VM flag.


Allocating all these page tables to install uffd-wp flags is also one of 
the things I actually dislike about the new approach just to get more 
precision. I wondered if it could be avoided, but my brain started to 
hurt. Just an idea how to eventually avoid it:


We can catch access to these virtual memory that are not populated using 
UFFD_MISSING mode. When installing a zeropage, we could set the uffd-wp 
bit. But we don't want to mix in the missing mode I guess. But maybe we 
could use a similar approach for the uffd-wp async mode? Something like 
the following.


We'd want another mode(s?) for that, in addition to _ASYNC mode:

(a) When we hit an unpopulated PTE using read-access, we map a fresh 
page (e.g., zeropage) and set the uffd-wp bit. This will make sure that 
the next write access triggers uffd-wp.

(b) When we hit an unpopulated PTE using write-access, we only map a 
fresh page (not setting the bit). We would want to trigger uffd-wp in 
!_ASYNC mode after that. In _ASYNC mode, all is good.


pte_none() without the uffd-wp bit set would be assumed to be clean -- 
because touching them would turn them dirty (!pte_none() and not have 
the uffd-wp bit set). We might have to handle discarding of memory, by 
using a uffd-wp marker. Then we could detect that (where we already had 
page tables allcoated!) directly.


Such a mode (excluding the _ASYNC stuff) would even make sense for 
background snapshots in QEMU, and would avoid us having to populate page 
tables completely. Simply uffd-wp all that's populated and don't care 
about pte_none(). These will be properly handled on fault.


Does something like that make sense?


> 
>> Simply because we cared about getting it precise for uffd-wp, which nobody
>> cared for before for soft-dirty. And yes, there are similar issues to be
>> solve.
>>
>> You are much rather turning uffd-wp with the async mode into a soft-dirty
>> replacement,
> 
> Exactly.  When I was discussing uffd-wp years ago with Andrea, Andrea
> already mentioned about replacing soft-dirty with uffd-wp since then.
> 
> We wasn't really clear about what interface it would look like; at that
> time the plan was not using pagemap, but probably something else to avoid
> the pgtable walking.
> 
> I thought about that later with other forms like ring structures, not so
> much. Later on I figured that maybe it's not that trivial to do so, and the
> benefit is not clear, either.  We know we may avoid pgtable walks, but we
> don't yet know what to lose.

Interesting idea.

> 
>> instead using what we learned with uffd-wp to make soft-dirty more
>> precise.
> 
> I hope it's not in a way we duplicate many things from userfaultfd, though.
> 
> As I mentioned before, we can have yet another bit reserved in pte markers
> for soft-dirty and that was actually the plan, but if they'll grow into
> something even more similar, it'll be fair if someone asks "why bother?".
> 
> The other thing is IIUC soft dirty just took the burden of compatibility,
> if that works out we don't probably need uffd-wp async mode on the other
> way round - in short, if we can have one thing working for all cases IMHO
> we don't bother duplicating in the other.

Right. I wish we didn't have softdirty and could start with something 
clean. :)

> 
>>
>> Fair enough, I won't interfere. The natural way for me to tackle this would
>> be to try fixing soft-dirty instead, or handle the details on how soft-dirty
>> is implemented internally: not exposing to user space that we are using
>> uffd-wp under the hood, for example.
>>
>>
>> Maybe that would be a reasonable approach? Handle this all internally if
>> possible, and remove the old soft-dirty infrastructure once it's working.
>>
>> We wouldn't be able to use uffd-wp + softdirty, but who really cares I guess
>> ...
> 
> The thing is userfaultfd is an exposed and formal kernel interface to
> userspace already, before / if this new async mode will land.  IMHO it's
> necessary in this case to let the user know what's happening inside rather
> than thinking this is not important and make decision for the user. We
> don't want to surprise anyone I guess..
> 
> It's not only from the angle where an user may be using userfault in its
> tracee app, so the user will know why the "new soft-dirty" won't work.
> 
> It's also about maintaining compatible with soft-dirty even if we want to
> replace it some day with uffd-wp - it means there'll at least be a period
> of having both of them exist, not until we know they're solidly replaceable
> between each other.
> 
> So far it's definitely not in that stage.. and they're not alike - it's
> just that some of us wanted to have soft-dirty change into something like
> uffd-wp, then since the 1st way is not easily achievable, we can try the
> other way round.

Right. And uffd-wp even supports hugetlb :)

>>>
>>> While the other "uffd cannot be nested" defect is actually the same to
>>> soft-dirty (no way to have a tracee being able to clear_refs itself or
>>> it'll also go a mess), it's just that we can still use soft-dirty to track
>>> an uffd application.
>>
>> I wonder if we really care about that. Would be good to know if there are
>> any relevant softdirty users still around ... from what I understoodm even
>> CRIU wants to handle it using uffd-wp.
> 
> Yeah I don't know either.
> 
>> Jup.
> 
> What does this mean?

Yes to the statement "So I assume there's no major issue to not continue 
with a new version, then I'll move on." :)

But my idea at the very beginning might make sense to consider: can we 
instead handle this at fault time and avoid allocating all these page 
tables. Happy to hear if I am missing something important.
Peter Xu Feb. 22, 2023, 8:37 p.m. UTC | #15
On Wed, Feb 22, 2023 at 06:02:37PM +0100, David Hildenbrand wrote:
> On 22.02.23 00:13, Peter Xu wrote:
> > On Tue, Feb 21, 2023 at 01:43:28PM +0100, David Hildenbrand wrote:
> > > I think what we really want to avoid is, creating a new VMA and requiring to
> > > populate page tables just to set the PTEs softdirty.
> > > 
> > > The VMA flag is one way, but it might prevent merging as we discovered.
> > > Changing the semantic of "pte_none()" to mean " dirty" is another one.
> > 
> > AFAIU, seeing pte_none() as dirty obviously adds false positives in another
> > way, comparing to what happens when we merge vmas.
> 
> Yeah ... I think it's all tuned for "well, let's allow some false positives,
> but keep it simple such that we don't have to allocate a bunch of page
> tables just to track memory that won't eventually be written either way".
> 
> So when merging, we say "well, let's consider all pte_none() as dirty again"
> by setting the softdirty flag. Otherwise, we'd have to handle allocation pf
> page tables and whatnot to make merging work without setting the VM flag.
> 
> 
> Allocating all these page tables to install uffd-wp flags is also one of the
> things I actually dislike about the new approach just to get more precision.

My take is that this is unavoidable if we need the accuracy.  More below.

> I wondered if it could be avoided, but my brain started to hurt. Just an
> idea how to eventually avoid it:
> 
> 
> We can catch access to these virtual memory that are not populated using
> UFFD_MISSING mode. When installing a zeropage, we could set the uffd-wp bit.

Good point. :)

> But we don't want to mix in the missing mode I guess. But maybe we could use
> a similar approach for the uffd-wp async mode? Something like the following.
> 
> 
> We'd want another mode(s?) for that, in addition to _ASYNC mode:
> 
> (a) When we hit an unpopulated PTE using read-access, we map a fresh page
> (e.g., zeropage) and set the uffd-wp bit. This will make sure that the next
> write access triggers uffd-wp.
> 
> (b) When we hit an unpopulated PTE using write-access, we only map a fresh
> page (not setting the bit). We would want to trigger uffd-wp in !_ASYNC mode

Not setting uffd-wp bit sounds dangerous here.  What if right after the
pgtable pte got setup then another thread writting to it?  I think it's
data loss.

> after that. In _ASYNC mode, all is good.

IIUC you're suggesting to have a new vma flag (or VM_UFFD_WP + some other
feature bit, which is fundamentally similar to a new vma flag) to show that
"when register uffd-wp on this region, protection starts right away".  Then
it's not pte based, and we don't have problem on pgtable populations
either.

True, but it goes back to why we need pte markers.  It has the accuracy,
alongside with the trade off of using the pgtables.

Without pte markers and uffd-wp bits everywhere, how do we tell "this pte
is none" or "even if this pte is none, it has been written before but just
got zapped, so we don't need to notify again"?

> 
> 
> pte_none() without the uffd-wp bit set would be assumed to be clean --
> because touching them would turn them dirty (!pte_none() and not have the
> uffd-wp bit set). We might have to handle discarding of memory, by using a
> uffd-wp marker. Then we could detect that (where we already had page tables
> allcoated!) directly.
> 
> 
> Such a mode (excluding the _ASYNC stuff) would even make sense for
> background snapshots in QEMU, and would avoid us having to populate page
> tables completely. Simply uffd-wp all that's populated and don't care about
> pte_none(). These will be properly handled on fault.
> 
> 
> Does something like that make sense?
> 
> 
> > 
> > > Simply because we cared about getting it precise for uffd-wp, which nobody
> > > cared for before for soft-dirty. And yes, there are similar issues to be
> > > solve.
> > > 
> > > You are much rather turning uffd-wp with the async mode into a soft-dirty
> > > replacement,
> > 
> > Exactly.  When I was discussing uffd-wp years ago with Andrea, Andrea
> > already mentioned about replacing soft-dirty with uffd-wp since then.
> > 
> > We wasn't really clear about what interface it would look like; at that
> > time the plan was not using pagemap, but probably something else to avoid
> > the pgtable walking.
> > 
> > I thought about that later with other forms like ring structures, not so
> > much. Later on I figured that maybe it's not that trivial to do so, and the
> > benefit is not clear, either.  We know we may avoid pgtable walks, but we
> > don't yet know what to lose.
> 
> Interesting idea.
> 
> > 
> > > instead using what we learned with uffd-wp to make soft-dirty more
> > > precise.
> > 
> > I hope it's not in a way we duplicate many things from userfaultfd, though.
> > 
> > As I mentioned before, we can have yet another bit reserved in pte markers
> > for soft-dirty and that was actually the plan, but if they'll grow into
> > something even more similar, it'll be fair if someone asks "why bother?".
> > 
> > The other thing is IIUC soft dirty just took the burden of compatibility,
> > if that works out we don't probably need uffd-wp async mode on the other
> > way round - in short, if we can have one thing working for all cases IMHO
> > we don't bother duplicating in the other.
> 
> Right. I wish we didn't have softdirty and could start with something clean.
> :)
> 
> > 
> > > 
> > > Fair enough, I won't interfere. The natural way for me to tackle this would
> > > be to try fixing soft-dirty instead, or handle the details on how soft-dirty
> > > is implemented internally: not exposing to user space that we are using
> > > uffd-wp under the hood, for example.
> > > 
> > > 
> > > Maybe that would be a reasonable approach? Handle this all internally if
> > > possible, and remove the old soft-dirty infrastructure once it's working.
> > > 
> > > We wouldn't be able to use uffd-wp + softdirty, but who really cares I guess
> > > ...
> > 
> > The thing is userfaultfd is an exposed and formal kernel interface to
> > userspace already, before / if this new async mode will land.  IMHO it's
> > necessary in this case to let the user know what's happening inside rather
> > than thinking this is not important and make decision for the user. We
> > don't want to surprise anyone I guess..
> > 
> > It's not only from the angle where an user may be using userfault in its
> > tracee app, so the user will know why the "new soft-dirty" won't work.
> > 
> > It's also about maintaining compatible with soft-dirty even if we want to
> > replace it some day with uffd-wp - it means there'll at least be a period
> > of having both of them exist, not until we know they're solidly replaceable
> > between each other.
> > 
> > So far it's definitely not in that stage.. and they're not alike - it's
> > just that some of us wanted to have soft-dirty change into something like
> > uffd-wp, then since the 1st way is not easily achievable, we can try the
> > other way round.
> 
> Right. And uffd-wp even supports hugetlb :)
> 
> > > > 
> > > > While the other "uffd cannot be nested" defect is actually the same to
> > > > soft-dirty (no way to have a tracee being able to clear_refs itself or
> > > > it'll also go a mess), it's just that we can still use soft-dirty to track
> > > > an uffd application.
> > > 
> > > I wonder if we really care about that. Would be good to know if there are
> > > any relevant softdirty users still around ... from what I understoodm even
> > > CRIU wants to handle it using uffd-wp.
> > 
> > Yeah I don't know either.
> > 
> > > Jup.
> > 
> > What does this mean?
> 
> Yes to the statement "So I assume there's no major issue to not continue
> with a new version, then I'll move on." :)
> 
> But my idea at the very beginning might make sense to consider: can we
> instead handle this at fault time and avoid allocating all these page
> tables. Happy to hear if I am missing something important.

I've raised my questions above.  I had a feeling that you're thinking for
anonymous mostly, because shmem is even trickier IIUC, because ptes can
easily got zapped, then if we only rely on a per-vma attribute, there'll be
tons of false positives.

Anonymous is actually similar, as long as the user app wants to drop the
data and repopulate it again, we'll get yet another wr-protect message
which is probably not useful at all (even if the tracer registered with
UFFD_FEATURE_EVENT_REMOVE to trap DONTNEED, we can't avoid the rubbish
message from coming after that point).

It's sad because that's one major goal of having uffd-wp - we're still
trying our best to avoid false positives, especially logical duplicated
messages (e.g., we got some message that page is wr-protected, we
unprotected it, after a long time, it's notified again it's written).

Thanks,
David Hildenbrand Feb. 23, 2023, 2:35 p.m. UTC | #16
On 22.02.23 21:37, Peter Xu wrote:
>> Allocating all these page tables to install uffd-wp flags is also one of the
>> things I actually dislike about the new approach just to get more precision.
> 
> My take is that this is unavoidable if we need the accuracy.  More below.

Yes, it's tricky. Ideally, we'd be able to not allocate any page tables 
at all, and only handle the unmap+remap case in a special way, hmmm

> 
>> I wondered if it could be avoided, but my brain started to hurt. Just an
>> idea how to eventually avoid it:
>>
>>
>> We can catch access to these virtual memory that are not populated using
>> UFFD_MISSING mode. When installing a zeropage, we could set the uffd-wp bit.
> 
> Good point. :)
> 
>> But we don't want to mix in the missing mode I guess. But maybe we could use
>> a similar approach for the uffd-wp async mode? Something like the following.
>>
>>
>> We'd want another mode(s?) for that, in addition to _ASYNC mode:
>>
>> (a) When we hit an unpopulated PTE using read-access, we map a fresh page
>> (e.g., zeropage) and set the uffd-wp bit. This will make sure that the next
>> write access triggers uffd-wp.
>>
>> (b) When we hit an unpopulated PTE using write-access, we only map a fresh
>> page (not setting the bit). We would want to trigger uffd-wp in !_ASYNC mode
> 
> Not setting uffd-wp bit sounds dangerous here.  What if right after the
> pgtable pte got setup then another thread writting to it?  I think it's
> data loss.

Sorry, I think what I meant is that we don't set the bit for _ASYNC 
mode. For !_ASYNC mode, we have to set the bit and notify.

> 
>> after that. In _ASYNC mode, all is good.
> 
> IIUC you're suggesting to have a new vma flag (or VM_UFFD_WP + some other
> feature bit, which is fundamentally similar to a new vma flag) to show that
> "when register uffd-wp on this region, protection starts right away".  Then
> it's not pte based, and we don't have problem on pgtable populations
> either.

Yes, that's the problematic bit I've mentioned, thanks for spelling it out.

> 
> True, but it goes back to why we need pte markers.  It has the accuracy,
> alongside with the trade off of using the pgtables.
> 
> Without pte markers and uffd-wp bits everywhere, how do we tell "this pte
> is none" or "even if this pte is none, it has been written before but just
> got zapped, so we don't need to notify again"?

I agree that we do need pte markers. And it's all very tricky :)

I had some idea of using two markers: PTE_UFFD_WP and PT_UFFD_NO_WP, and 
being pte_none() being something fuzzy in between that the application 
knows how to deal with ("not touched since we registered uffd-wp").

The goal would be to not populate page tables just to insert PTE 
markers/zeropages, but to only special case on the "there is a page 
table with a present PTE and we're unmapping something with uffd-wp set 
or uffd-wp not set". Because when we're unmapping we already have a page 
table entry we can just set instead of allocating a page table.

Sorry for throwing semi-finished ideas at you, but the basic idea would 
be to only special case when we're unmapping something: there, we 
already do have a page mapped and can remember uffd-wp-set (clean) vs. 
!uffd-wp-set (dirty).


uffd-wp protecting a range:
* !pte_none() -> set uffd-wp bit and wrprotect
* pte_none() -> nothing to do
* PTE_UFFD_WP -> nothing to do
* PTE_UFFD_NO_WP -> set PTE_UFFD_WP

unmapping a page (old way: !pte_none() -> pte_none()):
* uffd-wp bit set: set PTE_UFFD_WP
* uffd-wp bit not set: set PTE_UFFD_NO_WP

(re)mapping a page (old: pte_none() -> !pte_none()):
* PTE_UFFD_WP -> set pte bit for new PTE
* PTE_UFFD_NO_WP -> don't set pte bit for new PTE
* pte_none() -> set pte bit for new PTE

Zapping an anon page using MADV_DONTNEED is a bit confusing. It's 
actually similar to a memory write (-> write zeroes), but we don't 
notify uffd-wp for that (I think that's something you comment on below). 
Theoretically, we'd want to set PTE_UFFD_NO_WP ("dirty") in the async 
mode. But that might need more thought of what the expected semantics 
actually are.


When we walk over the page tables we would get the following information 
after protecting the range:

* PTE_UFFD_WP -> clean, not modified since last protection round
* PTE_UFFD_NO_WP -> dirty, modified since last protection round
* pte_none() -> not mapped and therefore not modified since beginning of
                 protection.
* !pte_none() -> uffd-wp bit decides

[...]

>>>>
>>>> Fair enough, I won't interfere. The natural way for me to tackle this would
>>>> be to try fixing soft-dirty instead, or handle the details on how soft-dirty
>>>> is implemented internally: not exposing to user space that we are using
>>>> uffd-wp under the hood, for example.
>>>>
>>>>
>>>> Maybe that would be a reasonable approach? Handle this all internally if
>>>> possible, and remove the old soft-dirty infrastructure once it's working.
>>>>
>>>> We wouldn't be able to use uffd-wp + softdirty, but who really cares I guess
>>>> ...
>>>
>>> The thing is userfaultfd is an exposed and formal kernel interface to
>>> userspace already, before / if this new async mode will land.  IMHO it's
>>> necessary in this case to let the user know what's happening inside rather
>>> than thinking this is not important and make decision for the user. We
>>> don't want to surprise anyone I guess..
>>>
>>> It's not only from the angle where an user may be using userfault in its
>>> tracee app, so the user will know why the "new soft-dirty" won't work.
>>>
>>> It's also about maintaining compatible with soft-dirty even if we want to
>>> replace it some day with uffd-wp - it means there'll at least be a period
>>> of having both of them exist, not until we know they're solidly replaceable
>>> between each other.
>>>
>>> So far it's definitely not in that stage.. and they're not alike - it's
>>> just that some of us wanted to have soft-dirty change into something like
>>> uffd-wp, then since the 1st way is not easily achievable, we can try the
>>> other way round.
>>
>> Right. And uffd-wp even supports hugetlb :)
>>
>>>>>
>>>>> While the other "uffd cannot be nested" defect is actually the same to
>>>>> soft-dirty (no way to have a tracee being able to clear_refs itself or
>>>>> it'll also go a mess), it's just that we can still use soft-dirty to track
>>>>> an uffd application.
>>>>
>>>> I wonder if we really care about that. Would be good to know if there are
>>>> any relevant softdirty users still around ... from what I understoodm even
>>>> CRIU wants to handle it using uffd-wp.
>>>
>>> Yeah I don't know either.
>>>
>>>> Jup.
>>>
>>> What does this mean?
>>
>> Yes to the statement "So I assume there's no major issue to not continue
>> with a new version, then I'll move on." :)
>>
>> But my idea at the very beginning might make sense to consider: can we
>> instead handle this at fault time and avoid allocating all these page
>> tables. Happy to hear if I am missing something important.
> 
> I've raised my questions above.  I had a feeling that you're thinking for
> anonymous mostly, because shmem is even trickier IIUC, because ptes can
> easily got zapped, then if we only rely on a per-vma attribute, there'll be
> tons of false positives.

Yes, I focused on anon. Let's see if any of the above I said makes sense. :)


Anyhow, what we're discussing here is yet another uffd-wp addition, if 
ever, so don't feel blocked by my comments.
Peter Xu Feb. 28, 2023, 7:42 p.m. UTC | #17
On Thu, Feb 23, 2023 at 03:35:13PM +0100, David Hildenbrand wrote:
> I had some idea of using two markers: PTE_UFFD_WP and PT_UFFD_NO_WP, and
> being pte_none() being something fuzzy in between that the application knows
> how to deal with ("not touched since we registered uffd-wp").
> 
> The goal would be to not populate page tables just to insert PTE
> markers/zeropages, but to only special case on the "there is a page table
> with a present PTE and we're unmapping something with uffd-wp set or uffd-wp
> not set". Because when we're unmapping we already have a page table entry we
> can just set instead of allocating a page table.
> 
> Sorry for throwing semi-finished ideas at you, but the basic idea would be
> to only special case when we're unmapping something: there, we already do
> have a page mapped and can remember uffd-wp-set (clean) vs. !uffd-wp-set
> (dirty).
> 
> 
> uffd-wp protecting a range:
> * !pte_none() -> set uffd-wp bit and wrprotect
> * pte_none() -> nothing to do
> * PTE_UFFD_WP -> nothing to do
> * PTE_UFFD_NO_WP -> set PTE_UFFD_WP
> 
> unmapping a page (old way: !pte_none() -> pte_none()):
> * uffd-wp bit set: set PTE_UFFD_WP
> * uffd-wp bit not set: set PTE_UFFD_NO_WP
> 
> (re)mapping a page (old: pte_none() -> !pte_none()):
> * PTE_UFFD_WP -> set pte bit for new PTE
> * PTE_UFFD_NO_WP -> don't set pte bit for new PTE
> * pte_none() -> set pte bit for new PTE
> 
> Zapping an anon page using MADV_DONTNEED is a bit confusing. It's actually
> similar to a memory write (-> write zeroes), but we don't notify uffd-wp for
> that (I think that's something you comment on below). Theoretically, we'd
> want to set PTE_UFFD_NO_WP ("dirty") in the async mode. But that might need
> more thought of what the expected semantics actually are.
> 
> When we walk over the page tables we would get the following information
> after protecting the range:
> 
> * PTE_UFFD_WP -> clean, not modified since last protection round
> * PTE_UFFD_NO_WP -> dirty, modified since last protection round
> * pte_none() -> not mapped and therefore not modified since beginning of
>                 protection.
> * !pte_none() -> uffd-wp bit decides

I can't say I thought a lot but I feel like it may work. I'd probably avoid
calling it PTE_UFFD_NO_WP or it'll be confusing.. maybe WP_WRITTEN or
WP_RESOLVED instead.

But that interface looks weird in that the protection happens right after
VM_UFFD_WP applied to VMA and that keeps true until unregister.  One needs
to reprotect using ioctl(UFFDIO_WRITEPROTECT) OTOH after the 1st round of
tracking.  It just looks a little bit over-complicated, not to mention we
will need two markers only for userfault-wp.  I had a feeling this
complexity can cause us some trouble elsewhere.

IIUC this can be something done on top even if it'll work (I think the
userspace API doesn't need to change at all), so I'd suggest giving it some
more thoughts and we start with simple and working.

In general, I'll be happy with anything simpler if Muhammad is happy with
its current performance..  For myself, WP_UNPOPULATED is definitely much
better than the old workaround in QEMU live snapshots, so I never worry that.

> Yes, I focused on anon. Let's see if any of the above I said makes sense. :)
> 
> Anyhow, what we're discussing here is yet another uffd-wp addition, if ever,
> so don't feel blocked by my comments.

Thanks.  I've just posted a new version.
David Hildenbrand March 2, 2023, 2:12 p.m. UTC | #18
>>
>> uffd-wp protecting a range:
>> * !pte_none() -> set uffd-wp bit and wrprotect
>> * pte_none() -> nothing to do
>> * PTE_UFFD_WP -> nothing to do
>> * PTE_UFFD_NO_WP -> set PTE_UFFD_WP
>>
>> unmapping a page (old way: !pte_none() -> pte_none()):
>> * uffd-wp bit set: set PTE_UFFD_WP
>> * uffd-wp bit not set: set PTE_UFFD_NO_WP
>>
>> (re)mapping a page (old: pte_none() -> !pte_none()):
>> * PTE_UFFD_WP -> set pte bit for new PTE
>> * PTE_UFFD_NO_WP -> don't set pte bit for new PTE
>> * pte_none() -> set pte bit for new PTE
>>
>> Zapping an anon page using MADV_DONTNEED is a bit confusing. It's actually
>> similar to a memory write (-> write zeroes), but we don't notify uffd-wp for
>> that (I think that's something you comment on below). Theoretically, we'd
>> want to set PTE_UFFD_NO_WP ("dirty") in the async mode. But that might need
>> more thought of what the expected semantics actually are.
>>
>> When we walk over the page tables we would get the following information
>> after protecting the range:
>>
>> * PTE_UFFD_WP -> clean, not modified since last protection round
>> * PTE_UFFD_NO_WP -> dirty, modified since last protection round
>> * pte_none() -> not mapped and therefore not modified since beginning of
>>                  protection.
>> * !pte_none() -> uffd-wp bit decides
> 
> I can't say I thought a lot but I feel like it may work. I'd probably avoid
> calling it PTE_UFFD_NO_WP or it'll be confusing.. maybe WP_WRITTEN or
> WP_RESOLVED instead.

I haven't thought about this further, but I maybe we might be able to 
just use a single PTE marker , because pte_none() would translate to 
PTE_UFFD_WP in such VMAs. So we could make the meaning of the 
PTE_UFFD_WP marker simply depend on the type of VMA.

If I am not wrong, we could stop setting PTE_UFFD_WP completely then, 
for any memory type (shmem/anon/hugetlb).

> 
> But that interface looks weird in that the protection happens right after
> VM_UFFD_WP applied to VMA and that keeps true until unregister.  One needs
> to reprotect using ioctl(UFFDIO_WRITEPROTECT) OTOH after the 1st round of
> tracking.  It just looks a little bit over-complicated, not to mention we
> will need two markers only for userfault-wp.  I had a feeling this
> complexity can cause us some trouble elsewhere.

When to apply this logic is indeed the interesting part. Doing it right 
from the beginning when the feature is enabled sounds like the simplest 
approach to me.  For background snapshots and dirty tracking that might 
just be good enough.


> 
> IIUC this can be something done on top even if it'll work (I think the

I think the API would have to change eventually, to enable it via a new 
feature ("unpopulated implies uffd-wp is active").

> userspace API doesn't need to change at all), so I'd suggest giving it some
> more thoughts and we start with simple and working.

Yes.
diff mbox series

Patch

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 44d1ee429eb0..0117e409cc07 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -108,6 +108,13 @@  static bool userfaultfd_is_initialized(struct userfaultfd_ctx *ctx)
 	return ctx->features & UFFD_FEATURE_INITIALIZED;
 }
 
+bool userfaultfd_wp_zeropage(struct vm_area_struct *vma)
+{
+	struct userfaultfd_ctx *ctx = vma->vm_userfaultfd_ctx.ctx;
+
+	return ctx && (ctx->features & UFFD_FEATURE_WP_ZEROPAGE);
+}
+
 static void userfaultfd_set_vm_flags(struct vm_area_struct *vma,
 				     vm_flags_t flags)
 {
@@ -1968,6 +1975,7 @@  static int userfaultfd_api(struct userfaultfd_ctx *ctx,
 #endif
 #ifndef CONFIG_HAVE_ARCH_USERFAULTFD_WP
 	uffdio_api.features &= ~UFFD_FEATURE_PAGEFAULT_FLAG_WP;
+	uffdio_api.features &= ~UFFD_FEATURE_WP_ZEROPAGE;
 #endif
 #ifndef CONFIG_PTE_MARKER_UFFD_WP
 	uffdio_api.features &= ~UFFD_FEATURE_WP_HUGETLBFS_SHMEM;
diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index 3767f18114ef..f539a9a45189 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -179,6 +179,7 @@  extern int userfaultfd_unmap_prep(struct mm_struct *mm, unsigned long start,
 				  unsigned long end, struct list_head *uf);
 extern void userfaultfd_unmap_complete(struct mm_struct *mm,
 				       struct list_head *uf);
+extern bool userfaultfd_wp_zeropage(struct vm_area_struct *vma);
 
 #else /* CONFIG_USERFAULTFD */
 
@@ -274,6 +275,11 @@  static inline bool uffd_disable_fault_around(struct vm_area_struct *vma)
 	return false;
 }
 
+static inline bool userfaultfd_wp_zeropage(struct vm_area_struct *vma)
+{
+	return false;
+}
+
 #endif /* CONFIG_USERFAULTFD */
 
 static inline bool pte_marker_entry_uffd_wp(swp_entry_t entry)
diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
index 005e5e306266..ba5da3a521b3 100644
--- a/include/uapi/linux/userfaultfd.h
+++ b/include/uapi/linux/userfaultfd.h
@@ -38,7 +38,8 @@ 
 			   UFFD_FEATURE_MINOR_HUGETLBFS |	\
 			   UFFD_FEATURE_MINOR_SHMEM |		\
 			   UFFD_FEATURE_EXACT_ADDRESS |		\
-			   UFFD_FEATURE_WP_HUGETLBFS_SHMEM)
+			   UFFD_FEATURE_WP_HUGETLBFS_SHMEM |	\
+			   UFFD_FEATURE_WP_ZEROPAGE)
 #define UFFD_API_IOCTLS				\
 	((__u64)1 << _UFFDIO_REGISTER |		\
 	 (__u64)1 << _UFFDIO_UNREGISTER |	\
@@ -203,6 +204,12 @@  struct uffdio_api {
 	 *
 	 * UFFD_FEATURE_WP_HUGETLBFS_SHMEM indicates that userfaultfd
 	 * write-protection mode is supported on both shmem and hugetlbfs.
+	 *
+	 * UFFD_FEATURE_WP_ZEROPAGE indicates that userfaultfd
+	 * write-protection mode will always apply to zero pages (aka,
+	 * empty ptes).  This will be the default behavior for shmem &
+	 * hugetlbfs, so this flag only affects anonymous memory behavior
+	 * when userfault write-protection mode is registered.
 	 */
 #define UFFD_FEATURE_PAGEFAULT_FLAG_WP		(1<<0)
 #define UFFD_FEATURE_EVENT_FORK			(1<<1)
@@ -217,6 +224,7 @@  struct uffdio_api {
 #define UFFD_FEATURE_MINOR_SHMEM		(1<<10)
 #define UFFD_FEATURE_EXACT_ADDRESS		(1<<11)
 #define UFFD_FEATURE_WP_HUGETLBFS_SHMEM		(1<<12)
+#define UFFD_FEATURE_WP_ZEROPAGE		(1<<13)
 	__u64 features;
 
 	__u64 ioctls;
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 1d4843c97c2a..c157d0830807 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -276,7 +276,15 @@  static long change_pte_range(struct mmu_gather *tlb,
 		} else {
 			/* It must be an none page, or what else?.. */
 			WARN_ON_ONCE(!pte_none(oldpte));
-			if (unlikely(uffd_wp && !vma_is_anonymous(vma))) {
+
+			/*
+			 * Nobody plays with any none ptes besides
+			 * userfaultfd when applying the protections.
+			 */
+			if (likely(!uffd_wp))
+				continue;
+
+			if (!vma_is_anonymous(vma)) {
 				/*
 				 * For file-backed mem, we need to be able to
 				 * wr-protect a none pte, because even if the
@@ -286,6 +294,17 @@  static long change_pte_range(struct mmu_gather *tlb,
 				set_pte_at(vma->vm_mm, addr, pte,
 					   make_pte_marker(PTE_MARKER_UFFD_WP));
 				pages++;
+			} else if (userfaultfd_wp_zeropage(vma)) {
+				/*
+				 * Anonymous memory, wr-protecting it with
+				 * WP_ZEROPAGE, injecting zero pages to
+				 * persist uffd-wp bit.
+				 */
+				pte_t entry = pte_mkspecial(pfn_pte(my_zero_pfn(addr),
+								    vma->vm_page_prot));
+				entry = pte_mkuffd_wp(entry);
+				set_pte_at(vma->vm_mm, addr, pte, entry);
+				pages++;
 			}
 		}
 	} while (pte++, addr += PAGE_SIZE, addr != end);
@@ -320,23 +339,52 @@  static inline int pmd_none_or_clear_bad_unless_trans_huge(pmd_t *pmd)
 	return 0;
 }
 
-/* Return true if we're uffd wr-protecting file-backed memory, or false */
+/*
+ * Return true if we want to split huge thps in change protection
+ * procedure, false otherwise.
+ */
 static inline bool
-uffd_wp_protect_file(struct vm_area_struct *vma, unsigned long cp_flags)
+pgtable_split_needed(struct vm_area_struct *vma, unsigned long cp_flags)
 {
+	/*
+	 * pte markers only resides in pte level, if we need pte markers,
+	 * we need to split.
+	 */
 	return (cp_flags & MM_CP_UFFD_WP) && !vma_is_anonymous(vma);
 }
 
 /*
- * If wr-protecting the range for file-backed, populate pgtable for the case
- * when pgtable is empty but page cache exists.  When {pte|pmd|...}_alloc()
- * failed we treat it the same way as pgtable allocation failures during
- * page faults by kicking OOM and returning error.
+ * Return true if we want to populate pgtables in change protection
+ * procedure, false otherwise
+ */
+static inline bool
+pgtable_populate_needed(struct vm_area_struct *vma, unsigned long cp_flags)
+{
+	/* If not within ioctl(UFFDIO_WRITEPROTECT), then don't bother */
+	if (!(cp_flags & MM_CP_UFFD_WP))
+		return false;
+
+	/* Either if this is file-based, we need it for pte markers */
+	if (!vma_is_anonymous(vma))
+		return true;
+
+	/*
+	 * Or anonymous, we only need this if WP_ZEROPAGE enabled (to
+	 * install zero pages).
+	 */
+	return userfaultfd_wp_zeropage(vma);
+}
+
+/*
+ * Populate the pgtable underneath for whatever reason if requested.
+ * When {pte|pmd|...}_alloc() failed we treat it the same way as pgtable
+ * allocation failures during page faults by kicking OOM and returning
+ * error.
  */
 #define  change_pmd_prepare(vma, pmd, cp_flags)				\
 	({								\
 		long err = 0;						\
-		if (unlikely(uffd_wp_protect_file(vma, cp_flags))) {	\
+		if (unlikely(pgtable_populate_needed(vma, cp_flags))) {	\
 			if (pte_alloc(vma->vm_mm, pmd))			\
 				err = -ENOMEM;				\
 		}							\
@@ -351,7 +399,7 @@  uffd_wp_protect_file(struct vm_area_struct *vma, unsigned long cp_flags)
 #define  change_prepare(vma, high, low, addr, cp_flags)			\
 	  ({								\
 		long err = 0;						\
-		if (unlikely(uffd_wp_protect_file(vma, cp_flags))) {	\
+		if (unlikely(pgtable_populate_needed(vma, cp_flags))) {	\
 			low##_t *p = low##_alloc(vma->vm_mm, high, addr); \
 			if (p == NULL)					\
 				err = -ENOMEM;				\
@@ -404,7 +452,7 @@  static inline long change_pmd_range(struct mmu_gather *tlb,
 
 		if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) {
 			if ((next - addr != HPAGE_PMD_SIZE) ||
-			    uffd_wp_protect_file(vma, cp_flags)) {
+			    pgtable_split_needed(vma, cp_flags)) {
 				__split_huge_pmd(vma, pmd, addr, false, NULL);
 				/*
 				 * For file-backed, the pmd could have been
diff --git a/tools/testing/selftests/mm/userfaultfd.c b/tools/testing/selftests/mm/userfaultfd.c
index 7f22844ed704..c600ea4ee9b9 100644
--- a/tools/testing/selftests/mm/userfaultfd.c
+++ b/tools/testing/selftests/mm/userfaultfd.c
@@ -1462,7 +1462,7 @@  static void userfaultfd_pagemap_test(unsigned int test_pgsize)
 	/* Flush so it doesn't flush twice in parent/child later */
 	fflush(stdout);
 
-	uffd_test_ctx_init(0);
+	uffd_test_ctx_init(UFFD_FEATURE_WP_ZEROPAGE);
 
 	if (test_pgsize > page_size) {
 		/* This is a thp test */
@@ -1482,6 +1482,17 @@  static void userfaultfd_pagemap_test(unsigned int test_pgsize)
 
 	pagemap_fd = pagemap_open();
 
+	if (test_pgsize == page_size) {
+		/* Test WP_ZEROPAGE first */
+		wp_range(uffd, (uint64_t)area_dst, test_pgsize, true);
+		value = pagemap_read_vaddr(pagemap_fd, area_dst);
+		pagemap_check_wp(value, true);
+
+		wp_range(uffd, (uint64_t)area_dst, page_size, false);
+		value = pagemap_read_vaddr(pagemap_fd, area_dst);
+		pagemap_check_wp(value, false);
+	}
+
 	/* Touch the page */
 	*area_dst = 1;
 	wp_range(uffd, (uint64_t)area_dst, test_pgsize, true);