[RFC,0/4] KVM: x86/mmu: Rework marking folios dirty/accessed

Message ID	20240320005024.3216282-1-seanjc@google.com (mailing list archive)
Headers	show Received: from mail-yb1-f201.google.com (mail-yb1-f201.google.com [209.85.219.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 5B656747F for <kvm@vger.kernel.org>; Wed, 20 Mar 2024 00:50:35 +0000 (UTC) Reply-To: Sean Christopherson <seanjc@google.com> Date: Tue, 19 Mar 2024 17:50:20 -0700 Precedence: bulk Mime-Version: 1.0 Message-ID: <20240320005024.3216282-1-seanjc@google.com> Subject: [RFC PATCH 0/4] KVM: x86/mmu: Rework marking folios dirty/accessed From: Sean Christopherson <seanjc@google.com> To: Paolo Bonzini <pbonzini@redhat.com>, Sean Christopherson <seanjc@google.com> Cc: kvm@vger.kernel.org, linux-kernel@vger.kernel.org, David Hildenbrand <david@redhat.com>, David Matlack <dmatlack@google.com>, David Stevens <stevensd@chromium.org>, Matthew Wilcox <willy@infradead.org> Content-Type: text/plain; charset="UTF-8"
Series	KVM: x86/mmu: Rework marking folios dirty/accessed \| expand [RFC,0/4] KVM: x86/mmu: Rework marking folios dirty/accessed [RFC,1/4] KVM: x86/mmu: Skip the "try unsync" path iff the old SPTE was a leaf SPTE [RFC,2/4] KVM: x86/mmu: Mark folio dirty when creating SPTE, not when zapping/modifying [RFC,3/4] KVM: x86/mmu: Mark page/folio accessed only when zapping leaf SPTEs [RFC,4/4] KVM: x86/mmu: Don't force flush if SPTE update clears Accessed bit

Sean Christopherson March 20, 2024, 12:50 a.m. UTC

Rework KVM to mark folios dirty when creating shadow/secondary PTEs (SPTEs),
i.e. when creating mappings for KVM guests, instead of when zapping or
modifying SPTEs, e.g. when dropping mappings.

The motivation is twofold:

  1. Marking folios dirty and accessed when zapping can be extremely
     expensive and wasteful, e.g. if KVM shattered a 1GiB hugepage into
     512*512 4KiB SPTEs for dirty logging, then KVM marks the huge folio
     dirty and accessed for all 512*512 SPTEs.

  2. x86 diverges from literally every other architecture, which updates
     folios when mappings are created.  AFAIK, x86 is unique in that it's
     the only KVM arch that prefetches PTEs, so it's not quite an apples-
     to-apples comparison, but I don't see any reason for the dirty logic
     in particular to be different.

I tagged this RFC as it is barely tested, and because I'm not 100% positive
there isn't some weird edge case I'm missing, which is why I Cc'd David H.
and Matthew.

Note, I'm going to be offline from ~now until April 1st.  I rushed this out
as it could impact David S.'s kvm_follow_pfn series[*], which is imminent.
E.g. if KVM stops marking pages dirty and accessed everywhere, adding
SPTE_MMU_PAGE_REFCOUNTED just to sanity check that the refcount is elevated
seems like a poor tradeoff (medium complexity and annoying to maintain, for
not much benefit).

Regarding David S.'s series, I wouldn't be at all opposed to going even
further and having x86 follow all architectures by marking pages accessed
_only_ at map time, at which point I think KVM could simply pass in FOLL_TOUCH
as appropriate, and thus dedup a fair bit of arch code.

Lastly, regarding bullet #1 above, we might be able to eke out better
performance by batching calls to folio_mark_{accessed,dirty}() on the backend,
e.g. when zapping SPTEs that KVM knows are covered by a single hugepage.  But
I think in practice any benefit would be marginal, as it would be quite odd
for KVM to fault-in a 1GiB hugepage at 4KiB granularity.

And _if_ we wanted to optimize that case, I suspect we'd be better off
pre-mapping all SPTEs for a pfn that is mapped at a larger granularity in
the primary MMU.  E.g. if KVM is dirty logging a 1GiB HugeTLB page, install
MMU-writable 4KiB SPTEs for the entire 1GiB region when any pfn is accessed.

P.S. Matthew ruined the "nothing but Davids!" Cc list.

[*] https://lore.kernel.org/all/20240229025759.1187910-1-stevensd@google.com

Sean Christopherson (4):
  KVM: x86/mmu: Skip the "try unsync" path iff the old SPTE was a leaf
    SPTE
  KVM: x86/mmu: Mark folio dirty when creating SPTE, not when
    zapping/modifying
  KVM: x86/mmu: Mark page/folio accessed only when zapping leaf SPTEs
  KVM: x86/mmu: Don't force flush if SPTE update clears Accessed bit

 Documentation/virt/kvm/locking.rst | 76 +++++++++++++++---------------
 arch/x86/kvm/mmu/mmu.c             | 60 +++++------------------
 arch/x86/kvm/mmu/paging_tmpl.h     |  7 ++-
 arch/x86/kvm/mmu/spte.c            | 27 ++++++++---
 arch/x86/kvm/mmu/tdp_mmu.c         | 19 ++------
 5 files changed, 78 insertions(+), 111 deletions(-)


base-commit: 964d0c614c7f71917305a5afdca9178fe8231434

David Hildenbrand March 20, 2024, 12:56 p.m. UTC | #1

On 20.03.24 01:50, Sean Christopherson wrote:
> Rework KVM to mark folios dirty when creating shadow/secondary PTEs (SPTEs),
> i.e. when creating mappings for KVM guests, instead of when zapping or
> modifying SPTEs, e.g. when dropping mappings.
> 
> The motivation is twofold:
> 
>    1. Marking folios dirty and accessed when zapping can be extremely
>       expensive and wasteful, e.g. if KVM shattered a 1GiB hugepage into
>       512*512 4KiB SPTEs for dirty logging, then KVM marks the huge folio
>       dirty and accessed for all 512*512 SPTEs.
> 
>    2. x86 diverges from literally every other architecture, which updates
>       folios when mappings are created.  AFAIK, x86 is unique in that it's
>       the only KVM arch that prefetches PTEs, so it's not quite an apples-
>       to-apples comparison, but I don't see any reason for the dirty logic
>       in particular to be different.
> 

Already sorry for the lengthy reply.

On "ordinary" process page tables on x86, it behaves as follows:

1) A page might be mapped writable but the PTE might not be dirty. Once
    written to, HW will set the PTE dirty bit.

2) A page might be mapped but the PTE might not be young. Once accessed,
    HW will set the PTE young bit.

3) When zapping a page (zap_present_folio_ptes), we transfer the dirty
    PTE bit to the folio (folio_mark_dirty()), and the young PTE bit to
    the folio (folio_mark_accessed()). The latter is done conditionally
    only (vma_has_recency()).

BUT, when zapping an anon folio, we don't do that, because there zapping 
implies "gone for good" and not "content must go to a file".

4) When temporarily unmapping a folio for migration/swapout, we
    primarily only move the dirty PTE bit to the folio.

GUP is different, because the PTEs might change after we pinned the page 
and wrote to it. We don't modify the PTEs and expect the GUP user to do 
the right thing (set dirty/accessed). For example, 
unpin_user_pages_dirty_lock() would mark the page dirty when unpinning, 
where the PTE might long be gone.

So GUP does not really behave like HW access.

Secondary page tables are different to ordinary GUP, and KVM ends up 
using GUP to some degree to simulate HW access; regarding NUMA-hinting, 
KVM already revealed to be very different to all other GUP users. [1]

And I recall that at some point I raised that we might want to have a 
dedicate interface for these "mmu-notifier" based page table 
synchonization mechanism.

But KVM ends up setting folio dirty/access flags itself, like other GUP 
users. I do wonder if secondary page tables should be messing with folio 
flags *at all*, and if there would be ways to to it differently using PTEs.

We make sure to synchronize the secondary page tables to the process 
page tables using MMU notifiers: when we write-protect/unmap a PTE, we 
write-protect/unmap the SPTE. Yet, we handle accessed/dirty completely 
different.

I once had the following idea, but I am not sure about all implications, 
just wanted to raise it because it matches the topic here:

Secondary page tables kind-of behave like "HW" access. If there is a 
write access, we would expect the original PTE to become dirty, not the 
mapped folio.

1) When KVM wants to map a page into the secondary page table, we
    require the PTE to be young (like a HW access). The SPTE can remain
    old.

2) When KVM wants to map a page writable into the secondary page table,
    we require the PTE to be dirty (like a HW access). The SPTE can
    remain old.

3) When core MM clears the PTE dirty/young bit, we notify the secondary
    page  table to adjust: for example, if the dirty bit gets cleared,
    the page cannot be writable in the secondary MMU.

4) GUP-fast cannot set the pte dirty/young, so we would fallback to slow
    GUP, wehre we hold the PTL, and simply modify the PTE to have the
    accessed/dirty bit set.

5) Prefetching would similarly be limited to that (only prefetch if PTE
    is already dirty etc.).

6) Dirty/accessed bits not longer have to be synced from the secondary
    page table to the process page table. Because an SPTE being dirty
    implies that the PTE is dirty.

One tricky bit, why ordinary GUP modifies the folio and not the PTE, is 
concurrent HW access. For example, when we want to mark a PTE accessed, 
it could happen that HW concurrently tries marking the PTE dirty. We 
must not lose that update, so we have to guarantee an atomic update 
(maybe avoidable in some cases).

What would be the implications? We'd leave setting folio flags to the MM 
core. That also implies, that if you shutdown a VM an zap all anon 
folios, you wouldn't have to mark any folio dirty: the pte is dirty, and 
MM core can decide to ignore that flag since it will discard the page 
either way.

Downsides? Likely many I have not yet thought about (TLB flushes etc). 
Just mentioning it because in context of [1] I was wondering if 
something that uses MMU notifiers should really be messing with 
dirty/young flags :)

> I tagged this RFC as it is barely tested, and because I'm not 100% positive
> there isn't some weird edge case I'm missing, which is why I Cc'd David H.
> and Matthew.

We'd be in trouble if someone would detect that all PTEs are clean, so 
it can clear the folio dirty flag (for example, after writeback). Then, 
we would write using the SPTE and the folio+PTE would be clean. If we 
then evict the "clean" folio that is actually dirty, we would be in trouble.

Well, we would set the SPTE dirty flag I guess. But I cannot immediately 
tell if that one would be synced back to the folio? Would we have a 
mechanism in place to prevent that?

> 
> Note, I'm going to be offline from ~now until April 1st.  I rushed this out
> as it could impact David S.'s kvm_follow_pfn series[*], which is imminent.
> E.g. if KVM stops marking pages dirty and accessed everywhere, adding
> SPTE_MMU_PAGE_REFCOUNTED just to sanity check that the refcount is elevated
> seems like a poor tradeoff (medium complexity and annoying to maintain, for
> not much benefit).
> 
> Regarding David S.'s series, I wouldn't be at all opposed to going even
> further and having x86 follow all architectures by marking pages accessed
> _only_ at map time, at which point I think KVM could simply pass in FOLL_TOUCH
> as appropriate, and thus dedup a fair bit of arch code.

FOLL_TOUCH is weird (excluding weird devmap stuff):

1) For PTEs (follow_page_pte), we set the page dirty and accessed, and
    do not modify the PTE. For THP (follow_trans_huge_pmd), we set the
    PMD young/dirty and don't mess with the folio.

2) FOLL_TOUCH is not implemented for hugetlb.

3) FOLL_TOUCH is not implemented for GUP-fast.

I'd leave that alone :)

[1] 
https://lore.kernel.org/lkml/20230727212845.135673-1-david@redhat.com/T/#u

David Matlack April 2, 2024, 5:38 p.m. UTC | #2

On Wed, Mar 20, 2024 at 5:56 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 20.03.24 01:50, Sean Christopherson wrote:
> > Rework KVM to mark folios dirty when creating shadow/secondary PTEs (SPTEs),
> > i.e. when creating mappings for KVM guests, instead of when zapping or
> > modifying SPTEs, e.g. when dropping mappings.
> >
> > The motivation is twofold:
> >
> >    1. Marking folios dirty and accessed when zapping can be extremely
> >       expensive and wasteful, e.g. if KVM shattered a 1GiB hugepage into
> >       512*512 4KiB SPTEs for dirty logging, then KVM marks the huge folio
> >       dirty and accessed for all 512*512 SPTEs.
> >
> >    2. x86 diverges from literally every other architecture, which updates
> >       folios when mappings are created.  AFAIK, x86 is unique in that it's
> >       the only KVM arch that prefetches PTEs, so it's not quite an apples-
> >       to-apples comparison, but I don't see any reason for the dirty logic
> >       in particular to be different.
> >
>
> Already sorry for the lengthy reply.
>
>
> On "ordinary" process page tables on x86, it behaves as follows:
>
> 1) A page might be mapped writable but the PTE might not be dirty. Once
>     written to, HW will set the PTE dirty bit.
>
> 2) A page might be mapped but the PTE might not be young. Once accessed,
>     HW will set the PTE young bit.
>
> 3) When zapping a page (zap_present_folio_ptes), we transfer the dirty
>     PTE bit to the folio (folio_mark_dirty()), and the young PTE bit to
>     the folio (folio_mark_accessed()). The latter is done conditionally
>     only (vma_has_recency()).
>
> BUT, when zapping an anon folio, we don't do that, because there zapping
> implies "gone for good" and not "content must go to a file".
>
> 4) When temporarily unmapping a folio for migration/swapout, we
>     primarily only move the dirty PTE bit to the folio.
>
>
> GUP is different, because the PTEs might change after we pinned the page
> and wrote to it. We don't modify the PTEs and expect the GUP user to do
> the right thing (set dirty/accessed). For example,
> unpin_user_pages_dirty_lock() would mark the page dirty when unpinning,
> where the PTE might long be gone.
>
> So GUP does not really behave like HW access.
>
>
> Secondary page tables are different to ordinary GUP, and KVM ends up
> using GUP to some degree to simulate HW access; regarding NUMA-hinting,
> KVM already revealed to be very different to all other GUP users. [1]
>
> And I recall that at some point I raised that we might want to have a
> dedicate interface for these "mmu-notifier" based page table
> synchonization mechanism.
>
> But KVM ends up setting folio dirty/access flags itself, like other GUP
> users. I do wonder if secondary page tables should be messing with folio
> flags *at all*, and if there would be ways to to it differently using PTEs.
>
> We make sure to synchronize the secondary page tables to the process
> page tables using MMU notifiers: when we write-protect/unmap a PTE, we
> write-protect/unmap the SPTE. Yet, we handle accessed/dirty completely
> different.

Accessed bits have the test/clear young MMU-notifiers. But I agree
there aren't any notifiers for dirty tracking.

Are there any cases where the primary MMU transfers the PTE dirty bit
to the folio _other_ than zapping (which already has an MMU-notifier
to KVM). If not then there might not be any reason to add a new
notifier. Instead the contract should just be that secondary MMUs must
also transfer their dirty bits to folios in sync (or before) the
primary MMU zaps its PTE.

>
>
> I once had the following idea, but I am not sure about all implications,
> just wanted to raise it because it matches the topic here:
>
> Secondary page tables kind-of behave like "HW" access. If there is a
> write access, we would expect the original PTE to become dirty, not the
> mapped folio.

Propagating SPTE dirty bits to folios indirectly via the primary MMU
PTEs won't work for guest_memfd where there is no primary MMU PTE. In
order to avoid having two different ways to propagate SPTE dirty bits,
KVM should probably be responsible for updating the folio directly.

David Hildenbrand April 2, 2024, 6:31 p.m. UTC | #3

On 02.04.24 19:38, David Matlack wrote:
> On Wed, Mar 20, 2024 at 5:56 AM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 20.03.24 01:50, Sean Christopherson wrote:
>>> Rework KVM to mark folios dirty when creating shadow/secondary PTEs (SPTEs),
>>> i.e. when creating mappings for KVM guests, instead of when zapping or
>>> modifying SPTEs, e.g. when dropping mappings.
>>>
>>> The motivation is twofold:
>>>
>>>     1. Marking folios dirty and accessed when zapping can be extremely
>>>        expensive and wasteful, e.g. if KVM shattered a 1GiB hugepage into
>>>        512*512 4KiB SPTEs for dirty logging, then KVM marks the huge folio
>>>        dirty and accessed for all 512*512 SPTEs.
>>>
>>>     2. x86 diverges from literally every other architecture, which updates
>>>        folios when mappings are created.  AFAIK, x86 is unique in that it's
>>>        the only KVM arch that prefetches PTEs, so it's not quite an apples-
>>>        to-apples comparison, but I don't see any reason for the dirty logic
>>>        in particular to be different.
>>>
>>
>> Already sorry for the lengthy reply.
>>
>>
>> On "ordinary" process page tables on x86, it behaves as follows:
>>
>> 1) A page might be mapped writable but the PTE might not be dirty. Once
>>      written to, HW will set the PTE dirty bit.
>>
>> 2) A page might be mapped but the PTE might not be young. Once accessed,
>>      HW will set the PTE young bit.
>>
>> 3) When zapping a page (zap_present_folio_ptes), we transfer the dirty
>>      PTE bit to the folio (folio_mark_dirty()), and the young PTE bit to
>>      the folio (folio_mark_accessed()). The latter is done conditionally
>>      only (vma_has_recency()).
>>
>> BUT, when zapping an anon folio, we don't do that, because there zapping
>> implies "gone for good" and not "content must go to a file".
>>
>> 4) When temporarily unmapping a folio for migration/swapout, we
>>      primarily only move the dirty PTE bit to the folio.
>>
>>
>> GUP is different, because the PTEs might change after we pinned the page
>> and wrote to it. We don't modify the PTEs and expect the GUP user to do
>> the right thing (set dirty/accessed). For example,
>> unpin_user_pages_dirty_lock() would mark the page dirty when unpinning,
>> where the PTE might long be gone.
>>
>> So GUP does not really behave like HW access.
>>
>>
>> Secondary page tables are different to ordinary GUP, and KVM ends up
>> using GUP to some degree to simulate HW access; regarding NUMA-hinting,
>> KVM already revealed to be very different to all other GUP users. [1]
>>
>> And I recall that at some point I raised that we might want to have a
>> dedicate interface for these "mmu-notifier" based page table
>> synchonization mechanism.
>>
>> But KVM ends up setting folio dirty/access flags itself, like other GUP
>> users. I do wonder if secondary page tables should be messing with folio
>> flags *at all*, and if there would be ways to to it differently using PTEs.
>>
>> We make sure to synchronize the secondary page tables to the process
>> page tables using MMU notifiers: when we write-protect/unmap a PTE, we
>> write-protect/unmap the SPTE. Yet, we handle accessed/dirty completely
>> different.
> 
> Accessed bits have the test/clear young MMU-notifiers. But I agree
> there aren't any notifiers for dirty tracking.
> 

Yes, and I am questioning if the "test" part should exist -- or if 
having a spte in the secondary MMU should require the access bit to be 
set (derived from the primary MMU). (again, my explanation about fake HW 
page table walkers)

There might be a good reason to do it like that nowadays, so I'm only 
raising it as something I was wondering. Likely, frequent clearing of 
the access bit would result in many PTEs in the secondary MMU getting 
invalidated, requiring a new GUP-fast lookup where we would set the 
access bit in the primary MMU PTE. But I'm not an expert on the 
implications with MMU notifiers and access bit clearing.

> Are there any cases where the primary MMU transfers the PTE dirty bit
> to the folio _other_ than zapping (which already has an MMU-notifier
> to KVM). If not then there might not be any reason to add a new
> notifier. Instead the contract should just be that secondary MMUs must
> also transfer their dirty bits to folios in sync (or before) the
> primary MMU zaps its PTE.

Grepping for pte_mkclean(), there might be some cases. Many cases use 
MMU notifier, because they either clear the PTE or also remove write 
permissions.

But these is madvise_free_pte_range() and 
clean_record_shared_mapping_range()...->clean_record_pte(), that might 
only clear the dirty bit without clearing/changing permissions and 
consequently not calling MMU notifiers.

Getting a writable PTE without the dirty bit set should be possible.

So I am questioning whether having a writable PTE in the secondary MMU 
with a clean PTE in the primary MMU should be valid to exist. It can 
exist today, and I am not sure if that's the right approach.

> 
>>
>>
>> I once had the following idea, but I am not sure about all implications,
>> just wanted to raise it because it matches the topic here:
>>
>> Secondary page tables kind-of behave like "HW" access. If there is a
>> write access, we would expect the original PTE to become dirty, not the
>> mapped folio.
> 
> Propagating SPTE dirty bits to folios indirectly via the primary MMU
> PTEs won't work for guest_memfd where there is no primary MMU PTE. In
> order to avoid having two different ways to propagate SPTE dirty bits,
> KVM should probably be responsible for updating the folio directly.
> 

But who really cares about access/dirty bits for guest_memfd?

guest_memfd already wants to disable/bypass all of core-MM, so different 
rules are to be expected. This discussion is all about integration with 
core-MM that relies on correct dirty bits, which does not really apply 
to guest_memfd.

Sean Christopherson April 3, 2024, 12:17 a.m. UTC | #4

On Tue, Apr 02, 2024, David Hildenbrand wrote:
> On 02.04.24 19:38, David Matlack wrote:
> > On Wed, Mar 20, 2024 at 5:56 AM David Hildenbrand <david@redhat.com> wrote:
> > > Secondary page tables are different to ordinary GUP, and KVM ends up
> > > using GUP to some degree to simulate HW access; regarding NUMA-hinting,
> > > KVM already revealed to be very different to all other GUP users. [1]
> > > 
> > > And I recall that at some point I raised that we might want to have a
> > > dedicate interface for these "mmu-notifier" based page table
> > > synchonization mechanism.
> > > 
> > > But KVM ends up setting folio dirty/access flags itself, like other GUP
> > > users. I do wonder if secondary page tables should be messing with folio
> > > flags *at all*, and if there would be ways to to it differently using PTEs.
> > > 
> > > We make sure to synchronize the secondary page tables to the process
> > > page tables using MMU notifiers: when we write-protect/unmap a PTE, we
> > > write-protect/unmap the SPTE. Yet, we handle accessed/dirty completely
> > > different.
> > 
> > Accessed bits have the test/clear young MMU-notifiers. But I agree
> > there aren't any notifiers for dirty tracking.
>
> Yes, and I am questioning if the "test" part should exist -- or if having a
> spte in the secondary MMU should require the access bit to be set (derived
> from the primary MMU). (again, my explanation about fake HW page table
> walkers)
> 
> There might be a good reason to do it like that nowadays, so I'm only
> raising it as something I was wondering. Likely, frequent clearing of the
> access bit would result in many PTEs in the secondary MMU getting
> invalidated, requiring a new GUP-fast lookup where we would set the access
> bit in the primary MMU PTE. But I'm not an expert on the implications with
> MMU notifiers and access bit clearing.

Ya.  KVM already does something similar this for dirty logging, where SPTEs are
write-protected, i.e. generate faults to track dirty status.  But KVM x86 has a
lot of code to mitigates the resulting pain, e.g. has a lockless fast-path to
make the SPTE writable and propagate the dirty status to bitmaps, and userspace
is heavily/carefully optimized to balance harvesting/clearing dirty status against
guest activity.

In general, assumptions that core MM makes about the cost of PTE modifications
tend to fall apart for KVM.  Much of that comes down to mmu_notifiers invalidations
being an imperfect boundary between the primary MMU and secondary MMUs.  E.g.
any prot changes (NUMA balancing in particular) can have disastrous impact on KVM
guests because those changes invalidate secondary MMUs at PMD granularity, and
effective force KVM to do a remote TLB for every PMD that is affected, whereas
the primary is able to batch TLB flushes until the entire VMA is processed.

So yeah, forcing KVM to zap and re-fault SPTEs in order to do page aging would be
comically slow for KVM guests without a crazy amount of optimization.

> > Are there any cases where the primary MMU transfers the PTE dirty bit
> > to the folio _other_ than zapping (which already has an MMU-notifier
> > to KVM). If not then there might not be any reason to add a new
> > notifier. Instead the contract should just be that secondary MMUs must
> > also transfer their dirty bits to folios in sync (or before) the
> > primary MMU zaps its PTE.
> 
> Grepping for pte_mkclean(), there might be some cases. Many cases use MMU
> notifier, because they either clear the PTE or also remove write
> permissions.
> 
> But these is madvise_free_pte_range() and
> clean_record_shared_mapping_range()...->clean_record_pte(), that might only
> clear the dirty bit without clearing/changing permissions and consequently
> not calling MMU notifiers.

Heh, I was staring at these earlier today.  They all are wrapped with
mmu_notifier_invalidate_range_{start,end}(), and KVM's hyper aggressive response
to mmu_notifier invalidations ensure all KVM SPTEs get dropped.

> Getting a writable PTE without the dirty bit set should be possible.
> 
> So I am questioning whether having a writable PTE in the secondary MMU with
> a clean PTE in the primary MMU should be valid to exist. It can exist today,
> and I am not sure if that's the right approach.

I came to the same conclusion (that it can exist today).  Without (sane) FOLL_TOUCH,
KVM could retrieve the new PTE (W=1,D=0) after an mmu_notifier invaliation, and
thus end up with a writable SPTE that isn't dirty in core MM.

For MADV_FREE, I don't see how KVM's current behavior of marking the folio dirty
at zap time changes anything.  Ah, try_to_unmap_one() checks folio_test_dirty()
*after* invoking mmu_notifier_invalidate_range_start(), which ensures that KVM's
dirty status is propagated to the folio, and thus try_to_unmap_one() keeps the
folio.

Aha!  But try_to_unmap_one() also checks that refcount==mapcount+1, i.e. will
also keep the folio if it has been GUP'd.  And __remove_mapping() explicitly states
that it needs to play nice with a GUP'd page being marked dirty before the
reference is dropped.

	 * Must be careful with the order of the tests. When someone has
	 * a ref to the folio, it may be possible that they dirty it then
	 * drop the reference. So if the dirty flag is tested before the
	 * refcount here, then the following race may occur:

So while it's totally possible for KVM to get a W=1,D=0 PTE, if I'm reading the
code correctly it's safe/legal so long as KVM either (a) marks the folio dirty
while holding a reference or (b) marks the folio dirty before returning from its
mmu_notifier_invalidate_range_start() hook, *AND* obviously if KVM drops its
mappings in response to mmu_notifier_invalidate_range_start().

> > > I once had the following idea, but I am not sure about all implications,
> > > just wanted to raise it because it matches the topic here:
> > > 
> > > Secondary page tables kind-of behave like "HW" access. If there is a
> > > write access, we would expect the original PTE to become dirty, not the
> > > mapped folio.
> > 
> > Propagating SPTE dirty bits to folios indirectly via the primary MMU
> > PTEs won't work for guest_memfd where there is no primary MMU PTE. In
> > order to avoid having two different ways to propagate SPTE dirty bits,
> > KVM should probably be responsible for updating the folio directly.
> > 
> 
> But who really cares about access/dirty bits for guest_memfd?
> 
> guest_memfd already wants to disable/bypass all of core-MM, so different
> rules are to be expected. This discussion is all about integration with
> core-MM that relies on correct dirty bits, which does not really apply to
> guest_memfd.

+1, this is one of many reasons I don't want to support swap/relcaim for guest_memfd.
The instant guest_memfd gets involved with anything LRU related, the interactions
and complexity skyrockets.

David Hildenbrand April 3, 2024, 9:43 p.m. UTC | #5

On 03.04.24 02:17, Sean Christopherson wrote:
> On Tue, Apr 02, 2024, David Hildenbrand wrote:
>> On 02.04.24 19:38, David Matlack wrote:
>>> On Wed, Mar 20, 2024 at 5:56 AM David Hildenbrand <david@redhat.com> wrote:
>>>> Secondary page tables are different to ordinary GUP, and KVM ends up
>>>> using GUP to some degree to simulate HW access; regarding NUMA-hinting,
>>>> KVM already revealed to be very different to all other GUP users. [1]
>>>>
>>>> And I recall that at some point I raised that we might want to have a
>>>> dedicate interface for these "mmu-notifier" based page table
>>>> synchonization mechanism.
>>>>
>>>> But KVM ends up setting folio dirty/access flags itself, like other GUP
>>>> users. I do wonder if secondary page tables should be messing with folio
>>>> flags *at all*, and if there would be ways to to it differently using PTEs.
>>>>
>>>> We make sure to synchronize the secondary page tables to the process
>>>> page tables using MMU notifiers: when we write-protect/unmap a PTE, we
>>>> write-protect/unmap the SPTE. Yet, we handle accessed/dirty completely
>>>> different.
>>>
>>> Accessed bits have the test/clear young MMU-notifiers. But I agree
>>> there aren't any notifiers for dirty tracking.
>>
>> Yes, and I am questioning if the "test" part should exist -- or if having a
>> spte in the secondary MMU should require the access bit to be set (derived
>> from the primary MMU). (again, my explanation about fake HW page table
>> walkers)
>>
>> There might be a good reason to do it like that nowadays, so I'm only
>> raising it as something I was wondering. Likely, frequent clearing of the
>> access bit would result in many PTEs in the secondary MMU getting
>> invalidated, requiring a new GUP-fast lookup where we would set the access
>> bit in the primary MMU PTE. But I'm not an expert on the implications with
>> MMU notifiers and access bit clearing.
> 
> Ya.  KVM already does something similar this for dirty logging, where SPTEs are
> write-protected, i.e. generate faults to track dirty status.  But KVM x86 has a
> lot of code to mitigates the resulting pain, e.g. has a lockless fast-path to
> make the SPTE writable and propagate the dirty status to bitmaps, and userspace
> is heavily/carefully optimized to balance harvesting/clearing dirty status against
> guest activity.
> 
> In general, assumptions that core MM makes about the cost of PTE modifications
> tend to fall apart for KVM.  Much of that comes down to mmu_notifiers invalidations
> being an imperfect boundary between the primary MMU and secondary MMUs.  E.g.
> any prot changes (NUMA balancing in particular) can have disastrous impact on KVM
> guests because those changes invalidate secondary MMUs at PMD granularity, and
> effective force KVM to do a remote TLB for every PMD that is affected, whereas
> the primary is able to batch TLB flushes until the entire VMA is processed.
> 
> So yeah, forcing KVM to zap and re-fault SPTEs in order to do page aging would be
> comically slow for KVM guests without a crazy amount of optimization.

Right, so access bits are certainly special. Fortunately, they are not 
as important to get right as dirty bits.

> 
>>> Are there any cases where the primary MMU transfers the PTE dirty bit
>>> to the folio _other_ than zapping (which already has an MMU-notifier
>>> to KVM). If not then there might not be any reason to add a new
>>> notifier. Instead the contract should just be that secondary MMUs must
>>> also transfer their dirty bits to folios in sync (or before) the
>>> primary MMU zaps its PTE.
>>
>> Grepping for pte_mkclean(), there might be some cases. Many cases use MMU
>> notifier, because they either clear the PTE or also remove write
>> permissions.
>>
>> But these is madvise_free_pte_range() and
>> clean_record_shared_mapping_range()...->clean_record_pte(), that might only
>> clear the dirty bit without clearing/changing permissions and consequently
>> not calling MMU notifiers.
> 
> Heh, I was staring at these earlier today.  They all are wrapped with
> mmu_notifier_invalidate_range_{start,end}(), and KVM's hyper aggressive response
> to mmu_notifier invalidations ensure all KVM SPTEs get dropped.
> 
>> Getting a writable PTE without the dirty bit set should be possible.
>>
>> So I am questioning whether having a writable PTE in the secondary MMU with
>> a clean PTE in the primary MMU should be valid to exist. It can exist today,
>> and I am not sure if that's the right approach.
> 
> I came to the same conclusion (that it can exist today).  Without (sane) FOLL_TOUCH,
> KVM could retrieve the new PTE (W=1,D=0) after an mmu_notifier invaliation, and
> thus end up with a writable SPTE that isn't dirty in core MM.
> 
> For MADV_FREE, I don't see how KVM's current behavior of marking the folio dirty
> at zap time changes anything.  Ah, try_to_unmap_one() checks folio_test_dirty()
> *after* invoking mmu_notifier_invalidate_range_start(), which ensures that KVM's
> dirty status is propagated to the folio, and thus try_to_unmap_one() keeps the
> folio.

Right, we have to check if it has been re-dirtied. And any unexpected 
reference (i.e., GUP) could dirty it.

> 
> Aha!  But try_to_unmap_one() also checks that refcount==mapcount+1, i.e. will
> also keep the folio if it has been GUP'd.  And __remove_mapping() explicitly states
> that it needs to play nice with a GUP'd page being marked dirty before the
> reference is dropped.

> 
> 	 * Must be careful with the order of the tests. When someone has
> 	 * a ref to the folio, it may be possible that they dirty it then
> 	 * drop the reference. So if the dirty flag is tested before the
> 	 * refcount here, then the following race may occur:
> 
> So while it's totally possible for KVM to get a W=1,D=0 PTE, if I'm reading the
> code correctly it's safe/legal so long as KVM either (a) marks the folio dirty
> while holding a reference or (b) marks the folio dirty before returning from its
> mmu_notifier_invalidate_range_start() hook, *AND* obviously if KVM drops its
> mappings in response to mmu_notifier_invalidate_range_start().
> 

Yes, I agree that it should work in the context of vmscan. But (b) is 
certainly a bit harder to swallow than "ordinary" (a) :)

As raised, if having a writable SPTE would imply having a writable+dirty 
PTE, then KVM MMU code wouldn't have to worry about syncing any dirty 
bits ever back to core-mm, so patch #2 would not be required. ... well, 
it would be replaces by an MMU notifier that notifies about clearing the 
PTE dirty bit :)

... because, then, there is also a subtle difference between 
folio_set_dirty() and folio_mark_dirty(), and I am still confused about 
the difference and not competent enough to explain the difference ... 
and KVM always does the former, while zapping code of pagecache folios 
does the latter ... hm

Related note: IIRC, we usually expect most anon folios to be dirty.

kvm_set_pfn_dirty()->kvm_set_page_dirty() does an unconditional 
SetPageDirty()->folio_set_dirty(). Doing a test-before-set might 
frequently avoid atomic ops.

>>>> I once had the following idea, but I am not sure about all implications,
>>>> just wanted to raise it because it matches the topic here:
>>>>
>>>> Secondary page tables kind-of behave like "HW" access. If there is a
>>>> write access, we would expect the original PTE to become dirty, not the
>>>> mapped folio.
>>>
>>> Propagating SPTE dirty bits to folios indirectly via the primary MMU
>>> PTEs won't work for guest_memfd where there is no primary MMU PTE. In
>>> order to avoid having two different ways to propagate SPTE dirty bits,
>>> KVM should probably be responsible for updating the folio directly.
>>>
>>
>> But who really cares about access/dirty bits for guest_memfd?
>>
>> guest_memfd already wants to disable/bypass all of core-MM, so different
>> rules are to be expected. This discussion is all about integration with
>> core-MM that relies on correct dirty bits, which does not really apply to
>> guest_memfd.
> 
> +1, this is one of many reasons I don't want to support swap/relcaim for guest_memfd.
> The instant guest_memfd gets involved with anything LRU related, the interactions
> and complexity skyrockets.

Agreed ... but it's unfortunate :/

Sean Christopherson April 3, 2024, 10:19 p.m. UTC | #6

On Wed, Apr 03, 2024, David Hildenbrand wrote:
> On 03.04.24 02:17, Sean Christopherson wrote:
> > On Tue, Apr 02, 2024, David Hildenbrand wrote:
> > Aha!  But try_to_unmap_one() also checks that refcount==mapcount+1, i.e. will
> > also keep the folio if it has been GUP'd.  And __remove_mapping() explicitly states
> > that it needs to play nice with a GUP'd page being marked dirty before the
> > reference is dropped.
> 
> > 
> > 	 * Must be careful with the order of the tests. When someone has
> > 	 * a ref to the folio, it may be possible that they dirty it then
> > 	 * drop the reference. So if the dirty flag is tested before the
> > 	 * refcount here, then the following race may occur:
> > 
> > So while it's totally possible for KVM to get a W=1,D=0 PTE, if I'm reading the
> > code correctly it's safe/legal so long as KVM either (a) marks the folio dirty
> > while holding a reference or (b) marks the folio dirty before returning from its
> > mmu_notifier_invalidate_range_start() hook, *AND* obviously if KVM drops its
> > mappings in response to mmu_notifier_invalidate_range_start().
> > 
> 
> Yes, I agree that it should work in the context of vmscan. But (b) is
> certainly a bit harder to swallow than "ordinary" (a) :)

Heh, all the more reason to switch KVM x86 from (b) => (a).

> As raised, if having a writable SPTE would imply having a writable+dirty
> PTE, then KVM MMU code wouldn't have to worry about syncing any dirty bits
> ever back to core-mm, so patch #2 would not be required. ... well, it would
> be replaces by an MMU notifier that notifies about clearing the PTE dirty
> bit :)

Hmm, we essentially already have an mmu_notifier today, since secondary MMUs need
to be invalidated before consuming dirty status.  Isn't the end result essentially
a sane FOLL_TOUCH?

> ... because, then, there is also a subtle difference between
> folio_set_dirty() and folio_mark_dirty(), and I am still confused about the
> difference and not competent enough to explain the difference ... and KVM
> always does the former, while zapping code of pagecache folios does the
> latter ... hm

Ugh, just when I thought I finally had my head wrapped around this.

> Related note: IIRC, we usually expect most anon folios to be dirty.
> 
> kvm_set_pfn_dirty()->kvm_set_page_dirty() does an unconditional
> SetPageDirty()->folio_set_dirty(). Doing a test-before-set might frequently
> avoid atomic ops.

Noted, definitely worth poking at.

David Hildenbrand April 4, 2024, 3:44 p.m. UTC | #7

On 04.04.24 00:19, Sean Christopherson wrote:
> On Wed, Apr 03, 2024, David Hildenbrand wrote:
>> On 03.04.24 02:17, Sean Christopherson wrote:
>>> On Tue, Apr 02, 2024, David Hildenbrand wrote:
>>> Aha!  But try_to_unmap_one() also checks that refcount==mapcount+1, i.e. will
>>> also keep the folio if it has been GUP'd.  And __remove_mapping() explicitly states
>>> that it needs to play nice with a GUP'd page being marked dirty before the
>>> reference is dropped.
>>
>>>
>>> 	 * Must be careful with the order of the tests. When someone has
>>> 	 * a ref to the folio, it may be possible that they dirty it then
>>> 	 * drop the reference. So if the dirty flag is tested before the
>>> 	 * refcount here, then the following race may occur:
>>>
>>> So while it's totally possible for KVM to get a W=1,D=0 PTE, if I'm reading the
>>> code correctly it's safe/legal so long as KVM either (a) marks the folio dirty
>>> while holding a reference or (b) marks the folio dirty before returning from its
>>> mmu_notifier_invalidate_range_start() hook, *AND* obviously if KVM drops its
>>> mappings in response to mmu_notifier_invalidate_range_start().
>>>
>>
>> Yes, I agree that it should work in the context of vmscan. But (b) is
>> certainly a bit harder to swallow than "ordinary" (a) :)
> 
> Heh, all the more reason to switch KVM x86 from (b) => (a).

:)

> 
>> As raised, if having a writable SPTE would imply having a writable+dirty
>> PTE, then KVM MMU code wouldn't have to worry about syncing any dirty bits
>> ever back to core-mm, so patch #2 would not be required. ... well, it would
>> be replaces by an MMU notifier that notifies about clearing the PTE dirty
>> bit :)
> 
> Hmm, we essentially already have an mmu_notifier today, since secondary MMUs need
> to be invalidated before consuming dirty status.  Isn't the end result essentially
> a sane FOLL_TOUCH?

Likely. As stated in my first mail, FOLL_TOUCH is a bit of a mess right now.

Having something that makes sure the writable PTE/PMD is dirty (or 
alternatively sets it dirty), paired with MMU notifiers notifying on any 
mkclean would be one option that would leave handling how to handle 
dirtying of folios completely to the core. It would behave just like a 
CPU writing to the page table, which would set the pte dirty.

Of course, if frequent clearing of the dirty PTE/PMD bit would be a 
problem (like we discussed for the accessed bit), that would not be an 
option. But from what I recall, only clearing the PTE/PMD dirty bit is 
rather rare.

I think this GUP / GUP-user setting of folio dirty bits is a bit of a 
mess, but it seems to be working so far, so ... who am I to judge :)

Sean Christopherson April 4, 2024, 5:31 p.m. UTC | #8

On Thu, Apr 04, 2024, David Hildenbrand wrote:
> On 04.04.24 00:19, Sean Christopherson wrote:
> > Hmm, we essentially already have an mmu_notifier today, since secondary MMUs need
> > to be invalidated before consuming dirty status.  Isn't the end result essentially
> > a sane FOLL_TOUCH?
> 
> Likely. As stated in my first mail, FOLL_TOUCH is a bit of a mess right now.
> 
> Having something that makes sure the writable PTE/PMD is dirty (or
> alternatively sets it dirty), paired with MMU notifiers notifying on any
> mkclean would be one option that would leave handling how to handle dirtying
> of folios completely to the core. It would behave just like a CPU writing to
> the page table, which would set the pte dirty.
> 
> Of course, if frequent clearing of the dirty PTE/PMD bit would be a problem
> (like we discussed for the accessed bit), that would not be an option. But
> from what I recall, only clearing the PTE/PMD dirty bit is rather rare.

And AFAICT, all cases already invalidate secondary MMUs anyways, so if anything
it would probably be a net positive, e.g. the notification could more precisely
say that SPTEs need to be read-only, not blasted away completely.

David Hildenbrand April 4, 2024, 6:23 p.m. UTC | #9

On 04.04.24 19:31, Sean Christopherson wrote:
> On Thu, Apr 04, 2024, David Hildenbrand wrote:
>> On 04.04.24 00:19, Sean Christopherson wrote:
>>> Hmm, we essentially already have an mmu_notifier today, since secondary MMUs need
>>> to be invalidated before consuming dirty status.  Isn't the end result essentially
>>> a sane FOLL_TOUCH?
>>
>> Likely. As stated in my first mail, FOLL_TOUCH is a bit of a mess right now.
>>
>> Having something that makes sure the writable PTE/PMD is dirty (or
>> alternatively sets it dirty), paired with MMU notifiers notifying on any
>> mkclean would be one option that would leave handling how to handle dirtying
>> of folios completely to the core. It would behave just like a CPU writing to
>> the page table, which would set the pte dirty.
>>
>> Of course, if frequent clearing of the dirty PTE/PMD bit would be a problem
>> (like we discussed for the accessed bit), that would not be an option. But
>> from what I recall, only clearing the PTE/PMD dirty bit is rather rare.
> 
> And AFAICT, all cases already invalidate secondary MMUs anyways, so if anything
> it would probably be a net positive, e.g. the notification could more precisely
> say that SPTEs need to be read-only, not blasted away completely.
> 

As discussed, I think at least madvise_free_pte_range() wouldn't do 
that. Notifiers would only get called later when actually zapping the 
folio. So at least for some time, you would have the PTE not dirty, but 
the SPTE writable or even dirty. So you'd have to set the page dirty 
when zapping the SPTE ... and IMHO that is what we should maybe try to 
avoid :)

Sean Christopherson April 4, 2024, 10:02 p.m. UTC | #10

On Thu, Apr 04, 2024, David Hildenbrand wrote:
> On 04.04.24 19:31, Sean Christopherson wrote:
> > On Thu, Apr 04, 2024, David Hildenbrand wrote:
> > > On 04.04.24 00:19, Sean Christopherson wrote:
> > > > Hmm, we essentially already have an mmu_notifier today, since secondary MMUs need
> > > > to be invalidated before consuming dirty status.  Isn't the end result essentially
> > > > a sane FOLL_TOUCH?
> > > 
> > > Likely. As stated in my first mail, FOLL_TOUCH is a bit of a mess right now.
> > > 
> > > Having something that makes sure the writable PTE/PMD is dirty (or
> > > alternatively sets it dirty), paired with MMU notifiers notifying on any
> > > mkclean would be one option that would leave handling how to handle dirtying
> > > of folios completely to the core. It would behave just like a CPU writing to
> > > the page table, which would set the pte dirty.
> > > 
> > > Of course, if frequent clearing of the dirty PTE/PMD bit would be a problem
> > > (like we discussed for the accessed bit), that would not be an option. But
> > > from what I recall, only clearing the PTE/PMD dirty bit is rather rare.
> > 
> > And AFAICT, all cases already invalidate secondary MMUs anyways, so if anything
> > it would probably be a net positive, e.g. the notification could more precisely
> > say that SPTEs need to be read-only, not blasted away completely.
> 
> As discussed, I think at least madvise_free_pte_range() wouldn't do that.

I'm getting a bit turned around.  Are you talking about what madvise_free_pte_range()
would do in this future world, or what madvise_free_pte_range() does today?  Because
today, unless I'm really misreading the code, secondary MMUs are invalidated before
the dirty bit is cleared.

	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm,
				range.start, range.end);

	lru_add_drain();
	tlb_gather_mmu(&tlb, mm);
	update_hiwater_rss(mm);

	mmu_notifier_invalidate_range_start(&range);
	tlb_start_vma(&tlb, vma);
	walk_page_range(vma->vm_mm, range.start, range.end,
			&madvise_free_walk_ops, &tlb);
	tlb_end_vma(&tlb, vma);
	mmu_notifier_invalidate_range_end(&range);

KVM (or any other secondary MMU) can re-establish mapping with W=1,D=0 in the
PTE, but the costly invalidation (zap+flush+fault) still happens.

> Notifiers would only get called later when actually zapping the folio.

And in case we're talking about a hypothetical future, I was thinking the above
could do MMU_NOTIFY_WRITE_PROTECT instead of MMU_NOTIFY_CLEAR.

> So at least for some time, you would have the PTE not dirty, but the SPTE
> writable or even dirty. So you'd have to set the page dirty when zapping the
> SPTE ...  and IMHO that is what we should maybe try to avoid :)

David Hildenbrand April 5, 2024, 6:53 a.m. UTC | #11

On 05.04.24 00:02, Sean Christopherson wrote:
> On Thu, Apr 04, 2024, David Hildenbrand wrote:
>> On 04.04.24 19:31, Sean Christopherson wrote:
>>> On Thu, Apr 04, 2024, David Hildenbrand wrote:
>>>> On 04.04.24 00:19, Sean Christopherson wrote:
>>>>> Hmm, we essentially already have an mmu_notifier today, since secondary MMUs need
>>>>> to be invalidated before consuming dirty status.  Isn't the end result essentially
>>>>> a sane FOLL_TOUCH?
>>>>
>>>> Likely. As stated in my first mail, FOLL_TOUCH is a bit of a mess right now.
>>>>
>>>> Having something that makes sure the writable PTE/PMD is dirty (or
>>>> alternatively sets it dirty), paired with MMU notifiers notifying on any
>>>> mkclean would be one option that would leave handling how to handle dirtying
>>>> of folios completely to the core. It would behave just like a CPU writing to
>>>> the page table, which would set the pte dirty.
>>>>
>>>> Of course, if frequent clearing of the dirty PTE/PMD bit would be a problem
>>>> (like we discussed for the accessed bit), that would not be an option. But
>>>> from what I recall, only clearing the PTE/PMD dirty bit is rather rare.
>>>
>>> And AFAICT, all cases already invalidate secondary MMUs anyways, so if anything
>>> it would probably be a net positive, e.g. the notification could more precisely
>>> say that SPTEs need to be read-only, not blasted away completely.
>>
>> As discussed, I think at least madvise_free_pte_range() wouldn't do that.
> 
> I'm getting a bit turned around.  Are you talking about what madvise_free_pte_range()
> would do in this future world, or what madvise_free_pte_range() does today?  Because
> today, unless I'm really misreading the code, secondary MMUs are invalidated before
> the dirty bit is cleared.

Likely I missed it, sorry! I was talking about the possible future. :)

> 
> 	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm,
> 				range.start, range.end);
> 
> 	lru_add_drain();
> 	tlb_gather_mmu(&tlb, mm);
> 	update_hiwater_rss(mm);
> 
> 	mmu_notifier_invalidate_range_start(&range);
> 	tlb_start_vma(&tlb, vma);
> 	walk_page_range(vma->vm_mm, range.start, range.end,
> 			&madvise_free_walk_ops, &tlb);
> 	tlb_end_vma(&tlb, vma);
> 	mmu_notifier_invalidate_range_end(&range);
> 

Indeed, we do setup the MMU notifier invalidation. We do the start/end 
... I was looking for PTE notifications.

I spotted the set_pte_at(), not a set_pte_at_notify() like we do in 
other code. Maybe that's not required here (digging through 
documentation I'm still left clueless).

"
set_pte_at_notify() sets the pte _after_ running the notifier. This is 
safe to start by updating the secondary MMUs, because the primary MMU 
pte invalidate must have already happened with a ptep_clear_flush() 
before set_pte_at_notify() has been invoked. ...
"

Absolutely unclear to me when we *must* to use it, or if it is. Likely 
its a pure optimization when we're *changing* a PTE.

KSM and wp_page_copy() uses it with MMU_NOTIFY_CLEAR, when replacing the 
mapped page by another page (or write-protecting an existing one). 
__replace_page() uses it with __replace_page() when replacing the mapped 
page. And migrate_vma_insert_page() uses it with MMU_NOTIFY_MIGRATE.

Other code (e.g., khugepaged) doesn't seem to use it as well.

... so I guess it's fine? Confusing.

Paolo Bonzini April 5, 2024, 9:37 a.m. UTC | #12

On Fri, Apr 5, 2024 at 8:53 AM David Hildenbrand <david@redhat.com> wrote:
> >       mmu_notifier_invalidate_range_start(&range);
> >       tlb_start_vma(&tlb, vma);
> >       walk_page_range(vma->vm_mm, range.start, range.end,
> >                       &madvise_free_walk_ops, &tlb);
> >       tlb_end_vma(&tlb, vma);
> >       mmu_notifier_invalidate_range_end(&range);
> >
>
> Indeed, we do setup the MMU notifier invalidation. We do the start/end
> ... I was looking for PTE notifications.
>
> I spotted the set_pte_at(), not a set_pte_at_notify() like we do in
> other code. Maybe that's not required here (digging through
> documentation I'm still left clueless). [...]
> Absolutely unclear to me when we *must* to use it, or if it is. Likely
> its a pure optimization when we're *changing* a PTE.

Yes, .change_pte() is just an optimization. The original point of it
was for KSM, so that KVM could flip the sPTE to a new location without
first zapping it. At the time there was also an .invalidate_page()
callback, and both of them were *not* bracketed by calls to
mmu_notifier_invalidate_range_{start,end}()

Later on, both callbacks were changed to occur within an
invalidate_range_start/end() block.

Commit 6bdb913f0a70 ("mm: wrap calls to set_pte_at_notify with
invalidate_range_start and invalidate_range_end", 2012-10-09) did so
for .change_pte(). The reason to do so was a bit roundabout, namely to
allow sleepable .invalidate_page() hooks (because .change_pte() is not
sleepable and at the time .invalidate_page() was used as a fallback
for .change_pte()).

This however made KVM's usage of the .change_pte() callback completely
moot, because KVM unmaps the sPTEs during .invalidate_range_start()
and therefore .change_pte() has no hope of succeeding.

(Commit 369ea8242c0f ("mm/rmap: update to new mmu_notifier semantic
v2", 2017-08-31) is where the other set of non-bracketed calls to MMU
notifier callbacks was changed; calls to
mmu_notifier_invalidate_page() were replaced by calls to
mmu_notifier_invalidate_range(), bracketed by calls to
mmu_notifier_invalidate_range_{start,end}()).

Since KVM is the only user of .change_pte(), we can remove
.change_pte() and set_pte_at_notify() completely.

Paolo

> __replace_page() uses it with __replace_page() when replacing the mapped
> page. And migrate_vma_insert_page() uses it with MMU_NOTIFY_MIGRATE.
>
> Other code (e.g., khugepaged) doesn't seem to use it as well.
>
> ... so I guess it's fine? Confusing.
>
> --
> Cheers,
>
> David / dhildenb
>

David Hildenbrand April 5, 2024, 10:14 a.m. UTC | #13

On 05.04.24 11:37, Paolo Bonzini wrote:
> On Fri, Apr 5, 2024 at 8:53 AM David Hildenbrand <david@redhat.com> wrote:
>>>        mmu_notifier_invalidate_range_start(&range);
>>>        tlb_start_vma(&tlb, vma);
>>>        walk_page_range(vma->vm_mm, range.start, range.end,
>>>                        &madvise_free_walk_ops, &tlb);
>>>        tlb_end_vma(&tlb, vma);
>>>        mmu_notifier_invalidate_range_end(&range);
>>>
>>
>> Indeed, we do setup the MMU notifier invalidation. We do the start/end
>> ... I was looking for PTE notifications.
>>
>> I spotted the set_pte_at(), not a set_pte_at_notify() like we do in
>> other code. Maybe that's not required here (digging through
>> documentation I'm still left clueless). [...]
>> Absolutely unclear to me when we *must* to use it, or if it is. Likely
>> its a pure optimization when we're *changing* a PTE.
> 
> Yes, .change_pte() is just an optimization. The original point of it
> was for KSM, so that KVM could flip the sPTE to a new location without
> first zapping it. At the time there was also an .invalidate_page()
> callback, and both of them were *not* bracketed by calls to
> mmu_notifier_invalidate_range_{start,end}()
> 
> Later on, both callbacks were changed to occur within an
> invalidate_range_start/end() block.
> 
> Commit 6bdb913f0a70 ("mm: wrap calls to set_pte_at_notify with
> invalidate_range_start and invalidate_range_end", 2012-10-09) did so
> for .change_pte(). The reason to do so was a bit roundabout, namely to
> allow sleepable .invalidate_page() hooks (because .change_pte() is not
> sleepable and at the time .invalidate_page() was used as a fallback
> for .change_pte()).
> 
> This however made KVM's usage of the .change_pte() callback completely
> moot, because KVM unmaps the sPTEs during .invalidate_range_start()
> and therefore .change_pte() has no hope of succeeding.
> 
> (Commit 369ea8242c0f ("mm/rmap: update to new mmu_notifier semantic
> v2", 2017-08-31) is where the other set of non-bracketed calls to MMU
> notifier callbacks was changed; calls to
> mmu_notifier_invalidate_page() were replaced by calls to
> mmu_notifier_invalidate_range(), bracketed by calls to
> mmu_notifier_invalidate_range_{start,end}()).
> 
> Since KVM is the only user of .change_pte(), we can remove
> .change_pte() and set_pte_at_notify() completely.

Nice, thanks for all that information!

Sean Christopherson April 5, 2024, 1:59 p.m. UTC | #14

On Fri, Apr 05, 2024, David Hildenbrand wrote:
> On 05.04.24 11:37, Paolo Bonzini wrote:
> > On Fri, Apr 5, 2024 at 8:53 AM David Hildenbrand <david@redhat.com> wrote:
> > > >        mmu_notifier_invalidate_range_start(&range);
> > > >        tlb_start_vma(&tlb, vma);
> > > >        walk_page_range(vma->vm_mm, range.start, range.end,
> > > >                        &madvise_free_walk_ops, &tlb);
> > > >        tlb_end_vma(&tlb, vma);
> > > >        mmu_notifier_invalidate_range_end(&range);
> > > > 
> > > 
> > > Indeed, we do setup the MMU notifier invalidation. We do the start/end
> > > ... I was looking for PTE notifications.
> > > 
> > > I spotted the set_pte_at(), not a set_pte_at_notify() like we do in
> > > other code. Maybe that's not required here (digging through
> > > documentation I'm still left clueless). [...]
> > > Absolutely unclear to me when we *must* to use it, or if it is. Likely
> > > its a pure optimization when we're *changing* a PTE.
> > 
> > Yes, .change_pte() is just an optimization. The original point of it
> > was for KSM, so that KVM could flip the sPTE to a new location without
> > first zapping it. At the time there was also an .invalidate_page()
> > callback, and both of them were *not* bracketed by calls to
> > mmu_notifier_invalidate_range_{start,end}()
> > 
> > Later on, both callbacks were changed to occur within an
> > invalidate_range_start/end() block.
> > 
> > Commit 6bdb913f0a70 ("mm: wrap calls to set_pte_at_notify with
> > invalidate_range_start and invalidate_range_end", 2012-10-09) did so
> > for .change_pte(). The reason to do so was a bit roundabout, namely to
> > allow sleepable .invalidate_page() hooks (because .change_pte() is not
> > sleepable and at the time .invalidate_page() was used as a fallback
> > for .change_pte()).
> > 
> > This however made KVM's usage of the .change_pte() callback completely
> > moot, because KVM unmaps the sPTEs during .invalidate_range_start()
> > and therefore .change_pte() has no hope of succeeding.
> > 
> > (Commit 369ea8242c0f ("mm/rmap: update to new mmu_notifier semantic
> > v2", 2017-08-31) is where the other set of non-bracketed calls to MMU
> > notifier callbacks was changed; calls to
> > mmu_notifier_invalidate_page() were replaced by calls to
> > mmu_notifier_invalidate_range(), bracketed by calls to
> > mmu_notifier_invalidate_range_{start,end}()).
> > 
> > Since KVM is the only user of .change_pte(), we can remove
> > .change_pte() and set_pte_at_notify() completely.
> 
> Nice, thanks for all that information!

Ya, from commit c13fda237f08 ("KVM: Assert that notifier count is elevated in
.change_pte()"):

    x86 and MIPS are clearcut nops if the old SPTE is not-present, and that
    is guaranteed due to the prior invalidation.  PPC simply unmaps the SPTE,
    which again should be a nop due to the invalidation.  arm64 is a bit
    murky, but it's also likely a nop because kvm_pgtable_stage2_map() is
    called without a cache pointer, which means it will map an entry if and
    only if an existing PTE was found.

I'm 100% in favor of removing .change_pte().  As I've said multiple times, the
only reason I haven't sent a patch is because I didn't want it to prompt someone
into resurrecting the original behavior. :-)

Paolo Bonzini April 5, 2024, 2:06 p.m. UTC | #15

On 4/5/24 15:59, Sean Christopherson wrote:
> Ya, from commit c13fda237f08 ("KVM: Assert that notifier count is elevated in
> .change_pte()"):
> 
>      x86 and MIPS are clearcut nops if the old SPTE is not-present, and that
>      is guaranteed due to the prior invalidation.  PPC simply unmaps the SPTE,
>      which again should be a nop due to the invalidation.  arm64 is a bit
>      murky, but it's also likely a nop because kvm_pgtable_stage2_map() is
>      called without a cache pointer, which means it will map an entry if and
>      only if an existing PTE was found.
> 
> I'm 100% in favor of removing .change_pte().  As I've said multiple times, the
> only reason I haven't sent a patch is because I didn't want it to prompt someone
> into resurrecting the original behavior.

[RFC,0/4] KVM: x86/mmu: Rework marking folios dirty/accessed

Message

Comments