Message ID | 20240320005024.3216282-1-seanjc@google.com (mailing list archive) |
---|---|
Headers | show |
Series | KVM: x86/mmu: Rework marking folios dirty/accessed | expand |
On 20.03.24 01:50, Sean Christopherson wrote: > Rework KVM to mark folios dirty when creating shadow/secondary PTEs (SPTEs), > i.e. when creating mappings for KVM guests, instead of when zapping or > modifying SPTEs, e.g. when dropping mappings. > > The motivation is twofold: > > 1. Marking folios dirty and accessed when zapping can be extremely > expensive and wasteful, e.g. if KVM shattered a 1GiB hugepage into > 512*512 4KiB SPTEs for dirty logging, then KVM marks the huge folio > dirty and accessed for all 512*512 SPTEs. > > 2. x86 diverges from literally every other architecture, which updates > folios when mappings are created. AFAIK, x86 is unique in that it's > the only KVM arch that prefetches PTEs, so it's not quite an apples- > to-apples comparison, but I don't see any reason for the dirty logic > in particular to be different. > Already sorry for the lengthy reply. On "ordinary" process page tables on x86, it behaves as follows: 1) A page might be mapped writable but the PTE might not be dirty. Once written to, HW will set the PTE dirty bit. 2) A page might be mapped but the PTE might not be young. Once accessed, HW will set the PTE young bit. 3) When zapping a page (zap_present_folio_ptes), we transfer the dirty PTE bit to the folio (folio_mark_dirty()), and the young PTE bit to the folio (folio_mark_accessed()). The latter is done conditionally only (vma_has_recency()). BUT, when zapping an anon folio, we don't do that, because there zapping implies "gone for good" and not "content must go to a file". 4) When temporarily unmapping a folio for migration/swapout, we primarily only move the dirty PTE bit to the folio. GUP is different, because the PTEs might change after we pinned the page and wrote to it. We don't modify the PTEs and expect the GUP user to do the right thing (set dirty/accessed). For example, unpin_user_pages_dirty_lock() would mark the page dirty when unpinning, where the PTE might long be gone. So GUP does not really behave like HW access. Secondary page tables are different to ordinary GUP, and KVM ends up using GUP to some degree to simulate HW access; regarding NUMA-hinting, KVM already revealed to be very different to all other GUP users. [1] And I recall that at some point I raised that we might want to have a dedicate interface for these "mmu-notifier" based page table synchonization mechanism. But KVM ends up setting folio dirty/access flags itself, like other GUP users. I do wonder if secondary page tables should be messing with folio flags *at all*, and if there would be ways to to it differently using PTEs. We make sure to synchronize the secondary page tables to the process page tables using MMU notifiers: when we write-protect/unmap a PTE, we write-protect/unmap the SPTE. Yet, we handle accessed/dirty completely different. I once had the following idea, but I am not sure about all implications, just wanted to raise it because it matches the topic here: Secondary page tables kind-of behave like "HW" access. If there is a write access, we would expect the original PTE to become dirty, not the mapped folio. 1) When KVM wants to map a page into the secondary page table, we require the PTE to be young (like a HW access). The SPTE can remain old. 2) When KVM wants to map a page writable into the secondary page table, we require the PTE to be dirty (like a HW access). The SPTE can remain old. 3) When core MM clears the PTE dirty/young bit, we notify the secondary page table to adjust: for example, if the dirty bit gets cleared, the page cannot be writable in the secondary MMU. 4) GUP-fast cannot set the pte dirty/young, so we would fallback to slow GUP, wehre we hold the PTL, and simply modify the PTE to have the accessed/dirty bit set. 5) Prefetching would similarly be limited to that (only prefetch if PTE is already dirty etc.). 6) Dirty/accessed bits not longer have to be synced from the secondary page table to the process page table. Because an SPTE being dirty implies that the PTE is dirty. One tricky bit, why ordinary GUP modifies the folio and not the PTE, is concurrent HW access. For example, when we want to mark a PTE accessed, it could happen that HW concurrently tries marking the PTE dirty. We must not lose that update, so we have to guarantee an atomic update (maybe avoidable in some cases). What would be the implications? We'd leave setting folio flags to the MM core. That also implies, that if you shutdown a VM an zap all anon folios, you wouldn't have to mark any folio dirty: the pte is dirty, and MM core can decide to ignore that flag since it will discard the page either way. Downsides? Likely many I have not yet thought about (TLB flushes etc). Just mentioning it because in context of [1] I was wondering if something that uses MMU notifiers should really be messing with dirty/young flags :) > I tagged this RFC as it is barely tested, and because I'm not 100% positive > there isn't some weird edge case I'm missing, which is why I Cc'd David H. > and Matthew. We'd be in trouble if someone would detect that all PTEs are clean, so it can clear the folio dirty flag (for example, after writeback). Then, we would write using the SPTE and the folio+PTE would be clean. If we then evict the "clean" folio that is actually dirty, we would be in trouble. Well, we would set the SPTE dirty flag I guess. But I cannot immediately tell if that one would be synced back to the folio? Would we have a mechanism in place to prevent that? > > Note, I'm going to be offline from ~now until April 1st. I rushed this out > as it could impact David S.'s kvm_follow_pfn series[*], which is imminent. > E.g. if KVM stops marking pages dirty and accessed everywhere, adding > SPTE_MMU_PAGE_REFCOUNTED just to sanity check that the refcount is elevated > seems like a poor tradeoff (medium complexity and annoying to maintain, for > not much benefit). > > Regarding David S.'s series, I wouldn't be at all opposed to going even > further and having x86 follow all architectures by marking pages accessed > _only_ at map time, at which point I think KVM could simply pass in FOLL_TOUCH > as appropriate, and thus dedup a fair bit of arch code. FOLL_TOUCH is weird (excluding weird devmap stuff): 1) For PTEs (follow_page_pte), we set the page dirty and accessed, and do not modify the PTE. For THP (follow_trans_huge_pmd), we set the PMD young/dirty and don't mess with the folio. 2) FOLL_TOUCH is not implemented for hugetlb. 3) FOLL_TOUCH is not implemented for GUP-fast. I'd leave that alone :) [1] https://lore.kernel.org/lkml/20230727212845.135673-1-david@redhat.com/T/#u
On Wed, Mar 20, 2024 at 5:56 AM David Hildenbrand <david@redhat.com> wrote: > > On 20.03.24 01:50, Sean Christopherson wrote: > > Rework KVM to mark folios dirty when creating shadow/secondary PTEs (SPTEs), > > i.e. when creating mappings for KVM guests, instead of when zapping or > > modifying SPTEs, e.g. when dropping mappings. > > > > The motivation is twofold: > > > > 1. Marking folios dirty and accessed when zapping can be extremely > > expensive and wasteful, e.g. if KVM shattered a 1GiB hugepage into > > 512*512 4KiB SPTEs for dirty logging, then KVM marks the huge folio > > dirty and accessed for all 512*512 SPTEs. > > > > 2. x86 diverges from literally every other architecture, which updates > > folios when mappings are created. AFAIK, x86 is unique in that it's > > the only KVM arch that prefetches PTEs, so it's not quite an apples- > > to-apples comparison, but I don't see any reason for the dirty logic > > in particular to be different. > > > > Already sorry for the lengthy reply. > > > On "ordinary" process page tables on x86, it behaves as follows: > > 1) A page might be mapped writable but the PTE might not be dirty. Once > written to, HW will set the PTE dirty bit. > > 2) A page might be mapped but the PTE might not be young. Once accessed, > HW will set the PTE young bit. > > 3) When zapping a page (zap_present_folio_ptes), we transfer the dirty > PTE bit to the folio (folio_mark_dirty()), and the young PTE bit to > the folio (folio_mark_accessed()). The latter is done conditionally > only (vma_has_recency()). > > BUT, when zapping an anon folio, we don't do that, because there zapping > implies "gone for good" and not "content must go to a file". > > 4) When temporarily unmapping a folio for migration/swapout, we > primarily only move the dirty PTE bit to the folio. > > > GUP is different, because the PTEs might change after we pinned the page > and wrote to it. We don't modify the PTEs and expect the GUP user to do > the right thing (set dirty/accessed). For example, > unpin_user_pages_dirty_lock() would mark the page dirty when unpinning, > where the PTE might long be gone. > > So GUP does not really behave like HW access. > > > Secondary page tables are different to ordinary GUP, and KVM ends up > using GUP to some degree to simulate HW access; regarding NUMA-hinting, > KVM already revealed to be very different to all other GUP users. [1] > > And I recall that at some point I raised that we might want to have a > dedicate interface for these "mmu-notifier" based page table > synchonization mechanism. > > But KVM ends up setting folio dirty/access flags itself, like other GUP > users. I do wonder if secondary page tables should be messing with folio > flags *at all*, and if there would be ways to to it differently using PTEs. > > We make sure to synchronize the secondary page tables to the process > page tables using MMU notifiers: when we write-protect/unmap a PTE, we > write-protect/unmap the SPTE. Yet, we handle accessed/dirty completely > different. Accessed bits have the test/clear young MMU-notifiers. But I agree there aren't any notifiers for dirty tracking. Are there any cases where the primary MMU transfers the PTE dirty bit to the folio _other_ than zapping (which already has an MMU-notifier to KVM). If not then there might not be any reason to add a new notifier. Instead the contract should just be that secondary MMUs must also transfer their dirty bits to folios in sync (or before) the primary MMU zaps its PTE. > > > I once had the following idea, but I am not sure about all implications, > just wanted to raise it because it matches the topic here: > > Secondary page tables kind-of behave like "HW" access. If there is a > write access, we would expect the original PTE to become dirty, not the > mapped folio. Propagating SPTE dirty bits to folios indirectly via the primary MMU PTEs won't work for guest_memfd where there is no primary MMU PTE. In order to avoid having two different ways to propagate SPTE dirty bits, KVM should probably be responsible for updating the folio directly.
On 02.04.24 19:38, David Matlack wrote: > On Wed, Mar 20, 2024 at 5:56 AM David Hildenbrand <david@redhat.com> wrote: >> >> On 20.03.24 01:50, Sean Christopherson wrote: >>> Rework KVM to mark folios dirty when creating shadow/secondary PTEs (SPTEs), >>> i.e. when creating mappings for KVM guests, instead of when zapping or >>> modifying SPTEs, e.g. when dropping mappings. >>> >>> The motivation is twofold: >>> >>> 1. Marking folios dirty and accessed when zapping can be extremely >>> expensive and wasteful, e.g. if KVM shattered a 1GiB hugepage into >>> 512*512 4KiB SPTEs for dirty logging, then KVM marks the huge folio >>> dirty and accessed for all 512*512 SPTEs. >>> >>> 2. x86 diverges from literally every other architecture, which updates >>> folios when mappings are created. AFAIK, x86 is unique in that it's >>> the only KVM arch that prefetches PTEs, so it's not quite an apples- >>> to-apples comparison, but I don't see any reason for the dirty logic >>> in particular to be different. >>> >> >> Already sorry for the lengthy reply. >> >> >> On "ordinary" process page tables on x86, it behaves as follows: >> >> 1) A page might be mapped writable but the PTE might not be dirty. Once >> written to, HW will set the PTE dirty bit. >> >> 2) A page might be mapped but the PTE might not be young. Once accessed, >> HW will set the PTE young bit. >> >> 3) When zapping a page (zap_present_folio_ptes), we transfer the dirty >> PTE bit to the folio (folio_mark_dirty()), and the young PTE bit to >> the folio (folio_mark_accessed()). The latter is done conditionally >> only (vma_has_recency()). >> >> BUT, when zapping an anon folio, we don't do that, because there zapping >> implies "gone for good" and not "content must go to a file". >> >> 4) When temporarily unmapping a folio for migration/swapout, we >> primarily only move the dirty PTE bit to the folio. >> >> >> GUP is different, because the PTEs might change after we pinned the page >> and wrote to it. We don't modify the PTEs and expect the GUP user to do >> the right thing (set dirty/accessed). For example, >> unpin_user_pages_dirty_lock() would mark the page dirty when unpinning, >> where the PTE might long be gone. >> >> So GUP does not really behave like HW access. >> >> >> Secondary page tables are different to ordinary GUP, and KVM ends up >> using GUP to some degree to simulate HW access; regarding NUMA-hinting, >> KVM already revealed to be very different to all other GUP users. [1] >> >> And I recall that at some point I raised that we might want to have a >> dedicate interface for these "mmu-notifier" based page table >> synchonization mechanism. >> >> But KVM ends up setting folio dirty/access flags itself, like other GUP >> users. I do wonder if secondary page tables should be messing with folio >> flags *at all*, and if there would be ways to to it differently using PTEs. >> >> We make sure to synchronize the secondary page tables to the process >> page tables using MMU notifiers: when we write-protect/unmap a PTE, we >> write-protect/unmap the SPTE. Yet, we handle accessed/dirty completely >> different. > > Accessed bits have the test/clear young MMU-notifiers. But I agree > there aren't any notifiers for dirty tracking. > Yes, and I am questioning if the "test" part should exist -- or if having a spte in the secondary MMU should require the access bit to be set (derived from the primary MMU). (again, my explanation about fake HW page table walkers) There might be a good reason to do it like that nowadays, so I'm only raising it as something I was wondering. Likely, frequent clearing of the access bit would result in many PTEs in the secondary MMU getting invalidated, requiring a new GUP-fast lookup where we would set the access bit in the primary MMU PTE. But I'm not an expert on the implications with MMU notifiers and access bit clearing. > Are there any cases where the primary MMU transfers the PTE dirty bit > to the folio _other_ than zapping (which already has an MMU-notifier > to KVM). If not then there might not be any reason to add a new > notifier. Instead the contract should just be that secondary MMUs must > also transfer their dirty bits to folios in sync (or before) the > primary MMU zaps its PTE. Grepping for pte_mkclean(), there might be some cases. Many cases use MMU notifier, because they either clear the PTE or also remove write permissions. But these is madvise_free_pte_range() and clean_record_shared_mapping_range()...->clean_record_pte(), that might only clear the dirty bit without clearing/changing permissions and consequently not calling MMU notifiers. Getting a writable PTE without the dirty bit set should be possible. So I am questioning whether having a writable PTE in the secondary MMU with a clean PTE in the primary MMU should be valid to exist. It can exist today, and I am not sure if that's the right approach. > >> >> >> I once had the following idea, but I am not sure about all implications, >> just wanted to raise it because it matches the topic here: >> >> Secondary page tables kind-of behave like "HW" access. If there is a >> write access, we would expect the original PTE to become dirty, not the >> mapped folio. > > Propagating SPTE dirty bits to folios indirectly via the primary MMU > PTEs won't work for guest_memfd where there is no primary MMU PTE. In > order to avoid having two different ways to propagate SPTE dirty bits, > KVM should probably be responsible for updating the folio directly. > But who really cares about access/dirty bits for guest_memfd? guest_memfd already wants to disable/bypass all of core-MM, so different rules are to be expected. This discussion is all about integration with core-MM that relies on correct dirty bits, which does not really apply to guest_memfd.
On Tue, Apr 02, 2024, David Hildenbrand wrote: > On 02.04.24 19:38, David Matlack wrote: > > On Wed, Mar 20, 2024 at 5:56 AM David Hildenbrand <david@redhat.com> wrote: > > > Secondary page tables are different to ordinary GUP, and KVM ends up > > > using GUP to some degree to simulate HW access; regarding NUMA-hinting, > > > KVM already revealed to be very different to all other GUP users. [1] > > > > > > And I recall that at some point I raised that we might want to have a > > > dedicate interface for these "mmu-notifier" based page table > > > synchonization mechanism. > > > > > > But KVM ends up setting folio dirty/access flags itself, like other GUP > > > users. I do wonder if secondary page tables should be messing with folio > > > flags *at all*, and if there would be ways to to it differently using PTEs. > > > > > > We make sure to synchronize the secondary page tables to the process > > > page tables using MMU notifiers: when we write-protect/unmap a PTE, we > > > write-protect/unmap the SPTE. Yet, we handle accessed/dirty completely > > > different. > > > > Accessed bits have the test/clear young MMU-notifiers. But I agree > > there aren't any notifiers for dirty tracking. > > Yes, and I am questioning if the "test" part should exist -- or if having a > spte in the secondary MMU should require the access bit to be set (derived > from the primary MMU). (again, my explanation about fake HW page table > walkers) > > There might be a good reason to do it like that nowadays, so I'm only > raising it as something I was wondering. Likely, frequent clearing of the > access bit would result in many PTEs in the secondary MMU getting > invalidated, requiring a new GUP-fast lookup where we would set the access > bit in the primary MMU PTE. But I'm not an expert on the implications with > MMU notifiers and access bit clearing. Ya. KVM already does something similar this for dirty logging, where SPTEs are write-protected, i.e. generate faults to track dirty status. But KVM x86 has a lot of code to mitigates the resulting pain, e.g. has a lockless fast-path to make the SPTE writable and propagate the dirty status to bitmaps, and userspace is heavily/carefully optimized to balance harvesting/clearing dirty status against guest activity. In general, assumptions that core MM makes about the cost of PTE modifications tend to fall apart for KVM. Much of that comes down to mmu_notifiers invalidations being an imperfect boundary between the primary MMU and secondary MMUs. E.g. any prot changes (NUMA balancing in particular) can have disastrous impact on KVM guests because those changes invalidate secondary MMUs at PMD granularity, and effective force KVM to do a remote TLB for every PMD that is affected, whereas the primary is able to batch TLB flushes until the entire VMA is processed. So yeah, forcing KVM to zap and re-fault SPTEs in order to do page aging would be comically slow for KVM guests without a crazy amount of optimization. > > Are there any cases where the primary MMU transfers the PTE dirty bit > > to the folio _other_ than zapping (which already has an MMU-notifier > > to KVM). If not then there might not be any reason to add a new > > notifier. Instead the contract should just be that secondary MMUs must > > also transfer their dirty bits to folios in sync (or before) the > > primary MMU zaps its PTE. > > Grepping for pte_mkclean(), there might be some cases. Many cases use MMU > notifier, because they either clear the PTE or also remove write > permissions. > > But these is madvise_free_pte_range() and > clean_record_shared_mapping_range()...->clean_record_pte(), that might only > clear the dirty bit without clearing/changing permissions and consequently > not calling MMU notifiers. Heh, I was staring at these earlier today. They all are wrapped with mmu_notifier_invalidate_range_{start,end}(), and KVM's hyper aggressive response to mmu_notifier invalidations ensure all KVM SPTEs get dropped. > Getting a writable PTE without the dirty bit set should be possible. > > So I am questioning whether having a writable PTE in the secondary MMU with > a clean PTE in the primary MMU should be valid to exist. It can exist today, > and I am not sure if that's the right approach. I came to the same conclusion (that it can exist today). Without (sane) FOLL_TOUCH, KVM could retrieve the new PTE (W=1,D=0) after an mmu_notifier invaliation, and thus end up with a writable SPTE that isn't dirty in core MM. For MADV_FREE, I don't see how KVM's current behavior of marking the folio dirty at zap time changes anything. Ah, try_to_unmap_one() checks folio_test_dirty() *after* invoking mmu_notifier_invalidate_range_start(), which ensures that KVM's dirty status is propagated to the folio, and thus try_to_unmap_one() keeps the folio. Aha! But try_to_unmap_one() also checks that refcount==mapcount+1, i.e. will also keep the folio if it has been GUP'd. And __remove_mapping() explicitly states that it needs to play nice with a GUP'd page being marked dirty before the reference is dropped. * Must be careful with the order of the tests. When someone has * a ref to the folio, it may be possible that they dirty it then * drop the reference. So if the dirty flag is tested before the * refcount here, then the following race may occur: So while it's totally possible for KVM to get a W=1,D=0 PTE, if I'm reading the code correctly it's safe/legal so long as KVM either (a) marks the folio dirty while holding a reference or (b) marks the folio dirty before returning from its mmu_notifier_invalidate_range_start() hook, *AND* obviously if KVM drops its mappings in response to mmu_notifier_invalidate_range_start(). > > > I once had the following idea, but I am not sure about all implications, > > > just wanted to raise it because it matches the topic here: > > > > > > Secondary page tables kind-of behave like "HW" access. If there is a > > > write access, we would expect the original PTE to become dirty, not the > > > mapped folio. > > > > Propagating SPTE dirty bits to folios indirectly via the primary MMU > > PTEs won't work for guest_memfd where there is no primary MMU PTE. In > > order to avoid having two different ways to propagate SPTE dirty bits, > > KVM should probably be responsible for updating the folio directly. > > > > But who really cares about access/dirty bits for guest_memfd? > > guest_memfd already wants to disable/bypass all of core-MM, so different > rules are to be expected. This discussion is all about integration with > core-MM that relies on correct dirty bits, which does not really apply to > guest_memfd. +1, this is one of many reasons I don't want to support swap/relcaim for guest_memfd. The instant guest_memfd gets involved with anything LRU related, the interactions and complexity skyrockets.
On 03.04.24 02:17, Sean Christopherson wrote: > On Tue, Apr 02, 2024, David Hildenbrand wrote: >> On 02.04.24 19:38, David Matlack wrote: >>> On Wed, Mar 20, 2024 at 5:56 AM David Hildenbrand <david@redhat.com> wrote: >>>> Secondary page tables are different to ordinary GUP, and KVM ends up >>>> using GUP to some degree to simulate HW access; regarding NUMA-hinting, >>>> KVM already revealed to be very different to all other GUP users. [1] >>>> >>>> And I recall that at some point I raised that we might want to have a >>>> dedicate interface for these "mmu-notifier" based page table >>>> synchonization mechanism. >>>> >>>> But KVM ends up setting folio dirty/access flags itself, like other GUP >>>> users. I do wonder if secondary page tables should be messing with folio >>>> flags *at all*, and if there would be ways to to it differently using PTEs. >>>> >>>> We make sure to synchronize the secondary page tables to the process >>>> page tables using MMU notifiers: when we write-protect/unmap a PTE, we >>>> write-protect/unmap the SPTE. Yet, we handle accessed/dirty completely >>>> different. >>> >>> Accessed bits have the test/clear young MMU-notifiers. But I agree >>> there aren't any notifiers for dirty tracking. >> >> Yes, and I am questioning if the "test" part should exist -- or if having a >> spte in the secondary MMU should require the access bit to be set (derived >> from the primary MMU). (again, my explanation about fake HW page table >> walkers) >> >> There might be a good reason to do it like that nowadays, so I'm only >> raising it as something I was wondering. Likely, frequent clearing of the >> access bit would result in many PTEs in the secondary MMU getting >> invalidated, requiring a new GUP-fast lookup where we would set the access >> bit in the primary MMU PTE. But I'm not an expert on the implications with >> MMU notifiers and access bit clearing. > > Ya. KVM already does something similar this for dirty logging, where SPTEs are > write-protected, i.e. generate faults to track dirty status. But KVM x86 has a > lot of code to mitigates the resulting pain, e.g. has a lockless fast-path to > make the SPTE writable and propagate the dirty status to bitmaps, and userspace > is heavily/carefully optimized to balance harvesting/clearing dirty status against > guest activity. > > In general, assumptions that core MM makes about the cost of PTE modifications > tend to fall apart for KVM. Much of that comes down to mmu_notifiers invalidations > being an imperfect boundary between the primary MMU and secondary MMUs. E.g. > any prot changes (NUMA balancing in particular) can have disastrous impact on KVM > guests because those changes invalidate secondary MMUs at PMD granularity, and > effective force KVM to do a remote TLB for every PMD that is affected, whereas > the primary is able to batch TLB flushes until the entire VMA is processed. > > So yeah, forcing KVM to zap and re-fault SPTEs in order to do page aging would be > comically slow for KVM guests without a crazy amount of optimization. Right, so access bits are certainly special. Fortunately, they are not as important to get right as dirty bits. > >>> Are there any cases where the primary MMU transfers the PTE dirty bit >>> to the folio _other_ than zapping (which already has an MMU-notifier >>> to KVM). If not then there might not be any reason to add a new >>> notifier. Instead the contract should just be that secondary MMUs must >>> also transfer their dirty bits to folios in sync (or before) the >>> primary MMU zaps its PTE. >> >> Grepping for pte_mkclean(), there might be some cases. Many cases use MMU >> notifier, because they either clear the PTE or also remove write >> permissions. >> >> But these is madvise_free_pte_range() and >> clean_record_shared_mapping_range()...->clean_record_pte(), that might only >> clear the dirty bit without clearing/changing permissions and consequently >> not calling MMU notifiers. > > Heh, I was staring at these earlier today. They all are wrapped with > mmu_notifier_invalidate_range_{start,end}(), and KVM's hyper aggressive response > to mmu_notifier invalidations ensure all KVM SPTEs get dropped. > >> Getting a writable PTE without the dirty bit set should be possible. >> >> So I am questioning whether having a writable PTE in the secondary MMU with >> a clean PTE in the primary MMU should be valid to exist. It can exist today, >> and I am not sure if that's the right approach. > > I came to the same conclusion (that it can exist today). Without (sane) FOLL_TOUCH, > KVM could retrieve the new PTE (W=1,D=0) after an mmu_notifier invaliation, and > thus end up with a writable SPTE that isn't dirty in core MM. > > For MADV_FREE, I don't see how KVM's current behavior of marking the folio dirty > at zap time changes anything. Ah, try_to_unmap_one() checks folio_test_dirty() > *after* invoking mmu_notifier_invalidate_range_start(), which ensures that KVM's > dirty status is propagated to the folio, and thus try_to_unmap_one() keeps the > folio. Right, we have to check if it has been re-dirtied. And any unexpected reference (i.e., GUP) could dirty it. > > Aha! But try_to_unmap_one() also checks that refcount==mapcount+1, i.e. will > also keep the folio if it has been GUP'd. And __remove_mapping() explicitly states > that it needs to play nice with a GUP'd page being marked dirty before the > reference is dropped. > > * Must be careful with the order of the tests. When someone has > * a ref to the folio, it may be possible that they dirty it then > * drop the reference. So if the dirty flag is tested before the > * refcount here, then the following race may occur: > > So while it's totally possible for KVM to get a W=1,D=0 PTE, if I'm reading the > code correctly it's safe/legal so long as KVM either (a) marks the folio dirty > while holding a reference or (b) marks the folio dirty before returning from its > mmu_notifier_invalidate_range_start() hook, *AND* obviously if KVM drops its > mappings in response to mmu_notifier_invalidate_range_start(). > Yes, I agree that it should work in the context of vmscan. But (b) is certainly a bit harder to swallow than "ordinary" (a) :) As raised, if having a writable SPTE would imply having a writable+dirty PTE, then KVM MMU code wouldn't have to worry about syncing any dirty bits ever back to core-mm, so patch #2 would not be required. ... well, it would be replaces by an MMU notifier that notifies about clearing the PTE dirty bit :) ... because, then, there is also a subtle difference between folio_set_dirty() and folio_mark_dirty(), and I am still confused about the difference and not competent enough to explain the difference ... and KVM always does the former, while zapping code of pagecache folios does the latter ... hm Related note: IIRC, we usually expect most anon folios to be dirty. kvm_set_pfn_dirty()->kvm_set_page_dirty() does an unconditional SetPageDirty()->folio_set_dirty(). Doing a test-before-set might frequently avoid atomic ops. >>>> I once had the following idea, but I am not sure about all implications, >>>> just wanted to raise it because it matches the topic here: >>>> >>>> Secondary page tables kind-of behave like "HW" access. If there is a >>>> write access, we would expect the original PTE to become dirty, not the >>>> mapped folio. >>> >>> Propagating SPTE dirty bits to folios indirectly via the primary MMU >>> PTEs won't work for guest_memfd where there is no primary MMU PTE. In >>> order to avoid having two different ways to propagate SPTE dirty bits, >>> KVM should probably be responsible for updating the folio directly. >>> >> >> But who really cares about access/dirty bits for guest_memfd? >> >> guest_memfd already wants to disable/bypass all of core-MM, so different >> rules are to be expected. This discussion is all about integration with >> core-MM that relies on correct dirty bits, which does not really apply to >> guest_memfd. > > +1, this is one of many reasons I don't want to support swap/relcaim for guest_memfd. > The instant guest_memfd gets involved with anything LRU related, the interactions > and complexity skyrockets. Agreed ... but it's unfortunate :/
On Wed, Apr 03, 2024, David Hildenbrand wrote: > On 03.04.24 02:17, Sean Christopherson wrote: > > On Tue, Apr 02, 2024, David Hildenbrand wrote: > > Aha! But try_to_unmap_one() also checks that refcount==mapcount+1, i.e. will > > also keep the folio if it has been GUP'd. And __remove_mapping() explicitly states > > that it needs to play nice with a GUP'd page being marked dirty before the > > reference is dropped. > > > > > * Must be careful with the order of the tests. When someone has > > * a ref to the folio, it may be possible that they dirty it then > > * drop the reference. So if the dirty flag is tested before the > > * refcount here, then the following race may occur: > > > > So while it's totally possible for KVM to get a W=1,D=0 PTE, if I'm reading the > > code correctly it's safe/legal so long as KVM either (a) marks the folio dirty > > while holding a reference or (b) marks the folio dirty before returning from its > > mmu_notifier_invalidate_range_start() hook, *AND* obviously if KVM drops its > > mappings in response to mmu_notifier_invalidate_range_start(). > > > > Yes, I agree that it should work in the context of vmscan. But (b) is > certainly a bit harder to swallow than "ordinary" (a) :) Heh, all the more reason to switch KVM x86 from (b) => (a). > As raised, if having a writable SPTE would imply having a writable+dirty > PTE, then KVM MMU code wouldn't have to worry about syncing any dirty bits > ever back to core-mm, so patch #2 would not be required. ... well, it would > be replaces by an MMU notifier that notifies about clearing the PTE dirty > bit :) Hmm, we essentially already have an mmu_notifier today, since secondary MMUs need to be invalidated before consuming dirty status. Isn't the end result essentially a sane FOLL_TOUCH? > ... because, then, there is also a subtle difference between > folio_set_dirty() and folio_mark_dirty(), and I am still confused about the > difference and not competent enough to explain the difference ... and KVM > always does the former, while zapping code of pagecache folios does the > latter ... hm Ugh, just when I thought I finally had my head wrapped around this. > Related note: IIRC, we usually expect most anon folios to be dirty. > > kvm_set_pfn_dirty()->kvm_set_page_dirty() does an unconditional > SetPageDirty()->folio_set_dirty(). Doing a test-before-set might frequently > avoid atomic ops. Noted, definitely worth poking at.
On 04.04.24 00:19, Sean Christopherson wrote: > On Wed, Apr 03, 2024, David Hildenbrand wrote: >> On 03.04.24 02:17, Sean Christopherson wrote: >>> On Tue, Apr 02, 2024, David Hildenbrand wrote: >>> Aha! But try_to_unmap_one() also checks that refcount==mapcount+1, i.e. will >>> also keep the folio if it has been GUP'd. And __remove_mapping() explicitly states >>> that it needs to play nice with a GUP'd page being marked dirty before the >>> reference is dropped. >> >>> >>> * Must be careful with the order of the tests. When someone has >>> * a ref to the folio, it may be possible that they dirty it then >>> * drop the reference. So if the dirty flag is tested before the >>> * refcount here, then the following race may occur: >>> >>> So while it's totally possible for KVM to get a W=1,D=0 PTE, if I'm reading the >>> code correctly it's safe/legal so long as KVM either (a) marks the folio dirty >>> while holding a reference or (b) marks the folio dirty before returning from its >>> mmu_notifier_invalidate_range_start() hook, *AND* obviously if KVM drops its >>> mappings in response to mmu_notifier_invalidate_range_start(). >>> >> >> Yes, I agree that it should work in the context of vmscan. But (b) is >> certainly a bit harder to swallow than "ordinary" (a) :) > > Heh, all the more reason to switch KVM x86 from (b) => (a). :) > >> As raised, if having a writable SPTE would imply having a writable+dirty >> PTE, then KVM MMU code wouldn't have to worry about syncing any dirty bits >> ever back to core-mm, so patch #2 would not be required. ... well, it would >> be replaces by an MMU notifier that notifies about clearing the PTE dirty >> bit :) > > Hmm, we essentially already have an mmu_notifier today, since secondary MMUs need > to be invalidated before consuming dirty status. Isn't the end result essentially > a sane FOLL_TOUCH? Likely. As stated in my first mail, FOLL_TOUCH is a bit of a mess right now. Having something that makes sure the writable PTE/PMD is dirty (or alternatively sets it dirty), paired with MMU notifiers notifying on any mkclean would be one option that would leave handling how to handle dirtying of folios completely to the core. It would behave just like a CPU writing to the page table, which would set the pte dirty. Of course, if frequent clearing of the dirty PTE/PMD bit would be a problem (like we discussed for the accessed bit), that would not be an option. But from what I recall, only clearing the PTE/PMD dirty bit is rather rare. I think this GUP / GUP-user setting of folio dirty bits is a bit of a mess, but it seems to be working so far, so ... who am I to judge :)
On Thu, Apr 04, 2024, David Hildenbrand wrote: > On 04.04.24 00:19, Sean Christopherson wrote: > > Hmm, we essentially already have an mmu_notifier today, since secondary MMUs need > > to be invalidated before consuming dirty status. Isn't the end result essentially > > a sane FOLL_TOUCH? > > Likely. As stated in my first mail, FOLL_TOUCH is a bit of a mess right now. > > Having something that makes sure the writable PTE/PMD is dirty (or > alternatively sets it dirty), paired with MMU notifiers notifying on any > mkclean would be one option that would leave handling how to handle dirtying > of folios completely to the core. It would behave just like a CPU writing to > the page table, which would set the pte dirty. > > Of course, if frequent clearing of the dirty PTE/PMD bit would be a problem > (like we discussed for the accessed bit), that would not be an option. But > from what I recall, only clearing the PTE/PMD dirty bit is rather rare. And AFAICT, all cases already invalidate secondary MMUs anyways, so if anything it would probably be a net positive, e.g. the notification could more precisely say that SPTEs need to be read-only, not blasted away completely.
On 04.04.24 19:31, Sean Christopherson wrote: > On Thu, Apr 04, 2024, David Hildenbrand wrote: >> On 04.04.24 00:19, Sean Christopherson wrote: >>> Hmm, we essentially already have an mmu_notifier today, since secondary MMUs need >>> to be invalidated before consuming dirty status. Isn't the end result essentially >>> a sane FOLL_TOUCH? >> >> Likely. As stated in my first mail, FOLL_TOUCH is a bit of a mess right now. >> >> Having something that makes sure the writable PTE/PMD is dirty (or >> alternatively sets it dirty), paired with MMU notifiers notifying on any >> mkclean would be one option that would leave handling how to handle dirtying >> of folios completely to the core. It would behave just like a CPU writing to >> the page table, which would set the pte dirty. >> >> Of course, if frequent clearing of the dirty PTE/PMD bit would be a problem >> (like we discussed for the accessed bit), that would not be an option. But >> from what I recall, only clearing the PTE/PMD dirty bit is rather rare. > > And AFAICT, all cases already invalidate secondary MMUs anyways, so if anything > it would probably be a net positive, e.g. the notification could more precisely > say that SPTEs need to be read-only, not blasted away completely. > As discussed, I think at least madvise_free_pte_range() wouldn't do that. Notifiers would only get called later when actually zapping the folio. So at least for some time, you would have the PTE not dirty, but the SPTE writable or even dirty. So you'd have to set the page dirty when zapping the SPTE ... and IMHO that is what we should maybe try to avoid :)
On Thu, Apr 04, 2024, David Hildenbrand wrote: > On 04.04.24 19:31, Sean Christopherson wrote: > > On Thu, Apr 04, 2024, David Hildenbrand wrote: > > > On 04.04.24 00:19, Sean Christopherson wrote: > > > > Hmm, we essentially already have an mmu_notifier today, since secondary MMUs need > > > > to be invalidated before consuming dirty status. Isn't the end result essentially > > > > a sane FOLL_TOUCH? > > > > > > Likely. As stated in my first mail, FOLL_TOUCH is a bit of a mess right now. > > > > > > Having something that makes sure the writable PTE/PMD is dirty (or > > > alternatively sets it dirty), paired with MMU notifiers notifying on any > > > mkclean would be one option that would leave handling how to handle dirtying > > > of folios completely to the core. It would behave just like a CPU writing to > > > the page table, which would set the pte dirty. > > > > > > Of course, if frequent clearing of the dirty PTE/PMD bit would be a problem > > > (like we discussed for the accessed bit), that would not be an option. But > > > from what I recall, only clearing the PTE/PMD dirty bit is rather rare. > > > > And AFAICT, all cases already invalidate secondary MMUs anyways, so if anything > > it would probably be a net positive, e.g. the notification could more precisely > > say that SPTEs need to be read-only, not blasted away completely. > > As discussed, I think at least madvise_free_pte_range() wouldn't do that. I'm getting a bit turned around. Are you talking about what madvise_free_pte_range() would do in this future world, or what madvise_free_pte_range() does today? Because today, unless I'm really misreading the code, secondary MMUs are invalidated before the dirty bit is cleared. mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, range.start, range.end); lru_add_drain(); tlb_gather_mmu(&tlb, mm); update_hiwater_rss(mm); mmu_notifier_invalidate_range_start(&range); tlb_start_vma(&tlb, vma); walk_page_range(vma->vm_mm, range.start, range.end, &madvise_free_walk_ops, &tlb); tlb_end_vma(&tlb, vma); mmu_notifier_invalidate_range_end(&range); KVM (or any other secondary MMU) can re-establish mapping with W=1,D=0 in the PTE, but the costly invalidation (zap+flush+fault) still happens. > Notifiers would only get called later when actually zapping the folio. And in case we're talking about a hypothetical future, I was thinking the above could do MMU_NOTIFY_WRITE_PROTECT instead of MMU_NOTIFY_CLEAR. > So at least for some time, you would have the PTE not dirty, but the SPTE > writable or even dirty. So you'd have to set the page dirty when zapping the > SPTE ... and IMHO that is what we should maybe try to avoid :)
On 05.04.24 00:02, Sean Christopherson wrote: > On Thu, Apr 04, 2024, David Hildenbrand wrote: >> On 04.04.24 19:31, Sean Christopherson wrote: >>> On Thu, Apr 04, 2024, David Hildenbrand wrote: >>>> On 04.04.24 00:19, Sean Christopherson wrote: >>>>> Hmm, we essentially already have an mmu_notifier today, since secondary MMUs need >>>>> to be invalidated before consuming dirty status. Isn't the end result essentially >>>>> a sane FOLL_TOUCH? >>>> >>>> Likely. As stated in my first mail, FOLL_TOUCH is a bit of a mess right now. >>>> >>>> Having something that makes sure the writable PTE/PMD is dirty (or >>>> alternatively sets it dirty), paired with MMU notifiers notifying on any >>>> mkclean would be one option that would leave handling how to handle dirtying >>>> of folios completely to the core. It would behave just like a CPU writing to >>>> the page table, which would set the pte dirty. >>>> >>>> Of course, if frequent clearing of the dirty PTE/PMD bit would be a problem >>>> (like we discussed for the accessed bit), that would not be an option. But >>>> from what I recall, only clearing the PTE/PMD dirty bit is rather rare. >>> >>> And AFAICT, all cases already invalidate secondary MMUs anyways, so if anything >>> it would probably be a net positive, e.g. the notification could more precisely >>> say that SPTEs need to be read-only, not blasted away completely. >> >> As discussed, I think at least madvise_free_pte_range() wouldn't do that. > > I'm getting a bit turned around. Are you talking about what madvise_free_pte_range() > would do in this future world, or what madvise_free_pte_range() does today? Because > today, unless I'm really misreading the code, secondary MMUs are invalidated before > the dirty bit is cleared. Likely I missed it, sorry! I was talking about the possible future. :) > > mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, > range.start, range.end); > > lru_add_drain(); > tlb_gather_mmu(&tlb, mm); > update_hiwater_rss(mm); > > mmu_notifier_invalidate_range_start(&range); > tlb_start_vma(&tlb, vma); > walk_page_range(vma->vm_mm, range.start, range.end, > &madvise_free_walk_ops, &tlb); > tlb_end_vma(&tlb, vma); > mmu_notifier_invalidate_range_end(&range); > Indeed, we do setup the MMU notifier invalidation. We do the start/end ... I was looking for PTE notifications. I spotted the set_pte_at(), not a set_pte_at_notify() like we do in other code. Maybe that's not required here (digging through documentation I'm still left clueless). " set_pte_at_notify() sets the pte _after_ running the notifier. This is safe to start by updating the secondary MMUs, because the primary MMU pte invalidate must have already happened with a ptep_clear_flush() before set_pte_at_notify() has been invoked. ... " Absolutely unclear to me when we *must* to use it, or if it is. Likely its a pure optimization when we're *changing* a PTE. KSM and wp_page_copy() uses it with MMU_NOTIFY_CLEAR, when replacing the mapped page by another page (or write-protecting an existing one). __replace_page() uses it with __replace_page() when replacing the mapped page. And migrate_vma_insert_page() uses it with MMU_NOTIFY_MIGRATE. Other code (e.g., khugepaged) doesn't seem to use it as well. ... so I guess it's fine? Confusing.
On Fri, Apr 5, 2024 at 8:53 AM David Hildenbrand <david@redhat.com> wrote: > > mmu_notifier_invalidate_range_start(&range); > > tlb_start_vma(&tlb, vma); > > walk_page_range(vma->vm_mm, range.start, range.end, > > &madvise_free_walk_ops, &tlb); > > tlb_end_vma(&tlb, vma); > > mmu_notifier_invalidate_range_end(&range); > > > > Indeed, we do setup the MMU notifier invalidation. We do the start/end > ... I was looking for PTE notifications. > > I spotted the set_pte_at(), not a set_pte_at_notify() like we do in > other code. Maybe that's not required here (digging through > documentation I'm still left clueless). [...] > Absolutely unclear to me when we *must* to use it, or if it is. Likely > its a pure optimization when we're *changing* a PTE. Yes, .change_pte() is just an optimization. The original point of it was for KSM, so that KVM could flip the sPTE to a new location without first zapping it. At the time there was also an .invalidate_page() callback, and both of them were *not* bracketed by calls to mmu_notifier_invalidate_range_{start,end}() Later on, both callbacks were changed to occur within an invalidate_range_start/end() block. Commit 6bdb913f0a70 ("mm: wrap calls to set_pte_at_notify with invalidate_range_start and invalidate_range_end", 2012-10-09) did so for .change_pte(). The reason to do so was a bit roundabout, namely to allow sleepable .invalidate_page() hooks (because .change_pte() is not sleepable and at the time .invalidate_page() was used as a fallback for .change_pte()). This however made KVM's usage of the .change_pte() callback completely moot, because KVM unmaps the sPTEs during .invalidate_range_start() and therefore .change_pte() has no hope of succeeding. (Commit 369ea8242c0f ("mm/rmap: update to new mmu_notifier semantic v2", 2017-08-31) is where the other set of non-bracketed calls to MMU notifier callbacks was changed; calls to mmu_notifier_invalidate_page() were replaced by calls to mmu_notifier_invalidate_range(), bracketed by calls to mmu_notifier_invalidate_range_{start,end}()). Since KVM is the only user of .change_pte(), we can remove .change_pte() and set_pte_at_notify() completely. Paolo > __replace_page() uses it with __replace_page() when replacing the mapped > page. And migrate_vma_insert_page() uses it with MMU_NOTIFY_MIGRATE. > > Other code (e.g., khugepaged) doesn't seem to use it as well. > > ... so I guess it's fine? Confusing. > > -- > Cheers, > > David / dhildenb >
On 05.04.24 11:37, Paolo Bonzini wrote: > On Fri, Apr 5, 2024 at 8:53 AM David Hildenbrand <david@redhat.com> wrote: >>> mmu_notifier_invalidate_range_start(&range); >>> tlb_start_vma(&tlb, vma); >>> walk_page_range(vma->vm_mm, range.start, range.end, >>> &madvise_free_walk_ops, &tlb); >>> tlb_end_vma(&tlb, vma); >>> mmu_notifier_invalidate_range_end(&range); >>> >> >> Indeed, we do setup the MMU notifier invalidation. We do the start/end >> ... I was looking for PTE notifications. >> >> I spotted the set_pte_at(), not a set_pte_at_notify() like we do in >> other code. Maybe that's not required here (digging through >> documentation I'm still left clueless). [...] >> Absolutely unclear to me when we *must* to use it, or if it is. Likely >> its a pure optimization when we're *changing* a PTE. > > Yes, .change_pte() is just an optimization. The original point of it > was for KSM, so that KVM could flip the sPTE to a new location without > first zapping it. At the time there was also an .invalidate_page() > callback, and both of them were *not* bracketed by calls to > mmu_notifier_invalidate_range_{start,end}() > > Later on, both callbacks were changed to occur within an > invalidate_range_start/end() block. > > Commit 6bdb913f0a70 ("mm: wrap calls to set_pte_at_notify with > invalidate_range_start and invalidate_range_end", 2012-10-09) did so > for .change_pte(). The reason to do so was a bit roundabout, namely to > allow sleepable .invalidate_page() hooks (because .change_pte() is not > sleepable and at the time .invalidate_page() was used as a fallback > for .change_pte()). > > This however made KVM's usage of the .change_pte() callback completely > moot, because KVM unmaps the sPTEs during .invalidate_range_start() > and therefore .change_pte() has no hope of succeeding. > > (Commit 369ea8242c0f ("mm/rmap: update to new mmu_notifier semantic > v2", 2017-08-31) is where the other set of non-bracketed calls to MMU > notifier callbacks was changed; calls to > mmu_notifier_invalidate_page() were replaced by calls to > mmu_notifier_invalidate_range(), bracketed by calls to > mmu_notifier_invalidate_range_{start,end}()). > > Since KVM is the only user of .change_pte(), we can remove > .change_pte() and set_pte_at_notify() completely. Nice, thanks for all that information!
On Fri, Apr 05, 2024, David Hildenbrand wrote: > On 05.04.24 11:37, Paolo Bonzini wrote: > > On Fri, Apr 5, 2024 at 8:53 AM David Hildenbrand <david@redhat.com> wrote: > > > > mmu_notifier_invalidate_range_start(&range); > > > > tlb_start_vma(&tlb, vma); > > > > walk_page_range(vma->vm_mm, range.start, range.end, > > > > &madvise_free_walk_ops, &tlb); > > > > tlb_end_vma(&tlb, vma); > > > > mmu_notifier_invalidate_range_end(&range); > > > > > > > > > > Indeed, we do setup the MMU notifier invalidation. We do the start/end > > > ... I was looking for PTE notifications. > > > > > > I spotted the set_pte_at(), not a set_pte_at_notify() like we do in > > > other code. Maybe that's not required here (digging through > > > documentation I'm still left clueless). [...] > > > Absolutely unclear to me when we *must* to use it, or if it is. Likely > > > its a pure optimization when we're *changing* a PTE. > > > > Yes, .change_pte() is just an optimization. The original point of it > > was for KSM, so that KVM could flip the sPTE to a new location without > > first zapping it. At the time there was also an .invalidate_page() > > callback, and both of them were *not* bracketed by calls to > > mmu_notifier_invalidate_range_{start,end}() > > > > Later on, both callbacks were changed to occur within an > > invalidate_range_start/end() block. > > > > Commit 6bdb913f0a70 ("mm: wrap calls to set_pte_at_notify with > > invalidate_range_start and invalidate_range_end", 2012-10-09) did so > > for .change_pte(). The reason to do so was a bit roundabout, namely to > > allow sleepable .invalidate_page() hooks (because .change_pte() is not > > sleepable and at the time .invalidate_page() was used as a fallback > > for .change_pte()). > > > > This however made KVM's usage of the .change_pte() callback completely > > moot, because KVM unmaps the sPTEs during .invalidate_range_start() > > and therefore .change_pte() has no hope of succeeding. > > > > (Commit 369ea8242c0f ("mm/rmap: update to new mmu_notifier semantic > > v2", 2017-08-31) is where the other set of non-bracketed calls to MMU > > notifier callbacks was changed; calls to > > mmu_notifier_invalidate_page() were replaced by calls to > > mmu_notifier_invalidate_range(), bracketed by calls to > > mmu_notifier_invalidate_range_{start,end}()). > > > > Since KVM is the only user of .change_pte(), we can remove > > .change_pte() and set_pte_at_notify() completely. > > Nice, thanks for all that information! Ya, from commit c13fda237f08 ("KVM: Assert that notifier count is elevated in .change_pte()"): x86 and MIPS are clearcut nops if the old SPTE is not-present, and that is guaranteed due to the prior invalidation. PPC simply unmaps the SPTE, which again should be a nop due to the invalidation. arm64 is a bit murky, but it's also likely a nop because kvm_pgtable_stage2_map() is called without a cache pointer, which means it will map an entry if and only if an existing PTE was found. I'm 100% in favor of removing .change_pte(). As I've said multiple times, the only reason I haven't sent a patch is because I didn't want it to prompt someone into resurrecting the original behavior. :-)
On 4/5/24 15:59, Sean Christopherson wrote: > Ya, from commit c13fda237f08 ("KVM: Assert that notifier count is elevated in > .change_pte()"): > > x86 and MIPS are clearcut nops if the old SPTE is not-present, and that > is guaranteed due to the prior invalidation. PPC simply unmaps the SPTE, > which again should be a nop due to the invalidation. arm64 is a bit > murky, but it's also likely a nop because kvm_pgtable_stage2_map() is > called without a cache pointer, which means it will map an entry if and > only if an existing PTE was found. > > I'm 100% in favor of removing .change_pte(). As I've said multiple times, the > only reason I haven't sent a patch is because I didn't want it to prompt someone > into resurrecting the original behavior.