Message ID | 20240611002145.2078921-5-jthoughton@google.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | mm: multi-gen LRU: Walk secondary MMU page tables while aging | expand |
On Mon, Jun 10, 2024 at 6:22 PM James Houghton <jthoughton@google.com> wrote: > > This new notifier is for multi-gen LRU specifically Let me call it out before others do: we can't be this self-serving. > as it wants to be > able to get and clear age information from secondary MMUs only if it can > be done "fast". > > By having this notifier specifically created for MGLRU, what "fast" > means comes down to what is "fast" enough to improve MGLRU's ability to > reclaim most of the time. > > Signed-off-by: James Houghton <jthoughton@google.com> If we'd like this to pass other MM reviewers, especially the MMU notifier maintainers, we'd need to design a generic API that can benefit all the *existing* users: idle page tracking [1], DAMON [2] and MGLRU. Also I personally prefer to extend the existing callbacks by adding new parameters, and on top of that, I'd try to consolidate the existing callbacks -- it'd be less of a hard sell if my changes result in less code, not more. (v2 did all these, btw.) [1] https://docs.kernel.org/admin-guide/mm/idle_page_tracking.html [2] https://www.kernel.org/doc/html/latest/mm/damon/index.html
On Mon, Jun 10, 2024 at 10:34 PM Yu Zhao <yuzhao@google.com> wrote: > > On Mon, Jun 10, 2024 at 6:22 PM James Houghton <jthoughton@google.com> wrote: > > > > This new notifier is for multi-gen LRU specifically > > Let me call it out before others do: we can't be this self-serving. > > > as it wants to be > > able to get and clear age information from secondary MMUs only if it can > > be done "fast". > > > > By having this notifier specifically created for MGLRU, what "fast" > > means comes down to what is "fast" enough to improve MGLRU's ability to > > reclaim most of the time. > > > > Signed-off-by: James Houghton <jthoughton@google.com> > > If we'd like this to pass other MM reviewers, especially the MMU > notifier maintainers, we'd need to design a generic API that can > benefit all the *existing* users: idle page tracking [1], DAMON [2] > and MGLRU. > > Also I personally prefer to extend the existing callbacks by adding > new parameters, and on top of that, I'd try to consolidate the > existing callbacks -- it'd be less of a hard sell if my changes result > in less code, not more. > > (v2 did all these, btw.) I think consolidating the callbacks is cleanest, like you had it in v2. I really wasn't sure about this change honestly, but it was my attempt to incorporate feedback like this[3] from v4. I'll consolidate the callbacks like you had in v2. Instead of the bitmap like you had, I imagine we'll have some kind of flags argument that has bits like MMU_NOTIFIER_YOUNG_CLEAR, MMU_NOTIFIER_YOUNG_FAST_ONLY, and other ones as they come up. Does that sound ok? Do idle page tracking and DAMON need this new "fast-only" notifier? Or do they benefit from a generic API in other ways? Sorry if I missed this from some other mail. I've got feedback saying that tying the definition of "fast" to MGLRU specifically is helpful. So instead of MMU_NOTIFIER_YOUNG_FAST_ONLY, maybe MMU_NOTIFIER_YOUNG_LRU_GEN_FAST to mean "do fast-for-MGLRU notifier". It sounds like you'd prefer the more generic one. Thanks for the feedback -- I don't want to keep this series lingering on the list, so I'll try and get newer versions out sooner rather than later. [3]: https://lore.kernel.org/linux-mm/Zl5LqcusZ88QOGQY@google.com/ > > [1] https://docs.kernel.org/admin-guide/mm/idle_page_tracking.html > [2] https://www.kernel.org/doc/html/latest/mm/damon/index.html
On Tue, Jun 11, 2024 at 09:49:59AM -0700, James Houghton wrote: > On Mon, Jun 10, 2024 at 10:34 PM Yu Zhao <yuzhao@google.com> wrote: > > > > On Mon, Jun 10, 2024 at 6:22 PM James Houghton <jthoughton@google.com> wrote: > > > > > > This new notifier is for multi-gen LRU specifically > > > > Let me call it out before others do: we can't be this self-serving. Establishing motivation for a change is always a good idea. The wording could be a bit crisper, but the connection between the new MMU notifier and MGLRU is valuable. I do not view the wording of the changeset as excluding other users of the 'fast' notifier. > I think consolidating the callbacks is cleanest, like you had it in > v2. I really wasn't sure about this change honestly, but it was my > attempt to incorporate feedback like this[3] from v4. I'll consolidate > the callbacks like you had in v2. My strong preference is to have the callers expectations of the secondary MMU be explicit. Having ->${BLAH}_fast_only() makes this abundantly clear both at the callsite and in the implementation. > Instead of the bitmap like you had, I imagine we'll have some kind of > flags argument that has bits like MMU_NOTIFIER_YOUNG_CLEAR, > MMU_NOTIFIER_YOUNG_FAST_ONLY, and other ones as they come up. Does > that sound ok? > > Do idle page tracking and DAMON need this new "fast-only" notifier? Or > do they benefit from a generic API in other ways? Sorry if I missed > this from some other mail. Let's also keep in mind we aren't establishing an ABI here. If we have direct line of sight (i.e. patches) on how to leverage the new MMU notifier for DAMON and idle page tracking then great, let's try and build something that satisfies all users. Otherwise, it isn't that big of a deal if the interface needs to change slightly when someone decides to leverage the MMU notifier for something else. > I've got feedback saying that tying the definition of "fast" to MGLRU > specifically is helpful. So instead of MMU_NOTIFIER_YOUNG_FAST_ONLY, > maybe MMU_NOTIFIER_YOUNG_LRU_GEN_FAST to mean "do fast-for-MGLRU > notifier". It sounds like you'd prefer the more generic one. > > Thanks for the feedback -- I don't want to keep this series lingering > on the list, so I'll try and get newer versions out sooner rather than > later. Let's make sure we get alignment on this before you proceed, I don't get the sense that we're getting to a common understanding of where to go with this.
On Tue, Jun 11, 2024, James Houghton wrote: > On Mon, Jun 10, 2024 at 10:34 PM Yu Zhao <yuzhao@google.com> wrote: > > > > On Mon, Jun 10, 2024 at 6:22 PM James Houghton <jthoughton@google.com> wrote: > > > > > > This new notifier is for multi-gen LRU specifically > > > > Let me call it out before others do: we can't be this self-serving. > > > > > as it wants to be > > > able to get and clear age information from secondary MMUs only if it can > > > be done "fast". > > > > > > By having this notifier specifically created for MGLRU, what "fast" > > > means comes down to what is "fast" enough to improve MGLRU's ability to > > > reclaim most of the time. > > > > > > Signed-off-by: James Houghton <jthoughton@google.com> > > > > If we'd like this to pass other MM reviewers, especially the MMU > > notifier maintainers, we'd need to design a generic API that can > > benefit all the *existing* users: idle page tracking [1], DAMON [2] > > and MGLRU. > > > > Also I personally prefer to extend the existing callbacks by adding > > new parameters, and on top of that, I'd try to consolidate the > > existing callbacks -- it'd be less of a hard sell if my changes result > > in less code, not more. > > > > (v2 did all these, btw.) > > I think consolidating the callbacks is cleanest, like you had it in > v2. I really wasn't sure about this change honestly, but it was my > attempt to incorporate feedback like this[3] from v4. I'll consolidate > the callbacks like you had in v2. James, wait for others to chime in before committing yourself to a course of action, otherwise you're going to get ping-ponged to hell and back. > Instead of the bitmap like you had, I imagine we'll have some kind of > flags argument that has bits like MMU_NOTIFIER_YOUNG_CLEAR, > MMU_NOTIFIER_YOUNG_FAST_ONLY, and other ones as they come up. Does > that sound ok? Why do we need a bundle of flags? If we extend .clear_young() and .test_young() as Yu suggests, then we only need a single "bool fast_only". As for adding a fast_only versus dedicated APIs, I don't have a strong preference. Extending will require a small amount of additional churn, e.g. to pass in false, but that doesn't seem problematic on its own. On the plus side, there would be less copy+paste in include/linux/mmu_notifier.h (though that could be solved with macros :-) ). E.g. -- diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c index 7b77ad6cf833..07872ae00fa6 100644 --- a/mm/mmu_notifier.c +++ b/mm/mmu_notifier.c @@ -384,7 +384,8 @@ int __mmu_notifier_clear_flush_young(struct mm_struct *mm, int __mmu_notifier_clear_young(struct mm_struct *mm, unsigned long start, - unsigned long end) + unsigned long end, + bool fast_only) { struct mmu_notifier *subscription; int young = 0, id; @@ -393,9 +394,12 @@ int __mmu_notifier_clear_young(struct mm_struct *mm, hlist_for_each_entry_rcu(subscription, &mm->notifier_subscriptions->list, hlist, srcu_read_lock_held(&srcu)) { - if (subscription->ops->clear_young) - young |= subscription->ops->clear_young(subscription, - mm, start, end); + if (!subscription->ops->clear_young || + fast_only && !subscription->ops->has_fast_aging) + continue; + + young |= subscription->ops->clear_young(subscription, + mm, start, end); } srcu_read_unlock(&srcu, id); @@ -403,7 +407,8 @@ int __mmu_notifier_clear_young(struct mm_struct *mm, } int __mmu_notifier_test_young(struct mm_struct *mm, - unsigned long address) + unsigned long address, + bool fast_only) { struct mmu_notifier *subscription; int young = 0, id; @@ -412,12 +417,15 @@ int __mmu_notifier_test_young(struct mm_struct *mm, hlist_for_each_entry_rcu(subscription, &mm->notifier_subscriptions->list, hlist, srcu_read_lock_held(&srcu)) { - if (subscription->ops->test_young) { - young = subscription->ops->test_young(subscription, mm, - address); - if (young) - break; - } + if (!subscription->ops->test_young) + continue; + + if (fast_only && !subscription->ops->has_fast_aging) + continue; + + young = subscription->ops->test_young(subscription, mm, address); + if (young) + break; } srcu_read_unlock(&srcu, id);
On Tue, Jun 11, 2024, Oliver Upton wrote: > On Tue, Jun 11, 2024 at 09:49:59AM -0700, James Houghton wrote: > > I think consolidating the callbacks is cleanest, like you had it in > > v2. I really wasn't sure about this change honestly, but it was my > > attempt to incorporate feedback like this[3] from v4. I'll consolidate > > the callbacks like you had in v2. > > My strong preference is to have the callers expectations of the > secondary MMU be explicit. Having ->${BLAH}_fast_only() makes this > abundantly clear both at the callsite and in the implementation. Partially agreed. We don't need a dedicated mmu_notifier API to achieve that for the callsites, e.g. ptep_clear_young_notify() passes fast_only=false, and a new ptep_clear_young_notify_fast_only() does the obvious. On the back end, odds are very good KVM is going to squish the "fast" and "slow" paths back into a common helper, so IMO having dedicated fast_only() APIs for the mmu_notifier hooks doesn't add much value in the end. I'm not opposed to dedicated hooks, but I after poking around a bit, I suspect that passing a fast_only flag will end up being less cleaner for all parties.
On Tue, Jun 11, 2024 at 10:50 AM James Houghton <jthoughton@google.com> wrote: > > On Mon, Jun 10, 2024 at 10:34 PM Yu Zhao <yuzhao@google.com> wrote: > > > > On Mon, Jun 10, 2024 at 6:22 PM James Houghton <jthoughton@google.com> wrote: > > > > > > This new notifier is for multi-gen LRU specifically > > > > Let me call it out before others do: we can't be this self-serving. > > > > > as it wants to be > > > able to get and clear age information from secondary MMUs only if it can > > > be done "fast". > > > > > > By having this notifier specifically created for MGLRU, what "fast" > > > means comes down to what is "fast" enough to improve MGLRU's ability to > > > reclaim most of the time. > > > > > > Signed-off-by: James Houghton <jthoughton@google.com> > > > > If we'd like this to pass other MM reviewers, especially the MMU > > notifier maintainers, we'd need to design a generic API that can > > benefit all the *existing* users: idle page tracking [1], DAMON [2] > > and MGLRU. > > > > Also I personally prefer to extend the existing callbacks by adding > > new parameters, and on top of that, I'd try to consolidate the > > existing callbacks -- it'd be less of a hard sell if my changes result > > in less code, not more. > > > > (v2 did all these, btw.) > > I think consolidating the callbacks is cleanest, like you had it in > v2. I really wasn't sure about this change honestly, but it was my > attempt to incorporate feedback like this[3] from v4. I'll consolidate > the callbacks like you had in v2. Only if you think updating the MMU notifier API is overall the best option. As I said earlier, I don't mind not doing the extra work, i.e., the bitmap/_fast_only parameters and the _FAST_YOUNG return value, as long as we know where we win and lose, i.e., simplicity and regressions. I would call all the potential regressions (the nested VM case, the arm64 case and the memcached benchmark) minor, compared with what we got from v2. But my opinion doesn't really matter much because I'm not your customer :)
On Tue, Jun 11, 2024 at 12:42 PM Sean Christopherson <seanjc@google.com> wrote: > > On Tue, Jun 11, 2024, James Houghton wrote: > > On Mon, Jun 10, 2024 at 10:34 PM Yu Zhao <yuzhao@google.com> wrote: > > > > > > On Mon, Jun 10, 2024 at 6:22 PM James Houghton <jthoughton@google.com> wrote: > > > > > > > > This new notifier is for multi-gen LRU specifically > > > > > > Let me call it out before others do: we can't be this self-serving. > > > > > > > as it wants to be > > > > able to get and clear age information from secondary MMUs only if it can > > > > be done "fast". > > > > > > > > By having this notifier specifically created for MGLRU, what "fast" > > > > means comes down to what is "fast" enough to improve MGLRU's ability to > > > > reclaim most of the time. > > > > > > > > Signed-off-by: James Houghton <jthoughton@google.com> > > > > > > If we'd like this to pass other MM reviewers, especially the MMU > > > notifier maintainers, we'd need to design a generic API that can > > > benefit all the *existing* users: idle page tracking [1], DAMON [2] > > > and MGLRU. > > > > > > Also I personally prefer to extend the existing callbacks by adding > > > new parameters, and on top of that, I'd try to consolidate the > > > existing callbacks -- it'd be less of a hard sell if my changes result > > > in less code, not more. > > > > > > (v2 did all these, btw.) > > > > I think consolidating the callbacks is cleanest, like you had it in > > v2. I really wasn't sure about this change honestly, but it was my > > attempt to incorporate feedback like this[3] from v4. I'll consolidate > > the callbacks like you had in v2. > > James, wait for others to chime in before committing yourself to a course of > action, otherwise you're going to get ping-ponged to hell and back. Ah yeah. I really mean "I'll do it, provided the other feedback is in line with this". > > > Instead of the bitmap like you had, I imagine we'll have some kind of > > flags argument that has bits like MMU_NOTIFIER_YOUNG_CLEAR, > > MMU_NOTIFIER_YOUNG_FAST_ONLY, and other ones as they come up. Does > > that sound ok? > > Why do we need a bundle of flags? If we extend .clear_young() and .test_young() > as Yu suggests, then we only need a single "bool fast_only". We don't need to. In my head it's a little easier to collapse them (slightly less code, and at the callsite you have a flag with a name instead of a true/false). Making it a bool SGTM. > As for adding a fast_only versus dedicated APIs, I don't have a strong preference. > Extending will require a small amount of additional churn, e.g. to pass in false, > but that doesn't seem problematic on its own. On the plus side, there would be > less copy+paste in include/linux/mmu_notifier.h (though that could be solved with > macros :-) ). I think having the extra bool is cleaner than the new fast_only notifier, definitely. > > E.g. > > -- > diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c > index 7b77ad6cf833..07872ae00fa6 100644 > --- a/mm/mmu_notifier.c > +++ b/mm/mmu_notifier.c > @@ -384,7 +384,8 @@ int __mmu_notifier_clear_flush_young(struct mm_struct *mm, > > int __mmu_notifier_clear_young(struct mm_struct *mm, > unsigned long start, > - unsigned long end) > + unsigned long end, > + bool fast_only) > { > struct mmu_notifier *subscription; > int young = 0, id; > @@ -393,9 +394,12 @@ int __mmu_notifier_clear_young(struct mm_struct *mm, > hlist_for_each_entry_rcu(subscription, > &mm->notifier_subscriptions->list, hlist, > srcu_read_lock_held(&srcu)) { > - if (subscription->ops->clear_young) > - young |= subscription->ops->clear_young(subscription, > - mm, start, end); > + if (!subscription->ops->clear_young || > + fast_only && !subscription->ops->has_fast_aging) > + continue; > + > + young |= subscription->ops->clear_young(subscription, > + mm, start, end); KVM changing has_fast_aging dynamically would be slow, wouldn't it? I feel like it's simpler to just pass in fast_only into `clear_young` itself (and this is how I interpreted what you wrote above anyway). > } > srcu_read_unlock(&srcu, id); > > @@ -403,7 +407,8 @@ int __mmu_notifier_clear_young(struct mm_struct *mm, > } > > int __mmu_notifier_test_young(struct mm_struct *mm, > - unsigned long address) > + unsigned long address, > + bool fast_only) > { > struct mmu_notifier *subscription; > int young = 0, id; > @@ -412,12 +417,15 @@ int __mmu_notifier_test_young(struct mm_struct *mm, > hlist_for_each_entry_rcu(subscription, > &mm->notifier_subscriptions->list, hlist, > srcu_read_lock_held(&srcu)) { > - if (subscription->ops->test_young) { > - young = subscription->ops->test_young(subscription, mm, > - address); > - if (young) > - break; > - } > + if (!subscription->ops->test_young) > + continue; > + > + if (fast_only && !subscription->ops->has_fast_aging) > + continue; > + > + young = subscription->ops->test_young(subscription, mm, address); > + if (young) > + break; > } > srcu_read_unlock(&srcu, id); > -- > > It might also require multiplexing the return value to differentiate between > "young" and "failed". Ugh, but the code already does that, just in a bespoke way. Yeah, that is necessary. > Double ugh. Peeking ahead at the "failure" code, NAK to adding > kvm_arch_young_notifier_likely_fast for all the same reasons I objected to > kvm_arch_has_test_clear_young() in v1. Please stop trying to do anything like > that, I will NAK each every attempt to have core mm/ code call directly into KVM. Sorry to make you repeat yourself; I'll leave it out of v6. I don't like it either, but I wasn't sure how important it was to avoid calling into unnecessary notifiers if the TDP MMU were completely disabled. > Anyways, back to this code, before we spin another version, we need to agree on > exactly what behavior we want out of secondary MMUs. Because to me, the behavior > proposed in this version doesn't make any sense. > > Signalling failure because KVM _might_ have relevant aging information in SPTEs > that require taking kvm->mmu_lock is a terrible tradeoff. And for the test_young > case, it's flat out wrong, e.g. if a page is marked Accessed in the TDP MMU, then > KVM should return "young", not "failed". Sorry for this oversight. What about something like: 1. test (and maybe clear) A bits on TDP MMU 2. If accessed && !should_clear: return (fast) 3. if (fast_only): return (fast) 4. If !(must check shadow MMU): return (fast) 5. test (and maybe clear) A bits in shadow MMU 6. return (slow) Some of this reordering (and maybe a change from kvm_shadow_root_allocated() to checking indirect_shadow_pages or something else) can be done in its own patch. > If KVM is using the TDP MMU, i.e. has_fast_aging=true, then there will be rmaps > if and only if L1 ran a nested VM at some point. But as proposed, KVM doesn't > actually check if there are any shadow TDP entries to process. That could be > fixed by looking at kvm->arch.indirect_shadow_pages, but even then it's not clear > that bailing if kvm->arch.indirect_shadow_pages > 0 makes sense. > > E.g. if L1 happens to be running an L2, but <10% of the VM's memory is exposed to > L2, then "failure" is pretty much guaranteed to a false positive. And even for > the pages that are exposed to L2, "failure" will occur if and only if the pages > are being accessed _only_ by L2. > > There most definitely are use cases where the majority of a VM's memory is accessed > only by L2. But if those use cases are performing poorly under MGLRU, then IMO > we should figure out a way to enhance KVM to do a fast harvest of nested TDP > Accessed information, not make MGRLU+KVM suck for a VMs that run nested VMs. This makes sense. I don't have data today to say that we would get a huge win from speeding up harvesting Accessed information from the shadow MMU would be helpful. Getting this information for the TDP MMU is at least better than no information at all. > > Oh, and calling into mmu_notifiers to do the "slow" version if the fast version > fails is suboptimal. Agreed. I didn't like this when I wrote it. This can be easily fixed by making mmu_notifier_clear_young() return "fast" and "young or not", which I will do. > So rather than failing the fast aging, I think what we want is to know if an > mmu_notifier found a young SPTE during a fast lookup. E.g. something like this > in KVM, where using kvm_has_shadow_mmu_sptes() instead of kvm_memslots_have_rmaps() > is an optional optimization to avoid taking mmu_lock for write in paths where a > (very rare) false negative is acceptable. > > static bool kvm_has_shadow_mmu_sptes(struct kvm *kvm) > { > return !tdp_mmu_enabled || READ_ONCE(kvm->arch.indirect_shadow_pages); > } > > static int __kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range, > bool fast_only) > { > int young = 0; > > if (!fast_only && kvm_has_shadow_mmu_sptes(kvm)) { > write_lock(&kvm->mmu_lock); > young = kvm_handle_gfn_range(kvm, range, kvm_age_rmap); > write_unlock(&kvm->mmu_lock); > } > > if (tdp_mmu_enabled && kvm_tdp_mmu_age_gfn_range(kvm, range)) > young = 1 | MMU_NOTIFY_WAS_FAST; I don't think this line is quite right. We might set MMU_NOTIFY_WAS_FAST even when we took the mmu_lock. I understand what you mean though, thanks. > > return (int)young; > } > > and then in lru_gen_look_around(): > > if (spin_is_contended(pvmw->ptl)) > return false; > > /* exclude special VMAs containing anon pages from COW */ > if (vma->vm_flags & VM_SPECIAL) > return false; > > young = ptep_clear_young_notify(vma, addr, pte); > if (!young) > return false; > > if (!(young & MMU_NOTIFY_WAS_FAST)) > return true; > > young = 1; > > with the lookaround done using ptep_clear_young_notify_fast(). > > The MMU_NOTIFY_WAS_FAST flag is gross, but AFAICT it would Just Work without > needing to update all users of ptep_clear_young_notify() and friends. Sounds good to me. Thanks for all the feedback!
On Tue, Jun 11, 2024, James Houghton wrote: > On Tue, Jun 11, 2024 at 12:42 PM Sean Christopherson <seanjc@google.com> wrote: > > -- > > diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c > > index 7b77ad6cf833..07872ae00fa6 100644 > > --- a/mm/mmu_notifier.c > > +++ b/mm/mmu_notifier.c > > @@ -384,7 +384,8 @@ int __mmu_notifier_clear_flush_young(struct mm_struct *mm, > > > > int __mmu_notifier_clear_young(struct mm_struct *mm, > > unsigned long start, > > - unsigned long end) > > + unsigned long end, > > + bool fast_only) > > { > > struct mmu_notifier *subscription; > > int young = 0, id; > > @@ -393,9 +394,12 @@ int __mmu_notifier_clear_young(struct mm_struct *mm, > > hlist_for_each_entry_rcu(subscription, > > &mm->notifier_subscriptions->list, hlist, > > srcu_read_lock_held(&srcu)) { > > - if (subscription->ops->clear_young) > > - young |= subscription->ops->clear_young(subscription, > > - mm, start, end); > > + if (!subscription->ops->clear_young || > > + fast_only && !subscription->ops->has_fast_aging) > > + continue; > > + > > + young |= subscription->ops->clear_young(subscription, > > + mm, start, end); > > KVM changing has_fast_aging dynamically would be slow, wouldn't it? No, it could/would be done quite quickly. But, I'm not suggesting has_fast_aging be dynamic, i.e. it's not an "all aging is guaranteed to be fast", it's a "this MMU _can_ do fast aging". It's a bit fuzzy/weird mostly because KVM can essentially have multiple secondary MMUs wired up to the same mmu_notifier. > I feel like it's simpler to just pass in fast_only into `clear_young` itself > (and this is how I interpreted what you wrote above anyway). Eh, maybe? A "has_fast_aging" flag is more robust in the sense that it requires secondary MMUs to opt-in, i.e. all secondary MMUs will be considered "slow" by default. It's somewhat of a moot point because KVM is the only secondary MMU that implements .clear_young() and .test_young() (which I keep forgetting), and that seems unlikely to change. A flag would also avoid an indirect call and thus a RETPOLINE when CONFIG_RETPOLINE=y, i.e. would be a minor optimization when KVM doesn't suppport fast aging. But that's probably a pretty unlikely combination, so it's probably not a valid argument. So, I guess I don't have a strong opinion? > > Double ugh. Peeking ahead at the "failure" code, NAK to adding > > kvm_arch_young_notifier_likely_fast for all the same reasons I objected to > > kvm_arch_has_test_clear_young() in v1. Please stop trying to do anything like > > that, I will NAK each every attempt to have core mm/ code call directly into KVM. > > Sorry to make you repeat yourself; I'll leave it out of v6. I don't > like it either, but I wasn't sure how important it was to avoid > calling into unnecessary notifiers if the TDP MMU were completely > disabled. If it's important, e.g. for performance, then the mmu_notifier should have a flag so that the behavior doesn't assume a KVM backend. Hence my has_fast_aging suggestion. > > Anyways, back to this code, before we spin another version, we need to agree on > > exactly what behavior we want out of secondary MMUs. Because to me, the behavior > > proposed in this version doesn't make any sense. > > > > Signalling failure because KVM _might_ have relevant aging information in SPTEs > > that require taking kvm->mmu_lock is a terrible tradeoff. And for the test_young > > case, it's flat out wrong, e.g. if a page is marked Accessed in the TDP MMU, then > > KVM should return "young", not "failed". > > Sorry for this oversight. What about something like: > > 1. test (and maybe clear) A bits on TDP MMU > 2. If accessed && !should_clear: return (fast) > 3. if (fast_only): return (fast) > 4. If !(must check shadow MMU): return (fast) > 5. test (and maybe clear) A bits in shadow MMU > 6. return (slow) I don't understand where the "must check shadow MMU" in #4 comes from. I also don't think it's necessary; see below. > Some of this reordering (and maybe a change from > kvm_shadow_root_allocated() to checking indirect_shadow_pages or > something else) can be done in its own patch. > > > So rather than failing the fast aging, I think what we want is to know if an > > mmu_notifier found a young SPTE during a fast lookup. E.g. something like this > > in KVM, where using kvm_has_shadow_mmu_sptes() instead of kvm_memslots_have_rmaps() > > is an optional optimization to avoid taking mmu_lock for write in paths where a > > (very rare) false negative is acceptable. > > > > static bool kvm_has_shadow_mmu_sptes(struct kvm *kvm) > > { > > return !tdp_mmu_enabled || READ_ONCE(kvm->arch.indirect_shadow_pages); > > } > > > > static int __kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range, > > bool fast_only) > > { > > int young = 0; > > > > if (!fast_only && kvm_has_shadow_mmu_sptes(kvm)) { > > write_lock(&kvm->mmu_lock); > > young = kvm_handle_gfn_range(kvm, range, kvm_age_rmap); > > write_unlock(&kvm->mmu_lock); > > } > > > > if (tdp_mmu_enabled && kvm_tdp_mmu_age_gfn_range(kvm, range)) > > young = 1 | MMU_NOTIFY_WAS_FAST; > > I don't think this line is quite right. We might set > MMU_NOTIFY_WAS_FAST even when we took the mmu_lock. I understand what > you mean though, thanks. The name sucks, but I believe the logic is correct. As posted here in v5, the MGRLU code wants to age both fast _and_ slow MMUs. AIUI, the intent is to always get aging information, but only look around at other PTEs if it can be done fast. if (should_walk_secondary_mmu()) { notifier_result = mmu_notifier_test_clear_young_fast_only( vma->vm_mm, addr, addr + PAGE_SIZE, /*clear=*/true); } if (notifier_result & MMU_NOTIFIER_FAST_FAILED) secondary_young = mmu_notifier_clear_young(vma->vm_mm, addr, addr + PAGE_SIZE); else { secondary_young = notifier_result & MMU_NOTIFIER_FAST_YOUNG; notifier_was_fast = true; } The change, relative to v5, that I am proposing is that MGLRU looks around if the page was young in _a_ "fast" secondary MMU, whereas v5 looks around if and only if _all_ secondary MMUs are fast. In other words, if a fast MMU had a young SPTE, look around _that_ MMU, via the fast_only flag.
On Tue, Jun 11, 2024 at 12:49:49PM -0700, Sean Christopherson wrote: > On Tue, Jun 11, 2024, Oliver Upton wrote: > > On Tue, Jun 11, 2024 at 09:49:59AM -0700, James Houghton wrote: > > > I think consolidating the callbacks is cleanest, like you had it in > > > v2. I really wasn't sure about this change honestly, but it was my > > > attempt to incorporate feedback like this[3] from v4. I'll consolidate > > > the callbacks like you had in v2. > > > > My strong preference is to have the callers expectations of the > > secondary MMU be explicit. Having ->${BLAH}_fast_only() makes this > > abundantly clear both at the callsite and in the implementation. > > Partially agreed. We don't need a dedicated mmu_notifier API to achieve that > for the callsites, e.g. ptep_clear_young_notify() passes fast_only=false, and a > new ptep_clear_young_notify_fast_only() does the obvious. > > On the back end, odds are very good KVM is going to squish the "fast" and "slow" > paths back into a common helper, so IMO having dedicated fast_only() APIs for the > mmu_notifier hooks doesn't add much value in the end. > > I'm not opposed to dedicated hooks, but I after poking around a bit, I suspect > that passing a fast_only flag will end up being less cleaner for all parties. Yeah, I think I'm headed in the same direction after actually reading the MM side of this, heh.
On Tue, Jun 11, 2024 at 5:34 PM Sean Christopherson <seanjc@google.com> wrote: > > On Tue, Jun 11, 2024, James Houghton wrote: > > On Tue, Jun 11, 2024 at 12:42 PM Sean Christopherson <seanjc@google.com> wrote: > > > -- > > > diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c > > > index 7b77ad6cf833..07872ae00fa6 100644 > > > --- a/mm/mmu_notifier.c > > > +++ b/mm/mmu_notifier.c > > > @@ -384,7 +384,8 @@ int __mmu_notifier_clear_flush_young(struct mm_struct *mm, > > > > > > int __mmu_notifier_clear_young(struct mm_struct *mm, > > > unsigned long start, > > > - unsigned long end) > > > + unsigned long end, > > > + bool fast_only) > > > { > > > struct mmu_notifier *subscription; > > > int young = 0, id; > > > @@ -393,9 +394,12 @@ int __mmu_notifier_clear_young(struct mm_struct *mm, > > > hlist_for_each_entry_rcu(subscription, > > > &mm->notifier_subscriptions->list, hlist, > > > srcu_read_lock_held(&srcu)) { > > > - if (subscription->ops->clear_young) > > > - young |= subscription->ops->clear_young(subscription, > > > - mm, start, end); > > > + if (!subscription->ops->clear_young || > > > + fast_only && !subscription->ops->has_fast_aging) > > > + continue; > > > + > > > + young |= subscription->ops->clear_young(subscription, > > > + mm, start, end); > > > > KVM changing has_fast_aging dynamically would be slow, wouldn't it? > > No, it could/would be done quite quickly. But, I'm not suggesting has_fast_aging > be dynamic, i.e. it's not an "all aging is guaranteed to be fast", it's a "this > MMU _can_ do fast aging". It's a bit fuzzy/weird mostly because KVM can essentially > have multiple secondary MMUs wired up to the same mmu_notifier. > > > I feel like it's simpler to just pass in fast_only into `clear_young` itself > > (and this is how I interpreted what you wrote above anyway). > > Eh, maybe? A "has_fast_aging" flag is more robust in the sense that it requires > secondary MMUs to opt-in, i.e. all secondary MMUs will be considered "slow" by > default. > > It's somewhat of a moot point because KVM is the only secondary MMU that implements > .clear_young() and .test_young() (which I keep forgetting), and that seems unlikely > to change. > > A flag would also avoid an indirect call and thus a RETPOLINE when CONFIG_RETPOLINE=y, > i.e. would be a minor optimization when KVM doesn't suppport fast aging. But that's > probably a pretty unlikely combination, so it's probably not a valid argument. > > So, I guess I don't have a strong opinion? (Sorry for the somewhat delayed response... spent some time actually writing what this would look like.) I see what you mean, thanks! So has_fast_aging might be set by KVM if the architecture sets a Kconfig saying that it understands the concept of fast aging, basically what the presence of this v5's test_clear_young_fast_only() indicates. > > > > Double ugh. Peeking ahead at the "failure" code, NAK to adding > > > kvm_arch_young_notifier_likely_fast for all the same reasons I objected to > > > kvm_arch_has_test_clear_young() in v1. Please stop trying to do anything like > > > that, I will NAK each every attempt to have core mm/ code call directly into KVM. > > > > Sorry to make you repeat yourself; I'll leave it out of v6. I don't > > like it either, but I wasn't sure how important it was to avoid > > calling into unnecessary notifiers if the TDP MMU were completely > > disabled. > > If it's important, e.g. for performance, then the mmu_notifier should have a flag > so that the behavior doesn't assume a KVM backend. Hence my has_fast_aging > suggestion. Thanks! That makes sense. > > > Anyways, back to this code, before we spin another version, we need to agree on > > > exactly what behavior we want out of secondary MMUs. Because to me, the behavior > > > proposed in this version doesn't make any sense. > > > > > > Signalling failure because KVM _might_ have relevant aging information in SPTEs > > > that require taking kvm->mmu_lock is a terrible tradeoff. And for the test_young > > > case, it's flat out wrong, e.g. if a page is marked Accessed in the TDP MMU, then > > > KVM should return "young", not "failed". > > > > Sorry for this oversight. What about something like: > > > > 1. test (and maybe clear) A bits on TDP MMU > > 2. If accessed && !should_clear: return (fast) > > 3. if (fast_only): return (fast) > > 4. If !(must check shadow MMU): return (fast) > > 5. test (and maybe clear) A bits in shadow MMU > > 6. return (slow) > > I don't understand where the "must check shadow MMU" in #4 comes from. I also > don't think it's necessary; see below. I just meant `kvm_has_shadow_mmu_sptes()` or `kvm_memslots_have_rmaps()`. I like the logic you suggest below. :) > > Some of this reordering (and maybe a change from > > kvm_shadow_root_allocated() to checking indirect_shadow_pages or > > something else) can be done in its own patch. So just to be clear, for test_young(), I intend to have a patch in v6 to elide the shadow MMU check if the TDP MMU indicates Accessed. Seems like a pure win; no reason not to include it if we're making logic changes here anyway. > > > > > So rather than failing the fast aging, I think what we want is to know if an > > > mmu_notifier found a young SPTE during a fast lookup. E.g. something like this > > > in KVM, where using kvm_has_shadow_mmu_sptes() instead of kvm_memslots_have_rmaps() > > > is an optional optimization to avoid taking mmu_lock for write in paths where a > > > (very rare) false negative is acceptable. > > > > > > static bool kvm_has_shadow_mmu_sptes(struct kvm *kvm) > > > { > > > return !tdp_mmu_enabled || READ_ONCE(kvm->arch.indirect_shadow_pages); > > > } > > > > > > static int __kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range, > > > bool fast_only) > > > { > > > int young = 0; > > > > > > if (!fast_only && kvm_has_shadow_mmu_sptes(kvm)) { > > > write_lock(&kvm->mmu_lock); > > > young = kvm_handle_gfn_range(kvm, range, kvm_age_rmap); > > > write_unlock(&kvm->mmu_lock); > > > } > > > > > > if (tdp_mmu_enabled && kvm_tdp_mmu_age_gfn_range(kvm, range)) > > > young = 1 | MMU_NOTIFY_WAS_FAST; The most straightforward way (IMHO) to return something like `1 | MMU_NOTIFY_WAS_FAST` up to the MMU notifier itself is to make gfn_handler_t return int instead of bool. In this v5, I worked around this need by using `bool *failed` in patch 5[1]. I think the way this is going to look now in v6 would be cleaner by actually changing gfn_handler_t to return int, and then we can write something like what you wrote here. What do you think? [1]: https://lore.kernel.org/linux-mm/20240611002145.2078921-6-jthoughton@google.com/ > > I don't think this line is quite right. We might set > > MMU_NOTIFY_WAS_FAST even when we took the mmu_lock. I understand what > > you mean though, thanks. > > The name sucks, but I believe the logic is correct. As posted here in v5, the > MGRLU code wants to age both fast _and_ slow MMUs. AIUI, the intent is to always > get aging information, but only look around at other PTEs if it can be done fast. > > if (should_walk_secondary_mmu()) { > notifier_result = > mmu_notifier_test_clear_young_fast_only( > vma->vm_mm, addr, addr + PAGE_SIZE, > /*clear=*/true); > } > > if (notifier_result & MMU_NOTIFIER_FAST_FAILED) > secondary_young = mmu_notifier_clear_young(vma->vm_mm, addr, > addr + PAGE_SIZE); > else { > secondary_young = notifier_result & MMU_NOTIFIER_FAST_YOUNG; > notifier_was_fast = true; > } > > The change, relative to v5, that I am proposing is that MGLRU looks around if > the page was young in _a_ "fast" secondary MMU, whereas v5 looks around if and > only if _all_ secondary MMUs are fast. > > In other words, if a fast MMU had a young SPTE, look around _that_ MMU, via the > fast_only flag. Oh, yeah, that's a lot more intelligent than what I had. I think I fully understand your suggestion; I guess we'll see in v6. :) I wonder if this still makes sense if whether or not an MMU is "fast" is determined by how contended some lock(s) are at the time. I think it does, but I guess we can discuss more if it turns out that having an architecture participate like this is actually something we want to do (i.e., that performance results say it's a good idea). Thanks Sean!
On Wed, Jun 12, 2024 at 11:53 PM Oliver Upton <oliver.upton@linux.dev> wrote: > > On Tue, Jun 11, 2024 at 12:49:49PM -0700, Sean Christopherson wrote: > > On Tue, Jun 11, 2024, Oliver Upton wrote: > > > On Tue, Jun 11, 2024 at 09:49:59AM -0700, James Houghton wrote: > > > > I think consolidating the callbacks is cleanest, like you had it in > > > > v2. I really wasn't sure about this change honestly, but it was my > > > > attempt to incorporate feedback like this[3] from v4. I'll consolidate > > > > the callbacks like you had in v2. > > > > > > My strong preference is to have the callers expectations of the > > > secondary MMU be explicit. Having ->${BLAH}_fast_only() makes this > > > abundantly clear both at the callsite and in the implementation. > > > > Partially agreed. We don't need a dedicated mmu_notifier API to achieve that > > for the callsites, e.g. ptep_clear_young_notify() passes fast_only=false, and a > > new ptep_clear_young_notify_fast_only() does the obvious. > > > > On the back end, odds are very good KVM is going to squish the "fast" and "slow" > > paths back into a common helper, so IMO having dedicated fast_only() APIs for the > > mmu_notifier hooks doesn't add much value in the end. > > > > I'm not opposed to dedicated hooks, but I after poking around a bit, I suspect > > that passing a fast_only flag will end up being less cleaner for all parties. > > Yeah, I think I'm headed in the same direction after actually reading > the MM side of this, heh. Yeah, I agree. What I have now for v6 is that the test_young() and clear_young() notifiers themselves now take a `bool fast_only`. When called with the existing helpers (e.g. `mmu_notifier_test_young()`, `fast_only` is false, and I have new helpers (e.g. `mmu_notifier_test_young_fast_only()`) that will set `fast_only` to true. Seems clean to me. Thanks!
On Thu, Jun 13, 2024, James Houghton wrote: > On Tue, Jun 11, 2024 at 5:34 PM Sean Christopherson <seanjc@google.com> wrote: > > A flag would also avoid an indirect call and thus a RETPOLINE when CONFIG_RETPOLINE=y, > > i.e. would be a minor optimization when KVM doesn't suppport fast aging. But that's > > probably a pretty unlikely combination, so it's probably not a valid argument. > > > > So, I guess I don't have a strong opinion? > > (Sorry for the somewhat delayed response... spent some time actually > writing what this would look like.) > > I see what you mean, thanks! So has_fast_aging might be set by KVM if > the architecture sets a Kconfig saying that it understands the concept > of fast aging, basically what the presence of this v5's > test_clear_young_fast_only() indicates. It would need to be a runtime setting, because KVM x86-64 with tdp_mmu_enabled=false doesn't support fast aging (uses the shadow MMU even for TDP). > > I don't understand where the "must check shadow MMU" in #4 comes from. I also > > don't think it's necessary; see below. > > I just meant `kvm_has_shadow_mmu_sptes()` or > `kvm_memslots_have_rmaps()`. I like the logic you suggest below. :) > > > > Some of this reordering (and maybe a change from > > > kvm_shadow_root_allocated() to checking indirect_shadow_pages or > > > something else) can be done in its own patch. > > So just to be clear, for test_young(), I intend to have a patch in v6 > to elide the shadow MMU check if the TDP MMU indicates Accessed. Seems > like a pure win; no reason not to include it if we're making logic > changes here anyway. I don't think that's correct. The initial fast_only=false aging should process shadow MMUs (nested TDP) and TDP MMUs, otherwise a future fast_only=false would get a false positive on young due to failing to clear the Accessed bit in the shadow MMU. E.g. if page X is accessed by both L1 and L2, then aged, and never accessed again, the Accessed bit would still be set in the page tables for L2. My thought for MMU_NOTIFY_WAS_FAST below (which again is a bad name) is to communicate to MGLRU that the page was found to be young in an MMU that supports fast aging, i.e. that looking around at other SPTEs is worth doing. > > > > So rather than failing the fast aging, I think what we want is to know if an > > > > mmu_notifier found a young SPTE during a fast lookup. E.g. something like this > > > > in KVM, where using kvm_has_shadow_mmu_sptes() instead of kvm_memslots_have_rmaps() > > > > is an optional optimization to avoid taking mmu_lock for write in paths where a > > > > (very rare) false negative is acceptable. > > > > > > > > static bool kvm_has_shadow_mmu_sptes(struct kvm *kvm) > > > > { > > > > return !tdp_mmu_enabled || READ_ONCE(kvm->arch.indirect_shadow_pages); > > > > } > > > > > > > > static int __kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range, > > > > bool fast_only) > > > > { > > > > int young = 0; > > > > > > > > if (!fast_only && kvm_has_shadow_mmu_sptes(kvm)) { > > > > write_lock(&kvm->mmu_lock); > > > > young = kvm_handle_gfn_range(kvm, range, kvm_age_rmap); > > > > write_unlock(&kvm->mmu_lock); > > > > } > > > > > > > > if (tdp_mmu_enabled && kvm_tdp_mmu_age_gfn_range(kvm, range)) > > > > young = 1 | MMU_NOTIFY_WAS_FAST; > > The most straightforward way (IMHO) to return something like `1 | > MMU_NOTIFY_WAS_FAST` up to the MMU notifier itself is to make > gfn_handler_t return int instead of bool. Hrm, all the options are unpleasant. Modifying gfn_handler_t to return an int will require an absurd amount of churn (all implementations in all archictures), and I don't love that the APIs that return true/false to indicate "flush" would lose their boolean-ness. One idea would be to add kvm_mmu_notifier_arg.aging_was_fast or so, and then refactor kvm_handle_hva_range_no_flush() into a dedicated aging helper, and have it morph the KVM-internal flag into an MMU_NOTIFIER flag. It's not perect either, but it requires far less churn and keeps some of the KVM<=>mmu_notifer details in common KVM code. diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 7b9d2633a931..c11a359b6ff5 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -258,6 +258,7 @@ int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu); #ifdef CONFIG_KVM_GENERIC_MMU_NOTIFIER union kvm_mmu_notifier_arg { unsigned long attributes; + bool aging_was_fast; }; struct kvm_gfn_range { diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 436ca41f61e5..a936f6bedd97 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -685,10 +685,10 @@ static __always_inline int kvm_handle_hva_range(struct mmu_notifier *mn, return __kvm_handle_hva_range(kvm, &range).ret; } -static __always_inline int kvm_handle_hva_range_no_flush(struct mmu_notifier *mn, - unsigned long start, - unsigned long end, - gfn_handler_t handler) +static __always_inline int kvm_age_hva_range(struct mmu_notifier *mn, + unsigned long start, + unsigned long end, + bool flush_if_young) { struct kvm *kvm = mmu_notifier_to_kvm(mn); const struct kvm_mmu_notifier_range range = { @@ -696,11 +696,14 @@ static __always_inline int kvm_handle_hva_range_no_flush(struct mmu_notifier *mn .end = end, .handler = handler, .on_lock = (void *)kvm_null_fn, - .flush_on_ret = false, + .flush_on_ret = flush_if_young, .may_block = false, + .aging_was_fast = false, }; - return __kvm_handle_hva_range(kvm, &range).ret; + bool young = __kvm_handle_hva_range(kvm, &range).ret; + + return (int)young | (range.aging_was_fast ? MMU_NOTIFIER_FAST_AGING : 0); } void kvm_mmu_invalidate_begin(struct kvm *kvm) @@ -865,7 +868,7 @@ static int kvm_mmu_notifier_clear_flush_young(struct mmu_notifier *mn, { trace_kvm_age_hva(start, end); - return kvm_handle_hva_range(mn, start, end, kvm_age_gfn); + return kvm_age_hva_range(mn, start, end, true); } static int kvm_mmu_notifier_clear_young(struct mmu_notifier *mn, @@ -875,20 +878,7 @@ static int kvm_mmu_notifier_clear_young(struct mmu_notifier *mn, { trace_kvm_age_hva(start, end); - /* - * Even though we do not flush TLB, this will still adversely - * affect performance on pre-Haswell Intel EPT, where there is - * no EPT Access Bit to clear so that we have to tear down EPT - * tables instead. If we find this unacceptable, we can always - * add a parameter to kvm_age_hva so that it effectively doesn't - * do anything on clear_young. - * - * Also note that currently we never issue secondary TLB flushes - * from clear_young, leaving this job up to the regular system - * cadence. If we find this inaccurate, we might come up with a - * more sophisticated heuristic later. - */ - return kvm_handle_hva_range_no_flush(mn, start, end, kvm_age_gfn); + return kvm_age_hva_range(mn, start, end, false); } static int kvm_mmu_notifier_test_young(struct mmu_notifier *mn, @@ -897,8 +887,7 @@ static int kvm_mmu_notifier_test_young(struct mmu_notifier *mn, { trace_kvm_test_age_hva(address); - return kvm_handle_hva_range_no_flush(mn, address, address + 1, - kvm_test_age_gfn); + return kvm_age_hva_range(mn, address, address + 1, false); } static void kvm_mmu_notifier_release(struct mmu_notifier *mn, > > The change, relative to v5, that I am proposing is that MGLRU looks around if > > the page was young in _a_ "fast" secondary MMU, whereas v5 looks around if and > > only if _all_ secondary MMUs are fast. > > > > In other words, if a fast MMU had a young SPTE, look around _that_ MMU, via the > > fast_only flag. > > Oh, yeah, that's a lot more intelligent than what I had. I think I > fully understand your suggestion; I guess we'll see in v6. :) > > I wonder if this still makes sense if whether or not an MMU is "fast" > is determined by how contended some lock(s) are at the time. No. Just because a lock wasn't contended on the initial aging doesn't mean it won't be contended on the next round. E.g. when using KVM x86's shadow MMU, which takes mmu_lock for write for all operations, an aging operation could get lucky and sneak in while mmu_lock happened to be free, but then get stuck behind a large queue of operations. The fast-ness needs to be predictable and all but guaranteed, i.e. lockless or in an MMU that takes mmu_lock for read in all but the most rare paths.
On Fri, Jun 14, 2024 at 9:13 AM Sean Christopherson <seanjc@google.com> wrote: > > On Thu, Jun 13, 2024, James Houghton wrote: > > On Tue, Jun 11, 2024 at 5:34 PM Sean Christopherson <seanjc@google.com> wrote: > > > A flag would also avoid an indirect call and thus a RETPOLINE when CONFIG_RETPOLINE=y, > > > i.e. would be a minor optimization when KVM doesn't suppport fast aging. But that's > > > probably a pretty unlikely combination, so it's probably not a valid argument. > > > > > > So, I guess I don't have a strong opinion? > > > > (Sorry for the somewhat delayed response... spent some time actually > > writing what this would look like.) > > > > I see what you mean, thanks! So has_fast_aging might be set by KVM if > > the architecture sets a Kconfig saying that it understands the concept > > of fast aging, basically what the presence of this v5's > > test_clear_young_fast_only() indicates. > > It would need to be a runtime setting, because KVM x86-64 with tdp_mmu_enabled=false > doesn't support fast aging (uses the shadow MMU even for TDP). I see. I'm not sure if it makes sense to put this in `ops` as you originally had it then (it seems like a bit of a pain anyway). I could just make it a member of `struct mmu_notifier` itself. > > So just to be clear, for test_young(), I intend to have a patch in v6 > > to elide the shadow MMU check if the TDP MMU indicates Accessed. Seems > > like a pure win; no reason not to include it if we're making logic > > changes here anyway. > > I don't think that's correct. The initial fast_only=false aging should process > shadow MMUs (nested TDP) and TDP MMUs, otherwise a future fast_only=false would > get a false positive on young due to failing to clear the Accessed bit in the > shadow MMU. E.g. if page X is accessed by both L1 and L2, then aged, and never > accessed again, the Accessed bit would still be set in the page tables for L2. For clear_young(fast_only=false), yeah we need to check and clear Accessed for both MMUs. But for test_young(fast_only=false), I don't see why we couldn't just return early if the TDP MMU reports young. > My thought for MMU_NOTIFY_WAS_FAST below (which again is a bad name) is to > communicate to MGLRU that the page was found to be young in an MMU that supports > fast aging, i.e. that looking around at other SPTEs is worth doing. That makes sense; I don't think this little test_young() optimization affects that. > > > > > So rather than failing the fast aging, I think what we want is to know if an > > > > > mmu_notifier found a young SPTE during a fast lookup. E.g. something like this > > > > > in KVM, where using kvm_has_shadow_mmu_sptes() instead of kvm_memslots_have_rmaps() > > > > > is an optional optimization to avoid taking mmu_lock for write in paths where a > > > > > (very rare) false negative is acceptable. > > > > > > > > > > static bool kvm_has_shadow_mmu_sptes(struct kvm *kvm) > > > > > { > > > > > return !tdp_mmu_enabled || READ_ONCE(kvm->arch.indirect_shadow_pages); > > > > > } > > > > > > > > > > static int __kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range, > > > > > bool fast_only) > > > > > { > > > > > int young = 0; > > > > > > > > > > if (!fast_only && kvm_has_shadow_mmu_sptes(kvm)) { > > > > > write_lock(&kvm->mmu_lock); > > > > > young = kvm_handle_gfn_range(kvm, range, kvm_age_rmap); > > > > > write_unlock(&kvm->mmu_lock); > > > > > } > > > > > > > > > > if (tdp_mmu_enabled && kvm_tdp_mmu_age_gfn_range(kvm, range)) > > > > > young = 1 | MMU_NOTIFY_WAS_FAST; > > > > The most straightforward way (IMHO) to return something like `1 | > > MMU_NOTIFY_WAS_FAST` up to the MMU notifier itself is to make > > gfn_handler_t return int instead of bool. > > Hrm, all the options are unpleasant. Modifying gfn_handler_t to return an int > will require an absurd amount of churn (all implementations in all archictures), > and I don't love that the APIs that return true/false to indicate "flush" would > lose their boolean-ness. > > One idea would be to add kvm_mmu_notifier_arg.aging_was_fast or so, and then > refactor kvm_handle_hva_range_no_flush() into a dedicated aging helper, and have > it morph the KVM-internal flag into an MMU_NOTIFIER flag. It's not perect either, > but it requires far less churn and keeps some of the KVM<=>mmu_notifer details in > common KVM code. SGTM. I think this will work. Thanks! > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h > index 7b9d2633a931..c11a359b6ff5 100644 > --- a/include/linux/kvm_host.h > +++ b/include/linux/kvm_host.h > @@ -258,6 +258,7 @@ int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu); > #ifdef CONFIG_KVM_GENERIC_MMU_NOTIFIER > union kvm_mmu_notifier_arg { > unsigned long attributes; > + bool aging_was_fast; > }; > > struct kvm_gfn_range { > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > index 436ca41f61e5..a936f6bedd97 100644 > --- a/virt/kvm/kvm_main.c > +++ b/virt/kvm/kvm_main.c > @@ -685,10 +685,10 @@ static __always_inline int kvm_handle_hva_range(struct mmu_notifier *mn, > return __kvm_handle_hva_range(kvm, &range).ret; > } > > -static __always_inline int kvm_handle_hva_range_no_flush(struct mmu_notifier *mn, > - unsigned long start, > - unsigned long end, > - gfn_handler_t handler) > +static __always_inline int kvm_age_hva_range(struct mmu_notifier *mn, > + unsigned long start, > + unsigned long end, > + bool flush_if_young) > { > struct kvm *kvm = mmu_notifier_to_kvm(mn); > const struct kvm_mmu_notifier_range range = { > @@ -696,11 +696,14 @@ static __always_inline int kvm_handle_hva_range_no_flush(struct mmu_notifier *mn > .end = end, > .handler = handler, > .on_lock = (void *)kvm_null_fn, > - .flush_on_ret = false, > + .flush_on_ret = flush_if_young, > .may_block = false, > + .aging_was_fast = false, > }; > > - return __kvm_handle_hva_range(kvm, &range).ret; > + bool young = __kvm_handle_hva_range(kvm, &range).ret; > + > + return (int)young | (range.aging_was_fast ? MMU_NOTIFIER_FAST_AGING : 0); > } > > void kvm_mmu_invalidate_begin(struct kvm *kvm) > @@ -865,7 +868,7 @@ static int kvm_mmu_notifier_clear_flush_young(struct mmu_notifier *mn, > { > trace_kvm_age_hva(start, end); > > - return kvm_handle_hva_range(mn, start, end, kvm_age_gfn); > + return kvm_age_hva_range(mn, start, end, true); > } > > static int kvm_mmu_notifier_clear_young(struct mmu_notifier *mn, > @@ -875,20 +878,7 @@ static int kvm_mmu_notifier_clear_young(struct mmu_notifier *mn, > { > trace_kvm_age_hva(start, end); > > - /* > - * Even though we do not flush TLB, this will still adversely > - * affect performance on pre-Haswell Intel EPT, where there is > - * no EPT Access Bit to clear so that we have to tear down EPT > - * tables instead. If we find this unacceptable, we can always > - * add a parameter to kvm_age_hva so that it effectively doesn't > - * do anything on clear_young. > - * > - * Also note that currently we never issue secondary TLB flushes > - * from clear_young, leaving this job up to the regular system > - * cadence. If we find this inaccurate, we might come up with a > - * more sophisticated heuristic later. > - */ > - return kvm_handle_hva_range_no_flush(mn, start, end, kvm_age_gfn); > + return kvm_age_hva_range(mn, start, end, false); > } > > static int kvm_mmu_notifier_test_young(struct mmu_notifier *mn, > @@ -897,8 +887,7 @@ static int kvm_mmu_notifier_test_young(struct mmu_notifier *mn, > { > trace_kvm_test_age_hva(address); > > - return kvm_handle_hva_range_no_flush(mn, address, address + 1, > - kvm_test_age_gfn); > + return kvm_age_hva_range(mn, address, address + 1, false); > } > > static void kvm_mmu_notifier_release(struct mmu_notifier *mn, > > > > > The change, relative to v5, that I am proposing is that MGLRU looks around if > > > the page was young in _a_ "fast" secondary MMU, whereas v5 looks around if and > > > only if _all_ secondary MMUs are fast. > > > > > > In other words, if a fast MMU had a young SPTE, look around _that_ MMU, via the > > > fast_only flag. > > > > Oh, yeah, that's a lot more intelligent than what I had. I think I > > fully understand your suggestion; I guess we'll see in v6. :) > > > > I wonder if this still makes sense if whether or not an MMU is "fast" > > is determined by how contended some lock(s) are at the time. > > No. Just because a lock wasn't contended on the initial aging doesn't mean it > won't be contended on the next round. E.g. when using KVM x86's shadow MMU, which > takes mmu_lock for write for all operations, an aging operation could get lucky > and sneak in while mmu_lock happened to be free, but then get stuck behind a large > queue of operations. > > The fast-ness needs to be predictable and all but guaranteed, i.e. lockless or in > an MMU that takes mmu_lock for read in all but the most rare paths. Aging and look-around themselves only use the fast-only notifiers, so they won't ever wait on a lock (well... provided KVM is written like that, which I think is a given). should_look_around() will use the slow notifier because it (despite its name) is responsible for accurately determining if a page is young lest we evict a young page. So in this case where "fast" means "lock not contended for now", I don't think it's necessarily wrong for MGLRU to attempt to find young pages, even if sometimes it will bail out because a lock is contended/held for a few or even a majority of the pages. Not doing look-around is the same as doing look-around and finding that no pages are young. Anyway, I don't think this bit is really all that important unless we can demonstrate that KVM participating like this actually results in a measurable win.
On Fri, Jun 14, 2024, James Houghton wrote: > On Fri, Jun 14, 2024 at 9:13 AM Sean Christopherson <seanjc@google.com> wrote: > > > > On Thu, Jun 13, 2024, James Houghton wrote: > > > On Tue, Jun 11, 2024 at 5:34 PM Sean Christopherson <seanjc@google.com> wrote: > > > > A flag would also avoid an indirect call and thus a RETPOLINE when CONFIG_RETPOLINE=y, > > > > i.e. would be a minor optimization when KVM doesn't suppport fast aging. But that's > > > > probably a pretty unlikely combination, so it's probably not a valid argument. > > > > > > > > So, I guess I don't have a strong opinion? > > > > > > (Sorry for the somewhat delayed response... spent some time actually > > > writing what this would look like.) > > > > > > I see what you mean, thanks! So has_fast_aging might be set by KVM if > > > the architecture sets a Kconfig saying that it understands the concept > > > of fast aging, basically what the presence of this v5's > > > test_clear_young_fast_only() indicates. > > > > It would need to be a runtime setting, because KVM x86-64 with tdp_mmu_enabled=false > > doesn't support fast aging (uses the shadow MMU even for TDP). > > I see. I'm not sure if it makes sense to put this in `ops` as you > originally had it then (it seems like a bit of a pain anyway). I could > just make it a member of `struct mmu_notifier` itself. Ah, right, because the ops are const. Yeah, losing the const just to set a flag would be unfortunate. > > > So just to be clear, for test_young(), I intend to have a patch in v6 > > > to elide the shadow MMU check if the TDP MMU indicates Accessed. Seems > > > like a pure win; no reason not to include it if we're making logic > > > changes here anyway. > > > > I don't think that's correct. The initial fast_only=false aging should process > > shadow MMUs (nested TDP) and TDP MMUs, otherwise a future fast_only=false would > > get a false positive on young due to failing to clear the Accessed bit in the > > shadow MMU. E.g. if page X is accessed by both L1 and L2, then aged, and never > > accessed again, the Accessed bit would still be set in the page tables for L2. > > For clear_young(fast_only=false), yeah we need to check and clear > Accessed for both MMUs. But for test_young(fast_only=false), I don't > see why we couldn't just return early if the TDP MMU reports young. Ooh, good point. > > > Oh, yeah, that's a lot more intelligent than what I had. I think I > > > fully understand your suggestion; I guess we'll see in v6. :) > > > > > > I wonder if this still makes sense if whether or not an MMU is "fast" > > > is determined by how contended some lock(s) are at the time. > > > > No. Just because a lock wasn't contended on the initial aging doesn't mean it > > won't be contended on the next round. E.g. when using KVM x86's shadow MMU, which > > takes mmu_lock for write for all operations, an aging operation could get lucky > > and sneak in while mmu_lock happened to be free, but then get stuck behind a large > > queue of operations. > > > > The fast-ness needs to be predictable and all but guaranteed, i.e. lockless or in > > an MMU that takes mmu_lock for read in all but the most rare paths. > > Aging and look-around themselves only use the fast-only notifiers, so > they won't ever wait on a lock (well... provided KVM is written like > that, which I think is a given). Regarding aging, is that actually the behavior that we want? I thought the plan is to have the initial test look at all MMUs, i.e. be potentially slow, but only do the lookaround if it can be fast. IIUC, that was Yu's intent (and peeking back at v2, that is indeed the case, unless I'm misreading the code). If KVM _never_ consults shadow (nested TDP) MMUs, then a VM running an L2 will end up with hot pages (used by L2) swapped out. Or are you saying that the "test" could be slow, but not "clear"? That's also suboptimal, because any pages accessed by L2 will always appear hot. > should_look_around() will use the slow notifier because it (despite its name) > is responsible for accurately determining if a page is young lest we evict a > young page. > > So in this case where "fast" means "lock not contended for now", No. In KVM, "fast" needs to be a property of the MMU, not a reflection of system state at some random snapshot in time. > I don't think it's necessarily wrong for MGLRU to attempt to find young > pages, even if sometimes it will bail out because a lock is contended/held lru_gen_look_around() skips lookaround if something else is waiting on the page table lock, but that is a far cry from what KVM would be doing. (a) the PTL is already held, and (b) it is scoped precisely to the range being processed. Not looking around makes sense because there's a high probability of the PTEs in question being modified by a different task, i.e. of the look around being a waste of time. In KVM, mmu_lock is not yet held, so KVM would need to use try-lock to avoid waiting, and would need to bail from the middle of the aging walk if a different task contends mmu_lock. I agree that's not "wrong", but in part because mmu_lock is scoped to the entire VM, it risks ending up with semi-random, hard to debug behavior. E.g. a user could see intermittent "failures" that come and go based on seemingly unrelated behavior in KVM. And implementing the "bail" behavior in the shadow MMU would require non-trivial changes. In other words, I would very strongly prefer that the shadow MMU be all or nothing, i.e. is either part of look-around or isn't. And if nested TDP doesn't fair well with MGLRU, then we (or whoever cares) can spend the time+effort to make it work with fast-aging. Ooh! Actually, after fiddling a bit to see how feasible fast-aging in the shadow MMU would be, I'm pretty sure we can do straight there for nested TDP. Or rather, I suspect/hope we can get close enough for an initial merge, which would allow aging_is_fast to be a property of the mmu_notifier, i.e. would simplify things because KVM wouldn't need to communicate MMU_NOTIFY_WAS_FAST for each notification. Walking KVM's rmaps requires mmu_lock because adding/removing rmap entries is done in such a way that a lockless walk would be painfully complex. But if there is exactly _one_ rmap entry for a gfn, then slot->arch.rmap[...] points directly at that one SPTE. And with nested TDP, unless L1 is doing something uncommon, e.g. mapping the same page into multiple L2s, that overwhelming vast majority of rmaps have only one entry. That's not the case for legacy shadow paging because kernels almost always map a pfn using multiple virtual addresses, e.g. Linux's direct map along with any userspace mappings. E.g. with a QEMU+KVM setup running Linux as L2, in my setup, only one gfn has multiple rmap entries (IIRC, it's from QEMU remapping BIOS into low memory during boot). So, if we bifurcate aging behavior based on whether or not the TDP MMU is enabled, then whether or not aging is fast is constant (after KVM loads). Rougly, the KVM side of things would be the below, plus a bunch of conversions to WRITE_ONCE() to ensure a stable rmap value (KVM already plays nice with lockless accesses to SPTEs, thanks to the fast page fault path). If KVM adds an rmap entry after the READ_ONCE(), then functionally all is still well because the original SPTE pointer is still valid. If the rmap entry is removed, then KVM just needs to ensure the owning page table isn't freed. That could be done either with a cmpxchg() (KVM zaps leafs SPTE before freeing page tables, and the rmap stuff doesn't actually walk upper level entries), or by enhancing the shadow MMU's lockless walk logic to allow lockless walks from non-vCPU tasks. And in the (hopefully) unlikely scenario someone has a use case where L1 maps a gfn into multiple L2s (or aliases in bizarre ways), then we can tackle making the nested TDP shadow MMU rmap walks always lockless. E.g. again very roughly, if we went with the latter: @@ -1629,22 +1629,45 @@ static void rmap_add(struct kvm_vcpu *vcpu, const struct kvm_memory_slot *slot, __rmap_add(vcpu->kvm, cache, slot, spte, gfn, access); } +static __always_inline bool kvm_handle_gfn_range_lockless(struct kvm *kvm, + struct kvm_gfn_range *range, + typedefme handler) +{ + gfn_t gfn; + int level; + u64 *spte; + bool ret; + + walk_shadow_page_lockless_begin(???); + + for (gfn = range->start; gfn < range->end; gfn++) { + for (level = PG_LEVEL_4K; level <= KVM_MAX_HUGEPAGE_LEVEL; level++) { + spte = (void *)READ_ONCE(gfn_to_rmap(gfn, level, range->slot)->val); + + /* Skip the gfn if there are multiple SPTEs. */ + if ((unsigned long)spte & 1) + continue; + + ret |= handler(spte); + } + } + + walk_shadow_page_lockless_end(???); +} + static int __kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range, bool fast_only) { bool young = false; - if (kvm_memslots_have_rmaps(kvm)) { - if (fast_only) - return -1; - - write_lock(&kvm->mmu_lock); - young = kvm_handle_gfn_range(kvm, range, kvm_age_rmap); - write_unlock(&kvm->mmu_lock); - } - - if (tdp_mmu_enabled) + if (tdp_mmu_enabled) { young |= kvm_tdp_mmu_age_gfn_range(kvm, range); + young |= kvm_handle_gfn_range_lockless(kvm, range, kvm_age_rmap_fast); + } else if (!fast_only) { + write_lock(&kvm->mmu_lock); + young = kvm_handle_gfn_range(kvm, range, kvm_age_rmap); + write_unlock(&kvm->mmu_lock); + } return (int)young; } > for a few or even a majority of the pages. Not doing look-around is the same > as doing look-around and finding that no pages are young. No, because the former is deterministic and predictable, the latter is not. > Anyway, I don't think this bit is really all that important unless we > can demonstrate that KVM participating like this actually results in a > measurable win. Participating like what? You've lost me a bit. Are we talking past each other? What I am saying is that we do this (note that this is slightly different than an earlier sketch; I botched the ordering of spin_is_contend() in that one, and didn't account for the page being young in the primary MMU). if (pte_young(ptep_get(pte))) young = 1 | MMU_NOTIFY_WAS_FAST; young |= ptep_clear_young_notify(vma, addr, pte); if (!young) return false; if (!(young & MMU_NOTIFY_WAS_FAST)) return true; if (spin_is_contended(pvmw->ptl)) return false; /* exclude special VMAs containing anon pages from COW */ if (vma->vm_flags & VM_SPECIAL) return false;
On Fri, Jun 14, 2024 at 4:17 PM Sean Christopherson <seanjc@google.com> wrote: > > On Fri, Jun 14, 2024, James Houghton wrote: > > On Fri, Jun 14, 2024 at 9:13 AM Sean Christopherson <seanjc@google.com> wrote: > > > > > > On Thu, Jun 13, 2024, James Houghton wrote: > > > > I wonder if this still makes sense if whether or not an MMU is "fast" > > > > is determined by how contended some lock(s) are at the time. > > > > > > No. Just because a lock wasn't contended on the initial aging doesn't mean it > > > won't be contended on the next round. E.g. when using KVM x86's shadow MMU, which > > > takes mmu_lock for write for all operations, an aging operation could get lucky > > > and sneak in while mmu_lock happened to be free, but then get stuck behind a large > > > queue of operations. > > > > > > The fast-ness needs to be predictable and all but guaranteed, i.e. lockless or in > > > an MMU that takes mmu_lock for read in all but the most rare paths. > > > > Aging and look-around themselves only use the fast-only notifiers, so > > they won't ever wait on a lock (well... provided KVM is written like > > that, which I think is a given). > > Regarding aging, is that actually the behavior that we want? I thought the plan > is to have the initial test look at all MMUs, i.e. be potentially slow, but only > do the lookaround if it can be fast. IIUC, that was Yu's intent (and peeking back > at v2, that is indeed the case, unless I'm misreading the code). I believe what I said is correct. There are three separate things going on here: 1. Aging (when we hit the low watermark, scan PTEs to find young pages) 2. Eviction (pick a page to evict; if it is definitely not young, evict it) 3. Look-around (upon finding a page is young upon attempted eviction, check adjacent pages if they are young too) (1) and (3) both use the fast-only notifier, (2) does not. In v2[1], this is true, as in the (1) and (3) paths, the notifier being called is the test_clear_young() notifier with the bitmap present. If the bitmap is present, the shadow MMU is not consulted. For (2), there is an mmu_notifier_clear_young() with no bitmap called in should_look_around(), just like in this v5. (2) is the only one that needs to use the slow notifier; the (1) and (3) are best-effort attempts to improve the decision for which pages to evict. [1]: https://lore.kernel.org/linux-mm/20230526234435.662652-11-yuzhao@google.com/ > > If KVM _never_ consults shadow (nested TDP) MMUs, then a VM running an L2 will > end up with hot pages (used by L2) swapped out. The shadow MMU is consulted at eviction time -- only at eviction time. So pages used by L2 won't be swapped out unless they're still cold at eviction time. In my (and Yu's) head, not being able to do aging for nested TDP is ok because running nested VMs is much more rare than running non-nested VMs. And in the non-nested case, being able to do aging is a strict improvement over what we have now. We could look into being able to do aging with the shadow MMU, but I don't think that should necessarily block this series. > Or are you saying that the "test" could be slow, but not "clear"? That's also > suboptimal, because any pages accessed by L2 will always appear hot. No. I hope what I said above makes sense. > > should_look_around() will use the slow notifier because it (despite its name) > > is responsible for accurately determining if a page is young lest we evict a > > young page. > > > > So in this case where "fast" means "lock not contended for now", > > No. In KVM, "fast" needs to be a property of the MMU, not a reflection of system > state at some random snapshot in time. Ok, that's fine with me. > > I don't think it's necessarily wrong for MGLRU to attempt to find young > > pages, even if sometimes it will bail out because a lock is contended/held > > lru_gen_look_around() skips lookaround if something else is waiting on the page > table lock, but that is a far cry from what KVM would be doing. (a) the PTL is > already held, and (b) it is scoped precisely to the range being processed. Not > looking around makes sense because there's a high probability of the PTEs in > question being modified by a different task, i.e. of the look around being a > waste of time. > > In KVM, mmu_lock is not yet held, so KVM would need to use try-lock to avoid > waiting, and would need to bail from the middle of the aging walk if a different > task contends mmu_lock. > > I agree that's not "wrong", but in part because mmu_lock is scoped to the entire > VM, it risks ending up with semi-random, hard to debug behavior. E.g. a user > could see intermittent "failures" that come and go based on seemingly unrelated > behavior in KVM. And implementing the "bail" behavior in the shadow MMU would > require non-trivial changes. This is a great point, thanks Sean. I won't send any patches that make other architectures trylock() for the fast notifier. > > In other words, I would very strongly prefer that the shadow MMU be all or nothing, > i.e. is either part of look-around or isn't. And if nested TDP doesn't fair well > with MGLRU, then we (or whoever cares) can spend the time+effort to make it work > with fast-aging. > > Ooh! Actually, after fiddling a bit to see how feasible fast-aging in the shadow > MMU would be, I'm pretty sure we can do straight there for nested TDP. Or rather, > I suspect/hope we can get close enough for an initial merge, which would allow > aging_is_fast to be a property of the mmu_notifier, i.e. would simplify things > because KVM wouldn't need to communicate MMU_NOTIFY_WAS_FAST for each notification. > > Walking KVM's rmaps requires mmu_lock because adding/removing rmap entries is done > in such a way that a lockless walk would be painfully complex. But if there is > exactly _one_ rmap entry for a gfn, then slot->arch.rmap[...] points directly at > that one SPTE. And with nested TDP, unless L1 is doing something uncommon, e.g. > mapping the same page into multiple L2s, that overwhelming vast majority of rmaps > have only one entry. That's not the case for legacy shadow paging because kernels > almost always map a pfn using multiple virtual addresses, e.g. Linux's direct map > along with any userspace mappings. > > E.g. with a QEMU+KVM setup running Linux as L2, in my setup, only one gfn has > multiple rmap entries (IIRC, it's from QEMU remapping BIOS into low memory during > boot). > > So, if we bifurcate aging behavior based on whether or not the TDP MMU is enabled, > then whether or not aging is fast is constant (after KVM loads). Rougly, the KVM > side of things would be the below, plus a bunch of conversions to WRITE_ONCE() to > ensure a stable rmap value (KVM already plays nice with lockless accesses to SPTEs, > thanks to the fast page fault path). > > If KVM adds an rmap entry after the READ_ONCE(), then functionally all is still > well because the original SPTE pointer is still valid. If the rmap entry is > removed, then KVM just needs to ensure the owning page table isn't freed. That > could be done either with a cmpxchg() (KVM zaps leafs SPTE before freeing page > tables, and the rmap stuff doesn't actually walk upper level entries), or by > enhancing the shadow MMU's lockless walk logic to allow lockless walks from > non-vCPU tasks. > > And in the (hopefully) unlikely scenario someone has a use case where L1 maps a > gfn into multiple L2s (or aliases in bizarre ways), then we can tackle making the > nested TDP shadow MMU rmap walks always lockless. > > E.g. again very roughly, if we went with the latter: > > @@ -1629,22 +1629,45 @@ static void rmap_add(struct kvm_vcpu *vcpu, const struct kvm_memory_slot *slot, > __rmap_add(vcpu->kvm, cache, slot, spte, gfn, access); > } > > +static __always_inline bool kvm_handle_gfn_range_lockless(struct kvm *kvm, > + struct kvm_gfn_range *range, > + typedefme handler) > +{ > + gfn_t gfn; > + int level; > + u64 *spte; > + bool ret; > + > + walk_shadow_page_lockless_begin(???); > + > + for (gfn = range->start; gfn < range->end; gfn++) { > + for (level = PG_LEVEL_4K; level <= KVM_MAX_HUGEPAGE_LEVEL; level++) { > + spte = (void *)READ_ONCE(gfn_to_rmap(gfn, level, range->slot)->val); > + > + /* Skip the gfn if there are multiple SPTEs. */ > + if ((unsigned long)spte & 1) > + continue; > + > + ret |= handler(spte); > + } > + } > + > + walk_shadow_page_lockless_end(???); > +} > + > static int __kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range, > bool fast_only) > { > bool young = false; > > - if (kvm_memslots_have_rmaps(kvm)) { > - if (fast_only) > - return -1; > - > - write_lock(&kvm->mmu_lock); > - young = kvm_handle_gfn_range(kvm, range, kvm_age_rmap); > - write_unlock(&kvm->mmu_lock); > - } > - > - if (tdp_mmu_enabled) > + if (tdp_mmu_enabled) { > young |= kvm_tdp_mmu_age_gfn_range(kvm, range); > + young |= kvm_handle_gfn_range_lockless(kvm, range, kvm_age_rmap_fast); > + } else if (!fast_only) { > + write_lock(&kvm->mmu_lock); > + young = kvm_handle_gfn_range(kvm, range, kvm_age_rmap); > + write_unlock(&kvm->mmu_lock); > + } > > return (int)young; > } Hmm, interesting. I need to spend a little bit more time digesting this. Would you like to see this included in v6? (It'd be nice to avoid the WAS_FAST stuff....) Should we leave it for a later series? I haven't formed my own opinion yet. > > for a few or even a majority of the pages. Not doing look-around is the same > > as doing look-around and finding that no pages are young. > > No, because the former is deterministic and predictable, the latter is not. Fair enough. I just meant in terms of the end result. > > Anyway, I don't think this bit is really all that important unless we > > can demonstrate that KVM participating like this actually results in a > > measurable win. > > Participating like what? You've lost me a bit. Are we talking past each other? Participating in aging and look-around by trylock()ing instead of being lockless. All I meant was: it's not important to try to figure out how to get this non-deterministic trylock()-based "fast" notifier implementation to work, because I'm not actually suggesting we actually do this. There was some suggestion previously to make arm64 work like this (in lieu of a lockless implementation), but I'd need to actually verify that this performs/behaves well before actually sending a patch. > What I am saying is that we do this (note that this is slightly different than > an earlier sketch; I botched the ordering of spin_is_contend() in that one, and > didn't account for the page being young in the primary MMU). > > if (pte_young(ptep_get(pte))) > young = 1 | MMU_NOTIFY_WAS_FAST; > > young |= ptep_clear_young_notify(vma, addr, pte); > if (!young) > return false; > > if (!(young & MMU_NOTIFY_WAS_FAST)) > return true; > > if (spin_is_contended(pvmw->ptl)) > return false; > > /* exclude special VMAs containing anon pages from COW */ > if (vma->vm_flags & VM_SPECIAL) > return false; SGTM. I think we're on the same page here.
On Mon, Jun 17, 2024, James Houghton wrote: > On Fri, Jun 14, 2024 at 4:17 PM Sean Christopherson <seanjc@google.com> wrote: > > > > On Fri, Jun 14, 2024, James Houghton wrote: > > > On Fri, Jun 14, 2024 at 9:13 AM Sean Christopherson <seanjc@google.com> wrote: > > > > > > > > On Thu, Jun 13, 2024, James Houghton wrote: > > > > > I wonder if this still makes sense if whether or not an MMU is "fast" > > > > > is determined by how contended some lock(s) are at the time. > > > > > > > > No. Just because a lock wasn't contended on the initial aging doesn't mean it > > > > won't be contended on the next round. E.g. when using KVM x86's shadow MMU, which > > > > takes mmu_lock for write for all operations, an aging operation could get lucky > > > > and sneak in while mmu_lock happened to be free, but then get stuck behind a large > > > > queue of operations. > > > > > > > > The fast-ness needs to be predictable and all but guaranteed, i.e. lockless or in > > > > an MMU that takes mmu_lock for read in all but the most rare paths. > > > > > > Aging and look-around themselves only use the fast-only notifiers, so > > > they won't ever wait on a lock (well... provided KVM is written like > > > that, which I think is a given). > > > > Regarding aging, is that actually the behavior that we want? I thought the plan > > is to have the initial test look at all MMUs, i.e. be potentially slow, but only > > do the lookaround if it can be fast. IIUC, that was Yu's intent (and peeking back > > at v2, that is indeed the case, unless I'm misreading the code). > > I believe what I said is correct. There are three separate things going on here: > > 1. Aging (when we hit the low watermark, scan PTEs to find young pages) > 2. Eviction (pick a page to evict; if it is definitely not young, evict it) > 3. Look-around (upon finding a page is young upon attempted eviction, > check adjacent pages if they are young too) Ah, I now see the difference between #1 and #2, and your responses make a lot more sense. Thanks! > > If KVM _never_ consults shadow (nested TDP) MMUs, then a VM running an L2 will > > end up with hot pages (used by L2) swapped out. > > The shadow MMU is consulted at eviction time -- only at eviction time. > So pages used by L2 won't be swapped out unless they're still cold at > eviction time. > > In my (and Yu's) head, not being able to do aging for nested TDP is ok > because running nested VMs is much more rare than running non-nested > VMs. And in the non-nested case, being able to do aging is a strict > improvement over what we have now. Yes and no. Running nested VMs is indeed rare when viewing them as a percentage of all VMs in the fleet, but for many use cases, the primary workload of a VM is to run nested VMs. E.g. say x% of VMs in the fleet run nested VMs, where 'x' is likely very small, but for those x% VMs, they run nested VMs 99% of the time (completely made up number). So yes, I completely agree that aging for non-nested VMs is a strict improvement, but I also think don't think we should completely dismiss nested VMs as a problem not worth solving. > We could look into being able to do aging with the shadow MMU, but I > don't think that should necessarily block this series. ... > > Ooh! Actually, after fiddling a bit to see how feasible fast-aging in the shadow > > MMU would be, I'm pretty sure we can do straight there for nested TDP. Or rather, > > I suspect/hope we can get close enough for an initial merge, which would allow > > aging_is_fast to be a property of the mmu_notifier, i.e. would simplify things > > because KVM wouldn't need to communicate MMU_NOTIFY_WAS_FAST for each notification. > > > > Walking KVM's rmaps requires mmu_lock because adding/removing rmap entries is done > > in such a way that a lockless walk would be painfully complex. But if there is > > exactly _one_ rmap entry for a gfn, then slot->arch.rmap[...] points directly at > > that one SPTE. And with nested TDP, unless L1 is doing something uncommon, e.g. > > mapping the same page into multiple L2s, that overwhelming vast majority of rmaps > > have only one entry. That's not the case for legacy shadow paging because kernels > > almost always map a pfn using multiple virtual addresses, e.g. Linux's direct map > > along with any userspace mappings. ... > Hmm, interesting. I need to spend a little bit more time digesting this. > > Would you like to see this included in v6? (It'd be nice to avoid the > WAS_FAST stuff....) Should we leave it for a later series? I haven't > formed my own opinion yet. I would say it depends on the viability and complexity of my idea. E.g. if it pans out more or less like my rough sketch, then it's probably worth taking on the extra code+complexity in KVM to avoid the whole WAS_FAST goo. Note, if we do go this route, the implementation would need to be tweaked to handle the difference in behavior between aging and last-minute checks for eviction, which I obviously didn't understand when I threw together that hack-a-patch. I need to think more about how best to handle that though, e.g. skipping GFNs with multiple mappings is probably the worst possible behavior, as we'd risk evicting hot pages. But falling back to taking mmu_lock for write isn't all that desirable either.
On Mon, Jun 17, 2024 at 11:37 AM Sean Christopherson <seanjc@google.com> wrote: > > On Mon, Jun 17, 2024, James Houghton wrote: > > On Fri, Jun 14, 2024 at 4:17 PM Sean Christopherson <seanjc@google.com> wrote: > > > Ooh! Actually, after fiddling a bit to see how feasible fast-aging in the shadow > > > MMU would be, I'm pretty sure we can do straight there for nested TDP. Or rather, > > > I suspect/hope we can get close enough for an initial merge, which would allow > > > aging_is_fast to be a property of the mmu_notifier, i.e. would simplify things > > > because KVM wouldn't need to communicate MMU_NOTIFY_WAS_FAST for each notification. > > > > > > Walking KVM's rmaps requires mmu_lock because adding/removing rmap entries is done > > > in such a way that a lockless walk would be painfully complex. But if there is > > > exactly _one_ rmap entry for a gfn, then slot->arch.rmap[...] points directly at > > > that one SPTE. And with nested TDP, unless L1 is doing something uncommon, e.g. > > > mapping the same page into multiple L2s, that overwhelming vast majority of rmaps > > > have only one entry. That's not the case for legacy shadow paging because kernels > > > almost always map a pfn using multiple virtual addresses, e.g. Linux's direct map > > > along with any userspace mappings. Hi Sean, sorry for taking so long to get back to you. So just to make sure I have this right: if L1 is using TDP, the gfns in L0 will usually only be mapped by a single spte. If L1 is not using TDP, then all bets are off. Is that true? If that is true, given that we don't really have control over whether or not L1 decides to use TDP, the lockless shadow MMU walk will work, but, if L1 is not using TDP, it will often return false negatives (says "old" for an actually-young gfn). So then I don't really understand conditioning the lockless shadow MMU walk on us (L0) using the TDP MMU[1]. We care about L1, right? (Maybe you're saying that, when the TDP MMU is enabled, the only cases where the shadow MMU is used are cases where gfns are practically always mapped by a single shadow PTE. This isn't how I understood your mail, but this is what your hack-a-patch[1] makes me think.) [1] https://lore.kernel.org/linux-mm/ZmzPoW7K5GIitQ8B@google.com/ > > ... > > > Hmm, interesting. I need to spend a little bit more time digesting this. > > > > Would you like to see this included in v6? (It'd be nice to avoid the > > WAS_FAST stuff....) Should we leave it for a later series? I haven't > > formed my own opinion yet. > > I would say it depends on the viability and complexity of my idea. E.g. if it > pans out more or less like my rough sketch, then it's probably worth taking on > the extra code+complexity in KVM to avoid the whole WAS_FAST goo. > > Note, if we do go this route, the implementation would need to be tweaked to > handle the difference in behavior between aging and last-minute checks for eviction, > which I obviously didn't understand when I threw together that hack-a-patch. > > I need to think more about how best to handle that though, e.g. skipping GFNs with > multiple mappings is probably the worst possible behavior, as we'd risk evicting > hot pages. But falling back to taking mmu_lock for write isn't all that desirable > either. I think falling back to the write lock is more desirable than evicting a young page. I've attached what I think could work, a diff on top of this series. It builds at least. It uses rcu_read_lock/unlock() for walk_shadow_page_lockless_begin/end(NULL), and it puts a synchronize_rcu() in kvm_mmu_commit_zap_page(). It doesn't get rid of the WAS_FAST things because it doesn't do exactly what [1] does. It basically makes three calls now: lockless TDP MMU, lockless shadow MMU, locked shadow MMU. It only calls the locked shadow MMU bits if the lockless bits say !young (instead of being conditioned on tdp_mmu_enabled). My choice is definitely questionable for the clear path. Thanks!
On Fri, Jun 28, 2024 at 7:38 PM James Houghton <jthoughton@google.com> wrote: > > On Mon, Jun 17, 2024 at 11:37 AM Sean Christopherson <seanjc@google.com> wrote: > > > > On Mon, Jun 17, 2024, James Houghton wrote: > > > On Fri, Jun 14, 2024 at 4:17 PM Sean Christopherson <seanjc@google.com> wrote: > > > > Ooh! Actually, after fiddling a bit to see how feasible fast-aging in the shadow > > > > MMU would be, I'm pretty sure we can do straight there for nested TDP. Or rather, > > > > I suspect/hope we can get close enough for an initial merge, which would allow > > > > aging_is_fast to be a property of the mmu_notifier, i.e. would simplify things > > > > because KVM wouldn't need to communicate MMU_NOTIFY_WAS_FAST for each notification. > > > > > > > > Walking KVM's rmaps requires mmu_lock because adding/removing rmap entries is done > > > > in such a way that a lockless walk would be painfully complex. But if there is > > > > exactly _one_ rmap entry for a gfn, then slot->arch.rmap[...] points directly at > > > > that one SPTE. And with nested TDP, unless L1 is doing something uncommon, e.g. > > > > mapping the same page into multiple L2s, that overwhelming vast majority of rmaps > > > > have only one entry. That's not the case for legacy shadow paging because kernels > > > > almost always map a pfn using multiple virtual addresses, e.g. Linux's direct map > > > > along with any userspace mappings. > > Hi Sean, sorry for taking so long to get back to you. > > So just to make sure I have this right: if L1 is using TDP, the gfns > in L0 will usually only be mapped by a single spte. If L1 is not using > TDP, then all bets are off. Is that true? > > If that is true, given that we don't really have control over whether > or not L1 decides to use TDP, the lockless shadow MMU walk will work, > but, if L1 is not using TDP, it will often return false negatives > (says "old" for an actually-young gfn). So then I don't really > understand conditioning the lockless shadow MMU walk on us (L0) using > the TDP MMU[1]. We care about L1, right? Ok I think I understand now. If L1 is using shadow paging, L2 is accessing memory the same way L1 would, so we use the TDP MMU at L0 for this case (if tdp_mmu_enabled). If L1 is using TDP, then we must use the shadow MMU, so that's the interesting case. > (Maybe you're saying that, when the TDP MMU is enabled, the only cases > where the shadow MMU is used are cases where gfns are practically > always mapped by a single shadow PTE. This isn't how I understood your > mail, but this is what your hack-a-patch[1] makes me think.) So it appears that this interpretation is actually what you meant. > > [1] https://lore.kernel.org/linux-mm/ZmzPoW7K5GIitQ8B@google.com/ > > > > > ... > > > > > Hmm, interesting. I need to spend a little bit more time digesting this. > > > > > > Would you like to see this included in v6? (It'd be nice to avoid the > > > WAS_FAST stuff....) Should we leave it for a later series? I haven't > > > formed my own opinion yet. > > > > I would say it depends on the viability and complexity of my idea. E.g. if it > > pans out more or less like my rough sketch, then it's probably worth taking on > > the extra code+complexity in KVM to avoid the whole WAS_FAST goo. > > > > Note, if we do go this route, the implementation would need to be tweaked to > > handle the difference in behavior between aging and last-minute checks for eviction, > > which I obviously didn't understand when I threw together that hack-a-patch. > > > > I need to think more about how best to handle that though, e.g. skipping GFNs with > > multiple mappings is probably the worst possible behavior, as we'd risk evicting > > hot pages. But falling back to taking mmu_lock for write isn't all that desirable > > either. > > I think falling back to the write lock is more desirable than evicting > a young page. > > I've attached what I think could work, a diff on top of this series. > It builds at least. It uses rcu_read_lock/unlock() for > walk_shadow_page_lockless_begin/end(NULL), and it puts a > synchronize_rcu() in kvm_mmu_commit_zap_page(). > > It doesn't get rid of the WAS_FAST things because it doesn't do > exactly what [1] does. It basically makes three calls now: lockless > TDP MMU, lockless shadow MMU, locked shadow MMU. It only calls the > locked shadow MMU bits if the lockless bits say !young (instead of > being conditioned on tdp_mmu_enabled). My choice is definitely > questionable for the clear path. I still don't think we should get rid of the WAS_FAST stuff. The assumption that the L1 VM will almost never share pages between L2 VMs is questionable. The real question becomes: do we care to have accurate age information for this case? I think so. It's not completely trivial to get the lockless walking of the shadow MMU rmaps correct either (please see the patch I attached here[1]). And the WAS_FAST functionality isn't even that complex to begin with. Thanks for your patience. [1]: https://lore.kernel.org/linux-mm/CADrL8HW=kCLoWBwoiSOCd8WHFvBdWaguZ2ureo4eFy9D67+owg@mail.gmail.com/
On Mon, Jul 08, 2024, James Houghton wrote: > On Fri, Jun 28, 2024 at 7:38 PM James Houghton <jthoughton@google.com> wrote: > > > > On Mon, Jun 17, 2024 at 11:37 AM Sean Christopherson <seanjc@google.com> wrote: > > > > > > On Mon, Jun 17, 2024, James Houghton wrote: > > > > On Fri, Jun 14, 2024 at 4:17 PM Sean Christopherson <seanjc@google.com> wrote: > > > > > Ooh! Actually, after fiddling a bit to see how feasible fast-aging in the shadow > > > > > MMU would be, I'm pretty sure we can do straight there for nested TDP. Or rather, > > > > > I suspect/hope we can get close enough for an initial merge, which would allow > > > > > aging_is_fast to be a property of the mmu_notifier, i.e. would simplify things > > > > > because KVM wouldn't need to communicate MMU_NOTIFY_WAS_FAST for each notification. > > > > > > > > > > Walking KVM's rmaps requires mmu_lock because adding/removing rmap entries is done > > > > > in such a way that a lockless walk would be painfully complex. But if there is > > > > > exactly _one_ rmap entry for a gfn, then slot->arch.rmap[...] points directly at > > > > > that one SPTE. And with nested TDP, unless L1 is doing something uncommon, e.g. > > > > > mapping the same page into multiple L2s, that overwhelming vast majority of rmaps > > > > > have only one entry. That's not the case for legacy shadow paging because kernels > > > > > almost always map a pfn using multiple virtual addresses, e.g. Linux's direct map > > > > > along with any userspace mappings. > > > > Hi Sean, sorry for taking so long to get back to you. > > > > So just to make sure I have this right: if L1 is using TDP, the gfns > > in L0 will usually only be mapped by a single spte. If L1 is not using > > TDP, then all bets are off. Is that true? > > > > If that is true, given that we don't really have control over whether > > or not L1 decides to use TDP, the lockless shadow MMU walk will work, > > but, if L1 is not using TDP, it will often return false negatives > > (says "old" for an actually-young gfn). So then I don't really > > understand conditioning the lockless shadow MMU walk on us (L0) using > > the TDP MMU[1]. We care about L1, right? > > Ok I think I understand now. If L1 is using shadow paging, L2 is > accessing memory the same way L1 would, so we use the TDP MMU at L0 > for this case (if tdp_mmu_enabled). If L1 is using TDP, then we must > use the shadow MMU, so that's the interesting case. Yep. > > (Maybe you're saying that, when the TDP MMU is enabled, the only cases > > where the shadow MMU is used are cases where gfns are practically > > always mapped by a single shadow PTE. This isn't how I understood your > > mail, but this is what your hack-a-patch[1] makes me think.) > > So it appears that this interpretation is actually what you meant. Yep. > > [1] https://lore.kernel.org/linux-mm/ZmzPoW7K5GIitQ8B@google.com/ > > > > > > > > ... > > > > > > > Hmm, interesting. I need to spend a little bit more time digesting this. > > > > > > > > Would you like to see this included in v6? (It'd be nice to avoid the > > > > WAS_FAST stuff....) Should we leave it for a later series? I haven't > > > > formed my own opinion yet. > > > > > > I would say it depends on the viability and complexity of my idea. E.g. if it > > > pans out more or less like my rough sketch, then it's probably worth taking on > > > the extra code+complexity in KVM to avoid the whole WAS_FAST goo. > > > > > > Note, if we do go this route, the implementation would need to be tweaked to > > > handle the difference in behavior between aging and last-minute checks for eviction, > > > which I obviously didn't understand when I threw together that hack-a-patch. > > > > > > I need to think more about how best to handle that though, e.g. skipping GFNs with > > > multiple mappings is probably the worst possible behavior, as we'd risk evicting > > > hot pages. But falling back to taking mmu_lock for write isn't all that desirable > > > either. > > > > I think falling back to the write lock is more desirable than evicting > > a young page. > > > > I've attached what I think could work, a diff on top of this series. > > It builds at least. It uses rcu_read_lock/unlock() for > > walk_shadow_page_lockless_begin/end(NULL), and it puts a > > synchronize_rcu() in kvm_mmu_commit_zap_page(). > > > > It doesn't get rid of the WAS_FAST things because it doesn't do > > exactly what [1] does. It basically makes three calls now: lockless > > TDP MMU, lockless shadow MMU, locked shadow MMU. It only calls the > > locked shadow MMU bits if the lockless bits say !young (instead of > > being conditioned on tdp_mmu_enabled). My choice is definitely > > questionable for the clear path. > > I still don't think we should get rid of the WAS_FAST stuff. I do :-) > The assumption that the L1 VM will almost never share pages between L2 > VMs is questionable. The real question becomes: do we care to have > accurate age information for this case? I think so. I think you're conflating two different things. WAS_FAST isn't about accuracy, it's about supporting lookaround in conditionally fast secondary MMUs. Accuracy only comes into play when we're talking about the last-minute check, which, IIUC, has nothing to do with WAS_FAST because any potential lookaround has already been performed. > It's not completely trivial to get the lockless walking of the shadow > MMU rmaps correct either (please see the patch I attached here[1]). Heh, it's not correct. Invoking synchronize_rcu() in kvm_mmu_commit_zap_page() is illegal, as mmu_lock (rwlock) is held and synchronize_rcu() might_sleep(). For kvm_test_age_rmap_fast(), KVM can blindly read READ_ONCE(*sptep). KVM might read garbage, but that would be an _extremely_ rare scenario, and reporting a zapped page as being young is acceptable in that 1 in a billion situation. For kvm_age_rmap_fast(), i.e. where KVM needs to write, I'm pretty sure KVM can handle that by rechecking the rmap and using CMPXCHG to write the SPTE. If the rmap is unchanged, then the old SPTE value is guaranteed to be valid, in the sense that its value most definitely came from a KVM shadow page table. Ah, drat, that won't work, because very theoretically, the page table could be freed, reallocated, and rewritten with the exact same value by something other than KVM. Hrm. Looking more closely, I think we can go straight to supporting rmap walks outside of mmu_lock. There will still be a "lock", but it will be a *very* rudimentary lock, akin to the TDP MMU's REMOVED_SPTE approach. Bit 0 of rmap_head->val is used to indicate "many", while bits 63:3/31:2 on 64-bit/32-bit KVM hold the pointer (to a SPTE or a list). That means bit 1 is available for shenanigans. If we use bit 1 to lock the rmap, then the fast mmu_notifier can safely walk the entire rmap chain. And with a reader/write scheme, the rmap walks that are performed under mmu_lock don't need to lock the rmap, which means flows like kvm_mmu_zap_collapsible_spte() don't need to be modified to avoid recursive self-deadlock. Lastly, the locking can be conditioned on the rmap being valid, i.e. having at least one SPTE. That way the common case of a gfn not having any rmaps is a glorified nop. Adding the locking isn't actually all that difficult, with the *huge* caveat that the below patch is compile-tested only. The vast majority of the churn is to make it so existing code ignores the new KVM_RMAP_LOCKED bit. I don't know that we should pursue such an approach in this series unless we have to. E.g. if we can avoid WAS_FAST or don't have to carry too much intermediate complexity, then it'd probably be better to land the TDP MMU support first and then add nested TDP support later. At the very least, it does make me more confident that a fast walk of the rmaps is very doable (at least for nested TDP), i.e. makes me even more steadfast against adding WAS_FAST. > And the WAS_FAST functionality isn't even that complex to begin with. I agree the raw code isn't terribly complex, but it's not trivial either. And the concept and *behavior* is complex, which is just as much of a maintenance burden as the code itself. E.g. it requires knowing that KVM has multiple MMUs buried behind a single mmu_notifier, and that a "hit" on the fast MMU will trigger lookaround on the fast MMU, but not the slow MMU. Understanding and describing the implications of that behavior isn't easy. E.g. if GFN=X is young in the TDP MMU, but X+1..X+N are young only in the shadow MMU, is doing lookaround and making decisions based purely on the TDP MMU state the "right" behavior? I also really don't like bleeding KVM details into the mmu_nofitier APIs. The need for WAS_FAST is 100% a KVM limitation. AFAIK, no other secondary MMU has multiple MMU implementations active behind a single notifier, and other than lack of support, nothing fundamentally prevents a fast query in the shadow MMU. --- arch/x86/kvm/mmu/mmu.c | 163 ++++++++++++++++++++++++++++++++--------- 1 file changed, 128 insertions(+), 35 deletions(-) diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 842a3a4cdfe9..bfcfdc0a8600 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -935,9 +935,59 @@ static struct kvm_memory_slot *gfn_to_memslot_dirty_bitmap(struct kvm_vcpu *vcpu * About rmap_head encoding: * * If the bit zero of rmap_head->val is clear, then it points to the only spte - * in this rmap chain. Otherwise, (rmap_head->val & ~1) points to a struct + * in this rmap chain. Otherwise, (rmap_head->val & ~3) points to a struct * pte_list_desc containing more mappings. */ +#define KVM_RMAP_MANY BIT(0) +#define KVM_RMAP_LOCKED BIT(1) + +static unsigned long kvm_rmap_lock(struct kvm_rmap_head *rmap_head) +{ + unsigned long old_val, new_val; + + old_val = READ_ONCE(rmap_head->val); + if (!old_val) + return 0; + + do { + while (old_val & KVM_RMAP_LOCKED) { + old_val = READ_ONCE(rmap_head->val); + cpu_relax(); + } + if (!old_val) + return 0; + + new_val = old_val | KVM_RMAP_LOCKED; + } while (!try_cmpxchg(&rmap_head->val, &old_val, new_val)); + + return old_val; +} + +static unsigned long kvm_rmap_write_lock(struct kvm_rmap_head *rmap_head) +{ + return kvm_rmap_lock(rmap_head); +} + +static void kvm_rmap_write_ulock(struct kvm_rmap_head *rmap_head, + unsigned long new_val) +{ + WARN_ON_ONCE(new_val & KVM_RMAP_LOCKED); + WRITE_ONCE(rmap_head->val, new_val); +} + +static unsigned long kvm_rmap_read_lock(struct kvm_rmap_head *rmap_head) +{ + return kvm_rmap_lock(rmap_head); +} + +static void kvm_rmap_read_unlock(struct kvm_rmap_head *rmap_head, + unsigned long old_val) +{ + if (!old_val) + return; + + WRITE_ONCE(rmap_head->val, old_val & ~KVM_RMAP_LOCKED); +} /* * Returns the number of pointers in the rmap chain, not counting the new one. @@ -945,21 +995,24 @@ static struct kvm_memory_slot *gfn_to_memslot_dirty_bitmap(struct kvm_vcpu *vcpu static int pte_list_add(struct kvm_mmu_memory_cache *cache, u64 *spte, struct kvm_rmap_head *rmap_head) { + unsigned long old_val, new_val; struct pte_list_desc *desc; int count = 0; - if (!rmap_head->val) { - rmap_head->val = (unsigned long)spte; - } else if (!(rmap_head->val & 1)) { + old_val = kvm_rmap_write_lock(rmap_head); + + if (!old_val) { + new_val = (unsigned long)spte; + } else if (!(old_val & KVM_RMAP_MANY)) { desc = kvm_mmu_memory_cache_alloc(cache); - desc->sptes[0] = (u64 *)rmap_head->val; + desc->sptes[0] = (u64 *)old_val; desc->sptes[1] = spte; desc->spte_count = 2; desc->tail_count = 0; - rmap_head->val = (unsigned long)desc | 1; + new_val = (unsigned long)desc | KVM_RMAP_MANY; ++count; } else { - desc = (struct pte_list_desc *)(rmap_head->val & ~1ul); + desc = (struct pte_list_desc *)(old_val & ~KVM_RMAP_MANY); count = desc->tail_count + desc->spte_count; /* @@ -968,21 +1021,25 @@ static int pte_list_add(struct kvm_mmu_memory_cache *cache, u64 *spte, */ if (desc->spte_count == PTE_LIST_EXT) { desc = kvm_mmu_memory_cache_alloc(cache); - desc->more = (struct pte_list_desc *)(rmap_head->val & ~1ul); + desc->more = (struct pte_list_desc *)(old_val & ~KVM_RMAP_MANY); desc->spte_count = 0; desc->tail_count = count; - rmap_head->val = (unsigned long)desc | 1; + new_val = (unsigned long)desc | KVM_RMAP_MANY; + } else { + new_val = old_val; } desc->sptes[desc->spte_count++] = spte; } + + kvm_rmap_write_ulock(rmap_head, new_val); + return count; } -static void pte_list_desc_remove_entry(struct kvm *kvm, - struct kvm_rmap_head *rmap_head, +static void pte_list_desc_remove_entry(struct kvm *kvm, unsigned long *rmap_val, struct pte_list_desc *desc, int i) { - struct pte_list_desc *head_desc = (struct pte_list_desc *)(rmap_head->val & ~1ul); + struct pte_list_desc *head_desc = (struct pte_list_desc *)(*rmap_val & ~KVM_RMAP_MANY); int j = head_desc->spte_count - 1; /* @@ -1009,9 +1066,9 @@ static void pte_list_desc_remove_entry(struct kvm *kvm, * head at the next descriptor, i.e. the new head. */ if (!head_desc->more) - rmap_head->val = 0; + *rmap_val = 0; else - rmap_head->val = (unsigned long)head_desc->more | 1; + *rmap_val = (unsigned long)head_desc->more | KVM_RMAP_MANY; mmu_free_pte_list_desc(head_desc); } @@ -1019,24 +1076,26 @@ static void pte_list_remove(struct kvm *kvm, u64 *spte, struct kvm_rmap_head *rmap_head) { struct pte_list_desc *desc; + unsigned long rmap_val; int i; - if (KVM_BUG_ON_DATA_CORRUPTION(!rmap_head->val, kvm)) - return; + rmap_val = kvm_rmap_write_lock(rmap_head); + if (KVM_BUG_ON_DATA_CORRUPTION(!rmap_val, kvm)) + goto out; - if (!(rmap_head->val & 1)) { - if (KVM_BUG_ON_DATA_CORRUPTION((u64 *)rmap_head->val != spte, kvm)) - return; + if (!(rmap_val & KVM_RMAP_MANY)) { + if (KVM_BUG_ON_DATA_CORRUPTION((u64 *)rmap_val != spte, kvm)) + goto out; - rmap_head->val = 0; + rmap_val = 0; } else { - desc = (struct pte_list_desc *)(rmap_head->val & ~1ul); + desc = (struct pte_list_desc *)(rmap_val & ~KVM_RMAP_MANY); while (desc) { for (i = 0; i < desc->spte_count; ++i) { if (desc->sptes[i] == spte) { - pte_list_desc_remove_entry(kvm, rmap_head, + pte_list_desc_remove_entry(kvm, &rmap_val, desc, i); - return; + goto out; } } desc = desc->more; @@ -1044,6 +1103,9 @@ static void pte_list_remove(struct kvm *kvm, u64 *spte, KVM_BUG_ON_DATA_CORRUPTION(true, kvm); } + +out: + kvm_rmap_write_ulock(rmap_head, rmap_val); } static void kvm_zap_one_rmap_spte(struct kvm *kvm, @@ -1058,17 +1120,19 @@ static bool kvm_zap_all_rmap_sptes(struct kvm *kvm, struct kvm_rmap_head *rmap_head) { struct pte_list_desc *desc, *next; + unsigned long rmap_val; int i; - if (!rmap_head->val) + rmap_val = kvm_rmap_write_lock(rmap_head); + if (!rmap_val) return false; - if (!(rmap_head->val & 1)) { - mmu_spte_clear_track_bits(kvm, (u64 *)rmap_head->val); + if (!(rmap_val & KVM_RMAP_MANY)) { + mmu_spte_clear_track_bits(kvm, (u64 *)rmap_val); goto out; } - desc = (struct pte_list_desc *)(rmap_head->val & ~1ul); + desc = (struct pte_list_desc *)(rmap_val & ~KVM_RMAP_MANY); for (; desc; desc = next) { for (i = 0; i < desc->spte_count; i++) @@ -1078,20 +1142,21 @@ static bool kvm_zap_all_rmap_sptes(struct kvm *kvm, } out: /* rmap_head is meaningless now, remember to reset it */ - rmap_head->val = 0; + kvm_rmap_write_ulock(rmap_head, 0); return true; } unsigned int pte_list_count(struct kvm_rmap_head *rmap_head) { + unsigned long rmap_val = READ_ONCE(rmap_head->val) & ~KVM_RMAP_LOCKED; struct pte_list_desc *desc; - if (!rmap_head->val) + if (!rmap_val) return 0; - else if (!(rmap_head->val & 1)) + else if (!(rmap_val & KVM_RMAP_MANY)) return 1; - desc = (struct pte_list_desc *)(rmap_head->val & ~1ul); + desc = (struct pte_list_desc *)(rmap_val & ~KVM_RMAP_MANY); return desc->tail_count + desc->spte_count; } @@ -1134,6 +1199,7 @@ static void rmap_remove(struct kvm *kvm, u64 *spte) */ struct rmap_iterator { /* private fields */ + struct rmap_head *head; struct pte_list_desc *desc; /* holds the sptep if not NULL */ int pos; /* index of the sptep */ }; @@ -1148,18 +1214,19 @@ struct rmap_iterator { static u64 *rmap_get_first(struct kvm_rmap_head *rmap_head, struct rmap_iterator *iter) { + unsigned long rmap_val = READ_ONCE(rmap_head->val) & ~KVM_RMAP_LOCKED; u64 *sptep; - if (!rmap_head->val) + if (!rmap_val) return NULL; - if (!(rmap_head->val & 1)) { + if (!(rmap_val & KVM_RMAP_MANY)) { iter->desc = NULL; - sptep = (u64 *)rmap_head->val; + sptep = (u64 *)rmap_val; goto out; } - iter->desc = (struct pte_list_desc *)(rmap_head->val & ~1ul); + iter->desc = (struct pte_list_desc *)(rmap_val & ~KVM_RMAP_MANY); iter->pos = 0; sptep = iter->desc->sptes[iter->pos]; out: @@ -1553,6 +1620,32 @@ static __always_inline bool kvm_handle_gfn_range(struct kvm *kvm, return ret; } +static __always_inline bool kvm_handle_gfn_range_lockless(struct kvm *kvm, + struct kvm_gfn_range *range, + rmap_handler_t handler) +{ + struct kvm_rmap_head *rmap_head; + unsigned long rmap_val; + bool ret = false; + gfn_t gfn; + int level; + + for (gfn = range->start; gfn < range->end; gfn++) { + for (level = PG_LEVEL_4K; level <= KVM_MAX_HUGEPAGE_LEVEL; level++) { + rmap_head = gfn_to_rmap(gfn, level, range->slot); + rmap_val = kvm_rmap_read_lock(rmap_head); + + if (rmap_val) + ret |= handler(kvm, rmap_head, range->slot, gfn, level); + + kvm_rmap_read_unlock(rmap_head, rmap_val); + } + } + + return ret; +} + + bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range) { bool flush = false; base-commit: 771df9ffadb8204e61d3e98f36c5067102aab78f --
On Tue, Jul 9, 2024 at 10:49 AM Sean Christopherson <seanjc@google.com> wrote: > > On Mon, Jul 08, 2024, James Houghton wrote: > > On Fri, Jun 28, 2024 at 7:38 PM James Houghton <jthoughton@google.com> wrote: > > > > > > On Mon, Jun 17, 2024 at 11:37 AM Sean Christopherson <seanjc@google.com> wrote: > > I still don't think we should get rid of the WAS_FAST stuff. > > I do :-) > > > The assumption that the L1 VM will almost never share pages between L2 > > VMs is questionable. The real question becomes: do we care to have > > accurate age information for this case? I think so. > > I think you're conflating two different things. WAS_FAST isn't about accuracy, > it's about supporting lookaround in conditionally fast secondary MMUs. > > Accuracy only comes into play when we're talking about the last-minute check, > which, IIUC, has nothing to do with WAS_FAST because any potential lookaround has > already been performed. Sorry, I thought you meant: have the MMU notifier only ever be lockless (when tdp_mmu_enabled), and just return a potentially wrong result in the unlikely case that L1 is sharing pages between L2s. I think it's totally fine to just drop WAS_FAST. So then we can either do look-around (1) always, or (2) only when there is a secondary MMU with has_fast_aging. (2) is pretty simple, I'll just do that. We can add some shadow MMU lockless support later to make the look-around not as useless for the nested TDP case. > > It's not completely trivial to get the lockless walking of the shadow > > MMU rmaps correct either (please see the patch I attached here[1]). > > Heh, it's not correct. Invoking synchronize_rcu() in kvm_mmu_commit_zap_page() > is illegal, as mmu_lock (rwlock) is held and synchronize_rcu() might_sleep(). > > For kvm_test_age_rmap_fast(), KVM can blindly read READ_ONCE(*sptep). KVM might > read garbage, but that would be an _extremely_ rare scenario, and reporting a > zapped page as being young is acceptable in that 1 in a billion situation. > > For kvm_age_rmap_fast(), i.e. where KVM needs to write, I'm pretty sure KVM can > handle that by rechecking the rmap and using CMPXCHG to write the SPTE. If the > rmap is unchanged, then the old SPTE value is guaranteed to be valid, in the sense > that its value most definitely came from a KVM shadow page table. Ah, drat, that > won't work, because very theoretically, the page table could be freed, reallocated, > and rewritten with the exact same value by something other than KVM. Hrm. > > Looking more closely, I think we can go straight to supporting rmap walks outside > of mmu_lock. There will still be a "lock", but it will be a *very* rudimentary > lock, akin to the TDP MMU's REMOVED_SPTE approach. Bit 0 of rmap_head->val is > used to indicate "many", while bits 63:3/31:2 on 64-bit/32-bit KVM hold the > pointer (to a SPTE or a list). That means bit 1 is available for shenanigans. > > If we use bit 1 to lock the rmap, then the fast mmu_notifier can safely walk the > entire rmap chain. And with a reader/write scheme, the rmap walks that are > performed under mmu_lock don't need to lock the rmap, which means flows like > kvm_mmu_zap_collapsible_spte() don't need to be modified to avoid recursive > self-deadlock. Lastly, the locking can be conditioned on the rmap being valid, > i.e. having at least one SPTE. That way the common case of a gfn not having any > rmaps is a glorified nop. > > Adding the locking isn't actually all that difficult, with the *huge* caveat that > the below patch is compile-tested only. The vast majority of the churn is to make > it so existing code ignores the new KVM_RMAP_LOCKED bit. This is very interesting, thanks for laying out how this could be done. I don't want to hold this series up on getting the details of the shadow MMU lockless walk exactly right. :) > I don't know that we should pursue such an approach in this series unless we have > to. E.g. if we can avoid WAS_FAST or don't have to carry too much intermediate > complexity, then it'd probably be better to land the TDP MMU support first and > then add nested TDP support later. Agreed! > At the very least, it does make me more confident that a fast walk of the rmaps > is very doable (at least for nested TDP), i.e. makes me even more steadfast > against adding WAS_FAST. > > > And the WAS_FAST functionality isn't even that complex to begin with. > > I agree the raw code isn't terribly complex, but it's not trivial either. And the > concept and *behavior* is complex, which is just as much of a maintenance burden > as the code itself. E.g. it requires knowing that KVM has multiple MMUs buried > behind a single mmu_notifier, and that a "hit" on the fast MMU will trigger > lookaround on the fast MMU, but not the slow MMU. Understanding and describing > the implications of that behavior isn't easy. E.g. if GFN=X is young in the TDP > MMU, but X+1..X+N are young only in the shadow MMU, is doing lookaround and making > decisions based purely on the TDP MMU state the "right" behavior? > > I also really don't like bleeding KVM details into the mmu_nofitier APIs. The > need for WAS_FAST is 100% a KVM limitation. AFAIK, no other secondary MMU has > multiple MMU implementations active behind a single notifier, and other than lack > of support, nothing fundamentally prevents a fast query in the shadow MMU. Makes sense. So in v6, I will make the following changes: 1. Drop the WAS_FAST complexity. 2. Add a function like mm_has_fast_aging_notifiers(), use that to determine if we should be doing look-around. 3. Maybe change the notifier calls slightly[1], still need to check performance. Does that sound good to you? Thanks! [1]: https://lore.kernel.org/linux-mm/CAOUHufb2f_EwHY5LQ59k7Nh7aS1-ZbOKtkoysb8BtxRNRFMypQ@mail.gmail.com/
On Wed, Jul 10, 2024, James Houghton wrote: > On Tue, Jul 9, 2024 at 10:49 AM Sean Christopherson <seanjc@google.com> wrote: > > > > On Mon, Jul 08, 2024, James Houghton wrote: > > > On Fri, Jun 28, 2024 at 7:38 PM James Houghton <jthoughton@google.com> wrote: > > > > > > > > On Mon, Jun 17, 2024 at 11:37 AM Sean Christopherson <seanjc@google.com> wrote: > > > I still don't think we should get rid of the WAS_FAST stuff. > > > > I do :-) > > > > > The assumption that the L1 VM will almost never share pages between L2 > > > VMs is questionable. The real question becomes: do we care to have > > > accurate age information for this case? I think so. > > > > I think you're conflating two different things. WAS_FAST isn't about accuracy, > > it's about supporting lookaround in conditionally fast secondary MMUs. > > > > Accuracy only comes into play when we're talking about the last-minute check, > > which, IIUC, has nothing to do with WAS_FAST because any potential lookaround has > > already been performed. > > Sorry, I thought you meant: have the MMU notifier only ever be > lockless (when tdp_mmu_enabled), and just return a potentially wrong > result in the unlikely case that L1 is sharing pages between L2s. > > I think it's totally fine to just drop WAS_FAST. So then we can either > do look-around (1) always, or (2) only when there is a secondary MMU > with has_fast_aging. (2) is pretty simple, I'll just do that. > > We can add some shadow MMU lockless support later to make the > look-around not as useless for the nested TDP case. ... > > Adding the locking isn't actually all that difficult, with the *huge* caveat that > > the below patch is compile-tested only. The vast majority of the churn is to make > > it so existing code ignores the new KVM_RMAP_LOCKED bit. > > This is very interesting, thanks for laying out how this could be > done. I don't want to hold this series up on getting the details of > the shadow MMU lockless walk exactly right. :) ... > 1. Drop the WAS_FAST complexity. > 2. Add a function like mm_has_fast_aging_notifiers(), use that to > determine if we should be doing look-around. I would prefer a flag over a function. Long-term, if my pseudo-lockless rmap idea pans out, KVM can set the flag during VM creation. Until then, KVM can set the flag during creation and then toggle it in (un)account_shadowed(). Races will be possible, but they should be extremely rare and quite benign, all things considered.
On Fri, Jul 12, 2024 at 8:06 AM Sean Christopherson <seanjc@google.com> wrote: > > On Wed, Jul 10, 2024, James Houghton wrote: > > 1. Drop the WAS_FAST complexity. > > 2. Add a function like mm_has_fast_aging_notifiers(), use that to > > determine if we should be doing look-around. > > I would prefer a flag over a function. Long-term, if my pseudo-lockless rmap > idea pans out, KVM can set the flag during VM creation. Until then, KVM can set > the flag during creation and then toggle it in (un)account_shadowed(). Races > will be possible, but they should be extremely rare and quite benign, all things > considered. So I think you're talking about .has_fast_aging bool on struct mmu_notifier, and we would do look-around if that is set, right? So we'd need to have something like mm_has_fast_aging_notifiers() to check if any of the subscribed mmu notifiers have .has_fast_aging. We could toggle it in (un)account_shadowed(), but I think I'd need to demonstrate some performance win to justify it.
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h index d39ebb10caeb..2655d841a409 100644 --- a/include/linux/mmu_notifier.h +++ b/include/linux/mmu_notifier.h @@ -61,6 +61,15 @@ enum mmu_notifier_event { #define MMU_NOTIFIER_RANGE_BLOCKABLE (1 << 0) +/* + * Bits in the return value for test_clear_young_fast_only. + * + * MMU_NOTIFIER_FAST_YOUNG: notifier succeeded, secondary MMU reports young. + * MMU_NOTIFIER_FAST_FAILED: notifier failed. + */ +#define MMU_NOTIFIER_FAST_YOUNG (1 << 0) +#define MMU_NOTIFIER_FAST_FAILED (1 << 1) + struct mmu_notifier_ops { /* * Called either by mmu_notifier_unregister or when the mm is @@ -122,6 +131,24 @@ struct mmu_notifier_ops { struct mm_struct *mm, unsigned long address); + /* + * test_clear_young_fast_only is called to check (and optionally clear) + * the young/accessed bitflag in the secondary pte such that the + * secondary MMU must implement it in a way that will not significantly + * disrupt other MMU operations. In other words, speed is more + * important than accuracy. + * + * Returns MMU_NOTIFIER_FAST_YOUNG if the secondary pte(s) were young. + * Returns MMU_NOTIFIER_FAST_FAILED if the secondary MMU could not do + * an accurate fast-only test and/or clear of the young/accessed + * flag. + */ + int (*test_clear_young_fast_only)(struct mmu_notifier *subscription, + struct mm_struct *mm, + unsigned long start, + unsigned long end, + bool clear); + /* * invalidate_range_start() and invalidate_range_end() must be * paired and are called only when the mmap_lock and/or the @@ -383,6 +410,10 @@ extern int __mmu_notifier_clear_young(struct mm_struct *mm, unsigned long end); extern int __mmu_notifier_test_young(struct mm_struct *mm, unsigned long address); +extern int __mmu_notifier_test_clear_young_fast_only(struct mm_struct *mm, + unsigned long start, + unsigned long end, + bool clear); extern int __mmu_notifier_invalidate_range_start(struct mmu_notifier_range *r); extern void __mmu_notifier_invalidate_range_end(struct mmu_notifier_range *r); extern void __mmu_notifier_arch_invalidate_secondary_tlbs(struct mm_struct *mm, @@ -428,6 +459,17 @@ static inline int mmu_notifier_test_young(struct mm_struct *mm, return 0; } +static inline int mmu_notifier_test_clear_young_fast_only(struct mm_struct *mm, + unsigned long start, + unsigned long end, + bool clear) +{ + if (mm_has_notifiers(mm)) + return __mmu_notifier_test_clear_young_fast_only(mm, start, end, + clear); + return 0; +} + static inline void mmu_notifier_invalidate_range_start(struct mmu_notifier_range *range) { @@ -612,6 +654,14 @@ static inline int mmu_notifier_test_young(struct mm_struct *mm, return 0; } +static inline int mmu_notifier_test_clear_young_fast_only(struct mm_struct *mm, + unsigned long start, + unsigned long end, + bool clear) +{ + return 0; +} + static inline void mmu_notifier_invalidate_range_start(struct mmu_notifier_range *range) { diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c index 8982e6139d07..7b77ad6cf833 100644 --- a/mm/mmu_notifier.c +++ b/mm/mmu_notifier.c @@ -424,6 +424,32 @@ int __mmu_notifier_test_young(struct mm_struct *mm, return young; } +int __mmu_notifier_test_clear_young_fast_only(struct mm_struct *mm, + unsigned long start, + unsigned long end, + bool clear) +{ + struct mmu_notifier *subscription; + int ret = 0, id; + + id = srcu_read_lock(&srcu); + hlist_for_each_entry_rcu(subscription, + &mm->notifier_subscriptions->list, hlist, + srcu_read_lock_held(&srcu)) { + if (subscription->ops->test_clear_young_fast_only) { + ret = subscription->ops->test_clear_young_fast_only( + subscription, mm, start, end, clear); + if (ret & MMU_NOTIFIER_FAST_FAILED) + break; + if (!clear && (ret & MMU_NOTIFIER_FAST_YOUNG)) + break; + } + } + srcu_read_unlock(&srcu, id); + + return ret; +} + static int mn_itree_invalidate(struct mmu_notifier_subscriptions *subscriptions, const struct mmu_notifier_range *range) {
This new notifier is for multi-gen LRU specifically, as it wants to be able to get and clear age information from secondary MMUs only if it can be done "fast". By having this notifier specifically created for MGLRU, what "fast" means comes down to what is "fast" enough to improve MGLRU's ability to reclaim most of the time. Signed-off-by: James Houghton <jthoughton@google.com> --- include/linux/mmu_notifier.h | 50 ++++++++++++++++++++++++++++++++++++ mm/mmu_notifier.c | 26 +++++++++++++++++++ 2 files changed, 76 insertions(+)